diff --git "a/Data/reviews.csv" "b/Data/reviews.csv"
new file mode 100644--- /dev/null
+++ "b/Data/reviews.csv"
@@ -0,0 +1,47881 @@
+id,number,forum,replyto,title,review,rating_int,confidence_int,conf_name
+yAY2qHJlsnV,4,2AL06y9cDE-,2AL06y9cDE-,Increasing robustness via optimal control,"Keeping the performance of deep neural networks against data perturbations is an important and open problem. The authors propose an optimal control-based approach by taking dynamical systems perspective. The proposed method sounds intuitive and efficient. Authors supply theoretical analysis and (small) experimental evaluation. Overall, I believe paper is a good. However, I would like to get some points clarified:
+a)	Authors used manifold assumption (which is a reasonable assumption for many problems) to define running loss (eq 3). (If I am not mistaken) They choose a quadratic loss to have a tractable optimization problem. However, under these assumptions, one may choose many different losses. Would you please comment on the form of the loss and its impact to the method?
+b)	Let’s assume dynamical systems perspective is a right perspective for analysing deep neural networks (to be honest I don’t have any criticism about this). To use control theoretic tools, one needs to comment on controllability and observability of the controlled system. I suspect these mentioned properties are a function of the neural network architecture or do authors think the proposed method (as shown in Figure 1) makes each and every deep neural network architecture controllable and/or observable? I would like to hear authors perspective on these issues.
+c)	As I mentioned before, the empirical study is quite small, and I didn’t see any baseline (do I miss something here). Do authors consider extending their empirical study and compare their method with some baselines.
+
+I would like to emphasize one more time that, I am positive about the paper. However, I would like to note that I am not expert in the field and I am open to change my view in both direction.
+
+
+",7,3.0,ICLR2021
+DqhIHj9Qt9,3,WPO0vDYLXem,WPO0vDYLXem,"Valuable framework, precisions required","The authors propose a new framework for hyperparameter optimization and transfer across incremental modifications to a given algorithm and its search space, a process called developer adjustments in the paper. The authors then propose a few strategies to transfer knowledge from previous HPO runs and evaluate them on a series of simulated benchmarks. Results show the value added by transferring information from previous runs, as well as the surprising efficiency of simply reusing the best found hyperparameters from the previous run. 
+
+Strong points:
+
+- The framework is simple and clearly introduced. 
+- Extensive experiments help bring to light the advantage of transferring across adjustments.
+- The paper is very well written. 
+
+Weak points:
+
+- Not enough details on benchmarks, more on this below
+- The use of simulated benchmarks with surrogate models introduces noise in the evaluation
+- Comparisons with more baselines would be beneficial. RF and/or GP-based HPO methods are extremely popular and would have been easy to integrate with the best-first baseline. 
+
+Recommendation:
+
+The contributions are simple and incremental, and clearly rooted in machine learning engineering, however I still think they could be beneficial as a whole to the community given the extensive experiments realized. I have some issues with experiments, lack of details and baselines, but those issues are mostly fixable. I'll give the paper a weak accept for now.
+
+Extra comments:
+
+You do not specify which benchmarks are based on lookup tables and which ones are based on surrogate models. From looking at the search spaces, I would assume that the SVM and XGB benchmarks are modeled via surrogate benchmarks and the FCN and NAS benchmarks are lookup tables, but this should be explicited in the paper (or appendix). Parameters used for the benchmark surrogate model should also be given (if defaults of Eggensperger are used, simply mention this). It is also not clear what underlying datasets are used, this bears some importance and should be mentioned, even if only in the Appendix.
+
+On surrogate model benchmarks: It can be seen in (Eggensperger et al. 2015., Figure 2) that ordering of methods can shift due to noise in the surrogate model (a random forest?). This is likely going to have a bigger impact when trying to measure the speedup, which is measured when a method reaches a certain threshold of performance. This threshold is likely to be met during the convergence phase of algorithms, and this phase appears noisier  (i.e. looking at how the phases of transition differ between the true benchmark and the RF surrogate benchmark differ in Eggensperger et al. 2015). Have you given this any thought? Have you compared experiments with a few runs on a real benchmark?
+
+The method you end up recommending only has its detailed performance shown in the appendix. This feels counterintuitive to me. This result should be featured in the paper itself. This is perhaps due to the used of those split violin plots, which force you to display only two methods per plot. Maybe you should display a group of X single-sided violin plots where X is the number of methods you are trying to compare.
+
+I think it is misleading to portray everything in terms of speedup or improvement over the ""TPE solution with X iterations"". A more strictly meaningful metric here is accuracy (assuming there is only one dataset per benchmark). Assuming the performance to beat by original TPE was an 11% error rate, there is a big difference between a method which was able to achieve a 10% error rate and a method which was able to achieve a 5% error rate, yet both will be assessed by how quickly they achieved x < 10% error rate. I can't seem to find such figures in the appendices.
+
+Typos:
+
+- Section 3.1 page 3, argmax g(x) / b(x) << you mean g(x) / l(x)?
+- appendix G, you wrote TPE2 instead of T2PE",6,4.0,ICLR2021
+dAyVns7P2pW,3,3ZeGLibhFo0,3ZeGLibhFo0,"Under the survival analyses setting, this paper concentrates on counterfactual inference related to individualized treatment effect, especially hazard ratio. Bound proposed in Shalit et al., 2017 is adopted for model learning by minimizing the upper bound which consists of factual loss and integral probability metric. Proposed factual loss is similar to Chapfuwa et al. (2018) for non-informative censoring case, and adding extra loss terms for informative censoring case. Simulation result shown. ","Disclosure: I found this paper online during review process https://arxiv.org/abs/2006.07756
+
+
+This is a comprehensive paper with interesting application of counterfactual inference under survival analysis setting. Overall, my recommendation is to accept. 
+•	It nice that the proposed nonparametric approach in this paper can adjusts for bias from confounding due to covariate dependent selection bias and censoring (informative or non-informative). 
+•	Under three criterions [concordance index (C-Index) (Harrell Jr et al., 1984), mean coefficient of variation (COV) and calibration slope (C-slope) (Chapfuwa et al., 2019)] and three datasets [FRAMINGHAM, ACTG, semi synthetic ACTG], compared proposed method with 7 seven others, including survival Bayesian additive regression trees (Surv-BART) (Sparapani et al., 2016) [using nonparametric Kaplan-Meier based estimator] and Cox proportional hazard model (using real HR form, using three normalized weighting schemes). 
+•	P6, equation (10), the nonparametric form is a natural adoption of KM estimator. We know S^{‘}=-f(t). Wondering the motivation of choosing a linear approximation to S, and curious would the cardiovascular and HIV data adopted happen to be with S not so curved? Could you shed light on these?
+•	P3, assumption of “no unobserved confounders or ignorability” sounds strong. Understand the mathematical challenge if relaxing it. Maybe for future research.
+•	The overall presentation is nice. The organization of a few places might be improved to help first time reader to follow, eg. ITE initially defined on P3 without example till two paragraphs below, h_{A} first mentioned with no prior definition and no explicit math relation to p(T|X), briefly specify “Do (A=a)” is for effect of intervention.  
+•	Minor issues like align the symbols used across the paper, eg. add subscript when define S^{‘}_{i},  m_{i}, ... i=0, 1 to increase clarity. 
+",7,4.0,ICLR2021
+NFfoYVJFxjn,2,ptbb7olhGHd,ptbb7olhGHd,On the Robustness of Sentiment Analysis for Stock Price Forecasting,"Pros: 
+The paper is clear with a significant contribution. It performed a sentiment analysis task that can predict trends in the stock market and also showed how an adversary may attack the model using tweets thereby leading to false price prediction. The methodology and the probabilistic forecasting used are excellent in my opinion. 
+
+Cons: 
+1. The title of the paper does not describe the actual work done. I suggest that the authors may consider giving a new title to reflect the method they applied. 
+2. It is unclear why the attack was done on the test stage but was not shown at implementation. That is, how would this work be reproduced or be beneficial to the community? 
+
+The following can be corrected: 
+1. The claim on Page 1, Para 2 of the Introduction needs a citation (Thales used his...) 
+2. On page 5, para 1, please briefly describe BERT 
+",7,4.0,ICLR2021
+Vi7C9X4Tfp,2,E8fmaZwzEj,E8fmaZwzEj,A great paper that defends texture-related attack the with simple solution.,"##########################################################################
+
+Summary:
+
+Studies suggest that CNNs that overly rely on texture features are more vulnerable to adversarial attacks. The authors of this paper propose a simple yet effective method ""defective convolution"" that randomly ""disables"" neurons on the convolution layer. The authors argue that by doing so, the CNN is encouraged to learn less from object texture and more on features such as shape. The authors support this statement by empirically evaluating the proposed model under multiple perturbation methods.
+ 
+##########################################################################
+
+Reasons for score: 
+
+ 
+Overall, I vote for accepting. This paper provides a simple yet novel method to force CNNs from learning texture-based features. But I found it more important to understand why such a simple method would achieve this effect rather than using it to defend against adversarial attacks. I hope the authors could provide more motivation and experiments to understand the effect that defective convolution layers have on the CNNs.
+
+ 
+##########################################################################
+
+Pros:
+
+1.  This paper is addressing a very fundamental question of CNN: how can we change convolution filters so that the model will learn certain visual features (shape) while less likely to learn others (texture)?  To answer this question requires a better understanding of the underlying mechanism of CNNs. This work could serve as the initial step for answering this question. 
+
+2. The authors have conducted experiments on synthetically altered images to show that the defective convolution indeed tends to learn less from texture while more putting more emphasis on edges. The comparative experiments in section 4 also empirically support the author's statement that CNNs with such property is more robust against transfer-based attacks.
+
+3. This paper is well-written. The introduction section provides sufficient background for the problem with clear intuition for the method the authors proposed. The literature review is sufficient and well-organized. 
+ 
+ 
+##########################################################################
+
+Cons: 
+
+1. This paper needs better motivation. Would this work be applicable to other real-world CV application where the texture of the object of interest is a confounding factor?  
+
+2. The author argues in 3.1 that by using M_defect, neurons of the conv layers are masked out and therefore local features are hard to preserve. While this is intuitive, a more rigorous analysis would increase the credibility of this statement. What would be a proper mathematical definition of texture? How it is related to the locality? Why masking out spatial location in the conv layers would impact locality? 
+ 
+ 
+##########################################################################
+
+Suggestions:
+
+1. Please address the concerns in the cons section.
+
+2. The ablation study mostly explores the effect of p and the layers where defective convolution is inserted. Intuitively, I think the spatial location where the defective neurons are placed would also impact the model's behavior. For instance, if all defective neurons are on the edge of the filter, we essentially reduce a large filter to a smaller one. On the other hand, if the defective neurons are at the center of the filters then we could discourage the model from learning some continuous patterns. It would be interesting to see how this would affect the model's performance. 
+
+
+ 
+#########################################################################",6,3.0,ICLR2021
+S6PlF27iLaS,4,SyxhaxBKPS,SyxhaxBKPS,Official Blind Review #4,"This paper studies mixed-precision quantization in deep networks where each layer can be either binarized or ternarized. The authors propose  an adaptive regularization function that can be pushed to either 2-bit or 3-bit through different parameterization, in order to automatically determine the precision of each layer. Experiments are performed on small-scale image classification data sets MNIST and CIFAR-10.
+
+The proposed regularization method is simple and straightforward. However, many details are not stated clearly enough for reproduction. E.g, since the proposed regularization already promotes binary or ternary weights, whey is there still a thresholding operation at the end of Section 3? Is it because the proposed regularization can not provide strict binary or ternary weights? Does the method require one more hard binarization/ternarization step after \beta is learned. Indeed, tan(x) is not well-defined when x=pi/2, and the derivative tan'(x)= 1+tan^2(x) can be large when x is near pi/2, and does gradient descent work well in this case?
+
+The experiments are only performed on small-scale data sets. Thus it is hard to tell if the proposed method also works for larger networks or data sets? Moreover, it is not fair to use ""best validation accuracy"" for comparison with other methods since the validation set is seen during training and it is not clear if the hyper-parameters of the proposed methods are tuned for best performance on the seen validation set. It would be more fair to compare the test accuracy like in the BinaryConnect (BC) paper. Yet another concern is that many recent methods that can train mixed-precision networks are not compared. For instance, the HAQ method [1] searches for precision for each layer using the reinforcement learning method, how does the proposed method perform when compared with it?
+
+[1]. Wang, Kuan, et al. ""HAQ: Hardware-Aware Automated Quantization with Mixed Precision."" Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.
+",3,,ICLR2020
+uLf22SM8Nr8,2,zeFrfgyZln,zeFrfgyZln,"The work has values in better performance and open source of the proposed method, though it should justify further the performance gain. I would like to vote for a weak accept. ","##########################################################################
+
+Summary:
+ 
+This paper studies the problem of dense text retrieval, which represents texts as dense vectors for approximate nearest neighbors (ANN) search. Dense text retrieval has two phases. The first phase learns a representation model to project semantically similar texts to vectors of large similarity scores (e.g. inner products or cosine similarity scores). The second phase adopts an ANN search algorithm to index these vectors and process queries. The paper claims key contributions at the first phase. Specifically, (1) The paper introduces a better negative sampling method to sample good dissimilar text pairs for training.  (2) The new method enables faster converge of model learning. (3) The new method leads to 100x faster efficiency than a BERT-based baseline, while achieving almost the same accuracy as the baseline. 
+
+##########################################################################
+
+Reasons for score: 
+ 
+Overall, I like the idea of this paper and opt for a weak accept. A carefully designed negative sampling method should be able to outperform baselines that use simple heuristics. The efficiency improvement 100X is very promising. However, the paper can be better in experimental comparison and presentation. For experimental comparison, a stronger baseline using dense vectors should be included to strengthen the performance claim. For presentation, many important terms require clear definitions, without which the performance gain is not understandable. It will be good if the authors can address the above two issues in the rebuttal.
+ 
+##########################################################################
+
+Pros:
+
+1. The paper proposes a novel negative sampling method. Based on the method, the paper proposes a new dense text retrieval framework ANCE. ANCE introduces an asynchronous index refresh to select the most dissimilar text pairs for training in a timely manner. 
+ 
+2. The proposed ANCE achieves faster model training and equally accurate text retrieval when compared with a number of baselines. In a TREC 2019 task, ANCE achieves the best NDCG score against 11 baselines.  
+ 
+3. The authors promise to make code open source. That will greatly improve the reproducibility of the work.  The code, together with its performance, will serve as a new state-of-the-art for future study. 
+
+##########################################################################
+
+Cons: 
+ 
+1.	An important baseline is missing. In section 5, the paper describes Baselines. According to the descriptions, all baseline use BM25 to retrieve samples for training. BM25 may not be the best for a strong baseline since it relies on sparse word tokens. An alternative is to use BERT [CLS] dense vectors of all texts and a similarity search algorithm such as locality sensitive hashing as the retriever. It will be good if the authors can add the baseline to the paper. 
+
+2. The paper does not explain clearly why the proposed method runs faster than baselines. The experimental results support that the proposed method outperforms several baselines. However, the paper does not explain the performance superiority. I am not sure about which of dissimilar text pairs selection or index refresher or others in the proposed negative sampling leads to the superiority. The reason for uncertainty may be due to the lacking of definitions in the paper. For example, “BERT rerank” refers to a baseline but it has not a definition in the paper. The input and output of “BERT rerank” remain unclear. Similarly, “TREC 2019” is an important benchmark but it has not definitions related to the inputs and outputs. It is necessary to explain important concepts for the best readability of the paper.
+ 
+##########################################################################
+
+Questions during rebuttal: 
+ 
+I would like to see some experiments or discussions to clarify the above cons.
+ ",7,3.0,ICLR2021
+BJ711zqxG,3,Hkn7CBaTW,Hkn7CBaTW,"Interesting, but premature contribution on interpretability ","I found this paper an interesting read for two reasons: First, interpretability is an increasingly important problem as machine learning models grow more and more complicated. Second, the paper aims at generalization of previous work on confounded linear model interpretation in neuroimaging (the so-called filter versus patterns problem). The problem is relevant for discriminative problems: If the objective is really to visualize the generative process,  the ""filters"" learned by the discriminative process need to be transformed to correct for spatial correlated noise. 
+
+Given the focus on extracting visualization of the generative process, it would have been meaningful to place the discussion in a greater frame of generative model deep learning (VAEs, GANs etc etc). At present the ""state of the art"" discussion appears quite narrow, being confined to recent methods for visualization of discriminative deep models.
+
+The authors convincingly demonstrate for the linear case, that their ""PatternNet"" mechanism can produce the generative process (i.e. discard spatially correlated ""distractors""). The PatternNet is generalized to multi-layer ReLu networks by construction of node-specific pattern vectors and back-propagating these through the network. The ""proof"" (eqs. 4-6) is sketchy and involves uncontrolled approximations. The back-propagation mechanism is very briefly introduced and depicted in figure 1.
+
+Yet, the results are rather convincing. Both the anecdotal/qualitative examples and the more quantitative patch elimination experiment figure 4a (?number missing) 
+
+I do not understand the remark: ""However, our method has the advantage that it is not only applicable to image models but is a generalization of the theory commonly used in neuroimaging Haufe et al. (2014).""  what ??
+
+Overall, I appreciate the general idea. However, the contribution could have been much stronger based on a detailed derivation with testable assumptions/approximations, and if based on a clear declaration of the aim.
+
+",6,4.0,ICLR2018
+gD63KPyyEuY,4,wXgk_iCiYGo,wXgk_iCiYGo,"The proof is novel, but some assumptions need further discussion.","This paper develops a density diffusion theory (DDT) to reveal how minima selection quantitatively depends on the minima sharpness and the hyperparameters. In particular, this paper theoretically and empirically prove that SGD favors flat minima exponentially more than sharp minima, while gradient descent (GD) with injected white noise favors flat minima only polynormially more than sharp minima.
+
+This paper is the first to theoretically and empirically prove that SGD favors flat minima exponentially more than sharp minima, which is novel. Furthermore, the paper is well written.
+
+Some shortcomings are listed as follows:
+
+First, the concept “valley” is frequently used in this paper, but it seems that no formal definition has been given for “valley”.
+
+Second, Assumption 1 assumes that the function value near $\theta^*$ can be estimated from the second order Taylor approximation. But when the Hessian matrix changes fast, the second order Taylor approximation will have large error and hence Assumption 1 will not be reasonable. Similarly, in Theorem 3.2, when the Hessian matrices $H_a$, $H_b$ and $\Delta L$ change fast, they cannot well reflect the flatness of minima. 
+
+----------------------
+After rebuttal:
+I thank the authors'  clarification. I keep my rating.",6,3.0,ICLR2021
+r1lteyFy9B,2,BygWRaVYwH,BygWRaVYwH,Official Blind Review #1,"This work presented a general formulation of a wide class of existing meta-learning approaches, and proved the requirements that must be satisfied for such approaches to be possible.
+
+Half of the work is focused on describing the unnamedlib library, which extends PyTorch to enable the easy
+and natural implementation of such meta-learning approaches.
+
+The early sections are interesting, especially section 2, which gives some great insights to the existing inner loop pattern in meta-learning. However, from section 3, the paper has turned to examples and related works, where I was hoping the author would give more detailed analysis of the pattern. My concern is the authors have spent too much space on the unnamedlib library. So http://www.jmlr.org/mloss/ might be a more suitable place for publication.",3,,ICLR2020
+HygVgKtjn7,1,S1zk9iRqF7,S1zk9iRqF7,Differenntially  private synthetic data set generation via combining the PATE framework and GAN,"The paper studies the problem of generating synthetic datasets (while ensuring differential privacy) via training a GAN. One natural approach is the teacher-student framework considered in the PATE framework.  In the original PATE framework, while the teachers are ensured to preserve differential privacy, the student model (typically a GAN) requires the presence of publicly data samples. The main contribution of this paper is to get around the requirement of public data via using uniformly random samples in [0,1]^d.
+
+Differentially private synthetic data generation is clearly an important and a long-standing open problem. Recently, there has been some work on exploiting differentially private variants of GANs to generate synthetic data. However, the scale of these results is far from satisfactory. The current paper claims to bypass this issue by using the PATE-GAN approach.
+
+I am not an expert on deep learning. The idea of bypassing the use of public data by taking uniformly random samples seems interesting. In my view, these random vectors are used in the GAN as some sort of a basis. It is interesting to see if this result extends to high-dimensional settings (i.e., where d  is very large).",7,3.0,ICLR2019
+HJgjIIMDtS,1,rJlnxkSYPS,rJlnxkSYPS,Official Blind Review #3,"This paper proposes a method for unsupervised clustering. Similarly to others unsupervised learning (UL) papers like ""Deep Clustering for Unsupervised Learning of Visual Features"" by Caron et al., they propose an algorithm alternating between a labelling phase and a training phase. Though, it has interesting differences. For example, unlike the Caron et al. paper, not all the samples get assigned a labels but only the most confident ones. These samples are determined by the pruning of a graph whose edges are determined by the votes of an ensemble of clustering models. Then, these pseudo labels are used within a supervised loss which act as a regularizer for the retraining of the clustering models.
+
+Novelties /contributions/good points:
+* Votes from the clustering models to create a graph
+* Using a graph to identify the most important samples for pseudo labelling
+* Modification of the ladder network to be used as clustering algorithm
+* Good amount of experiments and good results
+
+Weaknesses:
+* The whole experiment leading to Table 1 in page 2 is unclear for me. I have trouble understanding the experiment settings. Could you please rephrase it. About initial/ final clustering for example and the rest as well. The whole thing puzzles me whereas the experiments section at the end is much more clear.
+* Lack of motivation about why using the Ladder method rather than another one. Other recent methods have better results in semi-supervised learning.
+* Algorithm 1 seems quite ad-hoc. Do more principled algos exist to solve this problem ? You could write about it and at least explain why it would not be feasible here. The sentence ""The intuition is that most of the neighbours of that node will also be connected with each other"" is unmotivated: no empirical proof for this ?
+* Related work section is too light. It is an important section and should really not be hidden or neglected.
+* In the experiments, you could add the ""Deep Clustering for Unsupervised Learning of Visual Features""  as baseline as well even if they use it for unsupervised learning as they do clustering as well.
+* In the experiments, you use the features extracted from ResNet-50 but what about finetuning this network rather than adding something on top or even better starting from scratch. Because here CIFAR-10 benefits greatly from the ImageNet features. I know that you should reproduce the settings from other papers but it might be good to go a bit beyond. Especially, if the settings of previous papers are a bit faulty. 
+* Regarding, the impact of number of models in section D of the appendix, there is no saturation at 10 models. So how many models are necessary for saturation of the performance ?
+* Minor point: several times, you write ""psuedo"".
+
+Conclusion: the algorithm is novel and represents a nice contribution. Though, there are a lot of weaknesses that could be solved. So, I am putting ""Weak accept"" for the moment but it could change towards a negative rating depending on the rebuttal.
+",6,,ICLR2020
+HJDy-RKef,2,r1vccClCb,r1vccClCb,"Nice Idea but what about ""Curse of Dimensionality""?","A representation learning framework from unsupervised data, based not on auto-encoding (x in, x out), but on neighbor-encoding (x in, N(x) out, where N(.) denotes the neighbor(s) of x) is introduced. 
+
+The underlying idea is interesting, as such, each and every degree of freedom do not synthesize itself similar to the auto-encoder setting, but rather synthesize a neighbor, or k-neighbors. The authors argue that this form of unsupervised learning is more powerful compared to the standard auto-encoder setting, and some preliminary experimental proof is also provided. 
+
+However, I would argue that this is not a completely abstract - unsupervised representation learning setting since defining what is ""a neighbor"" and what is ""not a neighbor"" requires quite a bit of domain knowledge. As we all know, the euclidian distance, or any other comparable norm, suffers from the ""Curse of Dimensionality"" as the #-of-Dimensions increase. 
+
+For instance, in section 4.3, the 40-dimensional feature vector space is used to define neighbors in. It would be great how the neighborhood topology in that space looks like.
+
+All in all, I do like the idea as a concept but I am wary about its applicability to real data where defining a good neighborhood metric might be a major challenge of its own. ",6,4.0,ICLR2018
+rkgAfEoeG,2,ByOExmWAb,ByOExmWAb,Very thorough empirical study,"Generating high-quality sentences/paragraphs is an open research problem that is receiving a lot of attention. This text generation task is traditionally done using recurrent neural networks. This paper proposes to generate text using GANs. GANs are notorious for drawing images of high quality but they have a hard time dealing with text due to its discrete nature. This paper's approach is to use an actor-critic to train the generator of the GAN and use the usual maximum likelihood with SGD to train the discriminator. The whole network is trained on the ""fill-in-the-blank"" task using the sequence-to-sequence architecture for both the generator and the discriminator. At training time, the generator's encoder computes a context representation using the masked sequence. This context is conditioned upon to generate missing words. The discriminator is similar and conditions on the generator's output and the masked sequence to output the probability of a word in the generator's output being fake or real. With this approach, one can generate text at test time by setting all inputs to blanks. 
+
+Pros and positive remarks: 
+--I liked the idea behind this paper. I find it nice how they benefited from context (left context and right context) by solving a ""fill-in-the-blank"" task at training time and translating this into text generation at test time. 
+--The experiments were well carried through and very thorough.
+--I second the decision of passing the masked sequence to the generator's encoder instead of the unmasked sequence. I first thought that performance would be better when the generator's encoder uses the unmasked sequence. Passing the masked sequence is the right thing to do to avoid the mismatch between training time and test time.
+
+Cons and negative remarks:
+--There is a lot of pre-training required for the proposed architecture. There is too much pre-training. I find this less elegant. 
+--There were some unanswered questions:
+            (1) was pre-training done for the baseline as well?
+            (2) how was the masking done? how did you decide on the words to mask? was this at random?
+            (3) it was not made very clear whether the discriminator also conditions on the unmasked sequence. It needs to but 
+                  that was not explicit in the paper.
+--Very minor: although it is similar to the generator, it would have been nice to see the architecture of the discriminator with example input and output as well.
+
+
+Suggestion: for the IMDB dataset, it would be interesting to see if you generate better sentences by conditioning on the sentiment as well.
+",7,4.0,ICLR2018
+Hkg0A3UJ67,3,S1ej8o05tm,S1ej8o05tm,Object Detection for OCR,"Unfortunately, the work does not introduce new contributions, with the point of the paper provided in the introduction:
+In our experiments, we show that best performing approaches currently available for object detection
+on natural images can be used with success at OCR tasks.
+
+The work is applying established object detection algorithms to OCR. While the work provides a thorough experimental section exploring trade offs in network hyper-parameters, the application of object detection to the OCR domain does not provide enough novelty to warrant publication.
+",2,5.0,ICLR2019
+LUrOqKbxe6w,4,f_GA2IU9-K-,f_GA2IU9-K-,Interesting alg that extends DLTV to IQN  ,"This paper studies distributional RL and proposed two extensions. One is a method to enforce a non-decreasing ordering of quantile functions by a linear and non-negative increments. The other is extends the idea of DLTV which adds exploration bonus in action selection by using the random network distillation method, which in particular, using a measure of inconsistency between target network and predictor networks as a frequency measure of sampled states. 
+
+How do you convince us enforcing a non-decreasing ordering of the learned quantile functions is helpful? 
+I understand your arguments, but there is no evidence in the paper showing that doing so is helpful. 
+
+Comparison with DLTV is missing. 
+The paper argues that DLTV is not applicable to continuous quantiles. However, it would be to include this comparison especially they have results on Atari games as well. 
+
+The empirical results are not very strong, with 13 and 14 ties and losses with/to IQN. It appears the advantage of DLTV over QRDQN is larger than your advantage over IQN. 
+
+The technical quality and presentation of the paper can still be much improved. 
+
+Abstract: 
+two important problems still remain unsolved. 
+the other is how to design
+an efficient exploration strategy to fully utilize the distribution information
+--> Later you showed this is false argument by introducing DLTV (Mavrin et. al. 19)
+
+We describe the implementation details of the two architectures with
+what are they?
+you have two ""architectures""?
+
+ (b)(c) a simple incremental structure proposed in(3):
+this sentence is confusing. 
+
+What is the circle dot operator? (.)
+
+Eq 4 is just interpolation to ensure positive increments. 
+Why do you need to show (3) since it's not good?
+
+Th 1: 
+The definition of the \Pi operator isn't clear. 
+
+Before Sec 3.4:
+
+Is Relu is a good choice? What is your thought on other functions? How to choose g in practice?
+
+
+The prediction error would be high
+->The prediction error would be higher
+
+eq 11:
+This is similar to Mavrin's idea: using exploration bonus -- UCT style. 
+ 
+
+
+
+
+
+
+
+",6,5.0,ICLR2021
+BkBvQaFez,2,ryHM_fbA-,ryHM_fbA-,An okay paper that fails to document its contribution,"This paper uses CNNs to build document embeddings.  The main advantage over other methods is that CNNs are very fast.
+
+First and foremost I think this: ""The code with the full model architecture will be released … and we thus omit going into further details here.""  is not acceptable.  Releasing code is commendable, but it is not a substitute for actually explaining what you have done.  This is especially true when the main contribution of the work is a network architecture.  If you're going to propose a specific architecture I expect you to actually tell me what it is.
+
+I'm a bit confused by section 3.1 on language modelling.  I think the claim that it is showing ""a direct connection to language modelling"" and that ""we explore this relationship in detail"" are both very much overstated.  I think it would be more accurate to say this paper takes some tricks that people have used for language modelling and applies them to learning document embeddings.
+
+This paper proposed both a model and a training objective, and I would have liked to see some attempt to disentangle their effect.  If there is indeed a direct connection between embedding models and language models then I would have also expected to see some feedback effect from document embedding to language modeling.  Does the embedding objective proposed here also lead to better language models?
+
+Overall I do not see a substantial contribution from this paper. The main claims seem to be that CNNs are fast, and can be used for NLP, neither of which are new.
+",4,3.0,ICLR2018
+BJxG0Xg9h7,2,HJePRoAct7,HJePRoAct7,"interesting problem of pooling/upsampling graphs, experimental validation and literature review could be significantly improved","This paper proposes pooling and upsampling operations for graph structured data, to be interleaved with graph convolutions, following the spirit of fully convolutional networks for image pixel-wise prediction. Experiments are performed on node classification benchmarks, showing an improvement w.r.t. architectures that do not perform any downsampling/upsampling operations.
+
+Given that the main contribution of the paper is the introduction of a pooling operation for graph structured data, it might be a good idea to evaluate the operation in a task that does require some kind of downsampling, such as graph classification / regression. Moreover, authors should compare to other graph pooling methods.
+
+Authors claim that one of the motivations to perform their pooling operation is to increase the receptive field. It would be worth comparing pooling/upsamping to dilated convolutions to see if they have the same effect on the performance when dealing with graphs. 
+
+Some choices in the method seem rather arbitrary, such as the tanh non-linearity in \tilde y. Could the authors elaborate on that? How important is the gating?
+
+It would be interesting to analyze which nodes where selected by the pooling operators. Are those nodes close together or spread out in the previous graph?
+
+The proposed unpooling operation seems to be the same as unpooling performed to upsample images, that is using skip connections to track indices, by recovering the position where the max value comes from and setting the rest to 0. Have the authors tried other upsampling strategies analogous to the ones typically used for images (e.g. upsampling with nearest neighbors)?
+
+When skipping information from the downsampling path to the upsampling path, is there a concatenation or a summation? How do both operations compare? (note that concatenation introduces many more parameters) How about only skipping only the indices (no summation nor concatenation)? This kind of analysis, as it has been done in the computer vision literature, would be interesting.
+
+What is the influence of the first embedding layer to reduce the dimensionality of the features?
+
+How do the models in Table 2 compare in terms of number of parameters?
+ What's the influence of imposing larger weights on self loop in the graph?
+
+What about experiments in inductive settings?
+
+Please add references for the following claim ""U-Net models with depth 3 or 4 are commonly used...""
+
+Please double check your references, e.g. in the introduction, citations used for CNNs do not always correspond to CNN architectures.
+
+The literature review could be significantly improved, missing relevant papers to discuss include:
+- Gori et al. A new model for learning in graph domains, 2005.
+- Scarselli et al. The graph neural network model, 2009.
+- Bruna et al. Spectral networks and locally connected networks on graphs, 2014.
+- Henaff et al. Deep convolutional networks on graph-structured data, 2015.
+- Niepert et al. Learning convolutional neural networks for graphs, 2016.
+- Atwood and Towsley. Diffusion-convolutional neural networks, 2016.
+- Bronstein et al. Geometric deep learning: going beyond Euclidean data, 2016.
+- Monti et al. Geometric deep learning on graphs and manifolds using mixture model cnns, 2017.
+- Fey et al. SplineCNN: Fast Geometric Deep Learning with Continuous B-Spline Kernels, 2017.
+- Gama et al. Convolutional Neural Networks Architectures for Signals Supported on Graphs, 2018.
+As well as other pixel-wise architecture for image-based tasks such as:
+- Long et al. Fully Convolutional Networks for Semantic Segmentation, 2015.
+- Jegou et al. The one hundred layers tiramisu: fully convolutional densenets for semantic segmentation, 2016.
+- Isola et al. Image-to-image translation with conditional adversarial networks, 2016.
+- Zhao et al. Stacked What-Where auto-encoders, 2015.",4,4.0,ICLR2019
+fzSCunv1a_A,3,H-AAaJ9v_lE,H-AAaJ9v_lE,Review,"** Summary **
+
+The authors present a neural network based method to solve a special class of integral equations. Their approach involves training a neural network with Legendre polynomial based activation functions to approximate the solution $y(x)$ for a given $x$. The network is trained in a supervised fashion to minimize a loss function with two term- (1) the $\ell_2$ error between the true solution and $y(x)$ and (2) the residual of the given integral  equation when analysed at $x$. They show impressive numerical results for several instances of VFH-IEs with very low errors. The primary contributions as claimed by the authors are the use of Legendre polynomial based activation functions and creating a differentiable approximation for the integral equation by using Legendre polynomials and Quadrature methods to analyse the integral. 
+
+*** Pros ***
+1. The numerical results show great efficiency and same to perform at par or better compared to other numerical methods reported in literature. 
+2. The use of Legendre polynomials as an activation function to approximate the input domain is an interesting method to introduce well understood approximations from the numerical methods community.
+
+*** Cons ***
+1. The paper lacks comparisons and ablation studies to show how their model compares to simple supervised training. For example, a simple baseline comparison would be to train a network with similar number of parameters and standard loss functions in a supervised fashion and without the IE residual. This would allow us to analyse the efficacy of the various components of the proposed architecture better.
+
+2. How does the proposed method improve upon traditional numerical methods? I also would like to know the timing comparisons between traditional methods and the proposed neural network method.
+
+3. For Figures 2 and 3, the error between $y_{true}$ and $y_{pred}$ should be plotted as well.
+
+The paper in its current form is not addressing how and why neural networks improve performance over the traditional methods and is also missing relevant comparisons and ablation studies. I will be willing to change my score if the authors add the required experimental results.",5,2.0,ICLR2021
+V4kUxuNS4Te,2,0aZG2VcWLY,0aZG2VcWLY,"A new algorithm to encode continuous signals into spike trains, and to reconstruct the original signals from those spike trains.","PROS:
+* new
+* both theoretical and experimental results
+* an alternative to matching pursuit, more accurate in certain regimes, which could have a broad range of applications
+
+CONS:
+* fully deterministic (see below)
+
+The authors propose a new way to encode a broad class of continuous signals (those with finite rate of innovation, which include bandlimited signals) into spike trains, and to reconstruct those signals from the spikes. The reconstruction is exact under certain hypotheses: the signal should be a weighted sum of the kernels used by the neurons with some temporal shifts. In general several signals are compatible with a given spike train, and the algorithm finds the one with minimal energy.
+
+This algorithm is an alternative to matching pursuit, and is shown to be more accurate in some cases (Fig 4).
+
+In my opinion these results are new and worth sharing. I have only a few minor suggestions to improve the paper:
+* How would the reconstruction degrade with noise in the spike trains, e.g. temporal jitter or extra/missing spikes?
+* Noise in the (filtered) signal can be beneficial in some cases, because subthreshold signals can then cause spikes, with a higher rate if the denoised signal is near threshold. So the subthreshold signal can be estimated from the spike rates - something that would be impossible in the absence of noise. This is called stochastic resonance, and I wonder if the theory presented here could shed light on it.
+* The author uses a soft refractory period, i.e. an adapting threshold, that is increased after a spike, and then goes back to the baseline linearly. From a biological point of view, an exponential decay would be more realistic. Could the theory cope with such an exponential decay?
+
+MINOR POINTS:
+* The abstract says ""the transformation from external stimuli to
+spike trains is essentially deterministic"". This is highly debated!
+* Eq 3 = Eq 4
+
+",7,3.0,ICLR2021
+z5Y_khAH6en,1,oxRaiMDSzwr,oxRaiMDSzwr,"This paper proposed a feature perturbation procedure with given feature mappings, which can be used to select robust/weak features and generate adversarial attacks. ","The overall quality of the paper is good. This paper proposed a feature perturbation procedure, as a comparison to the commonly used perturbation to the original input data. Given access to a feature mapping and a black-box classifier, the proposed procedure is able to select the most robust/weak features. This then can be used for two important tasks: to determine a robust neighborhood for a data point using the robust features and to design adversarial examples using the weak features. For the first task, the feature-based robust neighborhood proposed by this paper is shown by experiments to contain far more points than the traditional input-based neighborhood. For the second task, the feature-based adversarial examples require less query to the black-box classifier and have less distortion from the original data points compared with other competitive methods, and thus are more human-imperceptible. These characteristics make the procedure appealing.
+Therefore, the main contribution of the paper (i.e., the perturbation procedure) is important to the ML community and worth further explorations. 
+The clarity of the paper is good. There is no difficulty in understanding the content and experimental details are provided. 
+
+Cons:
+1. The overall running time of Alg 1 is a concern. 
+2. When generating the adversarial examples, a greedy recovery is performed which may be time-consuming when the data dimension is high. 
+3. The effectiveness of the proposed procedure seems to strongly depend on the feature mappings. The performance under mapping other than PCA is unknown. 
+
+Other comments:
+1. The experiment showed good results with PCA feature mapping. Are there any other feature mappings that might work well with this proposed procedure (Alg 1)?",7,4.0,ICLR2021
+4fW94VG_Hqa,2,LhAqAxwH5cn,LhAqAxwH5cn,An OK paper. Exposition needs to be improved. Interesting Theory. Results not fully convincing.,"Summary:
+=======
+This paper deals with the problem of complementary label learning, that is, when we know the set of labels which a given observation does not belong to. In particular, the paper proposes a robust loss function and an algorithm for learning from complimentary labels. Results shown on MNIST and CIFAR datasets indicate the superior accuracy using the proposed loss function.
+
+
+Comments:
+==========
+The paper addresses an important problem but it is written in a hurry which makes it hard to assess its contribution. There are many typos and other writing issues in the paper. The experiments are also weak. Though, the theoretical results are interesting and improve the previous known results for complementary label learning along certain dimensions. 
+
+
+1). Typically, in ML robust loss function means a loss function that is robust to outliers, e.g., the Huber Loss. However, the definition of robustness of loss function is different in this paper. However, in this paper it means if the loss function with ordinary and complementary labels has the same minimizer. I am unaware of this definition of robustness of a loss function as it seems very specific to the complementary label learning problem.
+
+
+2). The results in Table 2 are not an apples-to-apples comparison. The numbers for GA, PC, Fwd are copied directly from other papers. In order to be fair, they should also similar base models as the authors. For instance, GA used used MLP which is less complex than the model used by the authors. So, it is unclear whether the improved performance is due to the difference in base architecture or due to the proposed robust loss function. 
+
+
+Typos: 
+
+Page 1: ""However, label such a large-scale dataset is time-consuming...""
+Page 1: ""In the view of label noise, complementary labels can also be view as...""
+Page 3: ""...only complementary labels that specific the samples does not...""
+Many others!",5,4.0,ICLR2021
+HJgxdIHqFB,1,B1x9ITVYDr,B1x9ITVYDr,Official Blind Review #3,"The paper studies the problem of the robustness of the neural network-based classification models under adversarial attacks. The paper improves upon the known framework on defending against l_0, l_2 norm attackers. 
+
+The main idea of the algorithm is to use the ""compress sensing"" framework to preprocess the image: Using F, the discrete Fourier transformation matrix, and the algorithm tries to reproduce on every given input x, a vector y with the smallest number of non-zero coordinate such that Fy approximates x. The main algorithms proposed in this paper are sparse iterative hard thresholding (IHT) or base pursuit (BP) which are all quite simple and standardized. 
+
+The intuition of the approach is that l_0, l_2 attackers on the original input x can not allude the sparse vector y by too much, thus the recovered vector Fy could have better robustness property comparing to the original input x. 
+
+
+The main concern for me is the experiment in this paper. The author does not provide enough details about how the attacker is trained in their task. It seems that the authors only use the attacker trained on a standard neural network. However, since the authors have a preprocessing algorithm (IHT, BP) on top of the given input, the attacker should in principle tries to attack this pre-processing process as well. Since the pre-processing process is not differentiable, it is, therefore, unclear to me how to define the true robustness of the approach of the authors. 
+
+An analog of my argument is if we create an artificial network that has a pre-processing layer that zeros out most of the input pixel, however, if we train an attacker without this knowledge (so it tries to attack a network without this pre-processing), the l_2, l_0 attacker might not be very good for the true network. 
+
+
+After Rebuttal: I have read the authors' responses and acknowledge the sensibility of the statement. However, I still think the algorithm in this paper is merely a ""clever"" version of gradient masking, which does not give the neural networks real robustness, it is just harder to design attacks on all these discrete operations.
+
+",3,,ICLR2020
+SJlBb0_QYB,1,SylurJHFPS,SylurJHFPS,Official Blind Review #3,"This paper proposes an estimator to quantify the difference in distributions between real and generated text based on a classifier that discriminates between real vs generated text.  The methodology is however not particularly well motivated and the experiments do not convince me that this proposed measure is superior to other reasonable choices.  Overall, the writing also contains many grammatical errors and confusing at places.
+
+Major Comments:
+
+- There are tons of other existing measures of distributional discrepancy that could be applied to this same problem.  Some would be classical approaches (eg. Kullback-Leibler or other f-divergence based on estimated densities, Maximum Mean Discrepancy based on a specific text kernel, etc) while others would be highly related to this work through their use of a classifier.  Here's just a few examples: 
+
+i) Lopez-Paz & Oquab (2018). ""Revisiting Classifier Two-Sample Tests
+"": https://arxiv.org/abs/1610.06545 
+ii) the Wasserstein critic in Wasserstein-GAN
+iii) Sugiyama et al (2012). ""Density Ratio Estimation in Machine Learning""
+
+Given all these existing methods (I am sure there are many more), it is unclear to me why the estimator proposed in this paper should be better. The authors need to clarify this both intuitively and empirically via comparison experiments (theoretical comparisons would be nice to see as well).
+
+- The authors are proposing a measure of discrepancy, which is essentially useful as a two-sample statistical test.  As such, the authors should demonstrate a power analysis of their test to detect differences between real vs generated text and show this new test is better than tests based on existing discrepancy measures.
+
+- The authors claim training a generator to minimize their proposed divergence is superior to a standard language GAN. However, the method to achieve this is quite convoluted, and straightforward generator training to minimize D_phi does not appear to work (the authors do not say why either).
+
+
+Minor Comments:
+
+- x needs to be defined before equation (1). 
+
+- It is mathematically incorrect to talk about probability density functions when dealing with discrete text. Rather these should be referred to as probability mass functions, likelihoods, or distributions (not ""distributional function"" either). 
+
+",3,,ICLR2020
+BJgXLXSp3m,3,S1llBiR5YX,S1llBiR5YX,Review,"Paper summary: This paper focuses on the case where the finiteness of trajectories will make the underlying process to lose the Markov property and investigates the claims theoretically for a one-dimensional random walk and Wiener process, and empirically on a number of simple environments. 
+
+Comments: The language and the structure of the paper are not on a very good scientific level. The paper should be proofread as it contains a lot of grammatical mistakes. 
+
+Given the assumption that every state's reward is fixed, the theoretical analysis is trivial.
+
+The comparison of policy gradient methods is too old. The authors should look for more advanced methods to compare.
+
+The experimental environment is very simple in reinforcement learning tasks, and the authors should look for more complex environments for comparison. The experiment results are hard to interpret. 
+
+
+
+Q1: In the theoretical analysis, why should the rewards for each state be fixed?
+
+Q2:Why use r_t – (V(s_t)-\gammaV(s_{t+1})) as the advantage function?
+
+Q3: What does the “variant” mean in all figures? 
+
+Typos: with lower absolute value -> with lower absolute values 
+",4,4.0,ICLR2019
+MOOiN4H6ZSR,2,mLcmdlEUxy-,mLcmdlEUxy-,"The paper is okay, but I have concerns.","After rebuttal:
+I appreciate authors' detailed responses and an updated version of the paper. They mostly clear my concerns and doubt. I increase my rating to accept. 
+--------------------------------------
+Summary:
+This paper introduces a module that ensembles independent RNNs using a multi-head attention mechanism. This proposed recurrent independent mechanism (RIM) includes multi-head attention, top-k activation section, input attention, and communication modules. 
+The experiments on a range of diverse tasks show that RIMs generalizes better in many tasks than LSTMs. 
+--------------------------------------
+Pros:
++ The paper is clearly written.
++ The related works and the difference with the proposed model are explained in details. 
++ The experiments cover a wide range of scenarios from copying task to reinforcement learning. The additional experiments in Appendix are helpful. 
+--------------------------------------
+Concerns:
+1. *Novelty:*
+In my understanding, the core idea is essentially combining mutli-head top k attentions with RNNs. I appreciate that authors includes necessity of the proposed module and their insights. However, this paper simply combines existing works and thus lacks novelty. I ask authors to clarify it if I missed anything.
+
+2. *Model capacity:*
+Authors claim that high performance with RIMs is not due to the increase of model capacity, and the model size is significantly reduced with RIMs when the comparing model has the same number hidden units. 
+Related to this, I have suggestions:
+    1. The model size of the proposed and comparing models should be reported. 
+    2. Additionally, adding latency and FLOPs would be helpful. 
+
+3. *Sparsity:*
+Authors mention that sparsity is necessary in this model. How is the proposed model comparable with other sparse networks? Increase sparsity by adding an existing technique [1-4] in the standard LSTM can be another baseline for this model. Some previous works are listed here: 
+    1. K-winner-take-all [1], local winner-take-all [2]
+    2. Dropout [3,4]
+
+4. *Missing references* regarding sparsity: [1-4]
+--------------------------------------
+Minor comments:
+- References of the model are missing in Table 1. 
+- Page 8: 'allow the RIMs **ot** communicate with' -> 'allow the RIMs **to** communicate with'
+--------------------------------------
+[1] Majani, et al., On the K-Winners-Take-All Network, NeurIPS 1988.
+[2] Srivastava, et al., Compete to Compute, NeurIPS 2013.
+[3] Srivastava, et al., Dropout: A Simple Way to Prevent Neural Networks from Overfitting, JMLR 2014.
+[4] Molchanov, et al., Variational Dropout Sparsifies Deep Neural Networks, ICML 2017",7,4.0,ICLR2021
+r1g2n75E3Q,1,S1xiOjC9F7,S1xiOjC9F7,Novel concept of cross-graph attention -- lack of in depth discussion,"The authors present two methods for learning a similarity score between pairs of graphs. They first is to use a shared GNN for each graph to produce independent graph embeddings on which a similarity score is computed. The authors improve this model using pairs of graphs as input and utilizing a cross-graph attention-mechanism in combination with graph convolution. The proposed approach is evaluated on synthetic and real world tasks. It is clearly shown that the proposed approach of cross-graph attention is useful for the given task (at the cost of extra computation).
+
+A main contribution of the article is that ideas from graph matching are introduced to graph neural networks and it is clearly shown that this is beneficial. However, in my opinion the intuition, effect and limitations of the cross-graph attention mechanism should be described in more detail. I like the visualizations of the cross-graph attention, which gives the impression that the process converges to a bijection between the nodes. However, this is not the case for graphs with symmetries (automorphisms); consider, e.g., two star graphs. A discussion of such examples would be helpful and would make the concept of cross-graph attention clearer.
+
+The experimental comparison is largely convincing. However, the proposed approach is motivated by graph matching and a connection to the graph edit distance is implied. However, in the experimental comparison graph kernels are used as baseline. I would like to suggest to also use a simple heuristics for the graph edit distance as a baseline (Riesen, Bunke. Approximate graph edit distance computation by means of bipartite graph matching. Image and Vision Computing, 27(7), 2009).
+
+
+There are several other questions that have not been sufficiently addressed in the article.
+
+* In Eq. 3, self-attention is used to compute graph level representations to ""only focus on important nodes in the graph"". How can this be reconciled with the idea of measuring similarities across the whole graph? Can you give more insights in how the attention coefficients vary for positive as well as negative examples? How much does the self-attention affects the performance of the model in contrast to mean or sum aggregation?
+* Why do you chose the cross-graph similarity to be non-trainable? Might there be any benefits in doing so?
+* The note on page 5 is misleading because two isomorphic graphs will lead to identical representations even if communication is not reduced to zero vectors (this happens neither theoretically nor in practice).
+* Although theoretical complexity of the proposed approach is mentioned, how much slower is the proposed approach in practice? As similarity is computed for every pair of nodes across two graphs, the proposed approach, as you said, will not scale. In practice, how would one solve this problem given two very large graphs which do not fit into GPU memory? To what extent can sampling strategies be used (e.g., from GraphSAGE)? Some discussion on this would be very fruitful.
+
+
+In summary, I think that this is an interesting article, which can be accepted for ICLR provided that the cross-graph attention mechanism is discussed in more detail.
+
+
+Minor remarks:
+
+* p3: The references provided for the graph edit distance in fact consider the (more specific) maximum common subgraph problem.",6,4.0,ICLR2019
+BylesNeRKr,2,Bke61krFvS,Bke61krFvS,Official Blind Review #1,"This paper presents an approach towards extending the capabilities of feedback alignment algorithms, that in essence replace the error backpropagation weights with random matrices.  The authors propose a particular type of network where all weights are constraint to positive values except the first layers, a monotonically increasing activation function, and where a single output neuron exists (i.e., for binary classification - empirical evidence for more output neurons is presented but not theoretically supported).  This is to enforce that the backpropagation of the (scalar) error signal to affect the magnitude of the error rather than the sign, while preserving universal approximation.  The authors also provide provable learning capabilities, and several experiments that show good performance, while also pointing out limitations in case of using multiple output neurons.
+
+The strong point of the paper and main contribution is in terms of proposing the specific network architecture to facilitate scalar error propagation, as well as the proofs and insights on the topic.  The proposed network affects only magnitude rather than sign, and the authors demonstrate that it can do better than current FA and match BP performance.  This seems inspired from earlier work [1,2] - where e.g., in [2] improvements are observed when feedback weights share the sign but not the magnitude of feedforward nets.
+
+Summarizing, I believe that this research is interesting, and can lead to improvements in FA algorithms that could potentially be more biologically plausible, and offer advantages such as full weight update parallelization (although this is more related to the fixed weights rather than the method per-se given my understanding).  However, this also seems - at the moment - to be of limited applicability.
+
+===
+Furthermore, the introduction of the network with positive weights in the 2nd layer and on is remiscent of non-negative matrix factorization algorithms.  Can the authors establish a link to these methods, where variants with backprop have also been proposed?
+
+
+
+[1] Xiao W. et al. Biologically-Plausible Learning Algorithms Can Scale to Large Datasets, 2018
+[2] Qianli Liao, Joel Z Leibo, and Tomaso Poggio.  How important is weight symmetry in backprop-agation, AAAI 2016",6,,ICLR2020
+rylD2kolqS,3,HJe-oRVtPB,HJe-oRVtPB,Official Blind Review #3,"The paper presents a non-asymptotic analysis of ResNet stability which determines a 'cutoff' value for the residual block scale factor characterizing stability vs output explosion, and improves upon prior results by finding a larger range for the scale factor that lead to global convergence in non-convex optimization. Theoretical findings are corroborated via experiments confirming the validity of the 'cutoff' value. Additional experiments are conducted indicating that using the cutoff value, ResNet can be trained even without normalization layers, and that the cutoff value is also beneficial with normalization as it allows effective training of very deep ResNet.
+
+This is an interesting submission which presents important theoretical results while also showing their practical pertinence via experimental validation. 
+
+The paper is well written and the presentation is clear. Prior work is reviewed appropriately.  The reviewer found it particularly informative to provide intuition and present a proof sketch next to each theoretical results. 
+
+The convergence analysis leads to a depth dependence of ResNet and the authors claim that this is not a *real* dependence and  attributes it to bounding techniques handling non smooth activation. The authors further indicate in their experiments that for a given width the convergence of ResNet does not depends *much* on depth, compared to feedforward networks. So there might still be a dependence though much milder than that found by the current theory.  It would be interesting to more precisely characterize the dependence observed in experiments.
+
+The authors claim that their analysis leading to the non-asymptotic bound on the spectral norm of the feedforward process via martingale theory might be relevant for other problems. It would be useful if the authors could elaborate and in particular indicate if their technique could carry to analyse other structures beyond ResNet. 
+
+Minor comments:
+""What else values"" ---> Are there other values 
+""our first claim is a new spectral norm"" -->  is a new bound of the spectral norm
+""we does no make such"" --> we do not make
+
+
+",6,,ICLR2020
+SJ1QyNJZf,3,HJjvxl-Cb,HJjvxl-Cb,"The paper considers the problem of Deep RL in continuous-action domains. It implements the well-studied idea of RL with Entropy bonus. The results on the control suit looks very promising, though the paper does not compare with the state-of-the-art variants of baseline. Also the implementation details of the algorithm is not completely provided. So it is difficult to fully assess  the empirical results.","Quality and clarity: 
+
+It seems that the authors can do a better job to improve the readability of the paper and its conciseness. The current structure of paper seems a bit suboptimal to me. The first 6.5 page of the paper is used to explain the idea of RL with entropy  reward and how it can be extended to the case of parametrized value function and policy and then the whole experimental results  is packed  in only 1 page.  I think the paper could be organized in a more balanced way  by providing a more detailed description and analysis of the numerical results, especially given the fact that in my opinion this is the main strength of the paper.  Finally some of the claims made in this paper is not really justified. For instance ""Prior deep RL methods based on this framework have been formulated as either off-policy Q-learning, or on-policy policy gradient methods"" not true, e.g., look at  Deepmind AI recent work:  https://arxiv.org/abs/1704.04651.
+
+Originality and novelty:
+
+I think much of the ideas considered in this paper is already explored in previous work as it is acknowledged  in the paper.  However some of the techniques such as the way the policy is  represented  and the way the policy gradient formulation is approximated  seems to be novel in the context of Deep RL though again these ideas have been explored in the literature of control and RL extensively.  
+
+Significance:
+
+I think the improvement on baseline in control suites is very impressive the problem is that the details of the implementation of algorithm e.g. architecture of neural network size of memory replay, the schedule the target network  is implemented is not sufficiently explained in the paper so it is  difficult to evaluate these results. Also the paper only compares with the original version of the baseline algorithms. These algorithms are improved since then and the new more efficient algorithms such as distributional policy gradients and NAF have been developed. So it would help to have a better understanding of this result if the paper compares with the state-of-the-art baselines. 
+
+Minor:
+for some reason different algorithms have ran for different number of steps which is a bit confusing. would be great if this is fixed in the future version. ",5,4.0,ICLR2018
+SJ3a_AG4g,1,rk9eAFcxg,rk9eAFcxg,A combination of variational RNN and domain adversarial networks,"This paper combines variational RNN (VRNN) and domain adversarial networks (DANN) for domain adaptation in the sequence modelling domain.  The VRNN is used to learn representations for sequential data, which is the hidden states of the last time step.  The DANN is used to make the representations domain invariant, therefore achieving cross domain adaptation.
+
+Experiments are done on a number of data sets, and the proposed method (VRADA) outperforms baselines including DANN, VFAE and R-DANN on almost all of them.
+
+I don't have questions about the proposed model, the model is quite clear and seems to be a simple combination of VRNN and DANN.  But a few questions came up during the pre-review question phase:
+
+- As the authors have mentioned, DANN in general outperforms MMD based methods, however, the VFAE method which is based on MMD regularization on the representations seems to outperform DANN across the board.  That seems to indicate VRNN + MMD should also be a good combination.
+
+- One baseline the authors showed in the experiments is R-DANN, which is an RNN version of DANN.  There are two differences between R-DANN and VRADA: (1) R-DANN uses deterministic RNN for representation learning, while VRADA uses variational RNN; (2) on target domain R-DANN only optimizes adversarial loss, while VRADA optimizes both adversarial loss and reconstruction loss for feature learning.  It would be good to analyze further where the performance gain comes from.",6,4.0,ICLR2017
+sg9LptEQPdx,2,fV4vvs1J5iM,fV4vvs1J5iM,A good paper with solid theoretical improvement,"This paper presents a reduction approach to tackle the optimization problem of constrained RL. They propose a Frank-Wolfe type algorithm for the task, which avoids many shortcomings of previous methods, such as the memory complexity. They prove that their algorithm can find an $\epsilon$-approximate solution with $O(1/\epsilon)$ invocation. They also show the power of their algorithm with experiments in a grid-world navigation task, though the tasks looks relatively simple.
+
+pros:
+- The application of Frank-Wolfe algorithm to constrained RL problem is novel. The method is basically different from the that of Miryoosefi et al. (2019).The improvement is mainly due to the algorithm design.
+
+- The theoretical improvement is solid. The paper tackles the memory requirement  issue in the previous literature, and only requires constant memory complexity. Further, the number of RL oracle invocation is also reduced from $O(1/\epsilon^2)$ to $O(1/\epsilon)$.
+
+- The paper is well-written. Though I only sketched the proof in the appendix, the algorithm and the analysis in the main part is reasonable and sound.
+
+comments:
+- The algorithm requires a policy set $\mathcal{U}$ and finds a mixed policy $\mu \in \Delta(\mathcal{U})$ to satisfy the constraints. How to get a policy set with a feasible solution? Is $\mathcal{U}$ predefined? For an MDP with $S$ states and $A$ actions, the possible deterministic policy can be $A^S$. Trivially setting $\mathcal{U}$ as a set with all possible policies may lead to exponential computational and memory complexity.
+
+- Constrained RL problem can be formulated from the standard dual LP form of RL problem, in which the policy $\pi$ can be fully represented as the density over state-action $d(s,a)$ (See e.g. [1]). Is it possible to solve constrained RL problem under this formulation? What is the advantage of using mixed policies over fixed policy set $\mathcal{U}$ compared with this formulation?
+
+typos:
+- line 3 of Algorithm 2: $(1-\eta_t w_{t-1})$ -> $(1-\eta_t) w_{t-1}$
+
+[1] Constrained episodic reinforcement learning in concave-convex and knapsack settings",7,2.0,ICLR2021
+SJlkRDV6ur,1,SJxTZeHFPH,SJxTZeHFPH,Official Blind Review #1,"The paper explores how focal loss can be used to improve calibration for classifiers. Focal loss extends the cross-entropy loss, which is -log(p_label), with a multiplicative factor equal to (1 - p_label)^gamma. Intuitively, this downweights the loss for elements where the probability of the correct label p_label is close to 1, relatively increasing the weight of the misclassified examples.
+
+Somewhat surprisingly, this tends to improve the calibration of the model. I say surprisingly because the focal loss is not a bregman divergence for all values of alpha so in general the expected minimizer of the focal loss for a fractional label is not the fractional label (i.e. the minimizer wrt x of - p (1-x)^gamma log(x) - (1-p) x^gamma log (1 -x) is not in general p).
+
+The paper shows somewhat thorough experiments on many datasets justifying this observation, but the theoretical part is rather weak since it doesn't seem to address this issue with the focal loss.
+
+It's also not very clear from reading the paper what the p0 should be when using the rule to automatically select the gamma of the focal loss.
+
+I'd support accepting the paper if the calibration properties of the focal loss itself was better analyzed on a simpler setup (linear models, or single parameter models) so it's easier to understand how it's helping calibration in the deep network setup and if the algorithm for choosing per-example gammas was more clearly stated out.",3,,ICLR2020
+rkgGbTwr5r,2,rkeNqkBFPB,rkeNqkBFPB,Official Blind Review #3,"Deep Automodulators introduces a generative autoencoder architecture that replaces the canonical encoder decoder autoencoder architecture with one inspired by StyleGAN. The encoder interacts with the decoder by modulating layer statistics via Adaptive Instance Normalization (AdaIN) conditioned on the latent. The paper trains this architecture with the loss framework of the Adversarial Generator–Encoder (AGE) and utilizes the progressive growing trick originally introduced in Progressive GAN which is also adapted by the Pioneer models, recent followups to AGE.
+
+The use of AdaIN conditioning across multiple layers and multiple scales (like StyleGAN) and the ability to directly compute latent codes via the encoder allows the authors introduce a disentanglement objective L_j and also an invariance objective L_inv to help encourage these properties in the models via consistency objectives 
+
+The paper shows results demonstrating StyleGAN style coarse/fine visual transfer on two high quality face datasets (importantly this is demonstrated on real inputs rather than samples as in StyleGAN) as well as respectable sample quality on LSUN Bedrooms and the LSUN Cars dataset.
+
+My decision is weak reject. Overall, I think the paper is promising and shows a nice combination of efficient latent inference and controllable generation but the authors do not include ablations to validate some of their core contributions such as the L_j objective. Additionally, the improved controllability of the approach seems to unfortunately result in lower reconstruction quality than direct prior work such as Balanced Pioneer and this potential tradeoff is not investigated/discussed.
+
+To expand a bit, there are three changes from that prior work that that stood out to me. 1) The StyleGAN inspired architecture 2) the disentangling objective L_j and 3) using the loss function dρ of Barron 2019. Successful ablations to demonstrate the importance of 2) to the presented results as well as better motivating / demonstrating the impact of including 3) would raise my score to an weak acceptance.
+
+My other concern is that the reconstruction quality seems noticeably lower than that of the proceeding work, Balanced Pioneer. This is reflected in its 10% reduction in LPIPS compared to the Automodulator’s paper. In general there also seems to be noticeable grid artifacts in the samples across all datasets samples/reconstructions, which don’t seem as prominent in Balanced Pioneer. It is not immediately clear why this is the case and additional investigation of this, such as checking whether this is due to the introduction of the disentanglement objective, or the inclusion of the Barron 2019 loss function would be informative.
+
+Additional Comments:
+
+Each subsection of 3 could be improved by providing a brief introduction to the motivation for and aim of each contribution before launching directly into how it is implemented / achieved. Without that bit of context on the goals of each subsection, it was more difficult to follow along with what was being done and why.
+
+The presentation of L_j with lots of inlined equations intermixed with text gets a bit difficult to read / follow.",3,,ICLR2020
+iQJ9qulvKEK,1,INhwJdJtxn6,INhwJdJtxn6,Well motivated work with strong experimental results,"This paper proposed a transfer approach for reinforcement learning. The proposed approach leverages a policy pre-trained via Never Give Up (NGU) approach, and can facilitate learning challenging RL tasks including the ones with sparse reward. This paper presents many strong pieces of evidence that this approach can be used to tackle challenging RL problems including hard exploration and multi-task learning. 
+
+Strengths
+* Necessity of transferring behavior is well motivated and experiment showed its strength in challenging RL benchmarks.
+* Figure 2 is useful to have intuition on how the proposed transfer method can be useful
+* Table 1 set a strong unsupervised RL performance for Atari Suite
+* Ablation study in Figure 3 is thought-provoking. It is interesting that the significant gain of pretraining come only when both exploitation and exploration method are used jointly.
+* Figure 4 provides a useful intuition that more pretraining is beneficial for transfer to hard exploration task.
+
+Weaknesses
+* The ""flights"" technique is not described in detail in the main text. I managed to find the detail in Appendix A, but the pointer does not exist in the main text.
+* The paper claim ""coverage"" as the desired objective for RL pretraining and tried to support this claim by showing the transfer performance after pretraining via Never Give Up (NGU). I am convinced that NGU is a good pre-training objective but, it is not clear whether a more general claim for ""coverage"" is supported as well. It is not clear whether NGU is optimizing ""coverage"" well, and the relation between ""coverage"" and transfer performance is not studied. 
+
+Comment / Questions to author
+* ""but little research has been conducted towards leveraging the acquired knowledge once the agent is exposed to extrinsic reward"": I'm not sure whether I agree with this description. My understanding is that (Burda et al., 2018) studied a setting where an intrinsic reward is jointly used with extrinsic rewards. The only difference with this work is that the previous work did not study a setting with a clear separation between ""pre-training"" and ""transfer"".
+* Is there a difference between ""ez-greedy with expended action set A+ (using pre-trained policy)"" vs ""the proposed transfer method (exploitation + exploration)""?
+* I'm curious about the comparison between CPT vs joint training with extrinsic reward. How authors would compare CPT vs joint training?
+
+Recommendation
+I recommend accepting this paper because this paper presented strong evidence that unsupervised pre-training and transfer may be a powerful approach to solve many challenging RL problems. I believe this observation is likely to catalyze future research of the related approaches, and the proposed method itself may be used for different domains to improve the capability of RL in general.",8,3.0,ICLR2021
+HkxGS3W9FH,2,Skl8EkSFDr,Skl8EkSFDr,Official Blind Review #1,"This paper proposes a method to compress GANs. The motivation is that the current compression methods work for other kinds of neural networks (classification, detection), but perform poorly in the GAN scenario. The authors present intuitive reasons for why this is the case. 
+
+However, the motivation why we would like to compress GANs is unclear to us. The intro mentions: reducing memory requirements and improving their performance. Sure, compressing networks for object detection and classification on mobile devices is really useful. But GANs are mainly used for unsupervised density estimation, why put a GAN generator on a mobile device? But maybe we are missing something here. 
+
+Their “self-supervised” method works by using the pre-trained discriminator network, while compressing only the generator. They show both qualitative and quantitative gains.
+
+The paper is clear and well-written. It presents a way of pruning GAN generator network and although of limited novelty, it might be an interesting read as it provides extensive and convincing experiments in a clear manner. It does have several parts though which require additional clarification.
+
+The idea of using the pre-trained discriminator network seems reasonable, but I am missing what the compression method for the generator network actually is (Section 4). From Table 2 I would assume it is pruning, in which case the paper’s contribution is very limited.
+
+The authors claim that the “self-supervised” method generalizes well to new tasks and models. ""Generalizes"" seems a strong word here, since the procedure compresses only the generator network. A more appropriate way of putting it might be ‘can be applied to other tasks and models.'
+
+In Section 4 the authors write: “Our main insight is found,” but then they describe the GAN method. What is the actual insight there?
+
+The qualitative results in Figure 1 suggest that their “self-supervised” method is better than the other baselines. 
+
+Scores from Table 2 also support the claims, but the table itself is not referenced anywhere in the text.
+
+The analysis in Section 6 seems out of context with the rest of the paper. It is not clear how it relates to the “self-supervised” method.
+
+Missing related work: 1st paragraph: compressing or distilling one network into another is much older than 2015, dating back to 1991 - see references in section 2 of the overview http://people.idsia.ch/~juergen/deep-learning-miraculous-year-1990-1991.html 
+The GAN principle itself is also much older (1990) - see references in section 5 of the link above.  
+
+General remarks:
+
+In the first read of Section 3 it is not clear what [a], [b], [c] are.
+
+It would be good to first refer to Table 1.
+
+Table 1: why is there a “?” only on the “Fixed” column?
+
+It would be good to have a larger font size in Figure 2, at least the size of the main text font.
+
+In its current form, the pdf file has 100MBs (8MBs the main paper and the rest is the appendix). One could instead move the images from the appendix to a website and provide a link.
+
+We might improve our rating provided the comments above were addressed in a satisfactory way in the rebuttal.
+
+",6,,ICLR2020
+rJeJQ-ud6m,2,rJleN20qK7,rJleN20qK7,A paper with a lot of potential but not well structured. I suggest to rewrite it for a journal track.,"The paper proposes a two-timescale framework for learning the value function and a state representation altogether with nonlinear approximators. The authors provide proof of convergence and a good empirical evaluation.
+
+The topic is very interesting and relevant to ICLR. However, I think that the paper is not ready for a publication.
+First, although the paper is well written, the writing can be improved. For instance, I found already the abstract a bit confusing. There, the authors state that they ""provide a two-timescale network (TTN) architecture that enables LINEAR methods to be used to learn values [...] The approach facilitates use of algorithms developed for the LINEAR setting [...] We prove convergence for TTNs, with particular care given to ensure convergence of the fast LINEAR component.""
+Yet, the title says NONLINEAR and in the remainder of the paper they use neural networks. 
+
+The major problem of the paper is, however, its organization. The novelty of the paper (the proof of convergence) is relegated to the appendix, and too much is spent in the introduction, when actually the idea of having the V-function depending on a slowly changing network is also not novel in RL. For instance, the authors say that V depends on \theta and w, and that \theta changes at slower pace compared to w. This recalls the use of target networks in the TD error for many actor-critic algorithms. (It is not the same thing, but there is a strong connection).
+Furthermore, in the introduction, the authors say that eligibility traces have been used only with linear function approximators, but GAE by Schulman et al. uses the same principle (their advantage is actually the TD(\lambda) error) to learn an advantage function estimator, and it became SOTA for learning the value function.
+
+I am also a bit skeptical about the use of MSBE in the experiment. First, in Eq 4 and 5 the authors state that using the MSTDE is easier than MSBE, then in the experiments they evaluate both. However, the MSBE error involves the square of an expectation, which should be biased. How do you compute it? 
+(Furthermore, you should spend a couple of sentences to explain the problem of this square and the double-sampling problem of Bellman residual algorithms. For someone unfamiliar with the problem, this issue could be unclear.)
+
+I appreciate the extensive evaluation, but its organization can also be improved, considering that some important information are, again, in the appendix.
+Furthermore, results on control experiment are not significative and should be removed (at the current stage, at least). In the non-image version there is a lot of variance in your runs (one blue curve is really bad), while for the image version all runs are very unstable, going always up and down. 
+
+In conclusion, there is a lot of interesting material in this paper. Even though the novelty is not great, the proofs, analysis and evaluation make it a solid paper. However, because there is so much do discuss, I would suggest to reorganize the paper and submit directly to a journal track (the paper is already 29 pages including the appendix).",6,4.0,ICLR2019
+B1lyj_jm5B,2,HyevIJStwH,HyevIJStwH,Official Blind Review #3,"The paper defines the quantity of ""gradient SNR"" (GSNR), shows that larger GSNR leads to better generalization, and shows that SGD training of deep networks has large GSNR. It tells a great story on why SGD-trained DNNs have good generalization.
+
+This topic is highly relevant to this conference.
+
+However, I struggle to rate this paper, since I feel swamped with math. It is hard work to read this paper, and I can honestly say that I could semi-confidently follow until about Eq. (8). To even get there, I had to scroll back and forth to remember the definitions of the various symbols. The math may be very well correct, but it is infeasible to verify (or follow) it fully. It does not make it easier that one cannot really search a PDF for greek symbols with indices etc. Someone who reads theoretical papers all day long might do better here.
+
+This is the reason I rate the paper Weak Reject.
+
+Some feedback points:
+
+Section 2.1:
+
+Eq. (1): It seems the common definition of SNR is the ratio of mean standard deviation. Your SNR is its square. This should be explained.
+
+I think it would help the reader a lot to give some intuitive meaning to the GSNR value. Can you, in Section 2.1, explain with examples what typical (or extreme) values would be?
+
+Assumption 2.3.1:
+
+This is dropped on the reader without any motivation. It is also confusing: ""we will make our derivation under the non-overfitting limit approximation"" conflicts with ""In the early training stage,..."" So is this whole derivation only true in the early stages?
+
+Assumption 2.3.1 seems to address a thought I had when reading this: At the end of the training, I would expect mu_q(theta) to be zero (the definition of convergence). At the start, it is arbitrary as it entirely depends on the initial values. So this paper must look at some part between the two extremes to make sense. Is it? Is this assumption related?
+
+What is the difference between \sigma and \rho? Seems one is on the data distribution and one on a sampled set. But then why is \mu the same in both cases (Eq. (1) vs. Eq. (5))?
+
+All plots:
+
+The plot labels are far too small to be readable.",3,,ICLR2020
+AzSntIGEXNx,2,GNv-TyWu3PY,GNv-TyWu3PY,An interesting extension of combinatorial semi-bandits,"This work introduces an interesting generalization of stochastic combinatorial semi-bandits for routing in a static graph. The main differences are: (1) the expected loss of an edge e is f_e(x^t_e) where the flow x^t_e is revealed at the beginning of each round (for each edge) and f_e is an unknown Lipschitz function (with known Lipschitz constant); (2) the regret is dynamic, computed against the sequence of optimal paths. When f_e is a constant function for each edge, then we recover a version of the stochastic combinatorial semi-bandit.
+
+The main contribution is a novel UCB-like algorithm with a dynamic regret bound after T steps of |E|T^{2/3} (ignoring log factors in T). This is larger than the rate |E|T^{1/2} achievable for *adversarial* combinatorial semi-bandits, but ---as we said--- the problem studied here is more general.
+
+The algorithm uses a hierarchical and dynamical bin structure to produce convergent estimates of f_e(x) for different values of x. This is a nice idea which is not standard in the bandit literature. The analysis of the algorithm is quite involved and apparently novel for the most part.
+
+The main ideas behind the analysis are well explained at an intuitive level.
+
+The source of the T^{2/3} dependence should be independent of the combinatorial nature of the semi-bandit problem. It would be interesting to know what happens in the simpler setting of parallel edges. Can the upper bound be improved? And if not, can a tight lower bound be proven?
+
+The definition of x_{max}^t on page 2 looks wrong because the max is over t.",7,3.0,ICLR2021
+ryglNsz9p7,3,H1x3SnAcYQ,H1x3SnAcYQ,"A paper on an important topic, but the contribution is not very significant","Thank you for an interesting read.
+
+This paper extends the recently published DiCE estimator for gradients of SCGs and proposed a control variate method for the second order gradient. The paper is well written. Experiments are a bit too toy, but the authors did show significant improvements over DiCE with no control variate.
+
+Given that control variates are widely used in deep RL and Monte Carlo VI, the paper can be interesting to many people. I haven't read the DiCE paper, but my impression is that DiCE found a way to conveniently implement the REINFORCE rules applied infinite times. So if I were to derive a baseline control variate for the second or higher order derivatives, I would ""reverse engineer"" from the exact derivatives and figure out the corresponding DiCE formula. Therefore I would say the proposed idea is new, although fairly straightforward for people who knows REINFORCE and baseline methods.
+
+For me, the biggest issue of the paper is the lack of explanation on the choice of the baseline. Why using the same baseline b_w for both control variates? Is this choice optimal for the second order control variate, even when b_w is selected to be optimal for the first order control variate? The paper has no explanation on this issue, and if the answer is no, then it's important to find out an (approximately) optimal baseline for this second order control variate. 
+
+Also the evaluation seems quite toy. As the design choice of b_w is not rigorously explained, I am not sure the better performance of the variance-reduced derivatives generalises to more complicated tasks such as MAML for few-shot learning.
+
+Minor:
+1. In DiCE, given a set of stochastic nodes W, why did you use marginal distributions p(w, \theta) for a node w in W, instead of the joint distribution p(W, \theta)? I agree that there's no need to use p(S, \theta) that includes all stochastic nodes, but I can't see why using marginal distribution is valid when nodes in W are not independent.
+
+2. For the choice of b_w discussed below eq (4), you probably need to cite [1][2].
+
+3. In your experiments, what does ""correlation coefficient"" mean? Normalised dot product?
+
+[1] Mnih and Rezende (2016). Variational inference for Monte Carlo objectives. ICML 2016.
+[2] Titsias and Lázaro-Gredilla (2015). Local Expectation Gradients for Black Box Variational Inference. NIPS 2015.",5,4.0,ICLR2019
+ras1CtT7zq,2,Uu1Nw-eeTxJ,Uu1Nw-eeTxJ,extension of cross lingual pre-trained models with sentence- and word-level contrastive losses,"The paper proposes a pre-trained language model variant which extends XLM-R (multilingual masked model) with two new objectives. The main difference to most other models is that the new losses are contrastive losses (however, as pointed out by the authors, other contrastive losses had been used before in e.g. ELECTRA). The first additional loss is a sentence-level one - where a [CLS] token is trained to be close to the positive sample, the paired sentence, with other sentences as negative samples. The same is done at word level, where the bag of words constructed from two sentences becomes the set of positive samples and other vocabulary words are negative samples. 
+Contrastive losses are promising and the paper shows positive results when adding them to the previously proposed XLM-R model. The review of previous work is thorough and very helpful to place the work proposed in the existing literature. However I had difficulties understanding the impact of the changes proposed and disentangling the different factors that may have led to the results. At the end of reading this paper, I am not sure if implementing what the authors proposed, versus other variations of existing models, would have given the same improvements: While these improvements can be seen across many data sets, they are often modest. The proposal does not offer any other advantages, such as computational efficiency. 
+For the NMT experiments, additional experiments on En-Es and En-Ro (to follow experiments in Zhu et al 2020), and/or back-translation experiments would have made the impact of the method clearer. Given that the main contribution of the paper is empirical (none of the ideas are new), better and more comprehensive experimental results would have strengthened this work.
+The following are clarification questions/comments: 
+- The query and key terminology used in section 2 is confusing: why not use negative/positive sample notation from the Saunshi et al, 2019 and  Oord et al, 2018 papers? Section 3.1 introduces r_x, which is yet another notation for the query q. 
+- Figure 1: please clarify the notation used in the caption (e.g. the set B is defined only later, similarly n, m, V). 
+- The losses in equations (2) and (3) are symmetric: if the data pairs are symmetric, which seems to be the case, why distinguish between queries and keys at all and define two identical, but symmetric losses? 
+- First paragraph in Word-Level CTL in Section 3.1: This should be rephrased in order to clarify the motivation for the word level loss. 
+- I couldn't find details regarding the negative samples for the sentence loss: no of negative samples, how are they obtained, etc.
+",5,4.0,ICLR2021
+ByEQXA5lM,2,HkepKG-Rb,HkepKG-Rb,"The paper proposes a new loss function that penalizes semantic features of the data, and shows some experiments; overall the writing is good and the ideas are nice, even though the contribution is relatively small.","The authors propose a new loss function that is directed to take into account Boolean constraints involving the variables of a classification problem. This is a nice idea, and certainly relevant. The authors clearly describe their problem, and overall the paper is well presented. The contributions are a loss function derived from a set of axioms, and experiments indicating that this loss function captures some valuable elements of the input. This is a valid contribution, and the paper certainly has some significant strengths.
+
+Concerning the loss function, I find the whole derivation a bit distracting and unnecessary. Here we have some axioms, that are not simple when taken together, and that collectively imply a loss function that makes intuitive sense by itself. Well, why not just open the paper with Definition 1, and try to justify this definition on the basis of its properties. The discussion of axioms is just something that will create debate over questionable assumptions. Also it is frustrating to see some axioms in the main text, and some axioms in the appendix (why this division?). 
+
+After presenting the loss function, the authors consider some applications. They are nicely presented; overall the gains are promising but not that great when compared to the state of the art --- they suggest that the proposed semantic loss makes sense. However I find that the proposal is still in search of a ""killer app"". Overall, I find that the whole proposal seems a bit premature and in need of more work on applications (the work on axiomatics is fine as long as it has something to add).
+
+Concerning the text, a few questions/suggestions:
+- Before Lemma 3, ""this allows..."" is the ""this including the other axioms in the appendix?
+- In Section 4, line 3: I suppose that the constraint is just creating a problem with a class containing several labels, not really a multi-label classification problem (?).
+- The beginning of Section 4.1 is not very clear. By reading it, I feel that the best way to handle the unlabeled data would be to add a direct penalty term forcing the unlabeled points to receive a label. Is this fair?
+- Page 6: ""a mor methodological""... should it be ""a more methodical""?
+- There are problems with capitalization in the references. Also some references miss page numbers and some do not even indicate what they are (journal papers, conference papers, arxiv, etc).
+",5,3.0,ICLR2018
+Sylu28cFYB,1,B1gd7REFDB,B1gd7REFDB,Official Blind Review #3,"This paper introduces a context-aware neural network (conCNN) that integrates context semantics into account for object detection. The proposed approach achieves this by embedding a context-aware module into the Faster R-CNN detection framework. The context-aware module simulates the learning process of Conditional Random Fields (CRF) model using a stack of common CNN operations. Specifically, this paper employs the mean-field approach of [1]. Experiments are performed on COCO dataset.
+
+The  paper reads well. The idea of combining the strengths of both CNNs and CRF-based graphical models in a detection framework is interesting. Though the main idea is borrowed from the mean-field approach of [1], it has been successfully reformulated as a stack of common CNN layers. However, my main concern is that the experimental results (Page 8 Table 2) does not support the merits of the proposed approach. When comparing with the existing Faster R-CNN + Relation work of [2], the improvement provided by the proposed approach is negligible. In fact the proposed approach achieves inferior results on large objects (APL), compared to [2]. [1] uses attention to model the geometric relationships among objects in the same image. In addition to geometric relationships, the proposed approach leverages the inherent co-occurence relationships between object classes. However, this additional information does not seem to provide much help by looking at the results on COCO dataset. The paper does present an experiment to justify this additional information in Table 1. However, that small experiment is only on 6 object categories and does not seem to generalize when going towards full COCO dataset with 80 categories (Table 2). Therefore, the reviewer would recommend to thoroughly evaluate the merits of the proposed approach in depth on COCO and on additional datasets (i.e., Visual Genome). 
+
+[1] Philipp Krähenbühl, Vladlen Koltun: Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials. NIPS 2011.
+[2] Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, Yichen Wei: Relation Networks for Object Detection. CVPR 2018. 
+",3,,ICLR2020
+r1gig7Jf9r,3,BJlVeyHFwH,BJlVeyHFwH,Official Blind Review #1,"This paper points out invertible neural networks are not necessarily invertible because of bad conditioning. It shows some cases when invertible neural networks fail, including adding adversarial pertubations, solving the decorrelation task, and training without maximum likelihood objective (Flow-GAN). The paper also shows that spectral normalization improves network stability. 
+
+I think this is a solid work. The main contribution is it points out a problem that is overlooked before, which can possibly explain some unstable behavior for training neural networks. The paper also has some study on various architectures, which sheds some light on the designing of invertible neural networks. I think this paper can be important for future researchers to design models and algorithms. 
+
+===============
+
+Update:
+
+After reading other reviewer's comment I agree with other reviewers that the experimental section is problematic. It seems to be unrelated with the theoretical results proposed in this paper. I think currently the experiments only make a point that invertible networks can be non-invertible in practice. But the paper has large room to improve if it has
+
+1. A complete discussion on which invertible blocks / modeling tasks are easier to be non-invertible, and why (theoretically, and combine with direct experimental evidence)
+2. A remedy (using additive coupling layer is not an acceptable one since it severely limits the modeling power)
+
+I still think posing the problem itself is important. Thus I will still give it an accept, but lower it to a weaker score.",6,,ICLR2020
+BkgdU-e16m,2,BJgK6iA5KX,BJgK6iA5KX,official review,"The authors proposed an AutoLoss controller that can learn to take actions of updating different parameters and using different loss functions.
+
+Pros
+1. Propose a unified framework for different loss objectives and parameters.
+2. An interesting idea in meta learning for learning loss objectives/schedule.
+
+Cons: 
+1. The formulation uses REINFORCE, which is often known with high variance. Are the results averaged across different runs? Can you show the variance? It is hard to understand the results without discussing it. The sample complexity should be also higher than traditional approaches.
+2. It is hard to understand what the model has learned compared to hand-crafted schedule. Are there any analysis other than the results alone?
+3. Why do you set S=1 in the experiments? What’s the importance of S?
+4. I think it is quite surprising the AutoLoss can resolve mode collapse in GANs. I think more analysis is needed to support this claim. 
+5. The evaluation metric of multi-task MT is quite weird. Normally people report BLEU, whereas the authors use PPL. 
+6. According to https://github.com/pfnet-research/chainer-gan-lib, I think the bested reported DCGAN results is not 6.16 on CIFAR-10 and people still found other tricks such as spectral-norm is needed to prevent mode-collapse. 
+
+Minor: 
+1. The usage of footnote 2 is incorrect.
+2. In references, some words should be capitalized properly such as gan->GAN.
+",6,3.0,ICLR2019
+SJg5feZqnQ,1,BkMq0oRqFQ,BkMq0oRqFQ,A premature paper proposing a novel interpretation of Batch Normalisation,"The paper aims at a better understanding of the positive impacts of Batch Normalisation (BN) on network generalisation (mainly) and  convergence of learning. First, the authors propose a novel interpretation of the BN re-parametrisation. They show that an affine transform of the variables with their local variance (scale) and mean (shift) can be interpreted as a decomposition of the gradient of the objective function into a regressor assuming that the gradient is parallel to the variables (up to a shift) and the residual part which is the gradient w.r.t. to the new variables. In the second part of the paper, authors review various normalisation proposals (differing mainly in the subset of variables over which the normalisation statistics is computed) as well as the known empirical findings about the dependence of BN on the batch size. The paper presents an experiment that combines two normalisation variants. A further experiment strives at regularising BN for small batch sizes.
+
+Unfortunately, it remains unclear what questions precisely the authors answer in the second part of the paper and, what is more important, how they are related to the novel interpretation of BN presented in the first part. This interpretation holds for any function and can be possibly seen as a gradient pre-conditioning. However, the authors do not ""extend"" it towards the gradients w.r.t. the network parameters and do not consider the specifics of the learning objectives (a sum of functions, each one depending on one training example only). The main presented experiment combines layer normalisation with standard batch normalisation for a convolutional network. The first one normalises using the statistics over channel and spatial dimensions, whereas the second one uses the statics over the batch and spatial dimensions. The improvements are rather marginal, but, what is more important, the authors do not explain how and why this proposal follows from their new interpretation of BN.
+
+Overall, in my view, this paper is premature and not appropriate for publishing at ICLR in its present form.
+",3,4.0,ICLR2019
+ByeUgZT35r,3,SkxgnnNFvH,SkxgnnNFvH,Official Blind Review #4,"Summary: This work proposes a new transformer architecture for tasks that involve a query sequence and multiple candidate sequences. The proposed architecture, called poly-encoder, strikes a balance between a dual encoder which independently encodes the query and candidate and combines representations at the top, and a more expressive architecture which does full joint attention over the concatenated query and candidate sequences. Experiments on utterance retrieval tasks for dialog and an information retrieval task show that poly-encoders strike a good trade-off between the inference speed of the dual encoder model and the performance of the full attention model. 
+
+Pros:
+- Strong results compared to baselines on multiple dialog and retrieval tasks. 
+- Detailed discussion of hyperparameter choices and good ablations.
+- Paper is well written and easy to follow.
+
+Cons:
+- Limited novelty of methods. Ideas similar to the model variants discussed in this work have been considered in other work (Eg: [1]). It is also known that in-domain pre-training (i.e, pre-training on data close to the downstream task’s data distribution) helps (Eg: [2]). So this work can be considered as an application of existing ideas to dialog tasks. 
+- In terms of impact, utterance retrieval has fairly limited applicability in dialog. The dialog tasks considered in this work have a maximum of 100 candidate utterances, whereas in practice, the space of possible responses is much larger. While retrieval models are useful, I am skeptical about the practical value of the improvements shown in the paper (especially the improvements over bi-encoder, which is already a decent model).
+
+Suggestions:
+One way to get around the inefficiency of the cross-encoder architecture is to first use an inexpensive scoring mechanism such as TFIDF or bi-encoder to identify a small number of promising candidates from all the possible candidates. We can then use the cross-encoder to do more precise scoring of only the promising candidates. I am curious where a pipelined model such as this compares against the variants discussed in the paper in terms of speed and performance. 
+
+While the paper presents strong results on several dialog utterance retrieval tasks, the methods presented have limited novelty and impact. I am hence leaning towards borderline. 
+
+References
+
+[1] Logeswaran Lajanugen, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, Jacob Devlin, and Honglak Lee. 2019. Zero-Shot Entity Linking by Reading Entity Descriptions. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
+[2] Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.
+
+Edit: I have read the author response. Based on the rebuttal, I am more convinced about the practical impact of the approach. I am raising my score and recommending accept. ",8,,ICLR2020
+BywXN7WMG,3,S1v4N2l0-,S1v4N2l0-,intuitive but effective self-supervised method (with some lack of evaluation thoroughness),"**Paper Summary**
+    This paper proposes a self-supervised method, RotNet, to learn effective image feature from images by predicting the rotation, discretized into 4 rotations of 0, 90, 180, and 270 degrees. The authors claim that this task is intuitive because a model must learn to recognize and detect relevant parts of an image (object orientation, object class) in order to determine how much an image has been rotated. 
+They visualize attention maps from the first few conv layers and claim that the attend to parts of the image like faces or eyes or mouths. They also visualize filters from the first convolutional layer and show that these learned filters are more diverse than those from training the same model in a supervised manner. 
+	They train RotNet to learn features of CIFAR-10 and then train, in a supervised manner, additional layers that use RotNet feature maps to perform object classification. They achieve 91.16% accuracy, outperforming other unsupervised feature learning methods. They also show that in a semi-supervised setting where only a small number of images of each category is available at training time, their method outperforms a supervised method.
+	They next train RotNet on ImageNet and use the learned features for image classification on ImageNet and PASCAL VOC 2007 as well as object detection on PASCAL VOC 2007. They achieve an ImageNet and PASCAL classification score as well as an object detection score higher than other baseline methods.
+    This task requires the ability to understand the types, the locations, and the poses of the objects presented in images and therefore provides a powerful surrogate supervision signal for representation learning. To demonstrate the effectiveness of the proposed method, the authors evaluate it under a variety of tasks with different settings. 
+    
+    
+
+**Paper Strengths**
+- The motivation of this work is well-written.
+- The proposed self-supervised task is simple and intuitive. This simple idea of using image rotation to learn features, easy to implement image rotations without any artifacts
+- Requiring no scale and aspect ratio image transformations, the proposed self-supervised task does not introduce any low-level visual artifacts that will lead the CNN to learn trivial features with no practical value for the visual perception tasks.
+- Training the proposed model requires the same computational cost as supervised learning which is much faster than training image reconstruction based representation learning frameworks.
+- The experiments show that this representation learning task can improve the performance when only a small amount of annotated examples is available  (the semi-supervised settings).
+- The implementation details are included, including the way of implementing image rotations, different network architectures evaluated on different datasets, optimizers, learning rates with weight decayed, batch sizes, numbers of training epochs, etc. 
+- Outperforms all baselines and achieves performance close to, but still below, fully supervised methods
+- Plots rotation prediction accuracy and object recognition accuracy over time and shows that they are correlated
+
+
+
+**Paper Weaknesses**
+- The proposed method considers a set of different geometric transformations as discrete and independent classes and formulates the task as a classification task. However, the inherent relationships among geometric transformations are ignored. For example, rotating an image 90 degrees and rotating an image 180 degrees should be closer compared to rotating an image 90 degrees and rotating an image 270 degrees.
+- The evaluation of low-level perception vision task is missing. In particular, evaluating the learned representations on the task of image semantic segmentation is essential in my opinion. Since we are interested in assigning the label of an object class to each pixel in the image for the task, the ability to encode semantic image feature by learning from performing the self-supervised task can be demonstrated.
+- The figure presenting the visualization of the first layer filters is not clear to understand nor representative of any finding.
+- ImageNet Top-1 classification results produced by Split-Brain (Zhang et al., 2016b) and Counting (Noroozi et al., 2017) are missing which are shown to be effective in the paper [Representation Learning by Learning to Count](https://arxiv.org/abs/1708.06734).
+- An in-depth analysis of the correlation between the rotation prediction accuracy and the object recognition accuracy is missing. Showing both the accuracies are improved over time is not informative.
+- Not fully convinced on the intuition, some objects may not have a clear direction of what should be “up” or “down” (symmetric objects like balls), in Figure 2, rotated image X^3 could plausibly be believed as 0 rotation as well, do the failure cases of rotation relate to misclassified images?
+- “remarkably good performance”, “extremely good performance” – vague language choices (abstract, conclusion)
+- Per class breakdown on CIFAR 10 and/or PASCAL would help understand what exactly is being learned
+- In Figure 3, it would be better to show attention maps on rotated images as well as activations from other unsupervised learning methods. With this figure, it is hard to tell whether the proposed model effectively focuses on high level objects.
+- In Figure 4, patterns of the convolutional filters are not clear. It would be better to make the figures clear by using grayscale images and adjusting contrast.
+- In Equation 2, the objective should be maximizing the sum of losses or minimizing the negative. Also, in Equation 3, the summation should be computed over y = 1 ~ K, not i = 1 ~ N.
+
+
+
+**Preliminary Evaluation**
+This paper proposes a self-supervised task which allows a CNN to learn meaningful visual representations without requiring supervision signal. In particular, it proposes to train a CNN to recognize the rotation applied to an image, which requires the understanding the types, the locations, and the poses of the objects presented in images. The experiments demonstrate that the learned representations are meaningful and transferable to other vision tasks including object recognition and object detection. Strong quantitative results outperforming unsupervised representation learning methods, but lacking qualitative results to confirm/interpret the effectiveness of the proposed method.",6,3.0,ICLR2018
+RC8I-tnTVvO,4,iTeUSEw5rl2,iTeUSEw5rl2,Official Blind Review #2,"This paper studies the domain adaptation problem when the source data comes from multiple domains continuously and the test domain for adaptation is unknown. The assumption used for domain shift is that the domain label would change the features but not the labels. So the main idea is to learn invariant representations across all the domains and avoid spurious correlation on the domain label. The proposed method then involves a multiplayer minimax game. The adversaries are the domain discriminator for each class, which tries to maximize the domain discrepancy. The minimizer player is the representation learner. The paper introduces two discrepancy measure based on the Jensen-Shannon divergence and the one dimensional Wasserstein distance. In the experiment, the data set for continual learning is constructed using domain shift data such that it mimics the online learning setting. The results are competitive in comparison with a limited set of baselines.
+ 
+ 
+Strong points:
+1. The paper focused on an interesting and important topic.
+2. The multiplayer adversarial game in terms of minimizing domain discrepancy seems to be novel.
+ 
+Weak points:
+1. The online or continual learning perspective is merely solved by keeping an episodic source data buffer, which I think is overly simplified. In general, I have a question about how this adversarial method would work when there is not an online/continual learning component. Given a fixed target, this a static set of source domains, it seems the method should be still valid. So I am not sure how the domain generalization side and the online side of the method interact with each other. Investigating the online setting before the batched setting seems problematic to me
+ 
+2. The evaluation in experiments shows that the FM measure of the proposed method is not very competitive. I believe it is related to how the memory is sampled and the length of the task sequence. However, it also indicates the method is not very satisfactory for avoiding forgetting. Otherwise, it could also be the case the conditional domain shift assumption is not valid in the data.
+ 
+3. The idea to learn an invariant representation is not novel. For example, the invariant risk minimization (IRM) method is exactly dealing with the same problem. Using an episodic source data buffer, it seems you can also apply IRM to solve the problem. I think it should be included as a baseline. In general, more invariant learning approaches should be discussed in the related work.
+ 
+4. Writing can be significantly improved.
+ 
+Given the weak points, I recommend rejection for this paper.
+ 
+Here are some of my questions and additional suggestions:
+1. I am curious about how the invariant feature looks like. Basically, after learning the adversarial domain predictors, how the representation learning looks like. It would be nice to have some visualization on that.
+
+2. I do recommend focusing on a batched problem before going to the online version. I feel even though it is not the case that the batched method will work for online cases, at least in domain generalization we cannot expect a bad learner to work online when the data is even more limited. Even for an online paper, showing the performance for the batched version seems to be necessary.
+
+3. How \alpha is chosen? And how it affects learning? In general, more ablation studies would improve paper quality. 
+",4,4.0,ICLR2021
+SJg2i1Pp3X,2,r1GbfhRqF7,r1GbfhRqF7,A interesting study of how to optimize kernel change-point detection algorithm. ,"A new approach to choose a kernel to maximize the test power, for the kernel change-point detection. This provides an extension to the two-sample version of the problem (Gretton et al. 2012b, Sutherland et al. 2017). The difficulty is caused by that there is very limited samples from the abnormal distribution. The idea is based on choosing a surrogate distributions using generative model. The idea makes sense although there seems to be not much detail in how to choose the surrogate distribution. There is a mechanism to study the threshold. Real-data and simulation demonstrates the good performance. I think the idea is really interesting and I am impressed by the completeness of the work. ",8,4.0,ICLR2019
+HJePWIB0Kr,1,SkgJOAEtvr,SkgJOAEtvr,Official Blind Review #3,"The paper analyzes if enforcing internal-consistency for speaker-listener setup can (i) improve the ability of the agents to refer to unseen referents (ii) generalize for different communicative roles. The paper evaluates a transformer and arecurrent model modified with various sharing strategies on a single-turn reference game. Finally, the paper claims that results with self-play suggest that internal consistency doesn’t help (i) but improves cross-role performance for (ii).
+
+As a reader, the paper doesn’t provide me a concrete finding which can help in designing future multi-agent systems. Most of the results for the experiments (except self-play) don’t have a uniform signal across the board to deduce whether the internal-consistency works in all of the cases. Most of the speaker-listener scenarios emerge in dialog based settings which are multi-turn and require agents to act both as speaker and listener. Though paper advocates through some of its results that self-play is helpful in generalization across roles via internal-consistency, without multi-step experiments, qualitative and quantitative analysis of what is happening and why there is so much variation, the paper is weak in its current form. Therefore, I recommend weak reject as my rating. Below I discuss some of the concerns in detail:
+
+Without multi-step evaluation, it is hard to gauge the extent to which self-play for internal consistency help in generalization of the roles. For e.g., task from Das et. al. (2017) [1] provides a clear signal on how well the agents are able to communicate through dialog evaluation. So in 5.2.1, the setup which requires training in both roles can provide better signal overall if it was trained to do multi-step conversation.
+
+Paper is missing any kind of quantitative or qualitative analysis. What are the differences between the embeddings of the agent that learned via self-play and the one which learned directly. It also be interesting to see how the shared embeddings and symmetric encoding and decoding affect these embedding and might help explain the drop and randomness. In Table 4., the results on symmetric encoding suggest that the claim of generalization through internal consistency might not hold everywhere. For Shared Embedding results, on RNN shapes, it is surprising that training in one role improves performance through internal consistency while in both roles it drops. These require further analysis to solidify the claim. Given the flaky results, to boost the claim, have authors tried other settings to test internal-consistency like Predator-Prey?
+
+Things that didn’t affect the score:
+
+Related work section is missing the relevant discussion on continuous communication work and discussion on why internal consistency wasn’t tested on those settings as well. (See Singh et.al [2]., Sukhbaatar et.al. [3], Das et.al. [4] etc)
+
+The number of pages are above eight, you should reduce the redundancy between table descriptions and text and maybe squeeze Section 2, decrease setup explanation.
+
+The setup for training and test sets explained at the end of page 7 isn’t very clear to me and needs to be rephrased.
+
+[1] Das, Abhishek, Satwik Kottur, José MF Moura, Stefan Lee, and Dhruv Batra. ""Learning cooperative visual dialog agents with deep reinforcement learning."" In Proceedings of the IEEE International Conference on Computer Vision, pp. 2951-2960. 2017.
+[2] Sukhbaatar, Sainbayar, and Rob Fergus. ""Learning multiagent communication with backpropagation."" In Advances in Neural Information Processing Systems, pp. 2244-2252. 2016.
+[3] Singh, Amanpreet, Tushar Jain, and Sainbayar Sukhbaatar. ""Learning when to communicate at scale in multiagent cooperative and competitive tasks."" arXiv preprint arXiv:1812.09755 (2018).
+[4] Das, Abhishek, Théophile Gervet, Joshua Romoff, Dhruv Batra, Devi Parikh, Michael Rabbat, and Joelle Pineau. ""Tarmac: Targeted multi-agent communication."" arXiv preprint arXiv:1810.11187 (2018).
+
+=========
+Post-rebuttal Comments
+=========
+Thanks for updating the manuscript to resolve my and R2's concerns. The new analysis section does provide good insights into what exactly is happening.
+
+When I was talking about actionable insights, I was talking about both negative and positive insights. Currently, the only take-away is that self-play helps in generalizing to listener roles as well. For the other negative insight that internal consistency doesn't help with generalization, as R2 suggested, it is unclear why that would be case in the first place (I read the pscyhology arguments, but I am not still not convinced). I still believe that without multi-step communication, the work is as useful as it can be in current form. In real world, no meaningful conversation is usually one step.
+
+For Predator-Prey setup, I was talking about OpenAI https://github.com/openai/multiagent-particle-envs in which multiple tasks can be setup. For e.g. if prey thinks of what action predator might take, does internal consistency help prey to perform better?
+
+I think most of what you got is correct for multi-step, see second para for more details in this response.
+
+Thanks for bringing the manuscript under 8 pages. [3] is still missing from references.
+
+Final comments: I would like to see multi-step experiments due to the reasons I explained above. The scheme of internal-consistency should be applicable beyond conversation to Predator-Prey setups also, thus, I feel experiments are not enough (only on 1 setting) to claim generalization of the hypothesis. Beyond these comments, I feel this is still a step in right direction and I would like to update my rating to weak accept while hoping that authors try to address these issues in camera-ready version if paper gets accepted.
+
+
+
+",6,,ICLR2020
+7LXScpW7tW6,1,4qgEGwOtxU,4qgEGwOtxU,"Interesting work,  some aspects need to be polished","The manuscript introduces an approach, based on importance and coherence, for evaluation whether a partitioning of a network exhibits modular characteristics. 
+Importance refers to how crucial is a neuron , or set of neurons, to the performance of a network on a given task, e.g. classification.
+Coherence refers to how consistently the neuron(s) in question are related to specific features.
+Experiments are conducted by considering sets of neurons identified via a spectral clustering algorithm.
+
+The manuscript proposes a method to verify to what extent a partitioning of a network follows modular characteristics. To a good extent the proposed method is grounded on proper theoretical foundations which is highly desirable. The only part where this cannot be fully verified is its dependence on the method from [Anonymous, 2021] which cannot be verified.
+
+My main concerns with the manuscripts are the following:
+
+- When conducting spectral clustering, the number of clusters is set to 12. Is there a procedure to set this value in a principled manner? is there an indication on the effect of this parameter? the manuscript would benefit from analyzing the effect of this parameter in the observation made on the reported experiments?
+
+- Visualizations discussed in Sec. 3.1.1 (Fig.1) are quite subjective. While in some cases some patterns are indeed visible, in other cases it is hard to make sense of what is being presented. Is it possible to evaluate the produced visualizations in a more objective manner?
+In recent years, several methods ( Bau et al, 2016, Oramas et al. ICLR'19, Yang and Kim, arXiv:1907.09701 ) for quantitative evaluation of methods for visual interpretation and explanation have been processed. Perhaps one of these could be adopted in the manuscript with the goal of objectively evaluating the visualizations/explanations presented in Fig. 1.
+
+- In some cases design decision are made that seem to favor observations expected in some experiments. For instance, in order to favor clusterability small MLPs are pruned, to improve visualization MLPs are trained with dropout; and other factors relevant to the proposed method. Therefore my question by ensuring that some of these properties, e.g. cleaner cluster, clearer visualizations, don't you favor the measurement capabilities of the proposed importance/coherence metrics?
+
+- When analyzing the ""importance"" metric on the lesion tests (Sec. 3.2) there are new conditions that are applied to the clusters being considered in the analysis, e.g. the size of the cluster, minimum effect of the cluster on accuracy, etc. Keeping this present, my questions are: i) Were these conditions also applied when analyzing ""coherence"", and ii) why these type of condition were not applied in the experiments of Sec. 3.1? Ideally, some level of consistency is expected among the experiments. Otherwise it is hard to assess properly the origin of observations made on the results of the experiments.
+
+- At the end of Sec. 4, it is stated that the conducted experiments, combining spectral clustering with feature visualization,
+highlight the usefulness of combining multiple interpretability methods in order to build an improved set of tools for rigorously understanding systems. However, from the observations made on the experiments I do not see the added value that the proposed method could bring to interpretability/explainability of the analyzed models (networks).
+
+- Very related, one paragraph later, it is stated that having modular networks is useful both for interpretability and for building better models. However, from the content of the manuscript it is not clear how having a modular network/representation does contribute with the two listed aspects.
+
+- Significant parts of the manuscript are delegated to the supplementary material. In addition, the third part of the proposed method, i.e. intrinsic partition evaluation, is part of another manuscript [Anonymous,2021] that does not seem to be published. For these reasons, to a good extent, the manuscript is not self-contained. ",5,4.0,ICLR2021
+r18pMHFgM,3,BkpXqwUTZ,BkpXqwUTZ,The paper claims to work towards a more biological version of error-backpropagation.,"The paper is incomplete and nowhere near finished, it should have been withdrawn. 
+
+The theoretical results are presented in a bitmap figure and only referred to in the text (not explained), and  the results on datasets are not explained either (and pretty bad). A waste of my time.",2,5.0,ICLR2018
+rJehYBRCFH,2,Hyx0slrFvH,Hyx0slrFvH,Official Blind Review #3,"The work studies differentiable quantization of deep neural networks with straight-through gradient (Bengio et. al., 2013). The authors find that a proper parametrization of the quantizer is critical to stable training and good quantization performance and demonstrated their findings to obtain mixed precision DNNs on two datasets,  i.e., CIFAR-10 and Imagenet.
+
+The paper is clearly written and easy to follow. The idea proposed is fairly straight-forward. Although the argument the authors used to support the finding is not very rigorous, the finding itself is still worth noting. 
+
+One of the arguments that the authors used to support the specific form of parametrization is that it leads to diagonal Hessian. From optimization perspective, what matters is the condition number, i.e., max/min of the eigenvalues of the Hessian. Could this explain the small difference between the three different parametrization forms with uniform quantization and the big difference for power-of-two quantization? 
+
+The penalty method used to address the memory constraints will not necessarily lead to solutions that satisfy the constraints. The authors noted that the algorithm is not sensitive to the choice of the penalty parameters. Have the authors tried to tackle problems of hard memory constraints?
+
+",6,,ICLR2020
+r1-zOIFgM,2,SkERSm-0-,SkERSm-0-,"Because of the excessive length, poor presentation quality, and limited novelty, this paper is below the bar for ICLR at this time.","This paper proposes to modify how noise factors are treated when developing VAE models.  For example, the original VAE work from (Kingma and Welling, 2013) applies a deep network to learn a diagonal approximation to the covariance on the decoder side.  Subsequent follow-up papers have often simplified this covariance to sigma^2*I, where sigma^2 is assumed to be known or manually tuned.  In contrast, this submission suggests either treating sigma^2 as a trainable parameter, or else introducing a more flexible zero-mean mixture-of-Gaussians (MoG) model for the decoder noise.  These modeling adaptations are then analyzed using various performance indicators and empirical studies.
+
+The primary issues I have with this work are threefold:  (i) The paper is not suitably organized/condensed for an ICLR submission, (ii) the presentation quality is quite low, to the extent that clarity and proper understanding are jeopardized, and (iii) the novelty is limited.  Consequently my overall impression is that this work is not yet ready for acceptance to ICLR.
+
+First, regarding the organization, this submission is 19 pages long (*excluding* references and appendices), despite the clear suggestion in the call for papers to limit the length to 8 pages: ""There is no strict limit on paper length. However, we strongly recommend keeping the paper at 8 pages, plus 1 page for the references and as many pages as needed in an appendix section (all in a single pdf). The appropriateness of using additional pages over the recommended length will be judged by reviewers.""  In the present submission, the first 8+ pages contain minimal new material, just various background topics and modified VAE update rules to account for learning noise parameters via basic EM algorithm techniques.  There is almost no novelty here.  In my mind, this type of well-known content is in no way appropriate justification for such a long paper submission, and it is unreasonable to expect reviewers to wade through it all during a short review cycle.
+
+Secondly, the presentation quality is simply too low for acceptance at a top-tier international conference (e.g., it is full of strange sentences like ""Such amelioration facilitates the VAE capable of always reducing the artificial intervention due to more proper guiding of noise learning.""  While I am sympathetic to the difficulties of technical writing, and realize that at times sufficiently good ideas can transcend local grammatical hiccups, my feeling is that, at least for now, another serious pass of editing is seriously needed.  This is especially true given that it can be challenging to digest so many pages of text if the presentation is not relatively smooth.
+
+Third and finally, I do not feel that there is sufficient novelty to overcome the issues already raised above.  Simply adapting the VAE decoder noise factors via either a trainable noise parameter or an MoG model represents an incremental contribution as similar techniques are exceedingly common.  Of course, the paper also invents some new evaluation metrics and then applies them on benchmark datasets, but this content only appears much later in the paper (well after the soft 8 page limit) and I admittedly did not read it all carefully.  But on a superficial level, I do not believe these contributions are sufficient to salvage the paper (although I remain open to hearing arguments to the contrary).",3,4.0,ICLR2018
+XJq4_kMFABe,4,jQSBcVURlpW,jQSBcVURlpW,initial review,"Initial review (2020.11.01)
+
+Review: 
+This paper addresses a specialized solution to combine neural networks and linear models for Raven progressive matrices (RPM). The solver is composed of convolutional neural networks (CNN) as an image perception module and a reasoning module with linearly parameterized operators to represent successive operations in the RPM. The operators are trained with regularized linear regressors, which are lightweight capable to adapt on-the-fly for each instance. The overall architecture enables end-to-end training and can be used for generation. For performance evaluation on two kinds of automatically generated problem sets, the authors test the proposed method with respect to 3 extrapolatory settings such as systematicity, productivity, and localism, which seems to be newly defined tasks for RPM. Their method outperforms pure neural state-of-the-art methods with a large margin over all of settings. 
+The proposed methods solve RPMs well with available components such as CNNs and linear models, which are able to be tuned on-the-fly. Also, this paper provides systematic results to easily comprehend the role of modules with new settings (systematicity, productivity, and localism) for RPMs. On the other hands, my major concerns are two-fold: (1) one of important research questions introduced in this work, “what constitutes such an algebraic inductive bias?” seems not so clearly answered in this paper. (2) the comparative result in the paper just considers only pure connectionist methods without considering other (semi-)symbolic approaches. Since this work utilizes additional problem information (e.g., specified features) for modeling than compared neural methods, I think it would be better to justify the position of this work by comparing various approaches including search-based solver, symbolic, and the neural methods to figure out the superiority of this approach.
+As a result, I vote for marginally below acceptance threshold.
+
+Pros:
+-	The authors provide a specialized solution for RPM well with CNNs and linear models.
+-	They introduce new 3 settings of RPMs and report systematic experimental results.
+
+Concerns:
+-	In the comparative study, the authors report only the systematicity, the productivity, and the localism, not the ""accuracy"" used in the cited related works. Due to the lack of the information, the readers can not directly compare them with respect to the previous viewpoint.
+-	If the result of other various approaches for RPM not only pure connectionist methods but also search-based solver, symbolic, and neuro-symbolic approaches, it would be easier to find the niche of this work and its superiority.
+-	As I mentioned above, the research questions, “what constitutes such an algebraic inductive bias?” seems not so clearly answered in this paper. Could you discuss this issue with the result in the paper?
+
+Minors:
+-	It would be better understandable to show examples for 3 settings of systematicity, productivity and localism on RPMs.
+-	fare -> far?
+",5,4.0,ICLR2021
+XTz5T9hX9V5,1,rSwTMomgCz,rSwTMomgCz,Interesting and well-motivated paper with perhaps some missing exposition into non-meta exploration approaches,"Summary: This paper introduces DREAM, a meta-RL approach that decouples exploration from exploitation. An exploitation policy learns to maximize rewards that are conditioned on an encoder that learns task relevant information. Then an exploration policy learns to collect data that maximizes the mutual information between the encoder and explored states. The work is compared against multiple baselines in simple tasks.
+
+
+Overall, I lean towards accepting the paper, though I am not as familiar with the meta-RL literature to have much of an informed opinion about what relevant benchmarks or approaches are. The paper was well-written and well-motivated, and while the experiments were simple, seemed to highlight the problems that the paper was addressing. It makes sense to separate out exploration and exploitation and I appreciated the inclusion of tasks that helped motivate this point. Furthermore, the paper provides a theoretical analysis of DREAM showing that the decoupled policy maximizes returns. Code and hyperparameters are provided and the paper seems to be reproducible. 
+
+
+I do think that the paper should have more discussion and evaluation over approaches that aim to explicitly address the exploration exploitation problem. The paper only considers exploration in the context of meta-learning but of course exploration is a central problem in RL and several approaches have studied it outside of Meta-RL. The paper would be improved by discussing such approaches (for example intrinsic rewards such as empowerment [1] or surprise [2]) and/or evaluating how well these approaches compare to DREAM when trained alone and combined with vanilla algorithms. 
+
+
+I also would have liked to see more empirical analysis over the exploration policy being learned by $\pi^{exp}$. 
+
+
+Questions:
+
+
+1) How was the decay rate for epsilon chosen in Figure 3? How would a policy with a fixed decay rate perform? 
+
+
+2) I do not quite understand how trajectories from the exploration policy can be used interchangeably with the encodings $z$ when plugged into $\pi^{task}$. Could the authors provide more insights into this?
+
+
+[1] A Unified Bellman Optimality Principle Combining Reward Maximization and Empowerment. Leibfried et al. 
+
+
+[2] Curiosity-driven Exploration by Self-supervised Prediction. Pathak et al.
+",7,2.0,ICLR2021
+H1Ae8Z5eM,2,rJv4XWZA-,rJv4XWZA-,May need more details for privacy analysis,"The paper proposes a technique for differentially privately generating synthetic data using GAN, and experimentally showed that their method achieves both high utility and good privacy.
+The idea of building a differentially private GAN and generating differentially private synthetic data is very interesting. However, my main concern is the privacy aspect of the technique, as it is not explained clearly enough in the paper. There is also room for improvement in the presentation and clarity of the paper.
+
+More details:
+- About the differential privacy aspect:
+  The author didn't provide detailed privacy analysis of the Gaussian noise layer, and I don't find the values of the sensitivity (C = 1) provided in the answer to a public comment easy to see. Also, the paper mentioned that the batch size is 32 and the author mentioned in the comment that the std of the Gaussian noise is 0.7, and the number of epoch is 50 or 150. I think these values would lead to epsilon much larger than 8 (as in Table 1). However, in Section 5.2, it is said that ""Privacy bounds were evaluated using the moments accountant and the privacy amplification theorem (Abadi et al., 2016), and therefore, are data-dependent and are tighter than using normal composition theorems."" I don't see clearly why privacy amplification is needed here, and why using moments accountant and privacy amplification can lead to data-dependent privacy loss.
+  In general, I don't find the privacy analysis of this paper clear and detailed enough to convince me about the correctness of the privacy results. However, I am very happy to change my opinion if there are convincing details in the rebuttal.
+
+- About the presentation:
+  As a paper proposing a differentially private algorithm, detailed and formal analysis of the privacy guarantees is essential to convince the readers. For example, I think it would be much better if there is a formal theorem showing the sensitivity of the Gaussian noise layer. And it would be better to restate (in Appendix 7.4) not only the definition of moments accountant, but the composition and tail bound, as well as the moments accountant for the Gaussian mechanism, since they are all used in the privacy analysis of this paper.
+",5,4.0,ICLR2018
+7ka9z27ytVg,3,MBpHUFrcG2x,MBpHUFrcG2x,A slight variation on standard stochastic EM algorithms,"Summary
+
+This paper proposes a stochastic expectation--maximisation (EM) algorithm. The main idea is that the target distribution is specified as a deterministic mapping, a.k.a. a normalising flow, from some simple ""base"" distribution.
+
+
+Strengths
+
+The algorithm appears to be formally correct (in the sense that it is a standard stochastic EM algorithm). The method is demonstrated on a large number of examples.
+
+Weaknesses
+
+The proposed algorithm is just a standard Metropolis--Hastings (MH) update interspersed with a stochastic EM update for the parameters of the target distribution. This does not seem novel; such algorithms have been around for decades.
+
+The authors should explain why they need for extending the space to include $y_O$ instead of just mapping $\xi$ to $y_M$.
+
+
+Minor comments
+
+- There are a number of typos in the bibliography mostly related to inconsistent use of capital letters in article titles and journal/conference names.
+
+- What does the semi-colon in $p_{f,\theta}(x_M; x_O)$ mean? Why not use a comma if this is meant to be a joint density?
+
+- In Section 3, it would be helpful to write the (extended) target distribution of the Metropolis--Hastings algorithm down explicitly and formally.
+
+- Within LaTeX's maths mode, you cannot just write operators $min$ and $Uniform$ (LaTeX treats this, e.g., as multiplying $m$ by $i$ by $n$ which leads to the wrong spacing).",6,3.0,ICLR2021
+3PwHp4elcTP,1,ufZN2-aehFa,ufZN2-aehFa,"A simple and effective idea, but missing an important baseline that can also address the problems of mean-aggregation in NPs","The authors present the Bayesian Aggregation (BA) mechanism in the context of Neural Processes (NPs) for aggregating the context information into the latent variable z in the form of posterior updates to z. The authors show that this improves predictive performance (in terms of likelihood) compared to mean aggregation MA that it replaces on various regression tasks with varying input-output dimensionality.
+
+Strengths:
+1. The idea is simple and leads to a notable improvement compared to MA in terms of likelihood
+2. The background and method is presented very clearly.
+3. The evaluation is done on a wide variety of tasks, ranging from standard 1D regression of GP samples to pendulum trajectory prediction tasks.
+
+Weaknesses:
+1. The evaluation is missing an important baseline model, which are (A)NP models that have self-attention in the encoder for processing the contexts (c.f. model figure in ANP paper (Kim et al., 2019b)). Contrary to the NP/CNP baselines that are compared against in the paper, the ANP with self-attention in the encoder does not give uniform weights to each context point - the self-attention allows the model to assign varying importance to the different context points (despite using mean-aggregation after the self-attention), which is presented as a key motivation for the BA mechanism introduced in the paper. Hence for the experiments, I strongly suggest comparing against CNP/NP/ANP with self-attention in the deterministic/latent/latent path of the encoder. For completeness, if would be nice to also compare against models that have both deterministic and latent paths, since BA can also be applied to these models. At the same time, I understand that BA would be more interpretable for showing which observations have little/high effect on z compared to the approach of using self-attention in the encoder, but it would still be very informative for the reader to be able to compare the two approaches. Also these two approaches can be combined to have self-attention in the encoder + BA, which might also yield improved performance.
+2. The claim that “BA includes MA as a special case” doesn’t seem to be true. Using a “non-informative prior and uniform observation variances” leads to constant sigma_z and mu_z being linearly proportional to mean(r_n) (i.e. sum_n r_n / N), which is not quite the same as MA - MA allows sigma_z and mu_z to be non-linear functions of mean(r_n), hence is strictly more expressive than this special case.
+3. In Equation (7), it seems as though the context points (x_n,y_n) only affects r_n via the variance, which seems unnecessarily limiting. Why not have the mean also depend on r_n? e.g. p(r_n|z) = N(r_n| z + mu_{r_n}, diag(sigma_{r_n}^2) where mu_{r_n} is also computed as a function of (x_n,y_n)? This will still give a closed-form posterior p(z|r_{1:N}) since the mean of p(r_n|z) is still linear in z, creating a model that’s strictly more expressive with very similar efficiency. It would be informative to see how this changes the experimental results.
+4. I’m guessing the VI objective was used to train the ANP. Given the clear advantage of training with the MC objective, shouldn’t the ANP also be trained with MC?
+5. The latent variable models were not evaluated on 2D image completion tasks because “architectures without deterministic paths were not able to solve this task”. Why not then add a deterministic path to these latent variable models to allow them to train?
+
+Other points
+- In the text, it says that the model is also compared against ANP to show that BA can compete with SOTA. This is arguably incorrect since ConvCNP models are SOTA among models of the NP family, showing a significant improvement over ANP. Hence to achieve the goal mentioned in the text, it would make sense to compare with ConvCNP models as part of the evaluation against other deterministic NPs.
+
+Overall the paper is presented very clearly with a simple yet effective idea tested on a wide variety of tasks. However it’s missing an important baseline that uses self-attention in the encoder, along with several other baselines that would be informative to compare against. I am willing to increase my score should these results be included in the revised version of the paper.
+
+=================
+
+Score raised to 6 after inclusion of MA + SA results in rebuttal.",6,5.0,ICLR2021
+Hye3HcXoFB,1,HygqFlBtPS,HygqFlBtPS,Official Blind Review #2,"Strengths:
+This work proposed two regularizers that can be used to train neural networks that yield convex relaxations with tighter bounds.
+The experiments display that the proposed regularizations result in tighter certification bounds than non-regularized baselines.
+The problem is interesting, and this work seems to be useful for many NLP pair-wise works.
+weaknesses:
+Some presentation issues.
+The dataset, MNIST, is not good enough for a serious research. 
+More datasets need to be added to the experiments in this paper.
+
+
+Comments:
+This paper proposes two regularizers to train neural networks that yield convex relaxations with tighter bounds. 
+
+Overall, the paper solves an interesting problem. Though I did not check complete technical details, the extensive evaluation results seem promising. 
+
+1. There are some presentation issues that can be addressed. For example, on page 8, the sentence of “the family of 10small” misses a blank space.
+
+2. In the experiments, the dataset is not a good one for evaluating the performance of the proposed idea.
+
+In conclusion,  at this stage, my opinion on this paper is Weak Accept. ",6,,ICLR2020
+1pQijHm_OQ,4,pVwU-8cdjQQ,pVwU-8cdjQQ,An interesting paper with insufficient experiments,"This paper presents a new model for unsupervised video decomposition based on the multi-object scene representation of IODINE. Basically, it makes use of 2D-LSTMs to provide the IODINE framework with a better ability to capture temporal dependencies.
+
+1.  My first concern is about the novelty of the proposed model. The most important contribution, if I understand it correctly, is the use of 2D-LSTMs in the iterative amortized inference for temporal modeling. It does extend IODINE into the spatiotemporal context, but it only provides limited new insight into this research field.
+
+2. Another concern is about the CLEVRER experiments. From Table 2, a counter-intuitive observation is that IODINE significantly outperforms Seq-IODINE. There might be two reasons: (a) In this dataset, the temporal dependencies between frames are relatively weak, because future frames depend not only on previous ones but also on a series of actions. (b) Seq-IODINE may not be a strong baseline model; If so, the authors might include other existing approaches specifically designed for video data, such as R-NEM and DDPAE [Hsieh et al., 2018].
+[Hsieh et al., 2018] Learning to Decompose and Disentangle Representations for Video Prediction.
+
+3. Although I am aware that R-NEM and IODINE were also evaluated on the synthetic datasets only, I recommend that the authors validate the model on real-world datasets of non-rigid objects and more complex structures.
+
+4. If the authors can provide a direct qualitative comparison with R-NEM and IODINE in Figures 3-4, it will be easier to understand the advantages of the proposed model.
+
+5. In Section 2, the authors could use a separate paragraph to describe the main differences between the proposed model and previous ones, in particular R-NEM and IODINE.",6,4.0,ICLR2021
+BJl-Vc9Yn7,3,BkVVOi0cFX,BkVVOi0cFX,Pre-training for QA helps,"This paper shows that a sentence selection / evidence scoring model for QA trained on SQuAD helps for QA datasets where such explicit per-evidence annotation is not available.
+
+Quality:
+Pros: The paper is mostly well-written, and suggested models are sensible. Comparisons to the state of the art are appropriate, as is the related work description. the authors perform a sensible error analysis and ablation study. They further show that their suggested model outperforms existing models on three datasets.
+Cons: The introduction and abstract over-sell the contribution of the paper. They make it sound like the authors introduce a new task and dataset for evidence scoring, but instead, they merely train on SQuAD with existing annotations. References in the method section could be added to compare how the proposed model relates to existing QA models. The multi-task ""requirement"" is implemented as merely a sharing of QA datasets' vocabularies, where much more involved MTL methods exist. What the authors refer to as ""semi-supervised learning"" is in fact transfer learning, unless I misunderstood something.
+
+Clarity:
+Apart from the slightly confusing introduction (see above), the paper is written clearly.
+
+Originality:
+Pros: The suggested model outperforms others on three QA datasets.
+Cons: The way achievements are achieved is largely by using more data, making the comparison somewhat unfair. None of the suggested models are novel in themselves. The evidence scoring model is a rather straight-forward improvement that others could have come up with as well, but merely haven't tested for this particular task.
+
+Significance:
+Other researchers within the QA community might cite this paper and build on the results. The significance of this paper to a larger representation learning audience is rather small.",4,4.0,ICLR2019
+Bku1giNxf,1,B1uvH_gC-,B1uvH_gC-,"The paper proposes a parametric manifold learning method based on deep learning and Siamese networks. Key references are missing (SAMMANN, auto-encoders); Experiments are limited.","The paper describes a manifold learning method that adapts the old ideas of multidimensional scaling, with geodesic distances in particular, to neural networks. The goal is to switch from a non-parametric to a parametric method and hence to have a straightforward out-of-sample extension.
+
+The paper has several major shortcomings:
+* Any paper dealing with MDS and geodesic distances should test the proposed method on the Swiss roll, which has been the most emblematic benchmark since the Isomap paper in 2000. Not showing the Swiss roll would possibly let the reader think that the method does not perform well on that example. In particular, DR is one of the last fields where deep learning cannot outperform older methods like t-SNE. Please add the Swiss roll example.
+* Distance preservation appears more and more like a dated DR paradigm. Simple example from 3D to 2D are easily handled but beyond the curse of dimensionality makes things more complicated, in particular due to norm computation. Computation accuracy of the geodesic distances in high-dimensional spaces can be poor. This could be discussed and some experiments on very HD data should be reported.
+* Some key historical references are overlooked, like the SAMMANN. There is also an over-emphasis on spectral methods, with the necessity to compute large matrices and to factorize them, probably owing to the popularity of spectral DR metods a decade ago. Other methods might be computationally less expensive, like those relying on space-partitioning trees and fast multipole methods (subquadratic complexity). Finally, auto-encoders could be mentioned as well; they have the advantage of providing the parametric inverse of the mapping too.
+* As a tool for unsupervised learning or exploratory data visualization, DR can hardly benefit from a parametric approach. The motivation in the end of page 3 seems to be computational only.
+* Section 3 should be further detailed (step 2 in particular).
+* The experiments are rather limited, with only a few artifcial data sets and hardly any quantitative assessment except for some monitoring of the stress. The running times are not in favor of the proposed method. The data sets sizes are, however, quite limited, with N<10000 for point cloud data and N<2000 for the image manifold.
+* The conclusion sounds a bit vague and pompous ('by allowing a limited infusion of axiomatic computation...'). What is the take-home message of the paper?",5,5.0,ICLR2018
+HJxxyGwinm,1,BJzuKiC9KX,BJzuKiC9KX,An interesting experimental paper but more insights are expected,"Main idea:
+This paper studies a problem of the importance weighted autoencoder (IWAE) pointed out by  Rainforth 18, that is, tighter lower bounds arising from increasing the number of particles improve the learning of the generative model, but worsen the learning of the inference network. The authors show that the reweighted wake-sleep algorithm (RWS) doesn't suffer from this issue. Moreover, as an alternative to control variate scheme and reparameterization trick, RWS doesn't suffer from high variance gradients, thus it is particularly useful for discrete latent variable models.   
+To support the claim, they conduct three experiments: 1) on ATTEND, INFER, REPEAT, a generative model with both discrete and continuous latent variables; 2) on MNIST with a continuous latent variable model; 3) on a synthetic GMM.
+
+Clarity issues:
+1. ""branching"" has been used many times, but AFAIK, this seems not a standard terminology. What do ""branching on the samples"", ""conditional branching"", ""branching paths"" mean?
+2. zero-forcing failure mode and delta-WW: I find this part difficult to follow. For example, the following sentence 
+""the inference network q(z|x) becomes the posterior for this model which, in this model, also has support at most {0, . . . , 9} for all x"". 
+However, this failure mode seems an interesting finding, and since delta-WW outperforms other methods, it deserves a better introduction. 
+
+Questions:
+1. In Fig 1 (right), how do you estimate KL(q(z|x) || p(z|x))?
+2. In Sec 4.2, why do you say IWAE learns a better model only up to a point (K = 128) and suffers from diminishing returns afterwards?  
+3. In Fig 4, why WS doesn't achieve a better performance when K increasing?
+
+Experiments:
+1. Since the motivating story is about discrete latent variable models, better baselines should be compared, e.g. RBM, DVAE, DVAE++, VQ-VAE etc. 
+2. All experiments were on either on MNIST or synthetic data, at least one large scale experiment on discrete data should be made to verify the performance of RWS. 
+",6,4.0,ICLR2019
+r1nc91tez,1,rkZB1XbRZ,rkZB1XbRZ,Novel techniques to improve private learning with PATE,"The paper proposes novel techniques for private learning with PATE framework. Two key ideas in the paper include the use of Gaussian noise for the aggregation mechanism in PATE instead of Laplace noise and selective answering strategy by teacher ensemble. In the experiments, the efficacy of the proposed techniques has been demonstrated. I am not familiar with privacy learning but it is interesting to see that more concentrated distribution (Gaussian) and clever aggregators provide better utility-privacy tradeoff. 
+
+1. As for noise distribution, I am wondering if the variance of the distribution also plays a role to keep good utility-privacy trade-off. It would be great to discuss and show experimental results for utility-privacy tradeoff with different variances of Laplace and Gaussian noise.
+
+2. It would be great to have an intuitive explanation about differential privacy and selective aggregation mechanisms with examples. 
+
+3. It would be great if there is an explanation about the privacy cost for selective aggregation. Intuitively, if teacher ensemble does not answer, it seems that it would reveal the fact that teachers do not agree, and thus spend some privacy cost.
+
+
+
+
+
+
+
+
+
+",6,1.0,ICLR2018
+HkHX9bm-M,3,Bk6qQGWRb,Bk6qQGWRb,"Interesting approach for Thompson sampling in DQN, some concerns over baseline.","(Last minute reviewer brought in as a replacement).
+
+This paper proposed ""Bayesian Deep Q-Network"" as an approach for exploration via Thompson sampling in deep RL.
+This algorithm maintains a Bayesian posterior over the last layer of the neural network and uses that as an approximate measure of uncertainty.
+The agent then samples from this posterior for an approximate Thompson sampling.
+Experimental results show that this outperforms an epsilon-greedy baseline.
+
+There are several things to like about this paper:
+- The problem of efficient exploration with deep RL is important and under-served by practical algorithms. This seems like a good algorithm in many ways.
+- The paper is mostly clear and well written.
+- The experimental results are impressive in their outperformance.
+
+However, there are also some issues, many of which have already been raised:
+- The poor performance of the DDQN baseline is concerning and does not seem to match the behavior of prior work (see Pong for example).
+- There are some loose and misleading descriptions of the algorithm computing ""the posterior"" when actually this is very much an approximation method... that's OK to have approximations but it shouldn't be hidden away.
+- The connection to RLSVI is definitely understated, since with a linear architecture this is precisely RLSVI. The sentiment that extending TS to larger spaces hasn't been fully explored is definitely valid... but this line of work should certainly be mentioned in the 4th paragraph. RLSVI is provably-efficient with a state-of-the-art regret bound for tabular learning - you would probably strengthen the case for this algorithm as an extension of RLSVI by building on this connection... otherwise it's a bit adhoc to justify this approximation method.
+- This paper spends a lot of time re-deriving Bayesian linear regression in a really standard way... and without much discussion of how/why this method is an approximation (it is) especially when used with deep nets.
+
+Overall, I like this paper and the approach of extending TS-style algorithms to Deep RL by just taking the final layer of the neural network.
+However, it also feels like there are some issues with the baselines + being a bit more clear about the approximations / position relative to other algorithms for approximate TS would be a better approach.
+For example, in linear networks this is the same as RLSVI, bootstrapped DQN is one way to extend this idea to deep nets, but this is another one and it is much better because XYZ. (this discussion could perhaps replace the rather mundane discussion of BLR, for example).
+
+In it's current state I'd say marginally above, but wouldn't be surprised if these changes turned it into an even better paper quite quickly.
+
+
+===============================================================
+
+Revising my review following the rebuttal period and also the (ongoing) revisions to the paper.
+
+I've been disappointed by the authors have incorporated the feedback/reviews - I expected something a little more clear / honest. Given the ongoing review decisions/issues I'm putting my review slightly below accept.
+
+## Relation to literature on ""randomized value functions""
+It's really wrong to present BDQN as is if it's the first attempt at large-scale approximations to Thompson sampling (and then slip in a citation to RLSVI as a BDQN-like algorithm). This algorithm is a form of RLSVI (2014) where you only consider uncertainty over the last (linear) layer - I think you should present it like this. Similarly *some* of the results for Bootstrapped DQN (2016) on Atari are presented without bootstrapping (pure ensemble) but this is very far from an essential part of the algorithm! If you say something like ""they did not estimate a true posterior"" then you should quantify this and (presumably) justify the implication that taking a gaussian approximation to the final layer is a *true* posterior. In a similar vein, you should be clear about the connections to Lipton et al 2016 as another method for approximate Bayesian posteriors in DQN.
+
+## Quality/science of experiments
+The experimental results have been updated, and the performance of the baseline now seems much more reasonable. However, the procedure for ""selecting arbitrary number of frames"" to report performance seems really unnecessary... it would be clear that BDQN is outperforming DDQN... you should run them all for the same number of frames and then either compare (final score, cumulative score, #frames to human) or something else more fair/scientific. This type of stuff smells like overfitting!",5,4.0,ICLR2018
+E46BITZp1PY,4,dyjPVUc2KB,dyjPVUc2KB,Review of 'Adapting to reward progressivity via spectral RL',"#######################################################################
+
+Summary:
+
+In this paper the authors propose a new RL method, spectral DQN, in which rewards are decomposed into different frequencies. This decomposition allow for the training loss to better balanced on certain tasks - in particular those with progressive rewards. The new method is shown to perform well on specially constructed tasks with extreme reward progressively, as well as on a selection of standard Atari tasks.
+
+#######################################################################
+
+Reasons for score:
+
+I think a weak accept is appropriate here. I think the authors correctly identify a class of worth while tasks - namely those with progressive rewards, and reasonably establish that standard approaches struggle here. The new method is a strong implementation of a simple idea which is demonstrated to work, and does not require significant fine-tuning.
+
+#######################################################################Pros:
+
+1. The paper is well written and clearly presented. I'd commend the authors on their exposition and balanced motivation throughout 
+
+2. I think this is an interesting direction of work more generally - exploring the performance of different methods against different reward distributions, and having agents predict quantities more flexible than the mean return. This would seem to be a contribution to that body of work.
+
+3. I see no reason that the constructed domain ExponentialPong shouldn't join the benchmark for any subsequent general-purpose agent 🙂
+
+#######################################################################
+
+Cons:
+
+1. While I believe the selected experiments are sufficient to demonstrate the claims in the paper, they do not fully explore the capabilities of the method and the intuitions that motivate it. It would have been good to see some more thinking here (even if it means experiments outside of Atari)
+
+(1) It would seem that this approach would work for a parametrizable class of reward functions - why not test that? Perhaps ExponentialPong with reward $b^(\alpha N)$ evaluated on a grid for $b$ and $\alpha$?
+
+(2) Similarly for failure modes
+
+(3) Frequencies needn't be geometric, how might other choices have performed
+
+2. I'd like to have seen a slightly richer discussion/motivation of the mixed Monte Carlo update. Clearly something is necessary here, but why this? What else was tried?
+
+#######################################################################
+
+Questions during rebuttal period:
+
+Q1: you note the obvious limitation in the selection of the number of reward frequencies, but as you note this is reasonably easily ameliorated in practice. A more interesting question perhaps is whether an adaptive approach for b might work - was this tried?
+
+#######################################################################
+
+Some typos:
+
+(1) 'Expoential Pong' → 'Exponential Pong', just above 4.1'
+
+(2) Fig2, last row, ""Fiilled proportion"" → 'Filed proportion'",6,3.0,ICLR2021
+cJpDuuI2QT,2,aYuZO9DIdnn,aYuZO9DIdnn,Reviewer 2 comments,"Briefing:
+
+This paper mainly investigates the patch-based pre-processing for image classification. The patch-based pre-processing is done by a simple convolutional kernel constructed in a data-driven manner. 
+The paper achieved better or comparable performance to other more heavy comparisons using a much smaller kernel size.
+
+Strongpoints:
+
+The paper shows that the simple whitening procedure (by mean and covariance from training-set) of first layer weights enhances the performance, rather than using a deep kernel, which can be much efficient.
+
+
+Weak points:
+
+Analysis of the improvement: The reviewer tried to find an explainable reason for the performance improvement, although the title includes 'unreasonable'. Figure 3 seems to analyze the dictionary by covariance spectrum and intrinsic dimension, but it wasn't easy to catch the main idea. A more precise explanation of the experiments and the analysis is required.
+
+Comments:
+
+(1) Table 1.a shows that the proposed method achieved good performance despite the hard-assignment. What if we apply soft-thresholding to the proposed method?
+
+(2) Experiments or discussion for the ImageNet performance with a lower number of the dictionary would be meaningful.
+
+(3) Ablation study (data-driven or not) in ImageNet performance would strengthen the contribution of the paper (in Table 2.a?)
+
+(4) It might be out of scope, but the reviewer could not fully catch the dictionary encoding usage just for the classification task. Is this possible to apply it to other task s.a. retrieval or other feature matching task? Then, discussion or experiments of the tasks would be required.
+Or, if not, the importance of the dictionary construction on the classification task would be required.
+
+(5) The proposed method only uses one or two layers of depth, one of the main contributions. Then, what if we use slightly more layers s.a. 3, 4, or 5? This addition of layers does not require much burden, and it would be better to add the layers if it can enhance performance.
+
+(6) It wasn't easy to catch how the paper constructs the kernel Phi, from the method. A clearer explanation of the procedure would be appreciated.
+
+(7) Qualitative analysis of each element of the dictionary would help readers catch the paper's contribution.
+
+Note: The reviewer does seem to catch the main strong-points and suggestions of the paper thoroughly. The reviewer requires a clearer explanation of the contribution with the replies for the above comment, and it would be required for the precise rating.
+
+",6,2.0,ICLR2021
+SklK3X3ptB,2,S1eqj1SKvr,S1eqj1SKvr,Official Blind Review #3,"This paper presents an adversarial attack method, which conducts perturbations in the feature spaces, instead of the raw image space. Specifically, the proposed method firstly learns an encoder that encodes features into the latent space, where style features are learned. At the same time, a decoder is learned to reconstruct the images with the encoded features. To conduct attacks, perturbations are added into the encoded features and attack images are generated with the decoder given the perturbated features. The experiment results look promising, showing that the proposed method achieves better attack performance with realistic adversarial images.
+
+The general idea of perturbating the feature (latent) space is not a novel one, which has been studied in [1]. However, the proposed one is with an autoencoder framework instead of GAN used in [1]. Therefore, the proposed approach is able to construct adversarial examples for specific images. In addition, the training of the encoder is adapted from a style transfer method, which seems to learn good features that capture style features. 
+
+It is a bit unclear on the intuition of the constructions of Eq. (5) and (6). The details may be in Huang & Belongie, 2017. But it is better to provide more intuitive explanation and discussion on why these constructions capture style variation.
+
+The results shown in the paper look promising. But it would be more comprehensive to compare with other pixel attacks in addition to PGD. Moreover, it is unclear whether it is a fair comparison between the proposed approach and pixel attacks, even under the same amount of perturbations. It would be good if the code will be released.
+
+Minor:
+
+Last sentence in the first paragraph of page 3: a missing reference.
+
+[1] Song, Yang, Rui Shu, Nate Kushman, and Stefano Ermon. ""Constructing unrestricted adversarial examples with generative models."" In Advances in Neural Information Processing Systems, pp. 8312-8323. 2018.",6,,ICLR2020
+rJf8bGuBe,3,SygvTcYee,SygvTcYee,Unclear presentation and some contributions,"UPDATE:
+I looked at the arxiv version of the paper. It is much longer and appears more rigorous. Fig 3 there is indeed more insightful.
+However, I am reviewing the submission and my overall assessment does not change. This is not a minor incremental contribution, and if you want to compress it into a conference submission of this type, I would recommend choosing message you want to convey, and focus on that. As you say, ""...ICLR submission focus on the ParMAC algorithm..."", I would focus on this properly - and remove or move to appendix all extensions and theoretical remarks, and have an extra page on explaining the algorithm. Additionally, make sure to clearly explain the relation of the arxiv paper, in particular that the submission was a compressed version.
+
+ORIGINAL REVIEW:
+The submission proposes ParMAC, based on MAC (Method of Auxiliary Coordinates), formulating a distributed variant of the idea.
+
+Related Work: In the part on convex ERM and methods, I would recommend citing general communication efficient frameworks, COCOA (Ma et al.) and AIDE (Reddi et al.). I believe these works are most related to the practical objectives authors of this paper set, while number of the papers cited are less relevant.
+
+Section 2, explaining MAC, is quite clearly written, but I do not find part on MAC and EM particularly useful.
+
+Section 3 is much less clearly written. I have trouble following notation, particularly in the speedups part, as different symbols were introduced at different places. Perhaps a quick summary or paragraph on notation in the introduction would be helpful. In paragraph 2, you write as if reader knew how data/anything is distributed, but this was not mentioned yet; it is specified later. It is not clear what is meant by ""submodel"". Perhaps a more precise example pointing back to eqs (1) & (2) would be useful. As far as I understand from what is written, there are P independent sets of submodels, that traverse the machines in circular fashion. I don't understand how are they initialized (identically?), and more importantly I don't understand what would be a single output of the algorithm (averaging? does not seem to make sense). Since this is not addressed, I suppose I get it wrong, leaving me to guess what was actually meant. 
+The fact that I am not able to understand what is actually happening, I see as major issue.
+
+I don't like the later paragraphs on extensions, model for speedup, convergence and topologies. I don't understand whether these are novel contributions or not, as the authors refer to other work for details. If these are novel, the explanation is not sufficient, particularly speedup part, which contains undefined quantities, e.g. T(P) (or I can't find it). If this is not novel, It does not provide enough explanation to understand anything more, compared with a its version compressed to 1/4 of its size and referring to the other work. The statement that we can recover the original convergence guarantees seems strong and I don't see why it should be trivial to show (but author point to other work which I did not look at). In topologies part, claiming that something does ""true SGD"", without explaining what is ""true SGD"" seems very strange. Other statements in this section seem also very vague and unjustified/unexplained.
+
+Experimental section seems to suggest that the method is interesting for binary autoencoders, but I don't see how would I conclude anything about any other models. ParMAC is also not compared to alternative methods, only with itself, focusing on scaling properties.
+
+Conclusion contains statements that are too strong or misleading based on what I saw. In particular, ""we analysed its parallel speedup and convergence"" seems ungrounded. Further, the claim ""The convergence properties of MAC remain essentially unaltered in ParMAC"" is unsupported, regardless of the meaning of ""essentially unchanged"".
+
+In summary, the method seems relevant for particular model class, binary autoencoders, but clarity of presentation is insufficient - I wouldn't be able to recreate the algorithm used in experiments - and the paper contains a number of questionable claims.",4,2.0,ICLR2017
+WhRQiVmxos,1,O1pkU_4yWEt,O1pkU_4yWEt,NER and entity linking with distant supervision for Russian text using BERT as a classifier,"This paper introduces an end-to-end task that identifies medical
+entities and links them to UMLS concepts for Russian biomedical
+text. The paper uses distant supervision to identify the entities with
+their corresponding concepts and uses a Russian pre-trained BERT model
+to perform the task. The most frequent 10K medical concepts are
+selected and the task is treated as a classification task.
+
+The strength of	this paper is that it is probably (one of) the first
+to address NER and entity linking for biomed concepts in Russian with
+a straightforward BERT model. The main weakness of the paper stems
+from the way the training/validating datasets are created.
+
+The dataset is extracted from EHR notes	and the	labeling is done by a
+simple rule-based model that involves exact matching. I'm not
+convinced that this is going to have general coverage, and, hence, the
+model and the validation are biased to the entities for which exact
+matching appear more often. The validation includes also a manually
+annotated portion (1.5K records) labelled only for the top 15 most
+common entities. I dind't follow the numbers in Table 3 too well (not
+clear what column # represents). The recall once a large amount of
+text is used for training seems quite high (high 90s). However, I
+haven't seen a discussion on the precision, which, from my experience
+with English text, seems to be more problematic.
+
+I didn't fully understand how the entity spans are detected. I think 1
+could include more details with a concrete example. If you think about
+it, that figure is so general that it could be part of any paper
+(minus the specialized version of BERT).
+
+I think	the paper is for the most part well written. The
+validation/evaluation is rather weak,	with some stats	missing.
+
+My take	is that	this paper does	not have a high	degree of novelty and
+it is also rather niche topic, as such, I think it's better suited for
+a more specialized venue such as the BioNLP workshops collocated with
+ACL conferences.",5,5.0,ICLR2021
+ag91KYqOKw,1,5UY7aZ_h37,5UY7aZ_h37,Review of 'Transferring Inductive Biases through Knowledge Distillation',"This paper shows that knowledge distillation from a (teacher) model A with an appropriate inductive bias to a (student) model B lacking it can lead to B generalizing better than if B was trained without knowledge distillation (but not as well as A), including out-of-distribution. The authors also show that the resulting learned representations inside B, as well as the shape of the training trajectories, are more like those of A (than those of B without knowledge distillation). 
+
+This is not very surprising but is still interesting from the point of view of the understanding of the nature of inductive biases. We already knew that inductive biases (like translation invariance) can be transferred through examples (e.g. by generating data transformations such as translated images), so this paper extends that kind of idea to knowledge distillation to provide the targets for such examples.
+
+Another nice contribution of the paper is the case study of the specific inductive biases of RNNs which transformers lack, decomposed into sequentiality, memory bottleneck and recursion. Not very surprising but the experiments confirm intuitions and expectations which is always useful.
+
+One concern I have is 'so what?' and 'then what?'. Have the authors thought of possible way (in say, future work) to take advantage of that observation? It is not obvious, because if you already have a teacher model A with the right inductive biases for the given task, why would you care about training a student B which is going to be worse than A anyways? Just use A. In addition, unlike for the original motivation of knowledge distillation, we normally expect that B would have MORE capacity than A (because it needs to 'learn' the inductive biases, so one would expect it would not work to choose B much smaller than A, in the sense that the gain would be much smaller, and certainly not as good a model as using A).  We already knew that examples could transfer inductive biases, now we know that knowledge distillation can do it, but why would that be useful?
+
+Experiments where B is much smaller than A would be interesting, because in that case, it might be worthwhile to do the knowledge distillation from a larger but better biased A. Also, the outcome of such experiments would not be apriori obvious (we would expect a gain vs the regular B, but would it be sufficiently interesting to be worth it?).
+
+Another question I would have liked to be studied is about what happens out-of-distribution (OOD). The paper already shows the unsurprising  result that distilling into B from A helps somewhat OOD. It would also be interesting to explore whether taking inputs outside of the training distribution of A as distillation examples when training B would increase the robustness of B OOD.
+
+Minor comments:
+
+- fig 4: caption is insufficient to understand the figure
+- the sec 3.2 sentence with 'almost closing the gap' is too strong and needs to be weakened (there is still a significant gap, with almost twice the error with B compared with A)
+- the conclusion sentence with 'demonstrate having the right inductive bias can be crucial' should be reformulated, since this is not a new demonstration (and reading it without reading the rest of the paper may give that false impression)
+
+",5,4.0,ICLR2021
+r4wNW3w6VD7,3,paE8yL0aKHo,paE8yL0aKHo,Interesting idea but needs more rigorous experiments,"This paper introduces an algorithm that augments SAC with Curiosity-Aware Temperature, to enable more efficient exploration. Previous versions of SAC had a fixed entropy temperature which had to be tuned or an automatic tuning mechanism that was not state-specific. The paper proposes that exploration can be improved if temperature is state-dependent, based on curiosity, or unfamiliarity of the state.
+
+Authors introduce curiosity to the target temperature such that the entropy is large in unfamiliar states, promoting exploration, and small in familiar states, encouraging more exploitation. To enable this, the authors introduce three components: 1) target entropy that is augmented with curiosity, 2) curiosity and hence state based entropy, 3) X-RND that adds contrastive loss to ensure more robust computation of curiosity.
+
+Curiosity is based on prediction error of states, using the idea from Random Distillation Network (RND) (Burda et al. 2018b). This curiosity term is normalized such that in expectation it corresponds to the original target entropy. The instance-level temperature also uses this curiosity to map states with similar level of unfamiliarity to similar temperature value. X-RND is a technique that the authors develop in order to overcome previous difficulties that RND had on feature inputs. 
+
+In their benchmark experiments they show that their method CAT-SAC shows superiority compared to SAC as well as other baseline methods. They also show results on a toy domain how X-RND can estimate curiosity more robustly than RND.
+
+The authors have done a fair job introducing curiosity to enable better exploration by varying the target entropy at state-level. Disregarding minor grammar errors I think it is structured nicely. The idea is interesting but I would recommend reject as the experiments need to be conducted more rigorously. 
+
+Benchmark experiments are conducted only with 4 runs on Mujoco environments that are known to have high variance based on random seeds. Henderson et al. 2018 (https://arxiv.org/pdf/1709.06560.pdf) show that on HalfCheetah, the same algorithm can have significantly different performance, between two groups of 5 random seeded runs. More runs need to be conducted in order to show credible performance improvements.
+Furthermore, currently CAT-SAC surpasses SAC in all four Mujoco domains at 200k steps of training, but SAC results in the original paper (Haarnoja et al. 2018b) show SAC reaching avg return of 2000 in Hopper and 3200 in Walker. This also shows that current results where SAC performs worse than CAT-SAC could have been due to variance in random seeds.
+
+Comparison of CAT-SAC to other baseline method performance seems less appropriate too. Results are based on SUNRISE paper (Lee et al. 2020) which has not been peer-reviewed and which have also done only 4 runs each. 
+
+Lastly, demonstration of X-RND seems to show that it is able to remember states it has visited and keep curiosity for remaining unvisited states. It looks like after the state is visited once the curiosity drops from 0.2 to 0.02 immediately. I’m not sure if it would be desirable to have curiosity drop suddenly after visiting it only once. To me X-RND seems like a mechanism with a replay buffer that limits neural networks from generalizing and making it tabular-like to output low curiosity only for states it has visited. X-RND certainly performs better than RND in this toy domain but I’m not sure if it adds that much value to the overall paper, where it is about having instance-level entropy that encourages exploration.
+
+For other minor details, I think the plots and labels in Figure 2(b) (c)  are confusing.I think the x-axis should be index of each state, and prediction error of each state the label for the colormap. ",4,3.0,ICLR2021
+ryx8tGiohX,3,ByfbnsA9Km,ByfbnsA9Km,A set of nice results that is insightful and clarifies some controversy ,"The paper challenges recent claims about cross-entropy loss attaining max margin when applied to linear classifier and linearly separable data. Along the road, it presents a couple of nice results that I find quite interesting and I believe they provide useful insights. Finally it presents a simple modification to the cross-entropy loss, which the authors refer to as differential training, that alleviates the problem for the case of linear model and linearly separable data.
+
+CONS:
+I find the paper useful and interesting mainly because of its insightful results rather than the final algorithm. The algorithm is evaluated in a very limited setting (linear model, synthetic data, binary classification); it is not clear if similar benefits would carry over to nonlinear models such as deep networks. In fact, I strongly encourage the authors to do a generalization comparison by comparing the **test accuracy** obtained by their modified cross-entropy against: 1. Vanilla cross-entropy as well as 2. A deep model large margin loss function (e.g. as in ""Large Margin Deep Networks for Classification"" by Elsayed). Of course on a realistic architecture and non-synthetic datasets (e.g. CIFAR-10).
+
+PROS:
+Putting the algorithm aside, I find the theorems interesting. In particular, Theorem 3 shows that some earlier claims about cross-entropy's ability to attain large margin (in the linearly separable case) is misleading (due to neglecting a bias term). This is important as it changes the faith of the community in cross-entropy and more importantly creates hope for constructing new loss functions with improved margin.
+I also find the connection between the dimension of the subspace that contains the points and quality of margin obtained by cross-entropy insightful.
+",8,3.0,ICLR2019
+SkxX-tyrT7,4,HklSf3CqKm,HklSf3CqKm,A good paper ,"This paper studies dictionary learning problem by a non-convex constrained l1 minimization. By using subgradient descent algorithm with random initialization, they provide a non-trivial global convergence analysis for problem. The result is interesting, which does not depend on the complicated initializations used in other methods. 
+
+The paper could be better, if the authors could provide more details and results on numerical experiments.   This could be used to confirm the proved theoretical properties in practical algorithms. ",7,3.0,ICLR2019
+SJpiju-Ve,2,S1LVSrcge,S1LVSrcge,Interesting exploratory work.,"This is high novelty work, and an enjoyable read.
+
+My concerns about the paper more or less mirror my pre-review questions. I certainly agree that the learned variable computation mechanism is obviously doing something interesting. The empirical results really need to be grounded with respect to the state of the art, and LSTMs are still an elephant in the room. (Note that I do not consider beating LSTMs, GRUs, or any method in particular as a prerequisite for acceptance, but the comparison nevertheless should be made.)
+
+In pre-review responses the authors brought up that LSTMs perform more computation per timestep than Elman networks, and while that is true, this is an axis along which they can be compared, this factor controlled for (at least in expectation, by varying the number of LSTM cells), etc. A brief discussion of the proposed gating mechanism in light of the currently popular ones would strengthen the presentation.
+
+---
+2017/1/20: In light of my concerns being addressed I'm modifying my review to a 7, with the understanding that the manuscript will be amended to include the new comparisons posted as a comment.",7,4.0,ICLR2017
+BygCDbac37,3,rylhToC5YQ,rylhToC5YQ,"Promising unsupervised approach, but clarity issues","Overall and positives:
+
+The paper investigates the problem of multidocument summarization
+without paired documents to summary data, thus using an unsupervised
+approach. The main model is constructed using a pair of locked
+autoencoders and decoders. The model is trained to optimize the
+combination of 1. Loss between reconstructions of the original reviews
+(from the encoded reviews) and original the reviews, 2. And the
+average similarity of the encoded version of the docs with the encoded
+representation of the summary, generated from the mean representation
+of the given documents.
+
+By comparing with a few simple baseline models, the authors were able
+to demonstrate the potential of the design against several naive
+approaches (on real datasets, YELP and AMAZON reviews). 
+The necessity of several model components is demonstrated
+through ablation studies. The paper is relatively well structured and
+complete. The topic of the paper fits well with ICLR. The paper
+provides decent technical contributions with some novel ideas about
+multi-doc summary learning models without a (supervised) paired
+data set.
+
+Comments / Issues
+
+[ issue 6 is most important ]
+
+1.  Problem presentation. The problem was not properly introduced and
+elaborated. In fact, there is not a formal and mathematical
+introduction of the problem, input, output, dataset and model
+parameters. The notations used are not very clearly defined and are
+quite handwavy, (e.g. what is V, dimensions of inputs x_i was not
+mentioned until much later in the paper). The authors should make
+these more precise. Similar problem with presentations of the models,
+parameters, and hyperparameters.
+
+3.  How does non-equal weighted linear combinations of l_rec and l_sim
+change the results? Other variation of the overall loss function? How
+do we see the loss function interaction in the training, validation
+and test data? With the proposed model, these could be interesting to
+observe.
+
+4.  In equation two, the decoder seems to be very directly affecting
+the quality of the output summary. Teacher forcing was used to train
+the decoder in part (1) of the model, but without ground truth, I
+would expect more discussions and experiments on how the Gumbel
+softmax trick affect or help the performance of the output.
+
+5.  Baseline models and metrics
+
+(1) There should be more details on how the language model is trained,
+some examples, and how the reviews are generated from the language
+model as a base model (in supplement?).
+
+(2). It is difficult to get a sense of how these metrics corresponds
+to the actual perceived quality of the summary from the
+presentation. (see next)
+
+(3). It will be more relevant to evaluate the proposed design
+vs. other neural models, and/or more tested and proved methods.
+
+6. The rating classifier (CLF) is intriguing, but it's not clearly
+explained and its effect on the evaluation of the performance is not
+clear: One of the key metrics used in the evaluation relies on the
+output rating of a classifier, CLF, that predicts reader ratings on
+reviews (eg on YELP).  The classifier is said to have 72%
+accuracy. First, the accuracy is not clearly defined, and the details
+of the classifier and its training is not explained (what features are
+its input, is the output ordinal regression).  Equation 4 is not
+explained clearly: what does 'comparing' in 'by comparing the
+predicted rating given the summary rating..' mean?  The classifier may
+have good performance, but it's unclear how this accuracy should
+affect the results of the model comparisons.
+
+The CLF is used to evaluate the rating of output
+reviews from various models. There is no justification these outputs
+are in the same space or generally the same type of document with the
+training sample (assuming real Yelp reviews).  That is probably
+particularly true for concatenation of the reviews, and the CLF classifier
+scores the concatenation very high (or  eq 4 somehow leads to highest value
+for the concatenation of reviews )... It's not clear whether such a classifier is 
+beneficial in this context.
+
+7. Summary vs Reviews. It seems that the model is built on an implicit
+assumption that the output summary of the multi-doc should be
+sufficiently similar with the individual input docs.  This may be not
+true in many cases, which affects whether the approach generalizes.
+Doc inputs could be covering different aspects of the review subject
+(heterogeneity among the input docs, including topics, sentiment etc),
+or they could have very different writing styles or length compared to
+a summary.  The evaluation metrics may not work well in such
+scenarios.  Maybe some pre-classification or clustering of the inputs,
+and then doing summarization for each, would  help?  In the conclusions section, the
+authors do mention summarizing negative and positive reviews
+separately.
+
+
+
+
+
+",5,4.0,ICLR2019
+nscZAmpeQGQ,3,9_J4DrgC_db,9_J4DrgC_db,Complex approach with little improvement,"Summary: This paper focuses on undirected exploration strategies in reinforcement learning. Following the prior work, this paper proposes an exploration method unifying the step-based and trajectory-based exploration. The authors propose to perturb only the last(linear) layer of the policy for exploration, instead of perturbing all layers of the policy network. Also, the authors use analytical and recurrent integration for policy updates. Experiments show that the proposed exploration strategy mostly helps A2C, PPO and SAC in three Mujoco environments.
+
+Clarity:
+This paper is generally written clearly. Some details need more clarification as pointed out in 'Cons'.
+
+Originality:
+As far as I know, the proposed technique is novel in the literature of undirected exploration. But for the three bullet points in section 1, the first point of ""Generalizing Step-based and Trajectory-based Exploration"" should not be one of the main contributions of this paper, because this paper follows the formulation of policy in van Hoof et al. (2017) and the latter proposed the generalized exploration connecting step-based and trajectory-based exploration. The work can be viewed as an extension of van Hoof et al. (2017)  with a deep policy network.
+
+Significance:
+The proposed method is mathematically solid, but the main concern lies in empirical performance. Nowadays SAC is the state-of-the-art and generally used method for continuous control tasks and it is more advanced than A2C and PPO. But the proposed method does not obviously improve the performance of SAC while inducing much more complexity in policy learning. Therefore the significance of the proposed approach in practice might be limited.
+
+Pros:
+*The authors provide detailed mathematical derivation (in the main text and the appendix) to support the proposed method.
+*The proposed method significantly outperforms the baselines when investigating the on-policy methods A2C and PPO.
+*The authors provide ablative studies about hyper-parameter values and components of the proposed method with A2C.
+
+Cons:
+*In section 4.2, ""we maintain and adapt a single magnitude σ for the parameter noise"". What's the motivation of this setting different from the formulation in section 4.1?
+*In section 5, why the advantage of the proposed method is poor with SAC? What's the value of hyper-parameters α and δ? Is the proposed method sensitive to these hyper-parameter choices? 
+*In section 5, apart from the comparison of the performance of the learned policy, the comparison of the complexity (which might be measured by wall time to learn the policy?) of different exploration strategies can also be interesting. 
+*In the first two rows of Figure 1, why the baseline methods NoisyNet-A2C(PPO) and PSNE-A2C(PPO) even significantly underperform the vanilla A2C(PPO)? The intuition is that introducing exploration strategies will mostly help the agent learns more quickly. Is it possible that the baselines are not tuned well?
+*The experiments on a single domain (Mujoco) seems not convincing enough. It will be better if there are experiments on other more complicated domains.",4,3.0,ICLR2021
+SJlybsupKH,1,ByeDl1BYvH,ByeDl1BYvH,Official Blind Review #2,"Summary: 
+This paper is about curvature as a general concept for embeddings and ML applications. The origin of this idea is that various researchers have studied embedding graphs into non-Euclidean spaces. Euclidean space is flat (it has zero curvature), while non-Euclidean spaces have different curvature; e.g., hyperbolic space has a constant negative curvature. It was noted that trees don't embed well into flat space, while they embed arbitrarily well into hyperbolic space.
+
+All of the notions of curvature, however, are defined for continuous spaces, and have to be matched in some sense to a discrete notion that applies to graphs beyond a particular class like trees. The authors study this setting, consider a variety of existing notions of curvature for graphs, introduce a notion of global curvature for the entire graph, and now to efficiently compute it. They also consider allowing these concepts to vary with the downstream task, as represented by the loss function.
+
+
+Pros, Cons, and Recommendation
+
+The study of the various proposed distances is fairly interesting, although it's hard to say what the takeaway here is. I think the part I'm struggling with the most is the motivation. Why do we care about using a space of constant curvature? True, we do so when it's appropriate---we embed trees into hyperbolic space. But when we have a more complicated and less regular graph, then compressing all of that curvature information into a scalar doesn't seem like a good idea, and indeed that's the point of the Gu et al work that's being built on here: it mixes and matches various component spaces that each have constant curvature, but altogether have varying curvature.
+
+At the same time, I like the idea of studying a bunch of proposed measures and attempting to gain new insights. This is a pretty unusual paper for ICLR, since the experimental section is really barely there, and what's most interesting are really these atomic insights. If the authors work on the motivation I would consider accepting it---for now I gave it weak accept.
+
+
+
+Comments:
+- The distinction between what the authors think of as ""global"" and ""local"" curvatures is confusing and should be explained further. From what I can see, the authors think of global as being a scalar, and local as being defined at each point; intuitively these seem like pretty bad labels. I would think of the ""global"" one as being coarse, and the ""local"" as being more refined, since it contains a lot more information. This is also related to the motivation: why stuff all of this information into one single scalar curvature? It forces you to take averages, while Gu et al defined a distribution over the local curvatures.
+
+If the idea is to simply use one space and not a product, there are in fact various spaces with non-constant curvature, e.g., the complex manifold CH^n.
+
+- Why do your need your graph to be unweighted at the very beginning of Section 2? On the other hand, you may want to define your graph to be connected for the distortion function to be well-defined.
+
+- The statement ""1, graph distances are hard to preserve:..."" isn't really meaningful, since for the example in 4.3.1, it is possible to embed that graph arbitrarily well. That is, even if the distortion isn't 0, it can be made as small as we desire. There are indeed graphs that are hard to embed (i.e., have lower bounds that do not go to 0) in reasonably tractable spaces, and the authors actually prove such a result, but the star graph is not one of these.
+
+- There's various tricks that actually make some of these graphs very easy to embed. One example is K_n in 4.33. Instead of just embedding K_n, embed the star graph on n+1 nodes, and place a weight of 1/2 on each edge. Now every pair of (non-central) nodes is at distance exactly. 1, and this thing is embeddable into hyperbolic space, etc. Interestingly, this is actually predicted by the Gromov hyperbolicity (for K_n_ that the authors briefly mention.
+
+The reason I bring this up is that even if the authors' project is successful, simple graph transformations may induce much better embeddings. That's fine, though, although it should be mentioned.
+
+- Can the authors write out what's going on for the hyperbolic lower bound on D_min in the  proof of Thm. 4.1?",6,,ICLR2020
+S1teFU6gG,3,rJ5C67-C-,rJ5C67-C-,"The methods presented in the paper are direct adaptation of existing techniques and rely on heursitcs to work. These methods need to be more thoroughly evaluated (among themselves, to know which method suits for a given problem) as well as against against a simple baseline. ","This paper addresses the problem of embedding sets into a finite dimensional vector space where the sets have the structure that they are hyper-edges of a hyper graph. It presents a collection of methods for solving this problem and most of these methods are only adaptation of existing techniques to the hypergraph setting. The only novelty I find is in applying node2vec (an existing technique) on the dual of the hypergraph to get an embedding for hyperedges. 
+
+For several methods proposed, they have to rely on unexplained heuristics (or graph approximations) for the adaptation to work.  For example, why taking average line 9 Algorithm 1 solves problem (5) with an additional constraint that \mathbf{U}s are same? Problem 5 is also not clearly defined: why is there superscript $k$ on the optimization variable when the objective is sum over all degrees $k$?
+
+It is not clear why it makes sense to adapt sen2vec (where sequence matters) for the problem of embedding hyperedges (which is just a set). To get a sequence independent embedding, they again have to rely on heuristics.
+
+Overall, the paper only tries to use all the techniques developed for learning on hypergraphs (e.g., tensor decomposition for k-uniform hypergraphs, approximating a hypergraph with a clique graph etc.) to develop the embedding methods for hyperedges. It also does not show/discuss which method is more suitable to a given setting. In the experiments, they show very similar results for all methods. Comparison of proposed methods against a baseline is missing. 
+
+
+",5,4.0,ICLR2018
+Hyxi-Xp6qS,2,SJeLO34KwS,SJeLO34KwS,Official Blind Review #1,"The paper is out of my research area. I could only provide little recommendation. I have tried to read this paper, but it was rather tedious with heavy notations. It would be more friendly to represent the models in visible way for example using diagrams as I can see that the model is a sequence matrix operators with non-linear transformations after that. The paper states that the proposed DrGCNs can improve the stability of GCN models via mean field theory. The experiments were conducted  on benchmark datasets and the proposed method was compared to several GCN variations. ",6,,ICLR2020
+Ske9ubVNYS,1,S1lukyrKPr,S1lukyrKPr,Official Blind Review #1,"In the paper, authors proposed a generative adversarial network-based rumor detection model that can label short text like Twitter posts as rumor or not. The model can further highlight the words that are responsible for the rumor accusation. 
+
+Proposed model consists of 4 sub models: a G_Where model finds the word to replace so to create an artificial rumor; a G_replace model decides what the replacement word should be; a D_classify model detects if a sequence is a rumor; a final D_explain model pinpoints the word of concern. D_ models and G_ models are trained in an adversarial competing way.
+
+Experiments showed that the LEX-GAN model outperforms other non-GAN models by a large margin on a previously published rumor dataset (PHEME) and in a gene classification task.
+
+My questions:
+
+1) The task modeled is essentially a word replacement detection problem. Is this equivalent to rumor detection? Even if it performs really well on a static dataset, it could be very vulnerable to attackers. Various previous works mentioned in the paper, including the PHEME paper by Kochkina et al, used supporting evidence for detection, which sounds like a more robust approach.
+
+2) Authors didn't explain the rationale behind the choice of model structure, e.g. GRU vs LSTM vs Conv. The different structures have been used in mix in the paper. Are those choices irrelevant or critical?
+
+3) I would like to see more discussion on the nature of errors from those models, but it's lacking in the paper. This could be critical to understand the model’s ability and limitation, esp given that it’s not looking at supporting evidences from other sequences.
+
+Small errors noticed: The citation for PHEME paper (Kochkina et al) points to a preprint version, while an ACL Anthology published version exists.",3,,ICLR2020
+rJeyiVOptH,1,HylZIT4Yvr,HylZIT4Yvr,Official Blind Review #2,"The paper proposes a model to address the Any-Code Generation (AnyGen) task, which basically to fill missing code from a given program. The model makes use of partial Abstract Syntax Tree (AST) as input. The model learns representation for partial AST paths and use the learnt representation to generate AST node at masked steps. The conducted experiments show that using AST paths from root and leaves are good for AST node generation, but whether those inputs are robust and sufficient should be further explored.
+ 
+There are some restrictions to the method, for example,  the input is only a single function, and the missing expression is not that complex. Nevertheless this work presents a novel method towards code generation. The paper also introduces a new metric to evaluate the prediction accuracy for generated expressions. Writing is clear. Evaluation is fairly comprehensive.
+
+Questions:
+1. Did the author test the method without the camel notation assumption, i.e. the data contains non-camel notation or mixed notations?
+2. In the Restrict Code Generation test, it seems that the author filters out non-primitive types and user-defined functions. Therefore, does the experiment on Java-small dataset fully show the proposed model’s strength?
+3. Can the author explain why the SLM model fail on Figure 6? Is it because of dividing token into sub tokens?
+4. How big is the token vocabulary? How does the vocab size affect the performance?
+",6,,ICLR2020
+Syx34y_sYB,2,BkgF4kSFPB,BkgF4kSFPB,Official Blind Review #2,"The paper propose a novel visual planning approach which constructs explicit plans from ""hallucinated"" states of the environment. To hallucinate states, it uses a Conditional Variational Autoencoder (which is conditioned on a context image of the domain). To plan, it trains a Contrastive Predictive Coding (CPC) model for judging similarities between states, then applies this model to hallucinated states + start/end states, then runs Dijkstra on the edges weighted by similarities.
+
+I vote for accepting this paper as it tackles two important problems: where to get subgoals for visual planning and what similarity function to use for zero-shot planning. Furthermore, the paper is clearly written, the experiments are well-conducted and analyzed.
+
+Detailed arguments:
+1. Where to get subgoals for visual planning is an important question persistently arising in control tasks. SPTM-style solution is indeed limited because it relies on an exploration sequence as a source of subgoals. Every time the environment changes, data would need to be re-collected. Getting subgoals from a conditional generative model is a neat solution.
+2. Benchmarking similarity functions is crucial. One productive way to approach zero-shot problems is to employ similarity functions, but the question arises: what algorithm to use for training them? The paper compares two popular choices: CPC and Temporal Distance Classification (in particular, R-network). It thus provides guidance that CPC might be a better algorithm for training similarity functions.
+3. The paper is well-positioned in the related work and points to the correct deficiencies of the existing methods. It also features nice experimental design with controlled complexity of the tasks, ablation studies and two relevant baselines.
+
+I would encourage the authors to discuss the following questions:
+1) Fidelity in Table 3 - why is it lower for SPTM compared to HTM if both methods rely on the same generated samples? Is it because HTM selects betters samples than SPTM for its plans?
+2) Why is fidelity larger for SPTM in a more complex task 2?
+3) Same question about fidelity/feasibility for HTM1/2?
+4) Are there any plans to open-source the code?",8,,ICLR2020
+BkeQx84T2X,3,ryGiYoAqt7,ryGiYoAqt7,Barely any novelty,"The paper proposes an augmentation of the DDPG algorithm with prioritized experience replay plus parameter noise. Empirical evaluations of the proposed algorithm are conducted on Mujoco benchmarks while the results are mixed.
+
+As far as I can see, the paper contains almost no novelty as it crudely puts together three existing algorithms without presenting enough motivation. This can be clearly seen even from the structuring of the paper, since before the experimental section, only a short two-paragraph subsection (4.1) and an algorithm chart are devoted to the description of the main ideas. Furthermore, the algorithm itself is a just simple addition of well-known techniques (DDPG + prioritized experience replay + parameter noise) none of which is proposed in the current paper. Finally, as shown in the experimental sections, I don't see a evidence that the proposed algorithm consistently outperform the baseline.
+
+To sum up, I believe the submission is below the novelty threshold for a publication at ICLR.",3,4.0,ICLR2019
+HylKKUQB3m,2,SJeXSo09FQ,SJeXSo09FQ,might be an interesting idea; the writing quality is not great; clearly insufficient evaluation,"The paper proposes a version of GANs specifically designed for generating point clouds. The core contribution of the work is the upsampling operation: in short, it takes as an input N points, and produces N more points (one per input) by applying a graph convolution-like operation.
+
+Pros:
++ The problem of making scalable generative models for point clouds is clearly important, and using local operations in that context makes a lot of sense.
+
+Cons:
+- The paper is not particularly well-written, is often hard to follow, and contains a couple of confusing statements (see a non-exhaustive list of remarks below).
+- The experimental evaluation seems insufficient: clearly it is possible to come up with more baselines. Even a comparison to other types of generative models would be useful (e.g. variants of VAEs, other types of GANs). There also alternative local graph-convolution-like operations (e.g. tangent convolutions) that are designed for point clouds. In addition, it is quite strange that results are reported not for all the classes in the dataset.
+
+Various remarks:
+p.1, “whereby it learns to exploit a self-similarity prior to sample the data distribution”: this is a confusing statement.
+p.2, “(GANs) have been shown on images to provide better approximations of the data distribution than other generative models”: This statement is earthier too strong (all other models) or does not say much (some other models)
+p.2, “However, this means that they are unable to learn localized features or exploit weight sharing.”: I see the point about no weight sharing in the generator, but feature learning 
+p.3, “the key difference with the work in this paper is that PointNet and PointNet++ are not
+generative models, but are used in supervised problems such as classification or segmentation.”: Yet, the kind of operation that is used in the pointnet++ is quite similar to what you propose?
+p.4: “because the high dimensionality of the feature vectors makes the gridding approach unfeasible.”: but you are actually dealing with the point clouds where each point is 3D?
+",6,4.0,ICLR2019
+U7vgvB00KNk,1,Te1aZ2myPIu,Te1aZ2myPIu,Motivation and explanations of methodology could be improved,"This paper proposes an improved sample-wise randomized smoothing technique, where the noise level is tuned for different samples, for certification of robustness. Further, it also proposes a pretrain-to-finetune methodology for training networks which are then certified via sample-wise randomized smoothing. The authors show in experiments on CIFAR and MNIST that combining their training methodology and certification methodology can sometimes improve the average certified when compared to state-of-the-art randomized smoothing techniques Smooth-Adv (Salman et. al, 2019).
+
+I recommend a rejection because the key takeaways of the paper should be clarified and the pretrain-to-finetune framework and the allocation of regions must be explained and justified better.
+
+The key idea of using different noise levels for different samples is intuitive and explained well in the motivation section (4.1). Furthermore, the authors show that their methodology does indeed lead to minor improvements in average certified l2-radius on Smooth-Adv for the CIFAR dataset, which is a more interesting dataset than MNIST, where the proposed technique performs similarly or slightly worse than Smooth-Adv.
+
+However, the paper does have shortcomings in its clarity and organization. First, I think the sample-wise certification is a clear and well-motivated idea, and should be discussed as the major contribution, rather than the pretrain-to-finetune framework. Furthermore, I was confused about the allocation of regions in the prediction step of the sample-wise certification; explaining why it is necessary, and why it is better than allocating a region for every single test datapoint (which is what I thought the motivation section in 4.1 explained) would improve the paper significantly. Finally, the amount of notation in the paper should be simplified significantly, and the notation often makes the paper more confusing (and sometimes, I could not understand due to either incorrect or unclear notation). For example, the pseudocode in Algorithm 1 would have been better if the notation was simplified, and in Algorithm 2, I did not know what B_{i_j} referred to at all.
+
+Specifically regarding the allocation of regions, I did not understand why it was necessary or led to improvements over choosing a new region for each test datapoint. Explaining it clearly, and showing an ablation study that compares using region-allocation and not using region-allocation would provide good motivation for its use.
+
+Specifically regarding the pretrain-to-finetune framework, I have the following questions:
+I saw that in Appendix C that the pretrain-to-finetune framework is necessary for the sample-wise randomized smoothing to show an improvement. Are there explanations for why sample-wise randomized smoothing does not well work by itself?
+
+Why does it make sense to do this 2 step procedure? Why does the pre-training have to involve varying noise levels if the fine-tuning procedure already finds the optimal noise level for each sample to train with? Could the pre-training just be the same as Smooth-Adv?
+
+How much does it matter which noise levels we choose during the pre-training phase? I noticed that the authors usually chose noise from 0.12 up to the amount that they compare to with SmoothAdv, but the reasons for this are not discussed.
+
+
+Overall, I feel that the paper has a well-motivated idea (sample-wise randomized smoothing) and shows some minor improvements in terms of results, but that clarity for all other parts of the paper must be improved significantly.
+
+
+Post Rebuttal Update:
+
+I appreciate the author response, but I will maintain my score after reading the rebuttal and discussion with other reviewers. It still appears to me that the motivation and clarity can be improved, and so I would recommend focusing on those aspects in future revisions. Additionally, baselines such as ""allocating a region for every single test point"" should be compared to in a clear way (as opposed to being in the appendix), as such baselines seem natural to compare to.",4,4.0,ICLR2021
+BylDXuqCnQ,3,ByezgnA5tm,ByezgnA5tm,interesting approach with inconclusive results,"This paper presents an DFA-based approach to constrain certain behavior of RL agents, where ""behavior"" is defined by a sequence of actions. This approach assumes that the developer has knowledge of what are good/bad behavior for a specific task and that the behavior can be checked by hand-coded DFAs or PDAs. During training, whenever such behavior is detected, the agent is given a negative reward, and the RL state is augmented with the DFA state. The authors experimented with different state augmentation methods (e.g. one-hot encoding, learned embedding) on 3 Atari tasks.
+
+The paper is clearly written. I also like the general direction of biasing the agent's exploration away from undesirable regions (or conversely, towards desired regions) with prior knowledge. However, I find the results hard to read.
+
+1. Goal. The goal of this work is unclear. Is it to avoid disastrous states during exploration / training, or to inject prior knowledge into the agent to speed up learning, or to balance trade-offs between constraint violation and reward optimization? It seems the authors are trying to do a bit of everything, but then the evaluation is insufficient. For example, when there are trade-offs between violation and rewards, we expect to see trade-off curves instead of single points for comparison. Without the trade-off, I suppose adding the constraint should speed up learning, in which case learning curves should be shown.
+
+2. Interpreting the results. 1) What is the reward function used? I suppose the penalty should have a large effect on the results, which can be tuned to generate a trade-off curve. 2) Why not try to add the enforcer during training? A slightly more complex baseline would be to enforce with probability (1-\epsilon) to control the trade-off. 3) Except for Fig 3 right and Fig 4 left, the constraints doesn't seem to affect the results much (judging from the results of vanilla DQN and DQN+enforcer) - are these the best settings to test the approach?
+
+Overall, an interesting and novel idea, but results are a bit lacking.",5,4.0,ICLR2019
+BC8jzrPf4Xh,3,lQdXeXDoWtI,lQdXeXDoWtI,A testbed good for domain generalization research,"In this paper, the authors implement a test bed to evaluate domain generalization methods in a unified way. The works is important because current methods use different model selection approaches, which may not reflect the inherent properties of the DG algorithms. 
+
+Model selection is a fundamentally difficult problem in the presence of distribution shift. However, it was significantly ignored in previous works. It is nice to see that the authors provide three kinds of model selection methods. From the results, it seems that existing DG methods do not have a clear advantage over ERM even when test-domain validation test is used. Does this mean existing methods themselves are not good? Or the dataset might not be appropriate for DG? It seems hard even for human to generalize to new domains when given a small number of domains with many changing factors. 
+
+I have some questions regarding the test bed details.
+1)	Did the authors implement the existing methods or use the source codes provided by the authors?
+2)	The authors carefully implemented and tuned ERM, did the authors also tuned the other methods carefully? This may require a significant amount of work, because different methods may need different engineering tricks.
+
+",7,5.0,ICLR2021
+kzvrrP8ob61,2,j0yLJ-MsgJ,j0yLJ-MsgJ,This paper systematically investigated the class imabalance problem in few-shot learning from multiple aspects.  ,"The authors present a detailed study of few-shot class-imbalance along three axes: dataset vs. support set imbalance, effect of different imbalance distributions (linear, step, random), and effect of rebalancing techniques. The authors extensively compare over 10 state-of-the-art few-shot learning methods using backbones of different depths on multiple datasets. The analysis reveals that 1) compared to the balanced task, the performances of their class-imbalance counterparts always drop, by up to 18.0% for optimization-based methods, although feature-transfer and metric-based methods generally suffer less, 2) strategies used to mitigate imbalance in supervised learning can be adapted to the few-shot case resulting in better performances, 3) the effects of imbalance at the dataset level are less significant than the effects at the support set level. 
+
+
+Pros: 
+1) the paper covers the state-of-the-art few-shot learning methods, over 10 methods are compared in the paper;
+2) the work reveals some interesting insights in few-shot learning, such as the three analysis summarized in Abstract. 
+3) the experiments are reasonable. There are a number of comparisons between different methods on different data sets. The codes to reproduce the experiments is released under an open-source license. 
+
+Cons: 
+1) the paper does not provide a new model and the contribution is marginal. 
+2) the experiments does not introduce new datasets as benchmark, all the datasets are heavily manipulated during testing. Is there any new data sets provides to test the assumptions of class-imbalance few-shot learning?
+3) the paper does not fully discuss new possible research directions in the field of class imbalance few learning. Although the authors discuss some insight into the previously unaddressed CI problem in the (meta-) training dataset and conclude that the effects of imbalance at the dataset level are less significant than the effects at the support set level, the future work along this direction seems still unclear.",5,3.0,ICLR2021
+5e4BHdUJndU,2,eoQBpdMy81m,eoQBpdMy81m,The paper has a novel idea but needs more clarification. ,"  
+The paper tries to use EM to explain the optimization procedure of the training method in a federated setting, and then treats the local model as a hidden variable of EM. Then, the authors propose a new FedSparse method to reduce communication efficiency by training a sparse model.
+
+Pros:
+
+It is a novel idea to treat local mode’s parameter \phi as a hidden variable in an EM framework. In particular, the following FedSparse algorithm is designed based on this setting. 
+
+This proposed sparsity-based model compression method is linked to federated learning in a comprehensive way. 
+
+As compared to FedAvg, the proposed method can reduce about 50% communication cost without scarifying accuracy. 
+
+Cons:                                                     
+
+The FedAvg is an SGD-based optimization, and treating the local model’s parameters as hidden variable phi_s is a little bit confusing. In particular, the FedAvg has no regularization term to prevent \phi moving too far from w, and the FedAvg’s variant version involves such a mode divergence-based regularization term – FedProx (Tian Li et. al. 2018).  Therefore, the definition of p(phi | w) in Equation 3 need more discussion.
+
+The paper uses a classic sparsity method, spike, and slab (Mitchell & Beauchamp, 1988). The authors need to introduce a more model sparsity method for compression purposes and then justify choosing the spike and slab. 
+
+In the experiment part, the proposed method and baseline methods have no significant difference in terms of accuracy. From a communication efficiency perspective, the proposed method should be also compared to FedDrop. Moreover, reduced communication cost in FedSparse is not very high.
+
+
+	 
+ ",5,4.0,ICLR2021
+5Ltv7Oj3GSa,2,sAzh_FTFDxz,sAzh_FTFDxz,Review AnonReviewer1,"
+**UPDATE**
+
+I acknowledge that I have read the author responses as well as the other reviews. I appreciate the clarifications and improvements made during the rebuttal phase, which I think have further strengthened this work.
+
+I find the key contributions of this work to be (i) demonstrating that recent methods that include labeled anomalies into training can suffer from unfavorable biases, and (ii) providing a framework for a theoretical analysis of this setting.
+
+Though I see that the presented results are somewhat what one would expect, to my knowledge such an analysis hasn't been carried out in the existing literature.
+
+Since weak forms of supervision (here few labeled anomalies) appears to be a promising research direction for anomaly detection, I find this critical and rigorous analysis to be worth circulating the community.
+
+For these reasons, I would keep my recommendation to accept this work (score: 7)
+
+#####
+
+
+**Summary**
+
+This paper studies biases in semi-supervised anomaly detection which is the setting where in addition to mostly nominal data a few labeled anomalies are also available for training. A theoretical framework for the semi-supervised setting is introduced that is based on binary classification which formulates the objective of a scorer as seeking to maximize the detection recall (true positive rate (TPR)) at a given target false positive rate (FPR). Using this framework, a relative scoring bias is derived that enables to assess the relative performance difference between unsupervised and semi-supervised detectors. Furthermore, finite sample rates are derived for this relative scoring bias, which subsequently are also validated empirically via synthetic simulations. Finally, an empirical evaluation that includes six recent state-of-the-art deep anomaly detection methods (Deep SVDD, Deep SAD, HSC, AE, SAE, and ABC) is presented on Fashion-MNIST, Statlog (Landsat Satellite), and ImageNet that demonstrates and highlights scenarios where the bias of a labeled (unrepresentative) anomaly set can be useful, but also harmful for anomaly detection performance.
+
+
+**Pros**
++ The paper presents a novel theoretical PAC framework for analyzing and understanding bias in semi-supervised anomaly detection. The framework extends a previous classification-based view on anomaly detection [4] to the semi-supervised setting.
++ Deep semi-supervised anomaly detection methods [2, 5, 7] that aim to include and learn from labeled anomalies is a timely topic of high practical relevance.
++ The experimental evaluation demonstrates that including labeled anomalies might introduce an unfavorable bias that can decrease detection performance, which is an important insight.
++ The paper has a clear structure and is easy to follow.
+
+**Cons**
+- Some related work is missing, especially previous classification-based views on anomaly detection [8].
+- There are some questions left open (see below).
+- The current manuscript includes some (minor) typos that should be fixed.
+
+
+**Recommendation**
+
+I recommend to accept this paper.
+
+The paper presents a well-motivated and useful theoretical framework for the timely and relevant semi-supervised anomaly detection setting. The arguments and derivations are technically correct. To my knowledge, this is also the first instance of a finite sample complexity bound on the scoring bias for this setting. The theoretical claims are validated through simulations and tested on real-world datasets in a scientifically rigorous manner. An important message of the analysis is that including labeled anomalies can introduce a bias that can be harmful for anomaly detection performance. In this regard, I think the paper also covers important ground for future analysis and towards building semi-supervised models that are unbiased.
+
+
+**Questions**
+
+(1) How does the presented view compare to well-known previous classification-based views [8]?
+
+(2) ‘for $\xi$, it also converges to a certain level.’ Specifically the level predicted by the bound of Theorem 3? 
+
+(3) How did you stabilize maximizing the reconstruction error for labeled anomalies in SAE? I suspect optimizing this objective is 
+unstable and prone to blow up.
+
+(4) Scenario 2 would make a compelling case for using Outlier Exposure [1]. Did you conduct such experiments similar to [6]?
+
+(5) At the end of Section 3, the empirical TPR estimate should have $s(x_j) > \tau$ in the indicator function, correct?
+
+(6) In the infinite sample case in Section 4.1, do you refer to the Glivenko–Cantelli theorem when citing Parzen (1980)?
+
+
+**Additional feedback and ideas for improvement**
+- Include Outlier Exposure [1] in the experimental analysis.
+- There exists further recent related work on biases in anomaly detection that observes that detectors may correctly detect anomalies, but based on wrong (spurious) features [3]. This should be added to the list of works studying biases in anomaly detection.
+
+
+**Minor Comments**
+1. The last paragraph in the Introduction, in which the contributions are listed, is a bit repetitive after the preceding paragraphs.
+2. In Section 4.1, $F_0(t)$ and $F_a(t)$ should have $s(x) \leq t$ in their definition to be consistent, right?
+3. The notation for the number of anomalous training samples mixes $m$ and $n_1$.
+4. Proposition 1: ‘[...], the relative scoring bias *is* [...]’
+5. In Section 4.1, after Proposition 1, the $\text{TPR}(s', \tau')$ function is missing parentheses.
+6. Corollary 2: ‘Let $q$ be a fixed target FPR. [...] Then, the relative scoring bias is [...]’
+7. Note that $\Phi$ denotes the cdf of the standard Gaussian.
+8. Finite sample case: ‘[...], where we follow the convention to assume *that the anomaly data* amounts to [...]’
+9. Theorem 3: The ‘-’ in the cdf superscripts should be ‘-1’. 10. There are spaces missing after ‘i.i.d.’ in the text.
+
+
+#####
+
+**References**
+
+[1] D. Hendrycks, M. Mazeika, and T. G. Dietterich. Deep anomaly detection with outlier exposure. In ICLR, 2019.
+
+[2] D. Hendrycks, M. Mazeika, S. Kadavath, and D. Song. Using self-supervised learning can improve model robustness and uncertainty. In NeurIPS, pages 15637–15648, 2019.
+
+[3] J. Kauffmann, L. Ruff, G. Montavon, and K.-R. Müller. The Clever Hans effect in anomaly detection. arXiv preprint arXiv:2006.10609, 2020.
+
+[4] S. Liu, R. Garrepalli, T. Dietterich, A. Fern, and D. Hendrycks. Open category detection with PAC guarantees. In ICML, volume 80, pages 3169–3178, 2018.
+
+[5] G. Pang, C. Shen, and A. van den Hengel. Deep anomaly detection with deviation networks. In KDD, pages 353–362, 2019.
+
+[6] L. Ruff, R. A. Vandermeulen, B. J. Franks, K.-R. Müller, and M. Kloft. Rethinking assumptions in deep anomaly detection. arXiv preprint arXiv:2006.00339, 2020.
+
+[7] L. Ruff, R. A. Vandermeulen, N. Görnitz, A. Binder, E. Müller, K.-R. Müller, and M. Kloft. Deep semi-supervised anomaly detection. In ICLR, 2020.
+
+[8] I. Steinwart, D. Hush, and C. Scovel. A classification framework for anomaly detection. Journal of Machine Learning Research, 6(Feb):211–232, 2005.
+",7,4.0,ICLR2021
+SyeE1XoaiX,1,BkgYIiAcFQ,BkgYIiAcFQ,motivation and experiment are not convincing enough,"This paper provide a modification on the classical LSTM structure. Specifically, it reformulate the forget gate with a monotonically decreasing manner, using sinusoidal function as the activation function. 
+
+However, both the motivation and experimental results on such modification are not convincing enough. 
+
+1. While there are many heuristic guesses in sec3, important supports of these guesses are missed. For example, Figure 2 is designed to provide supports for the claim that we need controlled forget gates.  However, all the values of forget gates and input gates in Figure 2 are manually set as *conceptual observations*, which provides limited insight on what will happen in the real cases. While the reformulation in sec4 is based on the observations in Figure 2, it is important to plot the real cell propagation after the reformulation, and see whether the real observation meets the conceptual observations in Figure 2.
+BTW, Plots in Figure 2 only account for LSTMs' propagation within 3 steps, but in real cases there are way more steps. 
+
+2. The authors claim monotonic propagation in the constant forget gates is more interpretable than those of the vanilla-LSTM, as no abrupt shrinkage and sudden growth are observed. But it isn't straightforward to get the relations between abrupt shrinkage and sudden growth on forget gates and the expressive power of the vanilla-LSTM. Also, it's hard to say the monotonic propagation is more interpretable because we don't know what's the meaning of such propagation on the behaviors of LSTMs in applications. 
+
+3. The reformulation in sec 4, especially the formula for the forget-polar input p_k, looks heavily hand-crafted, without experimental supports but statements such as ""we ran numerous simulations"", which is not convincing enough. 
+
+4. Experiments are applied on MNIST and Fashion-MNIST. While both datasets are not designed in nature for sequential models like LSTMs. There are better datasets and tasks for testing the proposed reformulation.   e.g. sentence classification, text generation, etc.  No explanation on the choice of datasets.  In addition, the difference between vanilla-LSTM and DecayNet-LSTM is small and it's hard to say it isn't marginal. Maybe larger-scale datasets are needed. 
+
+5. Lacking of explanation on specific experimental settings. E.g. training all methods for *only one epoch*, which is very different from the standard practice.  
+
+6. More qualitative interpretations for real cell states in both vanilla LSTM  and DecayNet-LSTM are needed. Only conceptual demonstration is included in Figure 2. ",4,4.0,ICLR2019
+SJ1aaWHNx,2,rJbbOLcex,rJbbOLcex,review,"This paper presents TopicRNN, a combination of LDA and RNN that augments traditional RNN with latent topics by having a switching variable that includes/excludes additive effects from latent topics when generating a word. 
+Experiments on two tasks are performed: language modeling on PTB, and sentiment analysis on IMBD. 
+The authors show that TopicRNN outperforms vanilla RNN on PTB and achieves SOTA result on IMDB.
+
+Some questions and comments:
+- In Table 2, how do you use LDA features for RNN (RNN LDA features)? 
+- I would like to see results from LSTM included here, even though it is lower perplexity than TopicRNN. I think it's still useful to see how much adding latent topics close the gap between RNN and LSTM.
+- The generated text in Table 3 are not meaningful to me. What is this supposed to highlight? Is this generated text for topic ""trading""? What about the IMDB one?
+- How scalable is the proposed method for large vocabulary size (>10K)?
+- What is the accuracy on IMDB if the extracted features is used directly to perform classification? (instead of being passed to a neural network with one hidden state). I think this is a fairer comparison to BoW, LDA, and SVM methods presented as baselines. ",7,4.0,ICLR2017
+SkkxKWJbM,3,Sy3XxCx0Z,Sy3XxCx0Z,This paper marginally improves performance on SNLI using a limited set of features indicating WordNet relations. The result is nice but predictable and the methods are not obviously applicable to other external forms of information. This contribution is not sufficient for ICLR.,"This paper adds WordNet word pair relations to an existing natural language inference model. Synonyms, antonyms, and non-synonymous sister terms in the ontology are represented using indicator features. Hyponymy and hypernymy are represented using path length features. These features are used to modify inter sentence attention, the final post-attention word representations, and the pooling operation used to aggregate the final sentence representations prior to inference. All of these three additions help, especially in the low data learning scenario. When all of the SNLI training data is used this approach adds 0.6% accuracy on the SNLI 3-way classification task. 
+
+I think that the integration of structured knowledge representations into neural models is a great avenue of investigation. And I'm glad to see that WordNet helps. But very little was done to investigate different ways in which these data can be integrated. The authors mention work on knowledge base embeddings and there has been plenty of work on learning WordNet embeddings. An obvious avenue of exploration would compare the use of these to the use of the indicator features in this paper. Another avenue of exploration is the integration of more resources such as VerbNet, propbank, WikiData etc. An approach that works with all of these would be much more impressive as it would need to handle a much more diverse feature space than the 4 inter-dependent features introduced here.
+
+Questions for authors:
+
+Is the WordNet hierarchy bounded at a depth of 8? If so please state this and if not, what is the motivation of your hypernymy and hyponymy features?",3,5.0,ICLR2018
+Byl7mCRpFr,1,HkgH0TEYwH,HkgH0TEYwH,Official Blind Review #854,"
+Summary of the work
+- The work proposes a new method two find anomaly (out of distribution) data when some labeled anomalies are given. 
+- The authors apply information theory-derived loss based on that the normal (in distribution) data usually have lower entropy compared to that of the abnormal data. 
+- The paper conducts extensive experiments on MNIST, Fashion-MNIST, and CIFAR 10, with varying the number of labeled anomlies.
+
+I think the paper is well written and the experiment seems to support the authors argument. Unfortunately, this field is not overlapped to my research field, and it is hard for me to judge this paper.",6,,ICLR2020
+NvD3b4cPkZ5,4,QnzSSoqmAvB,QnzSSoqmAvB,Insufficient contribution and experiments.,"This paper proposes NDMZ, which extends the previous MuZero algorithm to stochastic two-layer zero-sum games of perfect information. NDMZ formalize chance as a player (chance player) and introduces two additional quantities: the player identity policy and the chance player policy. NDMZ also introduce new node classes to MCTS, which allows it to accommodate chance.
+
+One major weakness of the paper is there is its lack of novelty/contribution. The core idea of interleaving chance nodes with choice nodes to model stochastic environment in tree search is not brand new. The main contribution of this paper is the integration of such idea into the specific MuZero tree search framework. This might be OK if the authors could present strong enough experimental results. 
+
+However, this brings up a second main weakness of the paper, which is lack of sufficient experimental justification. The proposed NDMZ is only evaluated on Nannon, a simplified version of backgammon, which is not sufficient. I would suggest the authors to evaluate the algorithm on at least 5~10 Atari games. Although the Atari environment is known to be a deterministic, it becomes stochastic when there is only a limited horizon of past observed frames. In addition, just like MuZero, the proposed NDMZ should also be adapted to single-player tasks like Atari games. Therefore, under this setting, Atari could be used to simulate the stochastic environment for the current purpose. In addition, it would also be helpful to further evaluate it on the more complicated tasks such as Chinese Dark Chess, as suggested by the authors.
+
+Finally, the dynamic evaluation (including both top-move dynamic test and uniform dynamics test) only evaluate the accuracy of the learned dynamics by only examining the chance of selecting illegal move. This seems to be just one very restricted perspective of dynamics evaluation. I’m wondering whether there should be other more comprehensive ways to do this?",4,4.0,ICLR2021
+fmdR4Q9luZZ,4,TJSOfuZEd1B,TJSOfuZEd1B,The paper is written well,"The paper proposed a method —- GeDi — to generate guided and controlled text from a large language model (LM). The method utilizes smaller LMs as generative discriminators to guide generation from large LMs to make them safer and more controllable. By safer and controllable they emphasis on the toxicity, hate, bias, and negativity contains in the training of the large LM. The proposed method guides generation at each time step by computing classification probabilities for all possible next tokens via Bayes rule by normalizing over two class-conditional distributions (i.e. contrastive discrimination); one conditioned on the desired attribute, or control code, and another conditioned on the undesired attribute (i.e. contrastive attribute), or anti-control code. 
+
+The paper explores ways to increase generation speed and claimed that with the proposed techniques the generation speeds more than 30 times faster compared to PPLM model. The paper explores different heuristics to impose the guided generation including bias parameter, weighted decoding and filtering heuristics. The findings are that GeDi gives stronger controllability than the state of the art method (i.e.PPLM, CC-LM, CTRL).  
+
+Experiments show that, training GeDi on four topics (i.e. Business, Science/Tech, Sports, World ) allows the controlled generation of new topics zero-shot from just a keyword. They also demonstrate that GeDi can make GPT-2 (1.5B parameters) significantly less toxic without sacrificing linguistic quality.
+
+
+Re: “so long as the LM and GeDi share the same tokenization”: can you please elaborate the constraint on ‘same tokenization’?
+
+Re: “If the GeDi was trained on movie reviews for sentiment control, its direct class-conditional predictions will be biased towards predicting movie review words (illustrated by next word prediction of “cinematic”). However, by contrasting the predictions of opposing control codes via Bayes rule, the bias towards movie reviews can be cancelled out.”: The word cinematic can reveal a neutral/negative sentiment, is there any possibility that pushing the sentiment towards positive might degrade the accuracy of the overall generation?
+
+Re: GeDi training (λ < 1 in Equation (10)) and standard generative training(λ = 1 in Equation (10)). : How the value for λ = 0.6 was chosen? What is the impact of other values for this hyper-parameter? 
+  
+Re: “In order to have prompts that are more likely to trigger aggressive generations but less likely to be explicitly toxic, we pass candidate prompts through a RoBERTa (Liu et al., 2019) model trained to classify toxicity, and only kept prompts where RoBERTa was less confident about the toxicity label.“: how did you measure model confidence about the toxicity label?",6,4.0,ICLR2021
+SJl9pTb827,2,r1xQQhAqKX,r1xQQhAqKX,Great paper! Could use more uncertainty-measuring application / experiments,"pros: The  paper is well-written and well-motivated. It seems like uncertain-embeddings will be a valuable tool as we continue to extend deep learning to Bayesian applications, and the model proposed here seems to work well, qualitatively. Additionally the paper is well-written, in that every step used to construct the loss function and training seem well motivated and generally intuitive, and the simplistic CNN and evaluations give confidence that this is not a random result. 
+
+cons: I think the quantitative results are not as impressive as I would have expected, and I think it is because the wrong thing is being evaluated. It would make the results more  impressive to try to use these embeddings in some active learning framework, to see if proper understanding of uncertainty helps in a task where a good uncertainty measure actually affects the downstream task in a known manner. Additionally, I don't think Fig 5 makes sense, since you are using the embeddings for the KNN task, then measuring correlation between the embedding uncertainty and KNN, which might be a high correlation without the embedding being good. 
+
+Minor comments: 
+ - Typo above (5) on page 3.
+ - Appendix line under (12), I think dz1 and dz2 should be after the KL terms.
+
+Reviewer uncertainty: I am not familiar enough with the recent literature on this topic to judge novelty. ",7,3.0,ICLR2019
+BklmK1PpKS,2,SkgscaNYPS,SkgscaNYPS,Official Blind Review #3,"This paper uses NTK and techniques from deriving NTK to study asymptotic spectrum of Hessian both at initialization and during training. For understanding neural network’s optimization and generalization property, understanding Hessian spectrum is quite important. 
+This paper gives explicit formula for limiting moments of Hessian of wide neural networks throughout training. 
+
+In detail, Hessian of neural networks can be decomposed into two components, denoted by H = I + S. In the infinite width I is totally described by NTK, and authors show that I and S are asymptotically orthogonal (both at initialization and during training).  Residual contribution is described by S, which captures evolution of Hessian by its first moments Tr (S) since Tr (S^2) remains constant and Tr (S^k ) for k>=3 vanishes. 
+
+Corollary 1 has analytic dynamics of moment of Hessian in the case of MSE loss demonstrating power of this paper’s main Theorem.  This is also supported by experiments in Figure 1. 
+
+few comments:
+Authors should follow format given by ICLR style file. The paper is more dense than typical submission and may have violated page limit (10 pages max) if the style guide line was followed. 
+Similar to the prior comment, the reference section should be cleaned and formatted better. The reference doesn’t count towards page limit and I don’t understand the reason for them to be formatted badly and become eligible. 
+It would be useful if the Figure axes are more legible. 
+There have been many variations of NTK beyond vanilla FC networks(Arora et al. 2019, Yang 2019). Is there a major block for the analysis given in the paper to extend beyond FC networks? 
+",8,,ICLR2020
+ry8UxQ6gM,3,H1Dy---0Z,H1Dy---0Z,A somewhat trivial extension of Prioritized Experience Replay by adding parallelization in actor algorithm,"This paper proposes a distributed architecture for deep reinforcement learning at scale, specifically, focusing on adding parallelization in actor algorithm in Prioritized Experience Replay framework. It has a very nice introduction and literature review of Prioritized experience replay and also suggested to parallelize the actor algorithm by simply adding more actors to execute in parallel, so that the experience replay can obtain more data for the learner to sample and learn. Not surprisingly, as this framework is able to learn from way more data (e.g. in Atari), it outperforms the baselines, and Figure 4 clearly shows the more actors we have the better performance we will have. 
+
+While the strength of this paper is clearly the good writing as well as rigorous experimentation, the main concern I have with this paper is novelty. It is in my opinion a somewhat trivial extension of the previous work of Prioritized experience replay in literature; hence the challenge of the work is not quite clear. Hence, I feel adding some practical learnings of setting up such infrastructure might add more flavor to this paper, for example. ",6,3.0,ICLR2018
+H1gsi8U0hX,2,B1e4wo09K7,B1e4wo09K7,"Technically Sound, Well Written, but The Key Idea is Not Very New","This paper is well written, and the quality of the figures is good. In this paper, the authors propose an invariant-covariant idea, which should be dated back at least to the bilinear models. The general direction is important and should be pursued further. 
+
+However, the literature is not well addressed. Eslami et al. 2018 have been cited, but some very important and related earlier works like: 
+[1] Kulkarni et al. 2015, Deep Convolutional Inverse Graphics Network
+[2] Cheung et al. 2015, Discovering Hidden Factors of Variation in Deep Networks
+were not discussed at all. The authors should certainly make an effort to discuss the connections and new developments beyond these works. At the end of section 1, the authors argue that the covariant vector could be more general, but in fact, these earlier works can achieve further equivalence, which is much stronger than the proposed covariance.
+
+There is also an effort to compare this work to Sabour et al. 2017 and the general capsule idea. I would like to point out, the capsule concept is a much more fine-grained what & where separation rather than a coarse-grained class & pose separation in one shot. In a hierarchical representation, what & where can appear at any level as one class can consist of several parts each with a geometrical configuration space. So the comparison of this work to the generic capsule network is only superficial if the authors can not make the proposed architecture into a hierarchical separation. Besides different capsule network papers, I found another potentially useful reference on a fine-grained separation:
+[3]Goroshin et al., Learning to Linearize Under Uncertainty
+
+In the paper, it is argued several times that the latent vector r_y contains a rich set of global properties of class y, rather than just its label and the aim is that it can learn what the elements of the class manifold have in common. But this point is not supported well since we can always make a label and this latent vector r_y equivalent by a template. I think this point could be meaningful if we look at r_y's for different y, where each of the dimension may have some semantic meaning. Additional interpretation is certainly needed.
+
+Under equation (3), ""Note that v is inferred from r_y"" should be ""inferred from both r_y and x"", which is pretty clear from the fig 5. Related to this, I could imagine some encoder can extract the 'style' directly from x, but here both r_y and x are used. I couldn't find any guarantee that v only contains the 'style' information based on the architecture with even this additional complication, could the authors comment on this?
+
+Equation (5) is not really a marginalization and further equation (6) may not be a lower bound anymore. This is probably a relatively minor thing and a little extra care is probably enough.
+
+The numbers in table 2 seems a little outdated.
+
+To conclude, I like the general direction of separating the identity and configurations. The natural signals have hierarchical structures and the class manifold concept is not general enough to describe the regularities and provide a transparent representation. Rather, it's a good starting point. If the authors could carefully address the related prior works and help us understand the unique and original contributions of this work, this paper could be considered for publication.",5,5.0,ICLR2019
+ftdzTUyETGf,3,cT0jK5VvFuS,cT0jK5VvFuS,This paper proposes to improve neural processes by changing the aggregation operator and using a hierarchical latent structure. These decisions are aimed at improving the performance in the low-data regime.,"The paper aims at increasing the sample diversity of neural processes when the condition set is small, while maintaining visual fidelity. The low-data regime is arguably where neural processes are most interesting, and in that regard the paper is right to turn to this setting. The discussion on how different aggregation functions affect the predictive uncertainty of the neural process is also appreciated, as is the experiment on regressing the size of the condition set based on the latent embedding.
+
+Unfortunately, the experiments section does not paint a clear enough picture. While the experiments show us that the proposed modifications have some benefits, it is not so clear how much each part contributes. Especially the contribution of the SIVI bound is hard to judge. As it stands the paper feels a bit incomplete in this regard. For that reason I cannot recommend accepting the paper at this stage, though I am willing to revise my score based on the authors' response. Specifically, I'd appreciate if the paper could make a clear case for adopting the hierarchical latent variable structure and the SIVI bound, as these add complexity to the method (while the max-aggregator does not).
+
+### Pros
+
+* The paper deals with a relevant issue. Neural processes are most interesting when the condition set is small, and this scenario has so far been largely ignored.
+* The discussion on the choice of aggregator is useful, as are the experiments on the variational posterior entropy and the prediction of the context set size.
+
+### Cons
+
+* It is unclear how much each modification (SIVI and max-pooling) contribute. The experimental results compare SIVI+max pooling with NP+max pooling, but SIVI+mean is omitted. It's also notable that NP+max seems to work better than SIVI+max in the CelebA dataset. Some discussion would be helpful here, as I don't see any reason why a hierarchical latent structure should hurt in any case, barring optimization difficulties.
+* I am not familiar with SIVI and I don't expect the average reader to be either. I'd appreciate some discussion on the choice of using it.
+
+### Other comments
+
+* The inception score sounds like the mutual information between the class label and the generated image. I expect that stating this would help some readers.
+* Perhaps it would help to look at each component in isolation and in different settings, e.g. outside the image domain. I can see how max-pooling might be good for images, while other aggregation methods might have an edge in, e.g. a dataset of robot joint trajectories. Other people have investigated the choice of aggregation method, and reading this work reminded me of work by [Soelch et al.](https://arxiv.org/abs/1903.07348), which might be interesting to the authors.",5,3.0,ICLR2021
+BaEccdWKLSt,4,YmA86Zo-P_t,YmA86Zo-P_t,Why is this important?,"The paper introduces a series of new datasets and task and investigates the inductive bias of seq2seq models. For each dataset, (at least) two hidden hypothesis could explain the data. The tasks investigated are count-vs-memorization, add-or-multiply, hierarchical-or-linear, composition-or-memorization. The datasets consists of one sample with varying length (amount of input/output pairs), which is denoted as description length. The models are evaluated on accuracy and a logloss. An LSTM, CNN, and Transformer are all trained on these datasets. Multiple seeds are used for significance testing. The results suggests that LSTM is better at counting when provided with a longer sequence, while the CNN and Transformer memorizes the data, but are better at handling hierarchical data. What this paper excels at is a thorough description of their experimental section and their approach to design datasets specifically for testing inductive bias, which I have not previously seen and must thus assume is a novel contribution.
+ 
+However, I lean to reject this paper for the following reasons
+- The paper tries to fit into the emerging field of formal language datasets for evaluating the capacity of deep learning methods. However, they do not build on any of the recent papers in the field. A new dataset, especially a synthetic one, should be well motivated by shortcomings of previous datasets and tasks in the field. I find the motivation and related works section lacking in that sense.
+- We already know that LSTMs can count https://arxiv.org/abs/1906.03648 and that transformer cannot https://www.mitpressjournals.org/doi/full/10.1162/tacl_a_00306
+- It is not clear to me why these results are important? Who will benefit from this analysis? Why are the current AnBnCn and DYCK languages that formal language people work with insufficient?
+- LSTMs do not have the capacity to perform multiplication. I don’t know why your results suggest otherwise. You would need to incorporate special units that can handle multiplications in the LSTM, such as https://arxiv.org/abs/2001.05016
+
+Update
+
+First I'd like to thank the authors for their detailed rebuttal. I have upgraded my recommendation from 3 to 4. As mentioned in my review I believe this approach is interesting. However, as pointed by reviewer2, the experimental section lacks completeness. I think this experimental section would be suitable for a workshop, but not a conference. I am excited to hear you are considering to use this method as an inspiration for real problems. I'd like to see the paper resubmitted when you have obtained such results.",4,3.0,ICLR2021
+B1xwlWqiKr,1,SJlYqRNKDS,SJlYqRNKDS,Official Blind Review #1,"This paper proposes blockwise adaptivity.  We divide the parameters into blocks, for example in a linear threshold unit the bias term is in a bias term block while the input weights are in an input weight block.  We then average the square norm of the gradients over each block and use the same adaptation based on this average square norm for all parameters in the block.  theoretical and experimental results are given.
+
+The idea of assigning different learning rates to different types of parameters is old.  Extending this idea to blockwise adaptation is natural and intuitively I would expect this to be an improvement on Adam.  However, I did not find this paper very compelling.  First, I believe that the theoretical results are a straightforward adaptation of know methods.  Second, and more significantly, comparing optimizers empirically is very tricky and I am not convinced that the experiments described here are convincing.  In particular the performance of optimizers is very sensitive to the tuning of hyper-parameters.  I would need to be convinced that the hyper-parameter tuning is sufficient.  Grid search is very inefficient compared to random or quasi-random methods.  Adam has four hyper-parameters and gird search over four parameters is very difficult.  For vision applications of Adam joint tuning of the learning rate and the epsilon parameter is critical --- these parameters are coupled.  It seems extremely likely to me that the move to blockwise adaptation has a profound effect on the optimal value of epsilon.  A thorough investigation of epsilon tuning is needed to demonstrate the value of blockwise adaptation in vision applications.
+
+Postscript:
+
+I still feel that the theoretical analysis provides little to no evidence of in-practice value of the method.  In general I find that theorems guaranteeing getting stuck on a flat plateau to be not very exciting.  What about the exploration goal of local search or mcmc?  In the absence of meaningful theory, the empirical results are what matter.
+
+I have read the response and am not convinced by the comments on optimizing epsilon.  I still believe that empirical claims about optimizers require extraordinary do-diligence in hyper-parameter optimization of both the proposed method and the baselines.",3,,ICLR2020
+SJgs3cCaFH,1,SkgS2lBFPS,SkgS2lBFPS,Official Blind Review #2,"This paper presents a bilingual generative model for sentence embedding based variational probabilistic framework. By separating a common latent variable from language-specific latent variables, the model is able to capture what's in common between parallel bilingual sentences and language-specific semantics. Experimental results show that the proposed model is able to produce sentence embeddings that reach higher correlation scores with human judgments on Semantic Textual Similarity tasks than previous models such as BERT. 
+
+Strength: 1) the idea of separating common semantics and language-specific semantics in the latent space is pretty neat; 2) the writing is very clear and easy to follow; 3) the authors explore four approaches to use the latent vectors and four approaches to merge semantic vectors, makes the final choices reasonable.
+
+Weakness:
+
+1) Experiments: 
+
+My major concern is the fairness of the experiments.  The authors compare their model with many state-of-the-art models that could produce sentence embeddings. However, how they produce the sentence embeddings with existing models is not convincing. For example, why using the hidden states of the last four layers of BERT? Moreover, the proposed model is trained with parallel bilingual data, while the BERT model in comparison is monolingual.  Also, the proposed deep variational model is close to an auto-encoder framework. You can also train a bilingual encoder-decoder transformer model (perhaps with pre-trained BERT parameters) with auto-encoder objective using the same parallel data set.  It seems to be a more comparable model to me. 
+
+Although the proposed model is based on variational framework, there's no comparison with previous neural variational models that learn encodings of texts as well such as https://arxiv.org/abs/1511.06038. 
+
+2) Ablation study and analysis
+
+I really like the idea of separating common semantic latent variables with language-specific latent variables.  However, I expected to see more analysis or experimental results to show why it is better than a monolingual variational sentence embedding framework. ",3,,ICLR2020
+Syxli5rPnm,2,SJz1x20cFQ,SJz1x20cFQ,"Good results, major assumption","Brief summary: 
+HRL method which uses a 2 level hierarchy for sparse reward tasks. The low level policies are only provided access to proprioceptive parts of the observation, and are trained to maximize change in the non-proprioceptive part of the state as reward. The higher level policy is trained as usual by commanding lower level policies. 
+
+Overall impression:
+I think the paper has a major assumption about the separation of internal and external state, thereby setting the form of the low level primitives. This may not be fully general, but is particularly useful for the classes of tasks shown here as seen from the strong results. I would like to see the method applied more generally to other robotic tasks, and a comparison to Florensa et al. And perhaps the addition of a video which shows the learned behaviors. 
+
+Introduction: 
+the difficulty of learning a high-level controller when the low-level policies shifts -> look at “data efficient hierarchical reinforcement learning” (Nachum et al)
+
+The basic assumption that we can separate out observations into proprioceptive and not proprioceptive can often be difficult. For example with visual inputs or entangled state representations, this might be very challenging to extract. This idea seems to be very heavily based on what is “internal” and what is “external” to the agent, which may be quite challenging to separate. 
+
+The introduction of phase functions seems to be very specific to locomotion?
+ Related work: 
+The connection of learning diverse policies should be discussed with Florensa et al, since they also perform something similar with their mutual information term. DeepMimic, DeepLoco (Peng et al) also use phase information in the state, worthwhile to cite. 
+
+Section 3.1:
+The pros and cons of making the assumption that representation is disentangled enough to make this separation, should be discussed. 
+
+Also, the internal and external state should be discussed with a concrete example, for the ant for example. 
+
+Section 3.2:
+The objective for learning diverse policies is in some sense more general than Florensa et al, but in the same vein of thinking. What are the pros and cons of this approach over that?  The objective is greedy in the change of external state. We’d instead like something that over the whole trajectory maximizes change?
+
+Section 3.3:  How well would these cyclic objectives work in a non-locomotion setting? For example manipulation
+
+Section 3.4: This formulation is really quite standard in many HRL methods such as options framework. The details can be significantly cut down, and not presented as a novel contribution. 
+
+Experiments:
+It is quite cool that Figure 2 shows very significant movement, but in some sense this is already supervised to say “move the CoM a lot”. This should be compared with explicitly optimizing for such an objective, as in Florensa et al. I’m not sure that this would qualify as “unsupervised” per se. As in it too is using a particular set of pre-training tasks, just decided by the form of choosing internal and external state.
+
+all of the baselines fail to get close to the goal locations.-> this is a bit surprising? Why are all the methods performing this poorly even when rewarded for moving the agent as much as possible.
+
+Overall, the results are pretty impressive. A video would be a great addition to the paper. 
+
+Comparison to Eysenbach et al isn’t quite fair since that method receives less information. If given the extra information, the HRL method performs much better (as indicated by the ant waypoint plot in that paper).",7,5.0,ICLR2019
+SklZzltOqS,3,BJlOcR4KwS,BJlOcR4KwS,Official Blind Review #5,"This paper studies the channel-collapsed problem in CNNs using 'BN+ReLU' . The Channel Equilibrium block which consists of batch decorrelation branch and adaptive instance inverse branch are proposed to reduce the channel-level sparsity. Experiments on ImageNet and COCO demonstrate that the proposed CE block can achieve higher performance than the conventional CNNs by introducing little computational complexity. The author also discuss the relationship between the proposed method and Nash Equilibrium.
+
+Pros:
+
++ The experimental results are impressive, the proposed block can improve the accuracy of CNNs while requires little additional computation cost.
++ This paper is well-written and easy to follow. The authors give a explicit explanation as well as prove of the proposed scheme. 
+
+Cons:
+
+- The motivation of this paper seems to be weak. The author argues that popular CNNs with 'BN+ReLU' have certain channels which would always output 0 for any input. Why not directly remove this channel to achieve speed-up? 
+- Moreover, the author argues that 'BN+ReLU' block would lead to channel-level sparsity according to [1]. However, [1] says that this sparsity relies on weight decay. Figure 3 (d) also proves that the sparsity ratios of BN and CE are all 0 when weight decay is set as 0 (also notes that they achieve best accuracy when weight decay is 0). The results demonstrate that the higher accuracy of CE does not rely on its lower sparsity ratio.
+
+In conclusion, the proposed CE is effective for achieving higher accuracy. However, the motivation and argument of the proposed method seems to be invalid, which prevents this work to be accepted.
+
+[1] On implicit filter level sparsity in convolutional neural networks. CVPR, 2019.",3,,ICLR2020
+S1zLl6mlz,1,Sy1f0e-R-,Sy1f0e-R-,A good survey of GAN evaluation metrics with exhaustive experimental evaluations.,"In the paper, the authors discuss several GAN evaluation metrics.
+Specifically, the authors pointed out some desirable properties that GANS evaluation metrics should satisfy.
+For those properties raised, the authors experimentally evaluated whether existing metrics satisfy those properties or not.
+Section 4 summarizes the results, which concluded that the Kernel MMD and 1-NN classifier in the feature space are so far recommended metrics to be used.
+
+I think this paper tackles an interesting and important problem, what metrics are preferred for evaluating GANs.
+In particular, the authors showed that Inception Score, which is one of the most popular metric, is actually not preferred for several reasons.
+The result, comparing data distributions and the distribution of the generator would be the preferred choice (that can be attained by Kernel MMD and 1-NN classifier), seems to be reasonable.
+This would not be a surprising result as the ultimate goal of GAN is mimicking the data distribution.
+However, the result is supported by exhaustive experiments making the result highly convincing.
+
+Overall, I think this paper is worthy for acceptance as several GAN methods are proposed and good evaluation metrics are needed for further improvements of the research field.
+",8,3.0,ICLR2018
+SylwTdykpQ,3,SyNvti09KQ,SyNvti09KQ,"Interesting approach, but analysis of the results should be improved","Summary:
+This submission proposes a reinforcement learning framework based on human emotional reaction in the context of autonomous driving. This relies on defining a reward function as the convex combination of an extrinsic (goal oriented) reward, and an intrinsic reward. This later reward is learnt from experiments with humans performing the task in a virtual environment, for which emotional response is quantified as blood volume pulse wave (BVP). The authors show that including this intrinsic reward lead to a better performance of a deep Q networks, with respect to using the extrinsic reward only. 
+Evaluation:
+Overall the proposed idea is interesting, and the use of human experiments to improve a reinforcement learning algorithm offers interesting perspectives. The weakness of the paper in my opinion is the statistical analysis of the results, the lack of in depth evaluation of the extrinsic reward prediction and the rather poor baseline comparison.
+Detailed comments:
+1.	Statistical analysis
+The significance of the results should be assessed with statistical methods in the following results:
+Section 4.1: Please provide and assessment of the significance of the testing loss of the prediction. For example, one could repetitively shuffle blocks of the target time series and quantify the RMSE obtained by the trained algorithm to build an H0 statistic of random prediction.
+Section 4.2: the sentence “improves significantly when lambda is either non-zero or not equal to 1” does not seem valid to me and such claim should in any case be properly evaluated statistically (including correction for multiple comparison etc…).
+Error bars: please provide a clear description in the figure caption of what the error bars represent. Ideally in case of small samples, box plots would be more appropriate.
+2.	Time lags in BVP
+It would be interesting to know (from the literature) the typical latency of BVP responses to averse stimuli (and possible the latency of the various mechanisms, e.g. brain response, in the chain from stimuli to BVP). Moreover, as latency is likely a critical factor in anticipating danger before it is too late, it would important to know how the prediction accuracy evolves when learning to predict at different time lags forward in time, and how such level of anticipation influence the performance of the Q-network.
+3.	Poor baseline comparison
+The comparison to reward shaping in section 4.4 is not very convincing. One can imagine that what counts is not the absolute distance to a wall, but the distance to a wall in the driving direction, within a given solid angle. As a consequence, a better heuristic baseline could be used. 
+Moreover, it is unclear whether the approaches should be compared with the same lambda: the authors need to provide evidence that the statistics (mean and possibly variance) of the chosen heuristic is match to the original intrinsic reward, otherwise it is obvious that the lambda should be adapted.
+4.	Better analysis of figure 5-6(Minor)
+I find figure 5-6 very interesting and I would suggest that the authors fully comment on these results. E.g. : (1) why the middle plot of Fig. 6 mostly flat, and why such differences between each curve from the beginning of the training. (2) Why the goal oriented task leads to different optimal lambda, is this just a normalization issue?
+",6,4.0,ICLR2019
+qVW5tR_lv3P,1,4xzY5yod28y,4xzY5yod28y,"I tend to accept this paper due to its impressive experimental results. However, there are some theoretical issues need to be addressed.","This paper proposes to restart the momentum parameter in SGD (with Nesterov's momentum) according to some carefully chosen schedules in training deep neural network, which is named as SRSGD. Two different restarting schedules are proposed: linear schedule and exponential schedule. The strong point of this paper is its extensive experimental evaluations, which justify that SRSGD significantly improves the convergence speed and generalization over standard momentum SGD. The empirical analysis also sheds some light on the parameter tuning and interpretation of SRSGD.
+
+This paper is well-written and easy to follow. Overall, I like this work and see the following strengths of this paper:
+- The proposed approach significantly outperforms standard momentum SGD in several DNN tasks and the advantage grows with the depth of DNN, which is appealing to practitioners.
+- The experiments are comprehensive and constructive. Various image classification benchmarks were tested and the confidence intervals are also provided. I appreciate the effort the authors put into that.
+- Although the proposed restarting schedules seem to be hard to tune in practice, the authors provide empirical analysis on the impacts of different parameters, which gives some guidance.
+
+On the other side, I have some concerns on the theoretical aspects: 
+
+- Theorem 1 (and 2) assumes a bounded variance of the stochastic gradient, which may not be true for ERM (e.g., for least squares ""Jain, P., et al. (2018). Accelerating Stochastic Gradient Descent for Least Squares Regression. In COLT, pages 545–604.""). Thus, the ""mini-batch stochastic gradient"" is not precise in Theorem 1 (and 2) and should be replaced with certain assumptions.
+- Under the classic bounded variance assumption, there exists a nice workaround of the non-convergence issue of NASGD in Theorem 1: the AC-SA approach proposed in ""Lan, G. (2012). An optimal method for stochastic composite optimization. Math. Program., 133(1-2):365–397."". AC-SA is basically NASGD with some parameter constraints. It can be shown that AC-SA maps to scheme (5) with $\frac{t_k - 1}{t_{k+1}} = \frac{\beta_k - 1}{\beta_{k+1}}$ and $s_k=\frac{\gamma_k}{\beta_k}$. Corollary 1 in (Lan, G., 2012) states that if $\frac{t_k - 1}{t_{k+1}} = \frac{k-1}{k+2}$ and $s_k= \min\{\frac{1}{2L}, \frac{C}{N^{1.5}}\}$, where $N$ is the total number of iterations, NASGD converges at the optimal $O(\frac{L}{N^2} + \frac{\sigma}{\sqrt{N}})$ rate. Thus, the non-convergence issue can be solved by simply fixing $N$ in advance and setting $s_k$ accordingly. 
+
+Due to these points, the discussion in Section 3.1 is not very insightful to me. There are some recent works studying the non-acceleration issues of SGD with HB and NAG momentum for least squares:
+- Kidambi, R., et al. (2018).On the insufficiency of existing momentum schemes for Stochastic Optimization. In ICLR.
+- Liu, C. and Belkin, M. (2020). Accelerating SGD with momentum for over-parameterized learning. In ICLR.
+
+Their setting (stochastic unbounded variance) is somewhat closer to ERM. Since this line of work is quite relevant, a proper literature review and adding relevant bibliographical entries might be necessary.
+
+Other comments:
+- In fact, both the constant scheme (3) and the variable-pararmeter one are referred to as NAG in optimization community. In the strongly convex case, NAG uses a constant $\mu$ to achieve acceleration and is a non-monotone method (this relates to the claim under ARNAG in Section 2). See his book ""Nesterov, Y. (2018). Lectures on convex optimization, volume 137. Springer"". 
+- ""Constant momentum achieves state-of-the-art result in DL (Sutskever et al., 2013)"". Actually, Sutskever et al. (2013) tried a variable momentum similar to the one in the paper.
+- The second sentence in Related Work. ""SGD with scheduled momentum"" -> ""These works all use constant momentum""?
+- The discussion about NAG with $\delta$-inexact oracle does not seem to be quite connected with other parts of the paper since this inexactness is deterministic. 
+- ""dom(J)"" is not defined in Appendix A.",6,4.0,ICLR2021
+RydFGH04m0f,2,KBWK5Y92BRh,KBWK5Y92BRh,"An interesting idea for NAS, where the neighborhood of the model is considered. ","** Summary 
+The authors proposed neighborhood-aware neural architecture search, where during the evaluation phase during search, the neighborhood of an architecture is considered. Specifically, when an architecture $\alpha$ is picked, its neighbors $\mathcal{N}(\alpha)$ all contribute to the performance validation. This is built upon the assumption that `` flat minima generalize better than sharp minima’’ and the authors verify it in Appendix C. 
+The authors conducted experiments on CIFAR-10/100 and ImageNet, and obtained promising improvements over the standard baselines. 
+
+** Clarify
+1.	Towards ""Due to the property of the total variation distance, when $d$ is an integer, the neighborhood contains all the cells that have at most $d$ edges associated with different operations from $\alpha$"". As you reported, $\alpha$ is a collection of $\alpha^{(i,j)}$, where each $\alpha^{(i,j)}$ is a one hot vector “ But in DARTS, $\alpha^{i,j}$ is a distribution of all candidate operations. In this case, how to select the $\mathcal{N}(\alpha)$?
+
+** Significance 
+Overall, I think the results are solid.
+1.	There is a possible “ensemble’” baseline for your method. Let us take DARTS as an example. First, we independently train $n_{nbr}$ one-shot models. Each one-shot model has an individual $\alpha$. Then, we sample an architecture from the average of all $\alpha$’s. This could be another way to leverage neighborhood information and should be compared. 
+2.	Improvement not significant: In Table 4, compared to DARTS+, the improvement is 0.1/0.4 on CIFAR10/100. Compared to PDARTS, on the three datasets, the results between your method and PDARTS are almost the same. Therefore, you should apply your method to more recent advantaged methods to show that it is orthogonal or others.
+3.    As you pointed in Section 2, (Zela et al 2020) also observed a strong correlation between the generalization error of the architecture found by DARTS. It is good to see the exploration in Table 6. I think the ""NA-DARTS-ES"" should be implemented to see whether your method and  (Zela et al 2020) are complementary to each other. 
+4.	The code should be released to reproduce your work.
+
+== Post Rebuttal ==
+
+I am satisfied with the response ""ensemble baseline"" and ""NA-DARTS-ES"". But my concern about ""Improvement not significant"" is not addressed, which is also mentioned by R1 and R4. I will remain my score as 6.
+",6,4.0,ICLR2021
+H1ep3mjh2Q,3,HJgeEh09KQ,HJgeEh09KQ,Interesting ideas but not persuasive enough,"This paper proposed a mixed strategy to obtain better precision on robustness verifications of feed-forward neural networks with piecewise linear activation functions.
+
+The topic of robustness verification is important. The paper is well-written and the overview example is nice and helpful. 
+
+The central idea of this paper is simple and the results can be expected: the authors combine several verification methods (the complete verifier MILP, the incomplete verifier LP and AI2) and thus achieve better precision compared with imcomplete verifiers while being more scalable than the complete verifiers. However, the verified networks are fairly small (1800 neurons) and it is not clear how good the performance is compared to other state-of-the-art complete/incomplete verifiers. 
+
+About experiments questions:
+1. The experiments compare verified robustness with AI2 and show that RefineAI can verify more than AI2 at the expense of much more computation time (Figure 3). However, the problem here is how is RefineAI or AI2 compare with other complete and incomplete verifiers as described in  the second paragraph of introduction? The AI2 does not seem to have public available codes that readers can try out but for some complete and incomplete verifiers papers mentioned in the introductions,  I do find some public codes available:
+* complete verifiers
+1. Tjeng & Tedrake (2017): github.com/vtjeng/MIPVerify.jl
+2. SMT Katz etal (2017): https://github.com/guykatzz/ReluplexCav2017
+
+* incomplete verifiers
+3. Weng etal (2018) : https://github.com/huanzhang12/CertifiedReLURobustness
+4. Wong & Kolter (2018): http://github.com/locuslab/convex_adversarial
+
+How does Refine AI proposed in this paper compare with the above four papers in terms of the verified robustness percentage on test set, the robustness bound (the epsilon in the paragraph Abstract Interpretation p.4) and the run time? The verified robustness percentage of Tjeng & Tedrake is reported but the robustness bound is not reported.  Also, can Refine AI scale to other datasets?
+
+About other questions:
+1. Can RefineAI handle only piece-wise linear activation functions? How about other activation functions, such as sigmoid? If so, what are the modifications to be made to handle other non-piece-wise linear activation functions? 
+
+2. In Sec 4, the Robustness properties paragraph. ""The adversarial attack considered here is untargeted and therefore stronger than ..."". The approaches in Weng etal (2018) and Tjeng & Tedrake (2017) seem to be able to handle the untargeted robustness as well? 
+
+3. In Sec 4, the Effect of neural selection heuristic paragraph. ""Although the number of images verified change by only 3 %... produces tighter output bounds..."". How tight the output bounds improved by the neuron selection heuristics? 
+",5,4.0,ICLR2019
+NatCqAo6gVT,4,6puUoArESGp,6puUoArESGp,Interesting direction but not convincing enough,"Summary: The paper studies the causal nature of concept explanations. Specifically, the authors treat labels as instrumental variables to then debias explanations and improve predictive performance as well.
+
+Strengths
+- Figure 2 was helpful for understanding the contributions of this work. However, it is not clear if \hat_d captures the same concepts that c alone would. If you could motivate using \hat_d with a pictorial example of where c, the concepts alone, fail that would be helpful.
+- The use of ROAR for concepts is clever and does show the utility of the method, but additional experimentation to show the concepts captured align with human intuition would have been nice. At the minimum, showing how the concepts recovered by the proposed method and CBM differ would be helpful.
+
+Weaknesses
+- While both experiments (synthetic and BIRDs) show the method's utility, it would have been nice to see experiments on other datasets. OAI perhaps like in Koh et al.
+- The connection to Yeh et al. is not clear to me. What is the notion of completeness in the proposed method? 
+
+Question
+- Can you extract uncertainty estimates for concepts from Equation 5?
+- Can you please explicitly the utility of your method in the linear Gaussian case? It seems as though using \hat_d for the concepts simply recovers the independent concept case from Koh et al.",4,3.0,ICLR2021
+H1xn-w1Op7,3,ryekdoCqF7,ryekdoCqF7,Incremental training of GANs,"The paper introduces an incremental training method for GAN's for capturing the diversity of the input space. The paper demonstrate that the proposed method allows smaller distances between the true and generated distribution. I find the idea interesting, but fear that the 60-100 small ensemble models could be replaced by a larger model.
+
+I am curious about why we need incremental training when it seems like we could directly train all the networks jointly. The corresponding generative model is simply stronger so all the convergence arguments would still hold. Is the statistical distance a reasonable estimate for you to determine whether you need an additional generator for incremental training?
+
+Also what are the generator architectures for the experiments? How can you put 60-100 generators within the GPU memory? The latent variable dimension seem to be only 1 for each of your generator? That seems to be seriously handicapping the capacity of each individual generator (to just some data points), so the ensemble distribution might be obtained simply by using a larger dimension z?
+
+There are also other measurements that are used by the GAN community, such as inception score, FID score and samples. It seems also reasonable to verify the effectiveness of this method on CIFAR or LSUN datasets, where the method would have a greater improvement because the data distributions are more complex.
+
+Minor points:
+- How do you measure the ""Wasserstein distance"" for high-dimensional distributions? 
+- What not set $\omega_i$ to be always 1? The subsampling process introduced in Algorithm 2 seem to enforce this, and you do this for all the experiments.
+- Fix citation typos.
+- Fix \mathbf for vector quantities, such as x and z.
+- Since the generative models have the same architecture, does the non-convex argument becomes moot when you have a mixture of 2 generators?",5,3.0,ICLR2019
+rygK6cLo3m,3,HJlfAo09KX,HJlfAo09KX,Lack of practicality and theoretical depth.,"The paper presents theoretical analysis for recovering one-hidden-layer neural networks using logistic loss function. I have the following major concerns:
+
+(1.a) The paper does not mention identifiability at all. As has been known, neural networks with even only one hidden layer are not identifiable. The authors need to either prove the identifiability or cite existing references on the identifiability. Otherwise,  the parameter recovery does not make sense.
+
+Example: The linear network takes f(x) = 1'Wx/k, where 1 is a vector with every entry equal to one. Then two models with parameters W and V are identical as long 1'W = 1'V.
+
+(1.b) If the equivalent parameters are not isolated, the local strong convexity is impossible to hold. The authors need to carefully justify their claim.
+
+(2) When using Sigmoid or Tanh activation functions, the output is bounded between [0,1] or [-1,+1]. This is unrealistic for logistic regression: The output of [0,1] means that the posterior probability has to be bounded between 1/2 and e/(1+e); The output of [-1,1] means that the posterior probability has to be bounded between 1/(1+e) and e/(1+e).
+
+(3) The most challenging part of the logistic loss is the lack of curvature, when neural networks have large magnitude outputs. Since this paper assumes that the neural networks takes very small magnitude outputs, the extension from Zhong et al. 2017b to the logistic loss is very straightforward. 
+
+(4) Spectral initialization is very impractical. Nobody is using it in practice. The spectral initialization avoids the challenging global convergence analysis.
+
+(5) Theorem 3 needs clarification. Please explicitly write the RHS of (7). The result would become meaningless, if under the scaling of Theorem 2, is the RHS of (7) smaller than RHS of (5).
+
+I also have the following minor concerns on some unrealistic assumptions, but these concerns do not affect my rating. These assumptions have been widely used in many other papers, due to the lack of theoretical understanding of neural networks in the machine learning community.
+
+(6)	The neural networks take independent Gaussian input.
+(7)	The model is assumed to be correct.
+(8)	Only gradient descent is considered.",3,4.0,ICLR2019
+Skek9hvCKH,2,Syee1pVtDS,Syee1pVtDS,Official Blind Review #3,"The authors study distributed online convex optimization where the distributed system consists of various computing units connected by a time varying graph. The authors prove optimal regret bounds for a proposed decentralized algorithm and experimentally evaluate the performance of their algorithms on distributed online regularized linear regression problems.
+
+The paper seems well written and well researched and places itself well in context of current literature. The authors also improve the state of the art in the field. The main weakness of the paper is the limited experimental evaluation and applicability of the assumptions and the theoretical setting that underpins this work.
+
+[Edit: After going through the other reviews, I have downgraded my score. The revised version of the paper the authors uploaded is 23 pages long with the main paper body being 10 pages. The CFP instructs reviewers to apply a higher standard to judge such long papers. I  am not convinced that the paper is solving an important problem that merits such a long paper.]",3,,ICLR2020
+r12m-yJVx,1,Hy6b4Pqee,Hy6b4Pqee,A significant development to include the flexibility of inference to PPL,"Thank you for an interesting read.
+
+I found this paper very interesting. Since I don't think (deterministic) approximate inference is separated from the modelling procedure (cf. exact inference), it is important to allow the users to select the inference method to suit their needs and constraints. I'm not an expert of PPL, but to my knowledge this is the first package that I've seen which put more focus on compositional inference. Leveraging tensorflow is also a plus, which allows flexible computation graph design as well as parallel computation using GPUs.
+
+The only question I have is about the design of flexible objective functions to learn hyper-parameters (or in the paper those variables associated with delta q distributions). It seems hyper-parameter learning is also specified as inference, which makes sense if using MAP. However the authors also demonstrated other objective functions such as Renyi divergences, does that mean the user need to define a new class of inference method whenever they want to test an alternative loss function?",7,4.0,ICLR2017
+SkgWFpePqH,3,HyezBa4tPB,HyezBa4tPB,Official Blind Review #3,"This manuscript proposes to train a wrapper to assess the confidence of a black box classifier decision on new samples. The resulting uncertainty prediction is used to reject decision with low confidence, such that the accuracy of the prediction on retained samples remains high. The idea, although a bit incremental, is potentially useful to practitioners, as argued in the introduction, and some empirical result tend to suggest that the method can be useful. However, I am not convinced the approach and its implementation are appropriate. 
+Main comments:
+Writing: the model seems strongly based on references such as Kendall & Gal (2017), however lack of introduction of notations and explanations prevent the paper to be self contained. For example: 
+(1)	Here is a list of quantities for which I could not find a definition: y_m, \omega, W. 
+(2)	\hat{y} appears with several indexing styles (up to three indices), sometime bold and sometimes not: the meaning of the indexing (which I could not find) is difficult to infer from the text
+Objective:
+I was unable to understand the rational behind the objective to be minimized, introduced in page 4. This is introduced as a cross-entropy loss, however, taking the expression of the Dirichlet distribution, it is not obvious to me how to reach this very simple expression. Is it possible that the authors simply mimic the expression of Kendall and Gal (2017, eq. (5)), that was designed for the Gaussian case, for which the cross-entropy expression is correct?
+As a consequence, I do not see how this objective is supposed to learn properly the correct beta parameter.
+
+Results:
+Figures 4-8 shows convincing evidence that the procedure improves with respect to a baseline that consist (as far as I understood) in ranking decision based on the entropy of the output probability vector of the classifier. However, given that I am unsure about what the proposed optimization does, it remains unclear to me whether these results reflect a true achievement. For example, one can argue that the chosen baseline is unlikely to be a good estimate of the entropy of the decision due to the fluctuations of the output probability for the unlikely classes. Those low probability values are the classical source of variance and bias in entropy estimates, and a classifier is not designed to get these low probabilities right (as they are low anyways). As a consequence, an already better baseline might be achieved for a given reference class by cropping the probability vector to keep only classes that non-negligible probability over the training set (here they can be many alternative approaches to test). 
+A trivial explanation for the proposed approached to work better than the currently chosen baseline is that the noise introduced by the sampling from the Dirichlet distribution leads to larger probabilities for the cases where the probability given by the classifier is small, which would reduce the variance of the entropy estimator based on the formula of page 5 (top). Overall, extensively checking many simpler baselines (which do not require training!) is a first step to see if the achieved result is not easy to get.   
+Minor comments:
+(1)	Please check for typos. 
+(2)	Please avoid remove the multiple parenthesis for successive citations (use a single “\citep”).
+(3)	In Fig. 1, subtitles are inconsistent.
+",1,,ICLR2020
+vX0hAEBSwq0,3,p3_z68kKrus,p3_z68kKrus,The paper establish the average leave-one-out stability bound for the interpolation solutions. The results are interesting and novel.   But I have some concerns which need authors' clarification. ,"The paper establish the average leave-one-out stability bound for the interpolation solutions, and show the above bound depends on condition number and spectral norm of kernel matrices.  The authors establishes a nice connection between numerical and statistical stability. A nice property is that among all interpolation solutions, the upper bound on stability achieves the minimum at the solution with the minimal norm. The authors then comment that the interpolation solution with minimal norm may generalize better than other interpolation solutions. The paper is clearly and well written.
+
+Comments:
+
+1. In Theorem 7, the authors show that the stability can be bounded by the spectral norm of $K,K^\dag$, the condition number of $K$ and the norm of $y$. It seems that this upper bound would diverge as we increase the dimension or sample size. This means that the upper bound is vacuous and may not explain the true generalization behavior of the interpolation solutions. If the upper bound is loose, then even if the interpolation solution with the minimal norm achieves the minimal upper bound, this may not convincingly show that it outperforms other interpolation solutions.
+
+2. I have doubts on eq (10). I think the left-hand side and right-hand side of the third identity differ by the term $2K^\dag Kv_i$. If $K^\dag Kv_i\neq 0$, then this identity would not hold. Since $v_i$ can be any vector, the deduction is not convincing. I would suggest the authors to take a close look at it.
+
+3. Lemma 5 is standard in the literature. I would suggest the authors to indicate its connection with existing results, e.g., Lemma 11 in ""Learnability, Stability and Uniform Convergence""",6,4.0,ICLR2021
+A_QYWYiX3Kw,3,TlPHO_duLv,TlPHO_duLv,Solid approach but with gaps in the experiments,"This paper addresses the task of training object detectors from noisy labels. Unlike other methods that operate in the weakly- or semi-supervised regime, this method operates on the full set of bounding box annotations but assumes that all of them have been synthetically corrupted: 
+- Each bounding box coordinate is drawn from a uniform distribution centered around the ground truth coordinate and with a range that is some pre-specified fraction of the bounding box height/width.
+- Each class label is either retained or flipped to some random class, with the decision to flip drawn from a Bernoulli distribution.
+
+To nonetheless reliably train an object detector from this data, the following procedure is proposed:
+- preparation: A Faster R-CNN detector with a fully differentiable feature pooling layer (PrRoi) is extended with a second detection head. This detector is trained for a warm-up period on the corrupted labels in the usual fashion.
+- for each mini-batch:
+  - two-stage noise correction:
+    1. class-agnostic bounding box correction: Both detection heads classify the incoming bounding box proposal from the RPN. If both agree that it is background, the proposal is discarded. Otherwise, with the help of the differentiable pooling layer the bounding box coordinates are optimised to maximise agreement of softmax scores between both detection heads. The optimisation runs for a single step to save time.
+    2. label update and second bounding box update: given the corrected bbox from the last step, the class predictions of both heads are averaged and ""sharpened"" to reduce entropy, and the regressed bounding boxes are averaged together with the ground truth box
+  - updating the network parameters:
+    - Given the corrected labels and (twice-corrected) bounding boxes, the network parameters are updated in the usual manner.
+
+This approach is evaluated against other weakly-supervised object detection approaches and performs quite favourably across multiple noise levels. As far as I can tell, the authors re-implemented the baselines for a fairer comparison, which is a plus. So overall, the results are quite good and the approach is sensible and seems to work. I don't have any major deal-breaking complaints, but a series of smaller ones mainly regarding gaps in the experiments:
+
+The introduction motivates this approach by citing a number from a 2012 paper that is very much outdated and should be replaced. According to Su et al., the median time for a single bounding box annotation is 42 seconds. In the much more recent OpenImagesV4 paper (Kuznetsova et al., 2018), they find that they need 7s on average per bounding box with (among other things) a much more efficient box drawing procedure, which is well below the average of 88s from Su et al. 2012. This more recent number also tracks with my own annotation experience.
+
+Given this emphasis on wanting to save annotation time, I would have expected the ""machine-generated"" annotations to figure much more prominently in the experiments (i.e. train a network on 10% of the ground truth and use detections on the rest of the training images). These only appear once in the sota comparison. It is still interesting to see that the proposed method handles high synthetic noise significantly better than competing methods, but this is the less realistic setting as the synthetic noise is applied to the entire set of ground truth annotations. I would thus also dispute the description of this setting as being ""more challenging and practical"" than other settings considered for weakly-supervised training (e.g. mostly relying on image-level labels).
+
+The warm-up phase seems like a critical part of this approach and it's only mentioned in passing without any discussion. How much does the length of the warm-up phase matter? Did you experiment with this? Is there a trade-off between (a) getting some amount of training for the correction to work, vs. (b) over-training on the noisy labels? Is the recommended warm-up length dependent on the amount of noise?
+
+The first noise correction stage (CA-BBC) is explicitly based on the assumption that if a bounding box tightly covers an object, the two classifiers will agree. This is a testable assumption and I would have been curious to see to what degree it actually holds.
+
+Given that the proposed method focuses on correcting the bounding box annotations, I think it would have been important to report results on the corrected training boxes, especially since you have access to the un-corrupted annotations. Based on the final results, the method obviously must be doing something right but some distribution of IoU values before and after the correction/training process would have been interesting w/ some more qualitative examples of boxes that weren't successfully corrected.
+
+I may have overlooked this, but since training involves multiple epochs (and with it multiple corrections to the same bounding box) are the corrections retained or discarded? That is: Are individual boxes cumulatively corrected throughout the training?
+
+Alpha is the hyperparameter for the step size for CA-BBC and in section 4.1 you specify three different values for it {0, 100, 200}. Why three values?
+
+The first result table is inconsistent with others, only reporting a subset of the results. What happens in the 0% label noise case?
+
+The optimisation in CA-BBC is only run for one step for efficiency reasons. What about the effect on performance? Do additional steps bring any improvements? Currently, the noise correction only adds 25% computation time to training so I assume that running the optimisation for a few more steps won't make the approach prohibitively slow.
+
+Corrections/editing suggestions:
+- A correction needs to be made to the Chadwick & Neuman reference, as the paper was published at IEEE IV 2019.
+- p2, par 2: ""distil"" -> ""distill""
+- 4.3 title: ""state-of-the-arts"" -> ""state-of-the-art""
+- I would maybe re-organise the experimental section a little, as 4.2 focuses on the CA-BBC part on its own, the 4.3 compares the full method against the state-of-the-art, and 4.4 returns to a more complete ablation study. I would switch the 4.3 and 4.4, i.e. first get all the ablations out of the way then compare against the state-of-the-art.
+- Fig. 1 shows one part of the method in detail and Fig. 2 an overview of full method. I would either switch the order here, or merge these somehow.
+
+Conclusion:
+-------------------
+
+Overall, I think this is a sensible aproach to training with noisy bounding box annotations, which compares favourably against competing methods. I have several reservations with regards to the experiments, especially the emphasis on synthetic noise as opposed to the more realistic setting where only a small set of hand-annotated labels are available. I would have liked to see more analysis (esp. quantitative) of the corrections of the training data, as this is what the method does. The warm-up phase appears to be critical and is also given short shrift. These are all things that are not fundamental problems, hence the (slightly) positive rating.
+",6,4.0,ICLR2021
+pmn_AxLb2Fz,3,sSjqmfsk95O,sSjqmfsk95O,Literature research is questionable,"The paper proposes multiple contributions.
+
+The paper identifies the following problem: current inpainting methods are suitable for small missing regions, but do not do well for large missing regions. I think the exposition is outdated and does not consider new work published at CVPR 2020, ECCV 2020, and possibly other venues.
+The paper ""Recurrent Feature Reasoning for Image Inpainting"" explicitely points out ""However, filling in large continuous holes remains difficult due to the lack of constraints for the hole center."" and the results seem to show many examples of inpaiting for large missing regions.
+""Image2StyleGAN++ ..."" show the inpainting of large regions as application and specifically mention the large-scale inpainting problem.
+""Image Fine-grained Inpainting"" states that ""Benefited from the property of this network, we can more easily recover large regions in an incomplete image""
+""DeepGIN: Deep Generative Inpainting Network for Extreme Image Inpainting"" even mentions the problem of large missing regions in the title of the paper.
+There are also other inpainting papers that one should look at. I didn't check the publication date or the relationship in detail at this point in time: ""Rethinking Image Inpainting via a Mutual Encoder-Decoder with Feature Equalizations"", ""Deep Generative Model for Image Inpainting with Local Binary Pattern Learning and Spatial Attention"", ""Deep Generative Model for Image Inpainting with Local Binary Pattern Learning and Spatial Attention"", ""Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation"".
+Main concern for this submission is the literature research and I would like to see this addressed in the rebuttal.
+
+The paper proposes an architecture modification the authors call co-modulation. The idea is to have the normalization of the generator layers not only controlled by either a random vector, or an input image, but by both. I think this overall idea is clear, but the details (including Fig. 3) are not that clear. I would say it's a nice, but smaller idea, that is suitable for publication if it leads to good results.
+
+The paper also proposes a new way to evaluate GANs using the proposed P-IDS and U-IDS score. The main idea here is to use a pre-trained feature transformation (the inception network) on real and fake images and then to evaluate the images using a linear classifier.
+I do not think the evaluation strategy of masking a random square of size wxw is that convincing. A square of 1x1 is a single pixel and this is not a meaningful image manipulation. Even changing a square of 8x8 pixels is not especially meaningful. The fact that your metric can distinguish between these manipulations is not an indication that the metric can distinguish between high quality and low quality results.
+The user study is more meaningful and gives some indication that the new metric is better. This looks promising and I like this result.",4,3.0,ICLR2021
+HyuhIWYez,1,SyrGJYlRZ,SyrGJYlRZ,YellowFin and the Art of Momentum Tuning,"This paper proposes a method to automatically tuning the momentum parameter in momentum SGD methods, which achieves better results and fast convergence speed than state-of-the-art Adam algorithm.
+
+Although the results are promising, I found the presentation of this paper almost inaccessible to me.
+
+First, though a minor point, but where does the name *YellowFin* come from?
+
+For the presentation, the motivation in introduction is fine, but the following section about momentum operator is hard to follow. There are a lot of undefined notation. For example, what does the *convergence rate* mean (what is the measurement for convergence)? And is the *optimal accelerated rate* the same as *convergence rate* mentioned above? Also, what do you mean by *all directions* in the sentence below eq.2?
+
+Then the paper talks about robustness properties of the momentum operator. But: first, I am not sure why the derivative of f(x) is defined as in eq.3, how is that related to the original definition of derivative?
+
+In the following paragraph, what is *contraction*? Does it have anything to do with the paper as I didn't see it in the remaining text?
+
+Lemma 2 seems to use the spectral radius of the momentum operator as the *robustness*. But how can it describe the robustness? More details are needed to understand this.
+
+What it comes to Section 3, it seems to me that the authors try to use a local quadratic approximation for the original function f(x), and use the results in last section to find the optimal momentum parameter. I got confused in this section because eq.9 defines f(x) as a quadratic function. Is this f(x) the original function (non quadratic) or just the local quadratic approximation? If it is the local quadratic approximation, how is it correlated to the original function? It seems to me that the authors try to say if h and C are calculated from the original function, then this f(x) is a local quadratic approximation? If what I think is correct, I think it would be important to show this.
+
+Also, the objective function in SingleStep algorithm seems to come from eq.13, but I failed to get the exact reasoning.
+
+Overall, I think this is an interesting paper, but the presentation is too fuzzy to get it evaluated.",4,3.0,ICLR2018
+Sklm0w_6tr,3,Hkx6hANtwH,Hkx6hANtwH,Official Blind Review #1,"This paper proposed to use Graph Neural Networks (GNN) to do type inference for dynamically typed languages. The key technique is to construct a type dependency graph and infer the type on top of it. The type dependency graph contains edges specifying hard constraints derived from the static analysis, as well as soft relationships specified by humans. Experiments on type predictions for TypeScript have shown better performance than the previous methods, with or without user specified types. 
+
+Overall this paper tackles a nice application of GNN, which is the type prediction problem that utilizes structural information of the code. Also the proposed type dependency graph seems interesting to me. Also the pointer mechanism used for predicting user specified types is a good strategy that advances the previous method. However, I have several concerns below:
+
+About formulation:
+1) I’m not sure if the predicted types for individual variable would be very helpful in general. Since the work only cares about individual predictions while no global consistency is enforced, it is somewhat limited. For example, in order to (partially) compile a program, does it require all the variable types to be correct in that part? If so, then the predicted types here might not be that helpful. I’m not sure about this, so any discussion would be appreciated. 
+
+
+About type dependency graph:
+1) Comparing to previous work (e.g, Allamanis et.al, ICLR 18), it seems the construction of the task specific graph is the major contribution, where the novelty is a bit limited. 
+2) The construction of the dependency graph is heuristic. For example, why the three contextual constraints are good? Would there be other good ones? Also why only include such limited set of logical constraints. For example, would expression like (x + y) induce some interesting relationships? Because such hand-crafted graph is lossy (unlike raw source code), all the questions here lead to the concern of such design choices. 
+3) The usage of graph is somewhat straightforward to me. For example, although the hard-constraints are there, there’s no such constraints reflected in the prediction. Adding the constraints on the predictions would be more interesting. 
+
+About experiments:
+1) I think one ablation study I’m most interested in is to simply run GNN on the AST (or simply use Allamanis et.al’s method). This is to verify and support the usage of proposed type dependency graph. 
+2) As the authors claimed in Introduction, ‘plenty of training data is available’. However in experiment only 300 projects are involved. Also it seems that these are not fully annotated, and the ‘forward type inference functionality from TypeScript’ is required to obtain labels. It would be good to explain such discrepancy. 
+3) Continue with 2), as the experiment results shown in Table 2, TS compiler performs poorly. So how would it be possible to train with poor annotations, while generalize much better? Some explanations would be helpful here.
+4) I think only predicting non-polymorphic types is another limitation. Would it be possible to predict structured types? like nested list, or function types with arguments? 
+",6,,ICLR2020
+S1gzAdmp_r,1,H1xSOTVtvH,H1xSOTVtvH,Official Blind Review #3,"The paper introduces the high variance policies challenge in domain randomization for reinforcement learning. The paper gives a new bound for the expected return of the policy when the policy is Lipschitz continuous. Then the paper proposes a new method to minimize the Lipschitz constant for policies of all randomization. Experiment results prove the efficacy of the proposed domain randomization method for various reinforcement learning approaches.
+
+The paper mainly focuses on the problem of visual randomization, where the different randomized domains differ only in state space and the underlying rewards and dynamics are the same. The paper also assumes that there is a mapping from the states in one domain to another domain. Are there any constraints on the mapping? Will some randomization introduces a larger state space than others?
+
+The paper demonstrates that the expected return of the policy is bounded by the largest difference in state space and the Lipschitz constant of the policies, which is a new perspective of domain randomization for reinforcement learning. 
+
+The proposed method minimizes the expected variations between states of two randomizations but the Lipschitz constant is by the largest difference of policy outputs of a state pair between domains. Should minimizing the maximum difference be more proper?
+
+The center part of Figure 2 is confusing, could the authors clarify it?
+
+In the Grid World environment, how does the random parameter influence the states?
+
+The baselines are a little weak. The paper only compares the proposed with training reinforcement learning algorithm on randomized environments. Could the authors compare with other domain randomization methods in reinforcement learning or naively adapt domain randomization methods from other areas to reinforcement learning?
+
+Overall, the paper is well-written and the ideas are novel. However, some parts are not clearly clarified and the experiments are a little weak with too weak baselines. I will consider raising my score according to the rebuttal.
+
+Post-feedback:
+I have read the rebuttal. The authors have addressed some of my concerns but why minimizing the expected difference is not convincing. I think the paper should receive a borderline score between 3 and 6. ",3,,ICLR2020
+ByEoHz5ez,2,Bk_fs6gA-,Bk_fs6gA-,Review,"# Summary
+This paper proposes a neural network framework for solving binary linear programs (Binary LP). The idea is to present a sequence of input-output examples to the network and train the network to remember input-output examples to solve a new example (binary LP). In order to store such information, the paper proposes an external memory with non-differentiable reading/writing operations. This network is trained through supervised learning for the output and reinforcement learning for discrete operations. The results show that the proposed network outperforms the baseline (handcrafted) solver and the seq-to-seq network baseline.
+
+[Pros]
+- The idea of approximating a binary linear program solver using neural network is new.
+
+[Cons]
+- The paper is not clearly written (e.g., problem statement, notations, architecture description). So, it is hard to understand the core idea of this paper.
+- The proposed method and problem setting are not well-justified. 
+- The results are not very convincing.
+
+# Novelty and Significance
+- The problem considered in this paper is new, but it is unclear why the problem should be formulated in such a way. To my understanding, the network is given a set of input (problem) and output (solution) pairs and should predict the solution given a new problem. I do not see why this should be formulated as a ""sequential"" decision problem. Instead, we can just give access to all input/output examples (in a non-sequential way) and allow the network to predict the solution given the new input like Q&A tasks. This does not require any ""memory"" because all necessary information is available to the network.
+- The proposed method seems to require a set of input/output examples even during evaluation (if my understanding is correct), which has limited practical applications. 
+
+# Quality
+- The proposed reward function for training the memory controller sounds a bit arbitrary. The entire problem is a supervised learning problem, and the memory controller is just a non-differentiable decision within the neural network. In this case, the reward function is usually defined as the sum of log-likelihood of the future predictions (see [Kelvin Xu et al.] for training hard-attention) because this matches the supervised learning objective. It would be good to justify (empirically) the proposed reward function. 
+- The results are not fully-convincing. If my understanding is correct, the LTMN is trained to predict the baseline solver's output. But, the LTMN significantly outperforms the baseline solver even in the training set. Can you explain why this is possible?
+
+# Clarity
+- The problem statement and model description are not described well. 
+1) Is the network given a sequence of program/solution input? If yes, is it given during evaluation as well?
+2) Many notations are not formally defined. What is the output (o_t) of the network? Is it the optimal solution (x_t)? 
+3) There is no mathematical definition of memory addressing mechanism used in this paper.
+- The overall objective function is missing. 
+
+[Reference]
+- Kelvin Xu et al., Show, Attend and Tell: Neural Image Caption Generation with Visual Attention",4,2.0,ICLR2018
+BkxgvOhitH,2,ryg8WJSKPr,ryg8WJSKPr,Official Blind Review #2,"This paper presents a solution to tackling the problem of delusional bias in Deep Q-learning, building upon Lu et.al. (NeuRIPS 2018).  Delusional bias arises because independently choosing maximizing actions at a state may be inconsistent as the backed-up values may not be realizable by any policy. They encourage non-delusional Q-functions by adding a penalty term that enforces that the max_a in Q-learning chooses actions that do not give rise to actions outside the realizable policy class. Further, in order to keep track of all consistent assignments, they pose a search problem and propose heuristics to approximately perform this search. The heuristics are based on sampling using exponentiated Q-values and scoring possible children using scores like Bellman error, and returns of the greedy policy. Their final algorithm is evaluated on a DQN and DDQN, where they observe some improvement from both components (consistency penalty and approximate search).
+
+I would lean towards being slightly negative towards accepting this paper. However, I am not sure if the paper provides enough evidence that delusional bias is a very relevant problem with DQNs, when using high-capacity neural net approximators. Further, would the problem go away, if we perform policy iteration, in the sense of performing policy iteration instead of max Q-learning (atleast in practice)? Maybe, the paper benefits with some evidence answering this question. To summarize, I am mainly concerned about the marginal benefit at the cost of added complexity and computation for this paper. I would appreciate more evidence justifying the significance of this problem in practice. 
+
+Another comment about experiments is that the paper uses pre-trained DQN for the ConQur results, where only the last linear layer of the Q-network is trained with ConQur. I think this setting might hide some properties which arise through the learning process without initial pre-training, which might be more interesting. Also, how would other auxilliary losses compare in practice, for example, losses explored in the Reinforcement Learning with Auxilliary Tasks (Jaderberg et.al.) paper? ",3,,ICLR2020
+S1e0ZJj35B,3,BkePneStwH,BkePneStwH,Official Blind Review #1,"What is the task?
+Multilingual natural language inference (NLI)
+
+What has been done before?
+Current state-of-the-art results in multilingual natural language inference (NLI) are based on tuning XLM (a pre-trained polyglot language model) separately for each language involved, resulting in multiple models.
+
+What are the main contributions of the paper?
+[Not novel] Significantly higher average XNLI accuracy with a single model for all 15 languages.
+[Moderately novel] Cross-lingual knowledge distillation approach that uses one and the same XLM model to serve both as teacher (for English sentences) and student (for their translations into other languages). The approach does not require end-task labels and can be applied in an unsupervised setting
+
+What are the main results? 
+ A single model trained for all 15 languages in the XNLI dataset can achieve better results than 15 individually trained models, and get even better when unrelated poorly-translated languages are removed from the multilingual tuning scheme.
+ Using XD they outperformed the previous methods that also do not use target languages labels. 
+
+Weaknesses :
+1. Combining XD with multilingual tuning is not effective in improving average results or even in case of target languages
+2. Final system is adhoc as experiments on a particular set of languages have been used to support claims. For example, Urdu was excluded to get the best MLT model. Only 4 languages were used while combining XD and MLT
+3. Findings, methods and experiments are not strongly novel.
+4. Paper was not an easy read.
+
+Strengths:
+ Using XD they outperformed the previous methods that also do not use target languages labels. 
+
+",6,,ICLR2020
+HklviNintr,1,r1egIyBFPS,r1egIyBFPS,Official Blind Review #3,"This paper provides a novel approach to the problem of simplifying symbolic expressions without relying on human input and information. To achieve this, they apply a REINFORCE framework with a reward function involving the number of symbols in the final output together with a probabilistic testing scheme to determine equivalence. The model itself consists of a tree-LSTM-based encoder-decoder module with attention, together with a sub tree selector. Their main contribution is that this framework works entirely independent of human-labelled data. In their experiments, they show that this deep learning approach outperforms their provided human-independent baselines, while sharing similar performance with human-dependent ones.
+Overall, the work in this paper has the potential to be a contribution to ICLR but lacks completeness and clarity. A number of experimental details are missing, making it difficult to understand the setup under which results were obtained. Moreover, the paper does not seem to have been revised with many grammatical issues that make it hard to read.
+The following are major issues within the paper and should be addressed:
+•	The paper does not mention the amount of compute given to their model, nor the amount of time taken to train. As the REINFORCE framework is generally quite computation-heavy, these are significant details. Without assessing the amount of compute and time allotted for training HISS, the comparisons to previous baselines lose a fair amount of meaning. The paper alludes to processes being ‘extremely time consuming’, but then does not provide any numbers.
+•	They do not mention the data used to train the model weights. In the comparisons sections, some details on datasets are given, but these seem to refer to data for inference.
+•	There are many grammatical errors that likely could have been detected with 	 revision. A handful of such errors would not affect the score, but they are so numerous as to make the paper much more difficult to understand.
+Additionally, these are comments that slightly detract from the quality of the paper:
+•	It’s unclear what to glean from Section 5.1, as the dataset and baselines seem to be fairly trivial. If their claim is to have the first nontrivial human-independent approach to simplifying symbolic expressions, there is no need to compare to baselines that can only handle small expressions.
+•	Sections 5.3 and 5.4 contribute little to the paper. For 5.3, the model was trained to embed equivalent expressions close together, using L2 regularization. It is therefore unsurprising that equivalent expressions are then closer together than non-equivalent ones. The paper also does not provide a comparison to the method without this regularization, and so it’s unclear if this embedding similarity helps in any way.  For 5.4, the section is extremely short and contains very little content. Moreover, just as many of the variables in their provided examples oppose their conjectures as support them.
+•	The most interesting figure provided is the rewrite rules discovered by the model. It would be even better if an additional column containing the rules discovered by Halide (the main baseline) were provided.
+Overall, in my understanding, the primary point in favor of the paper is in being the first nontrivial human-independent approach to simplifying symbolic expression. That said, this is not my area of expertise, so I cannot judge novelty or importance as well as other reviewers.
+",6,,ICLR2020
+NAuWcCxCfTK,2,NeRdBeTionN,NeRdBeTionN,Interesting empirical evaluation,"This paper provides an interesting empirical study of self-supervised image embeddings as an alternative to pre-trained ImageNet classification models, for the purpose of evaluating the quality and diversity of GAN generative image models. I always found it a little odd that a model trained on ImageNet with cross-entropy loss should somehow be a magical universal quality metric, and I am happy to see that this paper provides good evidence that this is not the case. The authors select 2 self-supervised models and compare them against a number of supervised models. The metrics used are FID and Precision/Recall. I am curious why Inception Score was not also compared? 
+
+The paper does quite a thorough job of selecting and comparing models, by normalizing for architecture and changing dataset or loss function. It shows clearly that self-supervised methods outperform the supervised methods for ranking various GAN models.
+
+It would have been interesting to train the self-supervised model on the dataset itself e.g. LSUN or CelebA to see whether that provides an even more useful signal. Given that deep networks find it hard to generalize across datasets, I would expect that directly training an embedding on the target dataset would do better. Did the authors try something along these lines?
+
+A minor comment is that the layout of the results and comments is a bit confusing: due to the very long number of points that refer to a particular figure and needing lots of scrolling back and forth. Some better way to organize the information and comments would be appreciated.
+
+I would also find it insightful to better understand *why* self-supervision works better for evaluating representations? Any comments to this regard would be interesting. Lastly, I am curious why the authors did not consider self-supervised methods such as SimCLR?
+
+I have read the rebuttals and other comments and maintain my rating of the paper. 
+",7,4.0,ICLR2021
+KIWKldKHUA,2,HajQFbx_yB,HajQFbx_yB,Official Blind Review #4,"Nonsymmetric determinantal point processes (NDPPs) received some attention recently because they allow modeling of both negative and positive correlations between items. This paper developed scalable learning and MAP inference algorithms with space and time complexity linear in ground set size, which is a huge improvement compared to previous approaches. Experimental results show that the algorithms scale significantly better, and can roughly match the predictive performance of prior work.
+
+This is a well written paper and I recommend its acceptance. Scalable learning and MAP inference algorithms are important for the application of the NDPPs model, which seems promising compared with its symmetric counterpart in experiments.
+
+I have some (minor) comments listed below.
+
+1. In Lemma 1, the result is only proved for skew-symmetric matrices with even rank. Does it hold for odd rank matrices? This is important to support the claim that the new decomposition covers the $P_0^+$ space.
+
+2. Equation (3) uses notation $\lambda_i$, which is already used in Lemma 1. This could cause confusion.
+
+3. In the paragraph after Theorem 1, it is proposed to set $B$ = $V$ and relax $C$. Is this used in Section 4? If not, I would suggest moving it to the experiments section, and adding some comparison in Table 2 to show the impact of this simplification.
+
+4. I cannot quite understand the last sentence before Lemma 2. What can be computed in $O(K^2)$ time?
+
+5. The footnote in Table 1 might cause confusion because it can be mis-interpreted as a square.
+
+6. In G.1, the first sentence after equation (13), do you mean when $M$ is odd or when $\ell$ is odd?
+
+7. In equation (24), $X$ should be $B^TXB$
+
+8. The inverse of $C$ appears in the gradient of $Z$. Is $C$ guaranteed to be invertible in the learning algorithm? And how are $V$, $B$, $C$ initialized in the algorithm?
+
+9. In equation (31), please double check if we need the reciprocal in the denominator.
+",7,4.0,ICLR2021
+Hke1_V2qnm,1,SJzqpj09YQ,SJzqpj09YQ,"Good work, but bad presentation","In this paper, the authors proposed a unified framework which computes spectral decompositions by stochastic gradient descent. This allows learning eigenfunctions over high-dimensional spaces and generating to new data without Nystrom approximation. From technical perspective, the paper is good. Nevertheless, I feel the paper is quite weak from the perspective of presentation. There are a couple of aspects the presentation can be improved from. 
+
+(1) I feel the authors should formally define what a Spectral inference network is, especially what the network is composed of, what are the nodes, what are the edges, and the semantics of the network and what's motivation of this type of network.
+
+(2) In Section 3, the paper derives a sequence of formulas, and many of the relevant results were given without being proven or a reference. Although I know the results are most likely to be correct, it does not hurt to make them rigorous. There are also places in the paper, the claim or statement is inclusive. For example, in the end of Section 2.3, ""if the distribution p(x) is unknown, then constructing an explicitly orthonormal function basis may not be possible"". I feel the authors should avoid this type of handwaving claims.  
+
+(3) The authors may consider summarize all the technical contribution in the paper. 
+
+One specific question:
+
+What's Omega above formula (6)? Is it the support of x? Is it continuous or discrete? Above formula (8), the authors said ""If omega is a graph"". It is a little bit confusing there. ",5,3.0,ICLR2019
+ARfspQ3zWtj,2,S0UdquAnr9k,S0UdquAnr9k,Official Blind Review #2,"Most existing methods follow a manually ﬁxed weight sharing pattern, leading to the difficulty that estimates the performance of networks with different widths. To address this issue, this paper proposes a locally free weight sharing strategy (CafeNet) to share weights more freely. Moreover, this paper further proposes FLOPs-sensitive bins to reduces the size of the search space. Specifically, this paper divides channels into several groups/bins that have the same FLOPs-sensitivity and searches for promising architectures based on the divided groups. Extensive experiments on several benchmark datasets demonstrate superiority over the considered methods. However, some important details regarding the proposed method are missing. My detailed comments are as follows.
+
+Positive points:
+1. Compared with the manually ﬁxed weight sharing pattern, this paper proposes a locally free weight sharing strategy (CafeNet), which allows more freedom in the channel assignment of a sub-network.
+
+2. To reduce the size of the search space, this paper proposes to divide channels into several groups/bins (also called minimum searching unit) that have the same FLOPs-sensitivity.
+
+3. The experimental results on image classification and object detection tasks show that the proposed method outperforms the existing methods by a large margin.
+
+Negative points:
+1. When training the super network, why the authors optimize the sub-network with the smallest training loss? More explanations are required.
+
+2. Why the sensitivity of a layer should be calculated as Eqn. (7)? It would be better to provide more details about that.
+
+3. Given a FLOPs constraint in Eq. (2), how to select a suitable width for each layer? Please discuss more and make it clearer.
+
+4. Is it possible to find a sub-network with zero width ($c=0$) for a layer? If so, how to deal with this case when evaluating the sub-network?
+
+5. The experimental results are inconsistent with the descriptions. In Figure 2(b), the performance of the proposed method goes worse with the decreasing of the $\lambda$. However, the authors state that “MobileNetV2 on CIFAR-10 improves 0.92% accuracy from $\lambda$ =0 to $\lambda$=1”.
+
+6. The experimental comparisons in Table 1 are unfair. Compared with other methods (e.g. AutoSlim), the proposed method trains the models on ImageNet for more epochs (100 v.s. 300). More experiments under the same settings are required.
+
+Minor issues:
+1. In appendix A.13, “… and the bin evolving speed α in Section 3.4” should be “… and the bin evolving speed α in Section 3.3”.",6,5.0,ICLR2021
+HyIm4t7xz,2,H135uzZ0-,H135uzZ0-,Mixed Precision Training,"This paper is about low-precision training for ConvNets. It proposed a ""dynamic fixed point"" scheme that shares the exponent part for a tensor, and developed procedures to do NN computing with this format. The proposed method is shown to achieve matching performance against their FP32 counter-parts with the same number of training iterations on several state-of-the-art ConvNets architectures on Imagenet-1K. According to the paper, this is the first time such kind of performance are demonstrated for limited precision training.
+
+Potential improvements:
+	
+  - Please define the terms like FPROP and WTGRAD at the first occurance.
+  - For reference, please include wallclock time and actual overall memory consumption comparisons of the proposed methods and other methods as well as the baseline (default FP32 training).",7,3.0,ICLR2018
+SkeuG_BAYB,2,Skeh1krtvH,Skeh1krtvH,Official Blind Review #1,"This submission belongs to the field of text-to-speech synthesis. In particular it looks at a novel way of formulating a normalising flow using 2D rather than conventional 1D representation. Such reformulation enables to provide interpretations to several existing approaches as well as formulate a new one with quite interesting properties. This submission would benefit from a discussion of limitations of your approach. 
+
+I believe there is a great deal of interest in the use of normalising flows in the text-to-speech area. I believe this submission could be a good contribution to the area. The test log-likelihoods look comparable to existing approaches with significantly worse inference times. The mean opinion scores (MOS) seem to approach one of the standard baselines with significantly worse inference times though at the expense of increasing the number of model parameters from 6M to 86M parameters whilst gaining only 0.2 in MOS. The submission would have benefited from discussion about model complexity/expressivity and it's impact on MOS for WaveFlow, WaveNet and other approaches. 
+
+The largest issues with this submission are:
+
+1) lack of proper technical description of your model in sections 1 and 2 making reading sections 1,2,3,etc in order awkward. It seems the order should be 3,4,(5),1,2,(5). 
+2) complete omission of conditioning on text to be synthesised; anyone not familiar deeply with speech synthesis will wonder where does the text come in
+3) explicit statement of complexity for the operations involved using proper big-O notation; helps to avoid confusion about what do you mean by ""parallel"" (autoregressive WaveNet followed by parallel computation != parallel computation)
+",6,,ICLR2020
+BJ8VRRhgM,3,H1T2hmZAb,H1T2hmZAb,"Using complex numbers for neural networks , but why?","Authors present complex valued analogues of real-valued convolution, ReLU and batch normalization functions. Their ""related work section"" brings up uses of complex valued computation such as discrete Fourier transforms and Holographic Reduced Representations. However their application don't seem to connect to any of those uses and simply reimplement existing real-valued networks as complex valued.
+
+Their contributions are:
+
+1. Formulate complex valued convolution
+2. Formulate two complex-valued alternatives to ReLU and compare them
+3. Formulate complex batch normalization as a ""whitening"" operation on complex domain
+4. Formulate complex analogue of Glorot weight normalization scheme
+
+Since any complex valued computation can be done with a real-valued arithmetic, switching to complex arithmetic needs a compelling use-case. For instance, some existing algorithm may be formulated in terms of complex values, and reformulating it in terms of real-valued computation may be awkward. However, cases the authors address, which are training batch-norm ReLU networks on standard datasets, are already formulated in terms of real valued arithmetic. Switching these networks to complex values doesn't seem to bring any benefit, either in simplicity, or in classification performance.",4,4.0,ICLR2018
+AVCMFSvCEIr,2,H92-E4kFwbR,H92-E4kFwbR,Nice contribution to adversarial training literature but with mixed results,"The authors propose a method for dealing with *composite* adversarial attacks, which are defined as a sequence of perturbation operators each applying some constrained perturbation to the output of the previous operator. Their method models the composed adversarial examples $x^*$ as the sum of the unperturbed example with a series of perturbations $\delta_i$ which maximize the estimator's loss. They compare their results to other existing adversarial training methods against multiple types of adversarial attacks.
+
+Pros:
+- Interesting idea, seems like a very natural continuation of existing work
+- Good experimental design, results are reasonably thorough
+- Some results are encouraging
+
+Cons:
+- Explanation of method (CAT) is somewhat lacking. It's not clear to me exactly what their method does differently than the baselines explained in the background.
+- Results are mixed with discussion focusing almost entirely on the positive parts. For example, CAT consistently performs significantly worse than baselines on ""clean accuracy"" and worse than one or more baselines on other singular attacks (see Tables 1,2,3,4).
+- Results in section 5.2 lack explanation (i.e. what do the table columns/rows actually mean)
+- Minor formatting issues
+
+Overall, I think the central problem that the authors are trying to solve is important and their work makes a reasonable contribution towards the solution. Despite the apparent mixed results, this paper should be a candidate for acceptance.
+
+Additional comments for the authors:
+- It would be helpful to provide references for the definitions of ""robust accuracy"" and ""clean accuracy""; I'm sure these are metrics that have been defined and used in prior work but this can sometimes make it difficult for outside readers to find where they are rigorously defined.
+- As mentioned in the Cons, you should make it more clear what the reader should be looking for in the tables. Reading just by the accuracy scores, it seems like CAT often performs worse or about the same as baselines in multiple experiments.
+- Table captions should be above, not below, the table. This particularly problematic with Table 4/Figure 4 where the Table caption looks like the title of Figure 4.
+- As mentioned before, equation 8 does not (for me) satisfactorily explain what CAT actually does.
+- In equation 8, $\delta_i$ appears in the constraint but not in the expression; perhaps you meant to write:
+$$
+x' = \underset{x^{(m)}}{\arg\max} \ell (f_{\theta}(x^{(m)} + \delta_i,y)
+$$
+- The distinction between the different indexing notations $x_i$ and $x^{(i)}$ is not always clear
+- It's not clear what the notation means in Tables 5, 6, and 7 and how it relates to ""ordering"" of perturbations.",5,3.0,ICLR2021
+D_HrFrGCAQX,1,2G9u-wu2tXP,2G9u-wu2tXP,The paper has values in the exploration of data hashing for neural networks. At the same time it needs more evidence make sense the motivation.,"##########################################################################
+
+Summary:
+
+This paper studies the problem of continual learning and proposes a new learning framework named hash-routed convolutional neural networks (HRN). HRN has a set of convolutional units and hashes similar data to the same unit. With this design, the paper claims three key contributions. (1) HRN provides excellent plasticity and more stable features. (2) HRN achieves excellent performance on a variety of benchmarks. (3) HRN can be used for unsupervised or reinforcement learning. 
+
+
+##########################################################################
+
+Reasons for score: 
+
+Overall, I like the scope of this paper that studies continual learning. The experiments in Figure 2 verify that the proposed HRN outperforms a number of baselines on incremental-Cifar100 dataset. However, I am not sure whether the dataset makes sense for benchmarking continual learning. It seems the authors split the 100 classes into 10 groups, with each group having 10 classes. The 10 groups and corresponding labels serve as 10 distinct tasks for evaluating continual learning. Such formulation is odd that leads to decreasing accuracy scores as shown in Figure 2. I cannot find a real-world application that can benefit from such formulation due to decreased accuracy scores. It will be good if the paper can clarity the above points to strengthen its motivation. 
+
+##########################################################################
+
+Pros:
+
+1. The paper proposes a novel neural network HRN for continual learning. HRN leverages multiple units of CNN and hashes similar data to the same unit for training.  
+
+2. The proposed HRN achieves impressive performance when compared with a selected set of baselines. HRN consistently shows more robust accuracy scores on three datasets, as shown in Figure 2 and Table 1. 
+
+
+##########################################################################
+
+Cons: 
+
+The most important point relates to the motivation of problem formulation. The paper splits a 100-classes dataset into 10 10-classes datasets. Any two datasets have no overlapping in the classes. Under such formulation, a model suffers from decreased accuracy score as shown in Figure 2. It is necessary to explain why decreased scores are appealing in practice given they are the base of the concerned continual learning. 
+ 
+According to Figure 2 and Table 1, all models suffer from decreased accuracy scores. Besides, the accuracy scores decrease monotonously regarding the task ids. Task id should not be a factor of performance. It will be good if the paper can provide some explanations. All existing machine learning researchers seek better accuracy scores so the decreased accuracy scores seem unusual. 
+ 
+It may not be fair to compare baselines with HRN that exhibits a larger model size. HRN leverages multiple (e.g. 6 in figure 2) units of CNN. To be fair, the paper should use the same number of units (or equivalently ensembles) for a baseline method. It will be good if the paper can run some experiments or show a comparison on model sizes. 
+
+
+##########################################################################
+
+Questions during rebuttal: 
+
+It will be nice if the authors can add experiments or discussions related to decreased accuracy scores. It will be perfect if the authors can show real applications of the proposed continual learning with decreased accuracy scores. 
+
+",6,4.0,ICLR2021
+RHN50cYZFEb,1,Ig53hpHxS4,Ig53hpHxS4,Accept. The idea is good and experiments are good. There are some concerns about the clarity of the paper but those can be worked on.,"### Summary
+
+This paper introduces a novel application of normalizing flows to speech synthesis, allowing direct optimization of spectrogram log-likelihoods which results in more natural variation at inference compared to L1/L2 losses that model the mean. This setup also allows more control over non-textual information and interpolation between samples and styles.
+
+
+### Recommendation
+
+**Accept**
+The idea is good and experiments are good. There are some concerns about the clarity of the paper but those can be worked on.
+
+### Positives
+
+1. The paper introduces a novel architecture and demonstrates improved output variation and more controllability, which is an important current issue for TTS research.
+1. Many experiments investigating controllability are described, as well as some ablation studies for the model architecture in the appendix.
+
+
+### Negatives
+
+1. **Figure 1**: Please provide more informative captions. In the text, timbre and F0 are mentioned but it is not clear how that is relevant in the images. It is not clear what the colors mean in 1b, and if 1b is supposed to show a separation between male and female speakers the colors make it worse. Finally, are the points in 1b cluster centers or a single random sample per speaker?
+
+1. **Figure 2**: Would it be possible to make it clearer that there is an autoregressive dependency in the Attention and Decoder blocks due to the LSTM cell memory? The way the figure is currently drawn makes it seem as if each attention/decoder block can be computed in parallel from only the inputs from the previous flow iteration. 
+
+1. In section 2.2 NN and f are described as acting on the latent variable frames $z_t$, but Figure 2 applies the NN and flow on the mel spectrogram frames x in order to produce z. Similarly, there is text saying ""we take the mel-spectrograms and pass them through the inverse steps of flow"" and ""Inference, [...], is simply a matter of sampling z values [...] and running them through the network"" but Figure 2 marks the block as ""Step of Flow"" rather than ""Inverse Step of Flow"". It would be good to smooth out the consistency.
+
+1. More discussion about the end of sequence prediction would be appreciated. As the entire sequence of z must be used for inference, I assume there is some constant max inference length z used to obtain a final x, and the end of sequence prediction only happens to the final x rather than at each step of the flow? How well does this model adapt to inference samples that are much longer than any of the inputs seen during training?
+
+1. **3.3.2 Interpolation Between Samples**: I don't understand why there was a need to sample z to find z_h and z_s. Is it not possible to take z_h and z_s from a random training example for the speaker? Or does that mean there is a correlation between the latent space z and the __content__ that is being spoken? If the latter, I would like to see more discussion on that.
+
+1. **Table 2**: I assume bold means closest to ground truth? It's really hard to know how to interpret this table and I don't think it supports saying that FTA Posterior is more effective. FTA did not capture the increase in pitch mean from expressive -> high pitch, nor the decrease in std from expressive -> surprised. A better visualization may be to express this table as a bar graph (3 separate groups of 3 bars each) and conclude FTA Posterior is able to produce much more variation than Tacotron 2 GST?
+
+### Misc
+
+#### Abstract
+
+1. ""varation"" -> ""variation""
+1. spell out IAF
+1. ""We provide results on speech variation"" etc. sounds weak. eg. ""Flowtron produces output with far more natural variation compared with Tacotron 2 and enables interpolation over time between samples and style transfer between seen and unseen speakers in ways that are either impossible or inefficient to achieve with prior works.""
+
+#### 1 Introduction
+
+1. ""Their assumption is that variable-length embeddings are not robust to text and speaker perturbations"" citation needed
+1. ""Flowtron learns an invertible function that maps a distribution over mel-spectrograms to a latent z-space parameterized by a spherical Gaussian."" mention IAF and cite Kingma et al., 2016 here instead of the previous paragraph. 
+1. ""Finally, although VAEs and GANs provide a latent embedding that can be manipulated, they may be difficult to train, are limited to approximate latent variable prediction"" While the IAF approach allows for direct optimization of log-likelihood, the latent variable encoding part is still approximate, just like in a VAE. The original Kingma paper even states that it is an approximate posterior. 
+
+#### 2.2 Invertible Transformations
+
+1. $f^{-1}$ wouldn't be applied to $z_t$. Maybe $f^{-1}(x_t)$ or $f^{-1}\left(f(z_t)\right)$.
+
+#### 3.1. Training Setup
+
+1. ""progressively adding steps of flow on the last step of flow has learned to attend to text"" on->once?
+
+#### 3.4.2 Seen Speaker with Unseen Style
+
+1. ""Flowtron succeeds in transferring not only the somber timbre, the low F0 and the long pauses
+associated with the narrative style"" -> ""not only the somber timbre, but also""
+
+#### Appendix
+
+1. IAF is known to be quite inefficient. Is there any noticeable impacts on training or inference time? Or is this not a problem due to only using 2 layers of flow? It would be great if the ablation study in the appendix regarding layers of flow also covers this, but this is quite optional.",9,5.0,ICLR2021
+Bk2K_F9lz,2,HJjvxl-Cb,HJjvxl-Cb,Impressive empirical results,"The paper presents an off-policy actor-critic method for learning a stochastic policy with entropy regularization. It is a direct extension of maximum entropy reinforcement learning for Q-learning (recently called soft-Q learning), and named soft actor-critic (SAC). Empirically SAC is shown to outperform DDPG significantly in terms of stability and sample efficiency, and can solve relatively difficult tasks that previously only on-policy (or hybrid on-policy/off-policy) method such as TRPO/PPO can solve stably. Besides entropy regularization, it also introduces multi-modal policy parameterization through mixture of Gaussians that enables diverse, on-policy exploration.     
+
+The main appeal of the paper is the strong empirical performance of this new off-policy method in continuous action benchmarks. Several design choices could be the key, so it is encouraged to provide more ablation studies on these, which would be highly valuable for the community. In particular,
+
+- Amortization of Q and \pi through fitting state value function
+
+- On-policy exploration vs OU process based off-policy exploration
+
+- Mixture vs non-mixture-based stochastic policy
+
+- SAC vs soft Q-learning
+
+Another valuable discussion to be had is the stability of off-policy algorithm comparing Q-learning versus actor-critic method.
+ 
+Pros:
+
+- Simple off-policy algorithm that achieves significantly better performance than existing off-policy baseline algorithms
+
+- It allows on-policy exploration in off-policy learning, partially thanks to entropy regularization that prevents variance from shrinking to 0. It could be considered a major success of off-policy algorithm that removes heuristic exploration noise.
+
+Cons:
+
+- Method is relatively simple extension from existing work in maximum entropy reinforcement learning. It is unclear what aspects lead to significant improvements in performance due to insufficient ablation studies. 
+
+
+Other question:
+
+- Above Eq. 7 it discusses that fitting a state value function wrt Q and \pi is shown to improve the stability significantly. Is this comparison with directly estimating state value using finite samples? If so, is the primary instability due to variance of the estimate, which can be avoided by drawing a lot of samples or do full integration (still reasonably tractable for finite mixture model)? Or, is the instability from elsewhere? By having SGD-based fitting of state value function, it appears to simulate slowly changing target values (similar role as target networks). If so, could a similar technique be used with DDPG and get more stable performance? 
+
+",7,4.0,ICLR2018
+kFngTXTHjD,3,MmCRswl1UYl,MmCRswl1UYl,new dataset to support research on open QA over text and table; two techs to retrieve and aggregate evidence; good empirical results,"##########################################################################
+
+Summary:
+
+ 
+The paper provides a interesting direction in open question answering. In particular, it proposes an open QA problem over both tabular and textual data, and present a new large-scale dataset Open Table-and-Text Question Answering (OTT-QA) to evaluate performance on this task. Two techniques are introduced to address the challenge of retrieving and aggregating evidence for OTT-QA. Results show that the newly introduced techs bring improvements.
+
+##########################################################################
+
+Reasons for score: 
+
+ 
+Overall, I vote for accepting. I like the idea of open question answering with various types of evidence. The major contribution of this work, in my personal opinion, is the the creation of the dataset which would foster the research on open question answering over text and table. The techs introduced are sound but the novelty in terms of methodology is limited. 
+
+ 
+##########################################################################Pros: 
+Comments:
+ 
+1. The paper formulate an interesting problem of open QA problem over both tabular and textual data. 
+
+ 
+2. The creation of the dataset (OTT-QA) is a great contribution to the community. The authors claim to release the data to public. Would the test set make blind so that make it a challenge like SQuAD?
+
+ 
+3. The method is sound. Experiment study is convincing. Two introduced techs bring improvements. 
+
+ 
+##########################################################################",7,4.0,ICLR2021
+AwdH4drr0Rs,4,SP5RHi-rdlJ,SP5RHi-rdlJ,A clear rejection,"This paper compresses neural networks via so called Sparse Binary Neural Network designs. The proposed idea is naïve, directly using a slightly modified sign function to quantize network weights into 0 and 1 instead of commonly defined -1 and 1. Experiments on small MNIST and CIFAR-10 datasets with two shallow and old neural networks are provided.
+
+This paper has obvious weaknesses.
+
+--- Limited novelty
+
+The proposed method is naïve. The authors merely replace  binary weights {-1, 1} by {0, 1}, using common quantization tricks for binary neural networks, such as straight-through estimator (STE) and a slightly modified sign function. The authors claim that such a modification can bring significantly improved compression. However, it is problematic, as it will force all quantized network weights to be non-negative, leading to serious accuracy drop. For example, in Table 3, a shallow VGG-like network on CIFAR-10 dataset shows about 10% absolute accuracy drop compared to the binary weight counterpart. Furthermore,  in the optimization, the authors add two non-negative constraints, which makes the training with STE even more challenging. I believe, experiments on large-scale image classification dataset such as ImageNet with modern CNNs will lead to more serious accuracy drops. 
+
+Actually, more impressive neural network compression yet with good accuracy can be achieved via combing quantization and pruning. There exist numerous works in this field,  e.g., ""Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding"" in ICLR 2016, ""Clip-q: Deep network compression learning by in-parallel pruning-quantization"" in CVPR 2018 and ""APQ: Joint Search for Network Architecture, Pruning and Quantization Policy"" in CVPR 2020. Unfortunately, they are missed by the authors.
+ 
+--- Poor writing
+
+The paper is poorly written, including introduction (messy), related works (poor), proposed method (tedious) and experiments presentation (weak). 
+
+--- Weak experiments
+
+There is no comparison with state of the art CNN compression methods, combining quantization and pruning for improved compression.
+
+The authors only conduct toy experiments, a 3-layer fully connected LeNet on MNIST dataset and a shallow VGG-like network on CIFAR-10 dataset.
+
+Even with toy experiments, results are very weak, showing serious accuracy drop even for a shallow VGG-like network on CIFAR-10 dataset, compared to the binary weight counterpart. 
+
+
+  ",3,5.0,ICLR2021
+BG6ospN2E1o,4,u8APpiJX3u,u8APpiJX3u,Interesting idea; need more experimental justification,"=========
+Summary:
+
+This paper proposes a homogeneous network structure for semantic segmentation, which optimizes for prediction accuracy, latency as well as memory footprint.  The paper studies anytime prediction setting and designs a re-usable single building block to reduce the memory footprint. Experimental results on CamVid data shows that it's possible to use a homogeneous network architecture to achieve competitive mIoU compared to previous work at the cost of increased MACs.  Experimental evaluation on larger datasets such as Cityscapes is prohibited due to memory constraints of the available GPUs. 
+
+=========
+Pros:
+*  The idea of using a homogeneous network is interesting and re-usable blocks are reasonable for the anytime prediction setting. 
+*  The paper is easy to follow. 
+
+=========
+Concerns:
+
+1.  Is the parameters of the re-usable building block updated iteratively during inference? The paper described two scenarios where the parameters of the re-usable building block can be updated or shared across multiple iterations.  My first question is,  is this for training time only or for both training and inference? 
+
+ a) At inference time if the parameter updates are allowed at each iteration, what is the mechanism to decide the new weights?  If an independent block was used and only the structure of the building block is reused, there would not be any benefit for latency or memory footprint. 
+ 
+ b) If the parameters of the building blocks are fixed during inference,  then it falls into the weight-sharing scenario where the model performance is much lower according to Figure 1 c). 
+
+2. Lack of anytime prediction experiments.  The paper claims it studies anytime prediction setting, however, I couldn't find explicit descriptions about it in the experiment section and descriptions about how the model choose the number of iterations during inference.  
+
+3. Missing Cityscapes experiments.  The paper claims in the abstract that the method was evaluated on cityscapes but in the discussion section, it says they couldn't provide cityscapes experiments due to memory issues.  The inconsistency in the paper should be fixed.  
+
+4. What is the exact benefit of the homogenous network?  The paper shows this structure requires at least 3 times larger MACs in order to achieve competitive performance. It also says this network has potential for novel, massively-parallel hardware accelerators.  This is a bit vague to me, could you provide a concrete example? 
+
+5. Related work on image pyramid and feature pyramid.  In this paper,  the data block preprocesses the images into multiple scales and the re-usable building block concatenates the features from multiple scales.  A discussion/revisit on the literature about the image pyramid and feature pyramid is recommended here and it might give some inspirations to address the memory issues of training the model on Cityscapes. 
+
+========
+
+Overall, I think the idea of building a homogenous network is reasonable to me and may lead to a better trade-off between latency, accuracy, and memory. However, the current manuscript still needs much improvement. 
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+",4,4.0,ICLR2021
+SkWqUPoKH,1,Bkxv90EKPB,Bkxv90EKPB,Official Blind Review #3,"Thank you for an interesting read.
+
+As far as I understand, this paper presents DAMS which is a MAML-like algorithm but applied to posterior sampling. The idea is the following:
+1. Construct a meta-sampler that generates good proposals/initial samples for task-specific samplers;
+2. Train the meta-sampler so that the task-specific posterior sampling converges faster to the target distribution.
+
+The meta-sampler is designed as an inverse version of the neural auto-regressive flow (NIAF), and the task-specific sampler is based on the Wasserstein gradient flow (WGF).
+
+======= novelty ======
+The method is novel:
+1. Probabilistic/Bayesian understanding of few-shot learning/meta learning has been proposed, but to date variational inference is the main inference engine used in the literature. This paper provides a nice complement by considering fast adaptation of posterior sampling.
+2. The meta-level parameterisation method is indeed different from (probabilistic) MAML, in the sense that the initial parameters/samples z is generated from a neural network conditioned on a task, instead of using a shared initialisation across tasks. This is more inline with approaches such as hyper-networks, and I haven't seen many MAML-like approaches doing that. The meta-sampler architecture is new but improved upon NAF so I consider the architectural novelty to be minor.
+
+
+======= significance ======
+The experimental section contains many results, with the two main category as (a) comparisons between DAMS and existing sampling methods on Bayesian inference tasks; and (b) comparisons to MAML on few-shot learning tasks. Compared with the baselines, DAMS achieves significantly better results, which is a good sign.
+
+However, DAMS as a whole pipeline has a lot of components, and it is not clear to me which part is the main driving force. So I think the following ablation studies will be very helpful:
+1. To see whether it is necessary to use NIAF, one can replace NIAF with a set of learned particles shared across tasks (i.e. make \psi = {z^(1), ..., z^(N)} which is in similar spirit as MAML).
+2. To see whether WGF brings in significant improvement in DAMS, one can replace WGF in sample adapter with SVGD or SG-MCMC method such as SGHMC/preconditioned SGLD. A comparison between e.g. SGHMC vs DAMS-SGHMC vs DAMS-WGF will be helpful for this ablation.
+3. It seems to me the meta-sampler requires running a few step of NAIF updates (\Gamma info contains previous sample and the gradients). How many steps are required here? A detailed analysis will be useful. If a lot of steps are required, the ""fast adaptation"" in meta-testing is not really the case, as both meta-sampling and sample adaptation requires evaluating gradients.
+4. The multiplicative normalising flow for BNN approximate posterior adds in another layer of complexity, as the method also perform meta-learning on this variational distribution. So a baseline which removes the NIAF part (i.e. also learn particles for masks z, see point 1) on few-shot learning tasks would be useful. As far as I understand this baseline approach is different from ABML/PMAML that has been reported in the paper.
+
+======= clarity =======
+The presentation needs to be improved. 
+To me, section 3 seems to be overwhelmed by details, e.g. the long discussion of task networks and WGF. For example, to what extent do the WGF construction details matter for understanding the whole DAMS pipeline? I think the general idea works for any valid posterior sampler in the fast adaptation step. 
+I would suggest the following structure for section 3 instead:
+1. write down the whole pipeline in a more abstract way, e.g. say the meta-sampler is any generator conditioned on task information, and the sample adaptation as generic posterior sampling;
+2. discuss the training algorithm, what are the loss functions, etc;
+3. discuss in meta-testing how DAMS is deployed; 
+3. discuss the detail implementation of the meta-sampler and sample adaptation method.",6,,ICLR2020
+GIZEyVWt38o,4,Kz42iQirPJI,Kz42iQirPJI,Review of the Paper ,"-Summary-
+The paper proposes a method for the sequential meta-learning problem. The author meta learn not only model parameters but also learning rate vectors for parameter blocks. To this end, the meta-learn model finds appropriate model parameters and adaptive learning rate vectors that capture task-general information. Overall experiments are performed on few-shot meta-learning settings with sequential domains (datasets).
+
+-Pros-
+- Optimizing fine-grained learnable learning rate vectors for manually grouped parameter blocks is reasonable.
+- The performance of the proposed model significantly outperforms baselines which naively combined existing few-shot meta-learning and continual learning approaches.
+
+
+-Cons-
+- The approach is too simple (Adding learnable strength vector weights for gradient update of conv. blocks) and heuristic. And core hyperparameters are manually decided, like # of blocks per Conv. layer and the size of memory.
+- Lack of analysis. There is a lack of concrete insight into how does adaptive learning rate weights mitigating forgetting. 
+- The method only considers multi-head continual learning problems for task-incremental learning. The majority of recent impressing CL works considers further realistic and applicable to the broader areas, called class-incremental learning problem that task oracle isn't given during training/inference of the model.
+- The method is only performed on simple CNN architectures. It needs to be validated on further modern deep neural network architectures. And, while the construction of CNN in this paper, most of the well-known CNN architectures have a different number of filters per layer. In this case, the strategy to split blocks can be important for pursuing a better model. However, there is no discussion/analysis of the problem.
+- Meta-learning with bilevel optimization might require an additional computational cost.
+
+-Comments-
+- Experiments on domain ordering are interesting. I see that recent CL works consider evaluations on multiple domains(datasets) like ""Hard Attention on Task"" (HAT) paper. It would be interesting to perform analysis of the task(domain)-order sensitivity like 'order-normalized performance disparity' (OPD) in [1], which can be beneficial for understanding backward-forward transfer during continual learning under the domain shift.
+- Citation of the XtarNet is duplicated. Please combine them. 
+
+[1] Yoon, Jaehong, et al. ""Scalable and Order-robust Continual Learning with Additive Parameter Decomposition."", ICLR 2020.",4,5.0,ICLR2021
+2R26oAjJTID,4,RmcPm9m3tnk,RmcPm9m3tnk,A convincing proof-of-concept solution for a difficult new task,"== Update ==
+
+Thank you for your response and clarifications. I have left my score as is.
+
+== Original Review ==
+
+The paper presents an unsupervised method for inferring scene graphs from images. Building upon
+scene-attention methods such as AIR and SPACE, it hierarchically decomposes a scene into objects and
+those objects into parts, giving rise to a tree structure. It is shown that this model successfully
+recovers the hierarchies underlying the data on two newly proposed hierarchical variants of the
+Sprites and CLEVR datasets.
+
+Strengths:
+ 1. The paper is well written, and despite the considerable complexity of the method, its
+    presentation is relatively easy to follow.
+ 2. The task of interest is well-defined, and has clearly been effectivly solved on the datasets
+    considered. Both quantitative and qualitative evaluations make it very clear that the model has
+    learned to infer the correct scene graphs as desired. Its ability to infer the appearance of
+    occluded parts is especially impressive.
+ 3. While there are no direct competitors on this newly defined task, the paper does a decent job of
+    comparing to the closest available baseline, showing how the additional structure can be
+    beneficial.
+
+Weaknesses:
+ 1. One may argue that the datasets have been deliberately constructed to showcase the model. While
+    that is probably true, I think this is a valid approach given the novel nature of the task and
+    the lack of supervision. Despite the clearly helpful structure (limited number of objects and
+    parts), the datasets still appear sufficiently challenging.
+ 2. In the experiments, scene graphs are limited to trees of height 2 and degree 4. This is a
+    significant constraint, however, each additional level of hierarchy introduces ambiguities and
+    makes it harder to learn the graph in an unsupervised manner. More complicated structures would
+    likely require supervision.
+ 3. As far as I can tell, the object types were chosen once to generate the datasets, and then kept
+    fixed across the experiments. Reporting results for multiple different datasets with randomly
+    chosen object types would be somewhat more convincing.
+ 4. As is common for unsupervised scene models, the proposed method likely only works on synthetic
+    images in its current state. However, due to the additional structural assumptions on the data,
+    it seems especially challenging to find suitable real-world use-cases.
+
+Overall, the paper presents an effective new method for the task it sets out to solve. While it is
+questionable how it would work on real-world data, I believe the paper is of sufficient interest as
+a proof of concept, and am therefore leaning towards acceptance.
+
+Questions:
+ 1. It is stated that auxiliary KL terms are added, with the sparseness constraint on $z^{pres}$
+    being one of them. But is not clear if there are others. This would be important to know in
+    order to evaluate how strong the model's inductive biases are.
+ 2. The downstream task used for Fig. 5 is not clear to me. If the number of parts is computed for
+    each object, and these numbers are then summed, isn't the result equal to the total number of
+    parts in the scene, which SPACE-P can also infer? If only distinct parts are counted,
+    how is equality of parts defined on the dataset?",6,4.0,ICLR2021
+eKWxtq5UJF,1,6isfR3JCbi,6isfR3JCbi,A qualitative comparison with private generative models on MNIST is needed,"Summary: This paper studies the differential private synthetic dataset generation. Unlike previous DP based GAN models, this paper aims to boost the sample quality of after the training stage. In particular, the final synthetic dataset is sampled from the sequence of generators obtained during GAN training. The distribution is obtained by a private two-player game between the privately selected discriminator and a sampler from the mixture of generators. The results are demonstrated on gaussian data and tabular data.
+
+
+Pros: 1. The sample quality of private generative models is known to be not as good as non-private models. This paper provides a practical private post gan boosting algorithm to improve the sample quality. 
+
+Cons:1. My main concern is on experiments. It is known that private generative models have bad sample quality on image data. Prior works on private synthetic generation papers usually show results on MNIST. It would be better if the authors could compare private PGB on MNIST dataset. 
+
+2. It would be better to have an ablation study of the proposed PGB and discriminator rejection sampling. For example, in Figure~1, the baselines for both non-private gan and private gan are too bad. I am wondering whether the gain is from rejection sampling or the proposed PGB algorithm.
+
+
+Questions:
+
+1. I am curious about how to split epsilon for gan training and the post gan boosting. Are there any principled reasons for the split?
+
+2. How do you calculate the sensitivity for the exponential mechanism?",6,2.0,ICLR2021
+SJcOWb5gf,3,ByED-X-0W,ByED-X-0W,"Interesting line of work, but the burning questions are not yet answered: would recommend for workshop publication at this stage.","# Paper overview:
+This paper views the learning process for stochastic feedforward networks through the lens of an
+iterative information bottleneck process; at each layer an attempt is made to minimise the mutual
+information (MI) with the feed-in layer while maximising the MI between that layer and the presumed-endogenous variable, 'Y'.
+
+Two propositions are made, (although I would argue that their derivations are trivially the consequence
+of the model structure and inference scheme defined), and experiments are run which compare the approach to maximum likelihood estimation for 'Y' using an equivalent stochastic network architecture.
+
+# Paper discussion:
+In general I like the idea of looking further into the effect of adding network structure on the original
+information bottleneck results (empirical and theoretical).  I would be interested to see if layerwise
+input skip connections (i.e. between each network layer L_i and the original input variable 'X') hastened the 'compression' stage of learning e.g. (i.e. the time during which the intermediate layers minimise MI with 'X').  I'm also interested that clear examples of the information bottleneck principle in practice (e.g. CCA) are rarely mentioned.
+
+On the other hand, I think this paper is not quite ready: it reads like work written in a hurry, and is at times hard to follow as a result.  There are several places where I think the terminology does not quite reflect what the authors perhaps hoped to express, or was otherwise slightly clumsy e.g:
+
+* ""...self-consistent equations are highly non-linear and still too abstract to be used for many..."", presumably what was implied was that the original solution to the information bottleneck as expressed by Tishby et al is non-analytic for most practical cases of interest?
+
+* ""Furthermore, we exploit the existing network architecture as variational decoders rather than resort to variational decoders that are not part of the neural network architecture."" -> The existing network architecture is used to provide a variational inference framework for I(Z,Y).
+
+* ""On average, 2H(X|Z) elements of X are mapped to the same code in Z."" In an ideal world I would like the assumptions required for this to hold true to be a fleshed out a little here.
+
+* ""The generated bottleneck samples are then used to estimate mutual information"" -> an empirical estimation of I(Z,X) would seem a very high variance estimator; the dimensionality of X is typically large in modern deep-learning problems---do you have any thoughts on how the learning process fares as this varies?  Further on you cite that L_PIB is intractable due to the high dimensionality of the bottleneck variables, I imagine that this still yields a high var MC estimator in your approximation (in practice)?  Was the performance significantly worse without the Raiko estimator?
+
+* ""In this experiment, we compare PIBs with ...."" -> I find this whole section hard to read, the description of how the models relate to each other is a little difficult to follow at first sight.
+
+* Information dynamics of learning process (Figures 3, 6, 7, 8) -> I am curious as to why you did not run the PIB for the same number of epochs as the SFNN?  I would also argue that you did not run either method as long as you should have (both approaches lack the longer term 'compression' stage whereby layers near the input reduce I(X,Z_i) as compared to their starting condition)?  This property is visible in I(Z_2,X) for PIB in Figure 3, but otherwise absent.
+
+# Conclusion:
+In conclusion, while interesting, for me the paper is not yet ready for publication.  I would recommend this work for a workshop presentation at this stage.
+",4,4.0,ICLR2018
+rkDLp95lG,1,rJGY8GbR-,rJGY8GbR-,Nice addition to the mean-field-theory subfield,"This paper further develops the research program using mean field theory to predict generalization performance of deep neural networks. As with all recent mean-field papers, the main query here is to what extent the assumptions (Axioms 1+2, which basically define the asymptotic parameters of interest to be the quantities defined in Sec. 2.; and also the fully connected residual structure of the network) apply in practice. This is answered using the same empirical standard as in [Yang and Schoenholz, Schoenholz et al.], i.e. showing that the dynamics of initialization predict generalization behavior on MNIST according to theory.
+
+As with the earlier papers in this recent program, the paper is notation-heavy but generally written well, though there is some overreliance on the readers' knowledge of previous work, for instance in presenting the evidence as above. Try as I might, I cannot find a detailed explanation of the color scale for the important Fig. 4. A small notation issue: the current Hebrew letter for the gradient quantity does not go with the other Greek letters and is typographically poor choice because of underlining, etc.). Also, several of the citations should be fixed to reflect peer-reviewed publication of Arxiv papers. I was not able to review all the proofs, but what I checked was sound. Finally, the techniques of WV and VV would be more applicable if it were not for the very tenuous relationship between gradient explosion and performance, which should be mentioned more than the one time it appears in the paper.",7,3.0,ICLR2018
+wyrzhkUc5Px,4,ztMLindFLWR,ztMLindFLWR,"Limited context, but useful theoretical framing of distinguishing power of GNN aggregators","Summary: 
+	This paper explores the representation power of graph neural networks. Unlike recent work on choosing among simple aggregation functions or combinations thereof, the authors here recognize that these aggregators are the bottleneck in the representation power and generalize simple aggregator functions commonly used in literature to an aggregation coefficient matrix. The paper supports this construction theoretically and also proposes two aggregators that satisfy the rank-preservation requirement for more expressive (distinguishing) GNNs.
+
+Strengths:
+* The theoretical results are strong in proving the bottleneck of aggregators (Lemma 1) and clearly contextualize popular existing methods into this result.
+* Formulating aggregation in terms of a product with coefficients (indexed by a permutation) allows for framing existing methods and to connect the representation power with the rank of the matrix of coefficients.
+
+Weaknesses:
+* The study of the expressiveness of GNNs is a very popular topic right now and not enough context is provided about related work on this topic and other approaches, mainly focusing on GIN and GAT in the development and while a few other GNNs are considered in the experimental results, they are not discussed or explained enough.
+
+Recommendation:
+By framing the aggregation in terms of coefficients, the paper provides interesting connections between the rank of these coefficient matrices and the distinguishing power of GNNs.  While analysis and explanation of experiments is extremely limited, the theoretical developments are interesting and novel enough to narrowly recommend publication. Meanwhile, I do think the paper should undergo a reorganization to add more details about related work in the study of expressiveness of GNNs and analyze experiments more thoroughly.
+
+
+Other comments and clarification needed:
+* It’s not quite clear what it means for aggregators to be incomparable (top of page 3), ie what does it mean for the relative strength to “not exist”? This point could be clarified with an additional sentence in this “Distinguishing strength” paragraph. 
+* The text on page 4 before Proposition 2 states that “… different M corresponds to different local structures. Therefore, the aggregation results of different aggregators must be different. However, it is not satisfied by existing GNNs.” This statement should be explained more and supported. 
+* The details of the ExpandingConv layer overwhelm the paper. The details fo equation 4 and related text can be moved to an Appendix and explained in the main text at a higher level. This would free up space to add context of the paper’s contribution and more properly address the experiments. In the end, the ExpandingConv formulation is a refinement on multi-head GAT.",6,3.0,ICLR2021
+SklWpVwT2m,3,S1z9ehAqYX,S1z9ehAqYX,Direct Application of a Well Established Statistical Method without Theoretical Gurantees and Very Limited Emprical Support,"The paper claims that a combination of policy gradients calculate by different RL algorithms would provide better objective values. Main focus of the work is to devise and adaptive combination scheme for policy gradient estimators. Authors claim that by using the statistical shrinkage estimators combining different gradients that have different bias-variance trade-off would provide better mean-square error than each of those individual gradient estimators. The key observations made by the authors are that gradients computed by on-policy methods would provide nearly unbiased estimators with very high variance while the gradients obtained by the off-policy methods in particular model based approaches would provide highly biased estimators with low variance. Proposed statistical tool to combine gradients is James-Steim shrinkage estimator. JS estimator provides strong theoretical guarantees for Gaussian cases but some practical  heuristics tor more complex non-Gaussian cases. Authors do not discuss  whether the JS estimator actually suitable for this task given the fact that strong assumptions of the underlying statistical approach is violated. They also do not go into any discussion about theoretical guarantees nor they provide any exposures or intuitions about that. The scope of the experiments is very limited. Given the fact that there is no theory behind the claims and the lack of strong evidence I believe this paper does not cut the requirements for publication.
+
+To improve please add significantly more empirical evidence, provide more discussion about theoretical ground work and discussion about the suitability of the JS estimators when its required assumptions are not satisfied.",4,2.0,ICLR2019
+S1eTSCxTtr,2,SylL0krYPS,SylL0krYPS,Official Blind Review #3,"  *Synopsis*:
+  This paper looks at a new framework for adversarial attacks on deep reinforcement learning agents under continuous action spaces. They propose a model based approach which adds noise to either the observation or actions of the agent to push the agent to predefined target states. They then report results against several model-free/unlearned baselines on MuJoCo tasks using a policy learned through D4PG.
+
+  Main contributions:
+  - Adversarial attacks for Deep RL in continuous action spaces.
+
+  *Review*
+  The paper is well written, and has some interesting discussion/insight into attacking deep RL agents in continuous actions spaces. I think the authors are headed in the right direction, but compared to prior work in adversarial attacks for deep RL agents (i.e. the Huang and Lin) I have a few concerns that I feel the authors need to better explain/motivate in their paper. I am recommending this paper be rejected based on the following concerns. I am willing to raise my score if some of these are addressed by the authors in subsequent revisions
+
+  1. This algorithm requires the pre-trained policy to plan attacks (which may be a high bar for such an adversarial attack). It would be a nice addition to include similar results with ""black-box"" adversarial attacks, as mentioned in the Huang. 
+
+  2. Another issue, addressed in the Lin paper, is this attack seems to require perturbation on every time step in a proposed trajectory. As mentioned by Lin, this is probably unrealistic and would cause the attacker to be detected. It would be another nice contribution to include variants that don't require perturbations on each transition.
+
+  3. Another unfortunate requirement is a learned model (or a way to simulate trajectories). From the Model Based RL literature, we know learning such a model is quite difficult and often unrealistic given our current approaches. While this is problematic, I think the paper could systematically test this looking at what happens as the model becomes less accurate over time. This could provide some nice results showing an accurate model isn't necessarily needed and anneal concerns over having to learn such a model.
+
+  4. It is unclear if the baselines measured against are meaningful in this setting, and I'm also a bit unclear how they are generated/implemented. Specifically, the random trajectories require you to return the generated trajectory with the smallest loss/reward. It is unclear how the adversary knows this information. Is it known through a model or some other simulation? Also the flip baseline could use a bit more explanation. I think these details can be safely placed in the appendix, but should appear somewhere in the final version.
+
+  5. I'm not sure the comparison to sample efficiency to the Gleave or Uesato papers are meaningful. For Gleave, the threat model explored is much different where they do not have access to the agent's observation or action streams and instead learn policies to affect the other agent in game scenarios. This is very different. Also, the Uesato is not adversarially attacking the agent, but attempting to find failure cases for the agent, which I again feel is very different from what you are trying to accomplish. I would remove this discussion and the claim at the end of the conclusion.
+
+
+  Other suggestions:
+
+  S1. It would be helpful to include the score of the learned policy without any attacks, to see how well the baselines are performing (this will help readers understand if these are reasonable/meaningful baselines).
+
+  S2. I'm unclear what figure three is adding to the paper, and am actually uncertain what the y-axis means. I don't think this is a wise use of the 9th page, and this plot could probably be relegated to the appendix.
+  
+  S3. As in prior work, it would be useful to see how well this line of attack works for multiple learning algorithms. Some potential candidates could be: PPO, TRPO, SAC, etc... 
+",3,,ICLR2020
+rkeZXiHs9r,2,BygzbyHFvB,BygzbyHFvB,Official Blind Review #3,"In this paper, the authors present a new adversarial training algorithm and apply it to the fintuning stage large scale language models BERT and RoBERTa. They find that with FreeLB applied to finetuning, both BERT and RoBERTa see small boosts in performance on GLUE, ARC, and CommonsenseQA. The gains they see on GLUE are quite small (0.3 on the GLUE test score for RoBERTa) but the gains are more substantial on ARC and CommonsenseQA. The paper also presents some ablation studies on the use of the same dropout mask across each ascent step of FreeLB, empirically seeing gains by using the same mask. They also present some analysis on robustness in the embedding space, showing that FreeLB leads to greater robustness than other adversarial training methods
+
+This paper is clearly presented and the algorithm shows gains over other methods. I would recommend that the authors try testing their method on SuperGLUE because it's possible they're hitting ceiling issues with GLUE, suppressing any gains the algorithm may yield.
+
+Questions,
+-  In tables 4 and 5, why are only results on RTE, CoLA, and MRPC presented? If this is because there was not noticeable difference on the other GLUE datasets, please mention it in the text.
+- I realize that this method is meant to increase robustness in the embedding space, but did you do any error analysis on the models? Did they make different types of errors than models fine-tuned the vanilla way?
+
+Couple typos,
+- Section 2.2, line 1: many -> much
+- Section 4.2, GLUE paragraph: 88 -> 88.8 ",8,,ICLR2020
+twnaLNPV8dt,1,5g5x0eVdRg,5g5x0eVdRg,"Novel paper, nice results","In this paper, the authors study the problem of unsupervised representation learning from data augmentations. Specifically, the authors claim that existing methods are prone to getting stuck at local minima owing to easy-to-learn local representations that optimise the commonly used MI objectives, and then propose a hierarchical method that tackles optimisation at multiple layers of the feature hierarchy.
+
+Strengths:
+- A highly novel solution to an important problem in representation learning.
+- Strong results obtained with respect to the state of the art.
+- Extensive analysis and comparisons.
+
+Weaknesses:
+- ""We demonstrate that current methods do not effectively maximise the MI objective"" + footnote: ""We show this by finding higher mutual information solutions using DHOG, rather than by any analysis of the solutions themselves."" => This claim calls for an analysis of the solutions and since this is missing in the paper, I would rephrase the claim.
+
+- It would have been nicer to have experimental analysis on different features (e.g. colour) being more prone to local optima, though this is intuitive. This could have significantly increased the impact of the paper.
+
+Minor comments:
+- ""a reasonable mapping need only compute colour information"" => ""a reasonable mapping needs only compute colour information"".
+- ""Learning a set of representations by encouraging them to have low MI,"" => This should be high MI?
+- ""CIFAR-10, CIFAR-100-20 (a 20-way"" => The parenthesis is not closed.
+- ""A network was learned to associate"" => ""A network was trained to associate"".
+
+**AFTER AUTHOR RESPONSE**
+
+I have read the other reviewers carefully and the feedback provided by the authors. The reviewers had two major concerns: (i) Theoretical or empirical justification/proof for the following claim (and the motivation) of the paper: “the current methods do not effectively maximize the MI objective because greedy SGD typically results in suboptimal local optima”. (ii) Lack of comparisons with newer methods from e.g. ECCV2020 etc.
+
+For the second, I feel empathy with the authors: In such a rapidly progressing field, it is difficult to integrate comparisons with new methods that are published while writing/submitting a paper. I am sure all of us have experienced similar problems.
+
+For the first issue, I disagree with the authors and agree with the other reviewers: This is an important claim that needs to be justified and I disagree with the authors's comment on the current results of the paper being a sufficient empirical evidence for the claim. An ICLR paper should have provided the necessary evidences and justifications.",4,2.0,ICLR2021
+S1zuJcxOC_M,2,XZzriKGEj0_,XZzriKGEj0_,Incorporating negative constraints into Gaussian process regression,"Summary:
+This paper incorporates information of obstacles to avoid (e.g robot navigation trajectory in the room where the robot has to avoid items such as furniture) into Gaussian process regression fit. They call the obstacles, negative datapairs and the rest of data, positive datapairs. The aim is to have a GP where the probability of passing through the negative datapairs is low. The proposed method is called the Gaussian process with negative constraints (GP-NC). 
+
+To be able to fit the GP regression to positive datapairs and avoid negative datapairs, they maximize the KL-divergence between the distributions of GP learned from positive datpairs $(p(y|\theta. X))$ and negative datapiars $(q(\hat{y}|\hat{X}))$ which will have a bound between $[0, \inf]$.
+For being able to maximize the KL-divergence with the marginal log-likelihood they change the scale of KL-divergence to the log scale and a parameter $\lambda$. $\lambda$ is a tradeoff between curve fitting and avoidance of the negative datapoints. KL term has an analytical solution since both distributions are Gaussians.
+They compare their method against SVGP (Hensman et al., 2013) and PPGPR (Jankowiak et al., 2019) and the exact GP.
+
+Comments and questions: 
+The paper is well-written and easy to follow. This paper is also technically sound and to the best of my knowledge is novel and relevant to the community.
+
+In figure two you have mentioned you sampled the inducing inputs randomly from the whole range of training inputs. Did you choose the same inducing inputs for the SVGP and SVGP-NC? 
+
+In the error calculation, standardized mean squared error (SMSE) could be a better choice than RMSR since the former incorporates variance information as well.
+
+In the experiment section, all negative datapairs are synthetically made. I was wondering if you could apply your method to a dataset with real negative datapiars (like a robot in the room avoiding obstacles)? 
+
+In figure 3 (3DRoad) the approximate results are better than the exact GP. Is this because of the scale of the data and not being able to converge?
+
+Miscellaneous comments:
+In Table 1 row 2 should be wine quality - white
+
+################### After the rebuttal ################
+
+I thank the authors for addressing the issues that were raised. The paper is indeed addressing a very practical issue in the ML community.
+After reading other reviewers' comments and concerns I decreased my score. 
+",6,2.0,ICLR2021
+BJWjA85xz,2,S1680_1Rb,S1680_1Rb,"Propose new filters based on Cayley transform -- interesting filter, but unconvincing theory / experiments","Summary: This paper proposes a new graph-convolution architecture, based on Cayley transform of the matrix. Succinctly, if L denotes the Laplacian of a graph, this filter corresponds to an operator that is a low degree polynomial of C(L) = (hL - i)/(hL+i), where h is a scalar and i denotes sqrt(-1). The authors contend that such filters are interesting because they can 'zoom' into a part of the spectrum, depending on the choice of h, and that C(L) is always a rotation matrix with eigenvalues with magnitude 1. The authors propose to compute them using Jacobi iteration (using the diagonal as a preconditioner), and present experimental results.
+
+Opinion: Though the Cayley filters seem to have interesting properties,  I find the authors theoretical and experimental justification insufficient to conclude that they offer sufficient advantage over existing methods. I list my major criticisms below:
+1. The comparison to Chebyshev filters  (small degree polynomials in the Chebyshev basis) at several places is unconvincing. The results on CORA (Fig 5a) compare filters with the same order, though Cayley filters have twice the number of variables for the same order as Chebyshev filters. Similarly for Fig 1, order 3 Cayley should be compared to Order 6 Chebyshev (roughly).
+
+2. Since Chebyshev polynomials blow up exponentially when applied to values larger than 1, applying Chebyshev filters to unnormalized Laplacians (Fig 5b) is an unfair comparison.
+
+3. The authors basically apply Jacobi iteration (gradient descent using a diagonal preconditioner) to estimate the Cayley filters, and contend that a constant number of iterations of Jacobi are sufficient. This ignores the fact that their convergence rate scales quadratically in h and the max-degree of the graph. Moreover, this means that the Filter is effectively a low degree polynomial in (D^(-1)A)^K, where A is the adjacency matrix of the graph, and K is the number of Jacobi iterations. It's unclear how (or why) a choice of K might be good, or why does it make sense to throw away all powers of D^(-1)Af, even though we're computing all of them.
+Also, note that this means a K-fold increase in the runtime for each evaluation of the network, compared to the Chebyshev filter.
+
+Among the other experimental results, the synthetic results do clearly convey a significant advantage at least over Chebyshev filters with the same number of parameters. The CORA results (table 2) do convey a small but clear advantage. The MNIST result seems a tie, and the comparison for MovieLens doesn't make it obvious that the number of parameters is the same. 
+
+Overall, this leads me to conclude that the paper presents insufficient justification to conclude that Cayley filters offer a significant advantage over existing work.",4,3.0,ICLR2018
+SJlkRGil9B,3,H1gpET4YDB,H1gpET4YDB,Official Blind Review #1,"This paper introduces a optimisation for BERT models based on using block matrices for the attention layers. This allows to reduce the memory footprint and the processing  time during training while reaching state-of-the-art results on 5 datasets. An interesting study on memory consumption in BERT is conducted. No results are given at test time : is there also a memory and processing time reduction ?
+
+Even if the proposition is interesting, the impact of the paper is limited to the (flourishing) scope optimising Bert models (""Bertology""). The authors do not mention if their code is available.
+
+
+Table 3 : Humam -> Human	",3,,ICLR2020
+BygVwssiYS,1,Skg9jnVFvH,Skg9jnVFvH,Official Blind Review #2,"Summary:
+
+This work is a follow-up of WaveGAN. It uses the first few layers of the original WaveGAN to synthesize low resolution waveform (4kHz), and applies several bandwidth extension modules to progressively output the higher resolution raw audios.
+
+pros:
+- The proposed PUGAN has significantly smaller number of parameters than WaveGAN (e.g., 20x smaller).
+
+cons:
+- WaveGAN was a preliminary and encouraging trial for raw audio synthesis with GAN. Note that, its audio fidelity is far away from the state-of-the-art results and it was only tested on simple dataset (sounds of ten-digit commands). In contrast, the state-of-the-art autoregressive models (e.g., WaveNet) and parallel flow-based models (e.g., Parallel WaveNet) have been tested on challenging high-fidelity speech synthesis. As a result, one may focus on improving the audio fidelity of GAN on more challenging tasks. However, the proposed PUGAN was still tested on very simple dataset (sounds of ten-digit commands), and the quality of generated samples are only comparable to WaveGAN. 
+
+Detailed comment:
+
+-- The attached samples are pretty noisy (e.g., noticeable artifacts on posted spectrograms). One may introduce the feature matching (e.g., STFT loss in ClariNet) as an auxiliary loss to improve the audio fidelity. 
+
+-- Did the authors try conditional generation, e.g., conditioned on the digit label? The posted failure cases and some samples tend to have overlapped sounds from different digits. ",1,,ICLR2020
+JFo4ZbXHLly,1,mQPBmvyAuk,mQPBmvyAuk,Straightforward approach for evaluating model robustness to subpopulation shift,"This paper addresses the problem of model robustness to subpopulation shift. Authors propose building large-scale subpopulation shift benchmarks wherein the data subpopulations present during model training and evaluation differ. In this regard, their approach is based on leveraging existing dataset labels and use them to identify superclasses to construct classification tasks over such superclasses and repurpose the original dataset classes to be the subpopulations of interest. They train some learning models over the generated benchmarks to evaluate model robustness to subpopulation shift and, finally, they try various learning interventions (from the literature) to decrease model sensitivity to this sort of data perturbations.
+
+
+Strengths:
+
+- Paper is very well-written.
+- The problem addressed is very important (learning model generalisation to data shift) and of interest for the majority of ML/AI research community.
+- The methodology followed is well defined and correct.
+- The authors have performed an excellent work with the comprehensive experimental setting proposed in the paper.
+
+Weaknesses:
+
+- It is difficult to characterize what new scientific understanding or knowledge was presented in this paper. The presented approach for identifying superclasses and subpopulations of interests is somehow straightforward. Also, I doubt that manual procedure for hierarchy creation / restructuring process is a trivial task (with relatively little effort) for most benchmarks with no structured organisation. Results in sections 4.3 and 5.1 appear entirely unsurprising (though admittedly, this could be hindsight bias). Results in 5.2 appear to present more interesting insights (e.g.  little effect of train-time interventions on model robustness to subpopulation shift). However, one is left wondering whether this insight generalizes beyond the specifics of this experiments/dataset or whether this will create an isolated ImageNet sub-community for addressing image-classification robustness tasks.
+
+
+Comments after author rebuttal:
+
+Looking at the author's comments (as well as the other reviewer's feedback), I think that the authors have made a good job with the responses and I'm now more convinced about the usefulness of this work. I'm increasing my original recommendation to 6 ""Marginally above threshold"".
+
+
+
+",6,3.0,ICLR2021
+r1NvXZ9ez,2,SyMvJrdaW,SyMvJrdaW,Creative investigation of Resnets,"Paper proposes a shallow model for approximating stacks of Resnet layers, based on mathematical approximations to the Resnet equations and experimental insights, and uses this technique to train Resnet-like models in half the time on CIFAR-10 and CIFAR-100. While the experiments are not particularly impressive, I liked the originality of this paper. ",7,3.0,ICLR2018
+Cn2xWFqe3hJ,1,vujTf_I8Kmc,vujTf_I8Kmc,Review,"The proposed contellation module is an improved version of non-local block [a], which contains a cell feature clustering module and a self-attention module for modeling pixel-wise (cell-wise) relationships. Inserting this block to the backbones could improve the performance for few-shot learning setting.
+
+Concerns:
+1. In my mind, the most important difference between the proposed constellation module and non-local block is the cell feature clustering module. (Multi-head design is widely-used in the extensions of transformer model) Hence, it is important to prove that the newly inserted cell feature clustering module is crucial in the constellation module at least in the few-shot learning setting. It would be bonus to do comparison in other important benchmarks. 
+
+2. The corresponding intuition that why the cell feature clustering module works well is also needed. For example, why should we need a shared part patterns (cluster centers) for all different images. Is 128 shared centers enough for all data points? It would be great to visualize the 128 learnt part patterns, e.g. visualizing nearest cell features in the all images.
+
+3. For few-shot learning, there are several papers published recently with state-of-the-art results [b,c], which should be compared in the literature.
+
+[a] Non-local Neural Networks, CVPR 2018
+[b] Negative Margin Matters: Understanding Margin in Few-shot Classification, ECCV 2020
+[c] Boosting Few-Shot Learning With Adaptive Margin Loss, CVPR 2020",5,5.0,ICLR2021
+H1gqaqHe5H,2,SJgs8TVtvr,SJgs8TVtvr,Official Blind Review #3,"The authors present an extension of variational autoencoders (VAEs), where Gaussian distribution of the latent variable is replaced by a mixture of Gaussians. The approach can be used for clustering and generation. The authors carry out experiments to evaluate the performance of the method in these tasks and compare it to competing methods.
+
+The paper is well written and easy to read and understand. Specialized related work is discussed. I find the extension of VAEs to GMMs interesting for the ICLR community, although it is somewhat straight forward in terms of its technical difficulty. However, the technical novelty together with the fine empirical evaluation are just good enough for ICLR, in my opinion.
+",6,,ICLR2020
+B1xF4md62m,3,BJesDsA9t7,BJesDsA9t7,Nice idea. Need better experiments.,"Privacy concerns arise when data is shared with third parties, a common occurrence. This paper proposes a privacy-preserving classification framework that consists of an encoder that extracts features from data, a classifier that performs the actual classification, and a decoder that tries to reconstruct the original data. In a mobile computing setting, the encoder is deployed at the client side and the classification is performed on the server side which accesses only the output features of the encoder. The adversarial training process guarantees good accuracy of the classifier while there is no decoder being able to reconstruct the original input sample accurately. Experimental results are provided to confirm the usefulness of the algorithm.
+
+The problem of privacy-preserving learning is an important topic and the paper proposes an interesting framework for that. However, I think it needs to provide more solid evaluations of the proposed algorithm, and presentation also need to be improved a bit.
+
+Detailed comments:
+I don’t see a significant difference between RAN and DNN in Figure 5. Maybe more explanation or better visualization would help.
+The decoder used to measure privacy is very important. Can you provide more detail about the decoders used in all the four cases? If possible, evaluating the privacy with different decoders may provide a stronger evidence for the proposed method.
+It seems that DNN(resized) is a generalization of DNN. If so, by changing the magnitude of noise and projection dimensions for PCA should give a DNN(resized) result (in Figure 3) that is close to DNN. If the two NNs used in DNN and DNN(resized) are different, I believe it’s still possible to apply the algorithm in DNN(resized) to the NN used in DNN, and get a full trace in the figure as noise and projection changes, which would lead to more fair comparison.
+The abstract mentioned that the proposed algorithm works as an “implicit regularization leading to better classification accuracy than the original model which completely ignores privacy”. But I don’t see clearly from the experimental results how the accuracy compares to a non-private classifier.
+Section 2.2 mentioned how different kind of layers would help with the encoder’s utility and privacy. It would be better to back up the argument with some experiments.
+I think it needs to be made clearer how reconstruction error works as a measure of privacy. For example, an image which is totally unreadable for human eye might still leak sensitive information when fed into a machine learning model. 
+In term of reference, it’s better to cite more articles with different kind of privacy attacks for how raw data can cause privacy risks. For the “Noisy Data” method, it’s better to cite more articles on differential privacy and local differential privacy.
+Some figures, like Figure 3 and 4, are hard to read. The author may consider making the figures larger (maybe with a 2 by 2 layout), adjusting the position of the legend & scale of x-axis for Figure 3, and using markers with different colors for Figure 4. 
+",4,4.0,ICLR2019
+H1xHdzo0Fr,1,Hkl_sAVtwr,Hkl_sAVtwr,Official Blind Review #2,"This paper proposes use of the deep image prior (DIP) in compressed sensing. The proposed method, termed CS-DIP, solves the nonlinear regularized least square regression (equation (3)). It is especially beneficial in that it does not require training using a large-scale dataset if the learned regularization is not used. Results of numerical experiments demonstrate empirical superiority of the proposed method on the reconstruction of chest x-ray images as well as on that of the MNIST handwritten digit images.
+
+The demonstrated empirical efficiency of the proposal is itself interesting. At the same time, the proposal can be regarded as a straightforward combination of compressed sensing and the DIP, so that the main contribution of this paper should be considered rather marginal. I would thus like to recommend ""weak accept"" of this paper.
+
+In my view the concept of DIP provides a very stimulating working hypothesis which claims that what is important in image reconstruction is not really representation learning but rather an appropriate network architecture. The results of this paper can be regarded as providing an additional empirical support for this working hypothesis. On the other hand, what we have understood in this regard seem quite limited; for example, on the basis of the contents of this paper one cannot determine what network architecture should be used in the proposed CS-DIP framework applied to a specific task. I think that the fact that DIP is still a working hypothesis should be stressed more in this paper. It does not reduce the value of this paper as one providing an empirical support for it.
+
+I think that the theoretical result of this paper, summarized as Theorem 4.1, does not tell us much about CS-DIP. The theorem shows that overfitting occurs even if one uses a single hidden-layer ReLU network. As the authors argue, it would suggest necessity of early stopping in the proposal. On the other hand, I could not find any discussion on early stopping in the experiments: On page 14, lines 22-23, it seems that the theoretical result is not taken into account in deciding the stopping criterion, which would make the significance of the theoretical contribution quite obscure.
+
+Page 4, line 10: a phenomen(a -> on)
+Page 13: The paper by Ulyanov, Vedaldi, and Lempitsky on DIP has been published as a conference paper in the proceedings of CVPR 2018, so that appropriate bibliographic information should be provided.
+Page 16, line 26: On the right-hand side of the equation, the outermost \phi should not be there.
+Page 17, line 3: 3828 should perhaps read 3528.
+",6,,ICLR2020
+ByeZrqUZ5S,3,rkx-wA4YPS,rkx-wA4YPS,Official Blind Review #2,"This paper builds upon recent work on detecting and correcting for label shift.
+They explore both the BBSE algorithm analyzed in Detecting and Correcting for Label Shift (2018)
+and another approach based on EM where the predictive posteriors and test set label distributions
+are iteratively computed, each an update based on the estimate of the other.
+
+Crucially, while the former method requires only that the confusion matrix be invertible,
+the  latter method only appears valid under strong assumptions including the calibration fo the classifier.
+Thus the authors propose an approach for “bias-corrected calibration”
+and shows that bias-corrected calibration can improve the performance of BBSE and EM.
+The method is crucial for EM and with it, the results seem to show that EM,
+in the large sample (8000 examples) regime and with good initial classifiers
+ (on the relatively easy CIFAR10 task with a strong baseline)
+that EM outperforms BBSE.
+
+The paper is easy to follow an the authors should also be credited for releasing 
+code anonymously with which we could reproduce their results. 
+
+I have as few specific concerns/questions about the paper that I would like the authors to address: 
+
+ * They consider JS divergence as a metric for evaluation. But they don’t consider other metrics
+  like the error in weight estimates which is considered in most of the prior work 
+ * They don’t compare their results with regularizations suggested on top of BBSE, particularly Azizzadenesheli et. al. https://arxiv.org/abs/1903.09734. 
+ * They compare methods for particularly limited ranges of Dirichlet shift (\alpha=0.1,1.0). 
+  * What happens when the \alpha increases to have less severe shifts? 
+Optimizing ELBO with EM can lead to local convergence to the likelihood function when the likelihood is not unimodal. 
+ *  Is this likelihood function unimodal?  Does the EM approach converges to MLE under some appropriate initialization and assumptions? 
+  
+A small presentation note: many of the papers are reporting the same metric and ought to be grouped as a large table, not as many tables. Also every table should state clearly what it is reporting in the caption, not just referring to earlier tables.
+
+=======Update
+I have read the rebuttal and appreciate that the authors took the time to establish the concavity of the likelihood function for EM. Overall this paper makes an interesting contribution in establishing the usefulness of the likelihood formulation (here optimized by EM) of label shift estimation and its apparent benefits over BBSE in some settings. I am happy to keep my score despite apparent disagreement from the other reviewers. I must say that some other reviews were disappointingly lacking in thoroughness.  
+ 
+The paper still leaves open some serious questions, e.g. --- why is this bias correction heuristic so effective vis-a-vis EM and is this explained by its performance at the calibration task itself? The original temperature scaling paper reported a similar heuristic yet didn't see such a benefit wrt their metrics. Why is it so useful here? 
+
+Still, while this paper can be improved in some key ways, it does make an interesting contribution.",6,,ICLR2020
+pCNcqICiHhO,3,MmcywoW7PbJ,MmcywoW7PbJ,"Like the Extensive Experiments, But The Paper Requires Plenty of Clarification","Summary: The paper proposes a novel method for learning goal-conditioned policy with images/text goals. 
+
+Quality: The overall quality of the paper is good. 
+    Strong side: Extensive evaluation on various environments and tasks showcases the advantage and generalizability of the method. The authors uses figures and algorithm boxes to make their method very clear.
+    Weak side: Several clarification on the motivation and details of the method is needed. See below.
+
+Clarity: (1) The main argument is not strong: 1. Image/text goals is not intractable for current goal-conditioned policies [1] 2. What do you mean by ‘extrinsic’ reward? If you mean task-specific reward, many methods learn goal-conditioned policy without extrinsic reward, like hindsight experience replay.
+(2) Therefore, I guess the authors wanted to claim they use ‘intrinsic’ rewards, which is a mutual-information-based reward. Now I have two questions for Section 3.2: 1. Why do we use this loss function Eq. (1)? It comes out of nowhere without intuition. (2) Why do you decompose the optimization into ($\mu, \Phi$) and $\theta$? I feel like decomposing into $\Phi$ and ($\mu, \theta$) is more reasonable in that one optimizes the reward first then policies.
+Another question on algorithm box: In step two, $g_t$ is changing with time steps. This is kind of strange because in typical goal-conditioned policy learning, in one epoch, people use a fixed goal to train their policy, then change the goal in the next epoch. What is the fundamental difference?
+(3) The whole disentanglement thing needs more clarification. I don’t understand the reason why you disentangle your policy. In computer vision, disentanglement has clear physical meanings like disentangling shape and color, but here I don’t have such intuition. In experiments, the effect of disentanglement is only demonstrated in a simple 2D-task, which seems not enough. 
+(4) Details: 1. How is p(w) defined? It seems super important, but I was not able to find its details in the paper. Maybe I overlooked something.
+
+Originality: As far as I know, the method is new.
+
+Significance: Learning goal-conditioned policy with high-dimensional goals is an important problem. I think if the authors could clarify the above questions, the solid experiments will make the paper a good contribution. However, I don’t think its current version passes the bar of ICLR.
+
+Reference: [1] https://papers.nips.cc/paper/9623-planning-with-goal-conditioned-policies.pdf",6,3.0,ICLR2021
+r1l6IbIc2X,1,rJg6ssC5Y7,rJg6ssC5Y7,An important and useful tool for the field.,"The authors propose a benchmark for optimization algorithms specific to deep learning called DeepOBS. They provide code to evaluate an optimizer against a suite of standard tasks in deep learning, and provide well tuned baselines for a comparison. The authors discuss important considerations when comparing optimizers, including how to measure speed and tunability of an optimizer, what metric(s) to compare against, and how to deal with stochasticity.
+
+A clear, standardized optimization benchmark suite would be very valuable for the field. As the others clearly state in the introduction, there have been many proposed optimization algorithms, but it is hard to compare many of these due to differences in how the optimizers were evaluated in the original papers. In general, people have different requirements for what the expect from an optimizer. However, this paper does a good job of discussing most of the factors that people should consider when choosing or comparing optimizers. Providing a set of well tuned baselines would save people a lot of time in making comparisons with a new optimizer, as well as providing a canonical set of tasks to evaluate against. I particularly appreciated the breadth and diversity of the included tasks.
+
+I am a little worried that people will still find minor quibbles with particular choices or tasks in this suite, and therefore continue to use bespoke comparisons, but I think this benchmark would be a valuable resource for the community.
+
+Some minor comments:
+- In section 2.3, there is a recommendation for how to estimate per-iteration cost. I would mention in this section that this procedure is automated and part of the benchmark suite.
+- I wanted to see how the baselines performed on all of the tasks in the suite (not just on the 8 tasks in the benchmark sets). Perhaps those figures could be included in an appendix.
+- The authors might want to consider including an automated way of generating performance profiles (https://arxiv.org/abs/cs/0102001) across tasks as part of DeepOBS, as a way of getting a sense of how optimizers performed generally across all tasks.",7,4.0,ICLR2019
+bEiOLmRcWok,4,DEa4JdMWRHp,DEa4JdMWRHp,Marginally above acceptance threshold,"Summary: 
+The paper proposes an interpretable method to detect Granger causality (GC) under nonlinear dynamics which can detect signs of Granger-causal effects (positive and negative) and inspect their variability over time. The novelty of this paper is 1 and 2 in the Pros below, and the methods utilized the existing stable frameworks such as a heuristic stability-based procedure that relies on time-reversed Granger causality (TRGC) (Winkler et al., 2016). 
+
+Reasons for score: 
+Although the motivation and the experiments were clear and the code was reproducible, the presentations such as in the results were sometimes unclear (below).
+I think this would be a valuable paper in this community, the presentation and investigation may be required to obtain a higher rating. 
+
+Pros:
+1. The method called generalized vector autoregression (GVAR) generalizes the VAR which is the basis of linear GC methods. The method is interpretable based on a self-explaining neural network (SENN) framework and allows exploring signs of Granger-causal effects and their variability through time.
+2. The method utilizes a framework for inferring nonlinear multivariate GC that relies on a GVAR model with sparsity-inducing and time-smoothing penalties.
+3. The method outperformed the baseline methods on a range of synthetic Lorenz 96 and Lotka–Volterra datasets in inferring the ground truth GC structure and effect signs.
+
+Cons:
+1. The detailed explanation and the difference from the previous TRGC work were unclear (below). 
+2. There was little discussion about the reason why TCDF outperformed the proposed methods on the simulated fMRI dataset. 
+3. There were no results in cost function (e.g., prediction error) even in the appendix. This detecting Granger causality would be unsupervised learning, thus the learning results based on the cost function are unknown. The information might help us understand the reason for 2. 
+
+Other comments:
+
+Introduction: 
+“varying causal structures (Lowe et al., 2020)”: Did this paper focus on the varying causal structures?
+
+Method:
+The alpha in Alg.1 (threshold) is confusing with an elastic net regularization parameter.  
+
+There is no explanation of Q of Alg. 1 in the main text. Is this a variable threshold in the analogous way of the ROC curve? How was the Q determined?
+Is the Alg. 1 the proposed method? The title “Stability-based thresholding” may be one of the components of the proposed method.
+
+Experiments:
+
+The number of sequences to evaluate the methods in each dataset was unknown. In my understanding, the proposed methods learn each model for each sequence like the methods of Tank et al. (2018) and Khanna & Tan (2020). In the shared code, for example, Lorenz 96 used 5 sequences (in the code, called datasets). This may be critical information to evaluate the method.",6,4.0,ICLR2021
+Syx91gcotr,1,S1xLuRVFvr,S1xLuRVFvr,Official Blind Review #1,"This work proposes a novel approach for visualizing the predictions of neural network models on pairwise tasks, e.g. predicting whether two images are similar. The authors show that to see similarity between two images down on the pixel/region level, it is not sufficient to apply methods like Grad-CAM which do not aim for decomposition. Instead, the authors' approach namely targets decomposition, and shows the benefit of decomposition through intuitive examples and qualitative results. The authors also quantitatively show the benefit of their method. First, they measure the performance of their method vs CAM and Grad-CAM, on the weakly supervised localization (WSL) task. They also show how their method reveals the disadvantages of standard triplet loss compared to a recent metric learning loss. 
+
+My concerns:
+1) While showing performance on WSL is appealing as it allows for a way to quantitatively evaluate the method, I wonder if there are other ways to evaluate, that explicitly measure the quality of the proposed technique for allowing interpretability and understanding of the base model's performance. 
+2) The Triplet vs MS experiment is interesting, but I'm not sure this is the most convincing way to show that this proposed visualization technique is better than something else. Just because something shows one method is worse than another, doesn't mean that this better/worse assessment is accurate. Further, how would Grad-CAM do on the same task?
+3) The retrieval experiment only shows qualitative results, and again, no baseline is compared. 
+4) Similarity is a relative judgement; it's hard to say if two items are similar, but easier to say if A and B are more similar than A and C. It seems the proposed method doesn't consider negatives, which is perhaps a limitation.",3,,ICLR2020
+lVwYVd9lbOV,2,Yj4mmVB_l6,Yj4mmVB_l6,"Report on ""Two steps at a time --- taking GAN training in stride with Tseng's method"" ","Summary: This paper introduces the forward-backward-forward splitting for variational inequalities. The main results are an asymptotic convergence results and a non-asymptotic convergence results using a restricted merit function. A new method, FBFp, is introduced and studied. Complete proofs are given. Preliminary numerical results obtained by training GANs are reported. 
+
+Pros:
++ complete proofs 
++ A new stochastic operator splitting method based on Tseng's FBF is introduced in which the operator needs to be evaluated once per iteration. This splitting, called FBFp, is indeed new, and has the potentially of being of practical relevance.
++ Preliminary numerical results on standard GAN architectures. 
+
+Cons: 
+- The paper takes a long time until it becomes clear what actually the monotone inclusion looks like. It seems that the problem of interest is formulation in eq. (9), preceded by a long and unnecessary discussion about existing solvers. It would have been much more accurate to simply start with the problem formulation, then propose your solution method, followed by a critical explanation of the contribution. 
+- p.3 claim that FBF has not been rigorously analyzed for saddle point problems. This is of course not true. Even the original paper by Tseng (A MODIFIED FORWARD-BACKWARD SPLITTING METHOD FOR MAXIMAL MONOTONE MAPPINGS, SICON 2000) discusses the application to saddle point problems. See Example 5 in that paper.  
+- The stochastic FBF has been studied in Bot et al. Mini-batch Forward-Backward-Forward Methods for solving Stochastic Variational inequalities, forthcoming in Stochastic Systems. Note that the Arxive version of that paper is available since 2019. Overall, the paper contains only marginal contributions to the state-of-the-art. 
+- Only convergence rates for the ergodic average is provided. It is known that the ergodic average might destroy important features of the true solution, such as sparsity. For SFBF we know non-asymptotic convergence rates of the last iterate. This is not mentioned at all. 
+- I have some doubts that the restricted merit function is the appropriate one here. Note if the aim is to solve the VI over an unconstrained domain, then FBF coincides with EG, and there is nothing to be analyzed. The interesting case is thus only the constrained case. These constraints are usually encoded in the non-smooth part of of eq. (8), so there is no need to write this explicitly. In my opinion it would therefore be cleaner to assume at the outset that the domain of $F+\partial r$ is bounded. The gap function used can in fact be traced back to Facchinei & Pang (2003) and is most likely even longer in use than that. 
+",4,5.0,ICLR2021
+SkgvijzLtB,2,Ske9VANKDH,Ske9VANKDH,Official Blind Review #2,"The paper proposes a new condition: $\gamma$-optimization principle. 
+The principle states that the inner product between the difference between the current iterate and a fixed global optimum and the stochastic gradient (or more general) is larger than the squared the gradient norm, plus the product of squared norm of the difference between the initialization and the global minimum, and the loss of the current iterate.
+
+Under this condition, the paper shows sublinear convergence.
+
+Main Comments：
+The proposed conditions are similar to many previous works, as pointed out by authors. With these kinds of conditions, proving global convergence is trivial. 
+One question is that the condition holds uniformly for all models and every sampled data point. There is no randomness in the condition. I would expect a condition that has some ""randomness"" in it, e.g., the condition holds in expectation over random sampling over the data.
+The condition also requires a specific global minimizer. Because of the randomness in initialization and stochastic training, I expect the target global minimizer can change from iteration to iteration, but the current condition does not reflect that.
+
+
+--------------------------------------------------------------------------------------------
+I have read the rebuttal and I maintain the score. 
+Note in the student-teacher setting even though teacher is unique, there can be multiple optimal students. I don't think this resolves my concern.",1,,ICLR2020
+r1xJuNzPYH,1,SJe_D1SYvr,SJe_D1SYvr,Official Blind Review #3,"* Paper summary.
+The paper considers an IL problem where partial knowledge about the transition probability of the MDP is available. To use of this knowledge, the paper proposes an expert induced MDP (eMDP) model where the unknown part of transition probability is modeled as-is from demonstrations. Based on eMDP, the paper proposes a reward function based on an integral probability metric between a state distribution of expert demonstrations under the target MDP and a state distribution of the agent under the eMDP. Using this reward function, the paper proposes an IL method that maximizes this reward function by RL. The main theoretical result of the paper is that the error between value function of the target MDP and that of the eMDP can be upper-bounded. Empirical comparisons against behavior cloning on discrete control tasks show that the proposed method performs better. 
+
+* Rating.  
+The main contribution of the paper is the eMDP model, which enables utilizing prior knowledge about the target MDP for IL. While this idea is interesting, eMDP is too restrictive and its practical usefulness is unclear. Moreover, there are other issues that should be addressed such as clarity and experiments. I vote for weak rejection. 
+
+* Major comments:
+- Limited practicality due to an assumption on the unresponsive transition kernel.
+In eMDP, the unresponsive transition kernel is modeled from demonstrations by directive using the observed next states (in demonstrations) as a next state of the agent. This modeling implicitly assumes that the agent cannot influence the unresponsive part of the state space. This is too restrictive, since actions of the agent usually influence other parts of the state space such as opponents and objects. For instance, in pong, actions of the agent indeed influence trajectories of the ball. Due to this restrictive assumption, I think the practicality of eMDP is too limited.
+
+Also, it is unclear what happens when the transition of unresponsive state space is stochastic. In this case, the unresponsive transitions of eMDP would be incorrect since eMDP assumes deterministic transitions as-is from demonstrations. Though I am not sure about this. 
+
+- The equality in Eq. (3) should be an upper-bound. The IPM is defined by the absolute difference between summations (expectations), but the right-hand side of Eq. (3) is summations (expectations) of the absolute difference. These two are not equal, and the right-hand side should be an upper-bound. 
+
+- The paper is difficult to follow. There are many skipping contents in the paper. Especially, in the description of eMDP and its solution, where the paper describes the optimal solution of eMDP (Section 2) before describing the eMDP model (Section 4). Also, it is unclear from my first reading pass what is the actual IL procedure. Including a pseudo-code would help. 
+
+- The state-value function’s error bound ignores a policy. The proof utilizes the state distribution P(s’|s). However, this distribution should depend on a policy. It is unclear to me what is the policy function used in this error bound. Does this error-bound hold for any policy or only for the optimal policy? In the case that it only holds for the optimal policy, this result still does not provide useful guarantees when the optimal policy is not learned, similarly to the result in Proposition 2.1.
+
+- Empirical evaluation lacks strong baseline methods. The paper compares the proposed method against behavior cloning, which is known to perform poorly under limited data. The paper requires stronger baseline methods such as those mentioned in Section 3. Also, it is strange that the vertical axis in Figure 2 represents the reward function of eMDP, which is an artificial quantity in Eq. (5). The results should present the reward function of the target MDP, which is the ground-truth reward of the task.
+
+* Besides the major comments, I have few minor comments and questions:
+- Does the reward function in Eq. (5) correspond to the Wasserstein distance for any metric d_s? If it is not, then the value function’s error-bound for the Wasserstein distance is not applicable to this reward function. 
+- What is the reward function r(s) in the value function’s error-bound that we can control the Lipschitz constant? We do not know the ground-truth reward r(s,a) so it cannot be controlled. The reward r(s,s’) of eMDP is given by the metric d_s of the state space so again we cannot control it. 
+- What is the metric d_s used in the experiments? 
+- How do you perform BC without expert actions in the experiments? 
+- Confusing notations. E.g., the function r is used for r(s,a) (MDP’s reward), r(s, s’) (eMDP’s reward), and r(s) (reward in the error bound’s proof); these r’s have different meaning and should be denoted differently.
+- Typos. E.g., “Lemma 2.1” in Section 5 should be “Proposition 2.1”, “Lemma A.1” should be “Proposition 2.1”. ""tildeF"". etc.
+
+*** After authors' response.
+I read the other reviews and the response. My major concern was the impracticality of eMDP. While I can see that the eMDP model provides advantages over BC, I still think that it is impractical since it assumes the agent cannot influence the unresponsive part. Moreover, extracting next-unresponsive states as-is from demonstrations is problematic with stochastic transitions, since actual next-unresponsive states can be different from next-unresponsive states in demonstrations even when the sames state and actions are provided. Given that the major concern remains, I still vote for weak rejection. 
+",3,,ICLR2020
+HJlZFxF_3m,1,rkeqCoA5tX,rkeqCoA5tX,"contribution, but experiments lacking","Quality is good, just a handful of typos.
+Claritys above average in explaining the problem setting.
+Originality: scan refs...
+Significance: medium
+Pros: the authors develop a novel GAN-based approach to denoising, demixing, and in the process train generators for the various components (not just inference). Further, for inference, the authors propose an explicit procedure. It seems like a noveel approach to demixing which is exciting.
+Cons: The experiments do not push the limits of their method. It's difficult to judge the demixing 'power' of the method because it's difficult to tell how hard the problem is. Their method seems to easily solve it (super low MSE). The classification measure is clearly improved by denoising, which is totally unsurprising-- There should definitely be comparison with other denoising methods.
+
+In general, they don't compare to any other methods. Actually in the appendix, comparisons are provided for a basic compressive sensing problem, but their only comparator is ""LASSO"" with a ""fixed regularization parameter"", and vanilla GAN. Since the authors ""main contribution"" (their words) is demixing, I'm surprised that they did not compare with other demixing approaches, or try on a harder problem. Could you  give some more details about the LASSO approach? How did you choose the L1 parameter?
+
+I have another problem with the demixing experimental setting. On one hand, both the sinusoids and MNIST have ""similar characteristics"" in the sense that they are both pretty sparse, basically simple combinations of primary curves. This actually makes the problem harder for a dictionary learning approach like MCA (referenced in your paper). On the other hand, both signals are very simple to reconstruct. For example, what if you superimposed the grid of digits onto a natural image? Would you be able to train the higher resolution GAN to handle a more difficult setting? The other demixing setting of adding 1's and 2's has a similar problem.
+
+The authors need to provide (R)MSE  results that show how well the method can reconstruct mixture components on average over the dataset. The only comparison is visual, and no comparators are provided.
+
+Conclusions:
+I'm actually torn on this paper. On one hand this paper seems novel and clearly contributes to the field. On the other hand, HOW MUCH contribution is not addressed experimentally, i.e. the method is not properly compared with other denoising or demixing methods, and definitely not pushed to its limits. It's hard to assess the difficulty of the denoising problem because their method does so well, and it's hard to assess the difficulty of demixing because of the lack of comparators.
+
+Caveats:
+I am knowledgeable about iterative optimization approaches to denoising and demixing, especially MCA (morphological component analysis), but *not knowledgeable about GAN-based approaches*, though I have familiarity with GANs.
+
+*********************
+Update after author response:
+I think the Fashion-MNIST experiments and comparisons with ICA are many times more compelling than the original experiments. I think this is an exciting contribution to dually learning component manifolds for demixing.",7,4.0,ICLR2019
+SJec_a682X,2,r1g5b2RcKm,r1g5b2RcKm,Marginally above acceptance threshold,"The paper proposes a multi-layer pruning method called MLPrune for neural networks, which can automatically decide appropriate compression ratios for all the layers. It firstly pre-trains a network. Then it utilizes K-FAC to approximate the Fisher matrix, which in turn approximates the exact Hessian matrix of training loss w.r.t model weights. The approximated Hessian matrix is then used to estimate the increment of loss after pruning a connection. The connections from all layers with the smallest loss increments are pruned and the network is re-trained to the final model.
+
+Strength:
+1. The paper is well-written and clear. 
+2. The method is theoretically sound and outperforms state-of-the-art by a large margin in terms of compression ratio. 
+3. The analysis of the pruning is interesting.
+
+Weakness:
+*Method complexity and efficiency are missing, either theoretically or empirically.* 
+The main contribution claimed in the paper is that they avoid the time-consuming search for the compression ratio for each layers. However, there are no evidences that the proposed method can save time. As the authors mention, AlexNet contains roughly 61M parameters. On the other hand, the two matrices A_{l-1} and DS_l needed in the method for a fully-connected layer already have size 81M and 16M respectively. Is this only a minor overhead, especially when the model goes deeper?
+
+Overall, it is a good paper. I am inclined to accept, and I hope that the authors can show the complexity and efficiency of their method.
+",6,4.0,ICLR2019
+HJVMLcbVe,2,H1Fk2Iqex,H1Fk2Iqex,A work to try to structure a deep network,"Pros: 
+- Introduction of a nice filter banks and its implementation
+- Good numerical results
+- Refinement of the representation via back propagation, and a demonstration that it speeds up learning
+
+Cons:
+- The algorithms (section 3.1) are not necessary, and they even affect the presentation of the paper. However, a source code would be great!
+- The link with a scattering transform is not clear
+- Sometimes (as mentionned in some of my comments), the writing could be improved.
+
+From a personal point of view, I also believe the negative points I mention can be easily removed.",6,5.0,ICLR2017
+DLpuw85e9mh,3,Qe_de8HpWK,Qe_de8HpWK,GenQu: A Hybrid Framework for Learning Classical Data in Quantum States,"The paper claims to introduce a new quantum machine learning framework called GenQu. However, the description of the framework very vague (using classical computers to optimize the parameters of a fixed quantum circuit), and hardly novel. In fact, the same basic ideas are so well-known in the community that they are described in detail as usage examples for popular quantum computing platforms such as Qiskit and IBM Q.
+The only remotely nontrivial part of the paper is contained in Section 4.2 about Quantum Deep Learning, where the authors consider the MNIST data set. Upon closer inspection it turns out that they use PCA to reduce the dataset to 4 dimensions, which is in turn used to train a ""quantum neural network"" to perform binary classification (i.e. to discriminate between '0'-instances and '5'-instances). The authors claim that such a quantum classifier provides an advantage versus a convolutional neural network in terms of
+ 1. the number of training epochs (while ignoring the time needed to perform PCA), and
+ 2. the number of parameters (while ignoring the parameters needed to describe the principal components).
+Additionally, no confidence intervals are visible on Fig. 7, which suggests that the data might have been obtained from a single experimental run. Finally, there are several instances of sloppy writing, such as the inconsistent usage of math mode for variables, the statement P(|\phi>) = |0>, the typo ""iWs"" instead of is, etc.",2,5.0,ICLR2021
+H1lDy33g6m,3,HJGkisCcKm,HJGkisCcKm,"promising results, well-written","A method is presented to modify a music recording so that it sounds like it was performed by a different (set of) instrument(s). This task is referred to as ""music translation"". To this end, an autoencoder model is constructed, where the decoder is autoregressive (WaveNet-style) and domain-specific, and the encoder is shared across all domains and trained with an adversarial ""domain confusion loss"". The latter helps the encoder to produce a domain-agnostic intermediate representation of the audio.
+
+Based on the provided samples, the translation is often imperfect: the original timbre often ""leaks"" into the output. This is most clearly audible when translating piano to strings: the percussive onsets of the piano (due to the hammers hitting the strings) are also present in the translated audio, even though instruments like the violin and the cello are not supposed to produce percussive onsets. This gives the result an unusual sound, which can be interesting from an artistic point of view, but it is undesirable in the context of the original goal of the paper.
+
+Nevertheless, the results are quite impressive and for some combinations of instruments/styles it works surprisingly well. The question of whether the approach is equivalent to pitch estimation followed by rendering with a different instrument is also addressed in the paper, which I appreciate.
+
+The paper is well written and the related work section is comprehensive. The experimental evaluation is thorough and extensive as well (although a few potentially interesting experiments seemingly didn't make the cut, see other comments). I also like that the authors went through the trouble of doing some experiments on a publicly available dataset, to facilitate reproduction and future comparison experiments.
+
+
+Other comments:
+
+* ""autoregressive"" should be one word everywhere
+
+* In section 2 it is stated that attempts to use a unified decoder with style/instrument conditioning all failed. I'm curious about what was tried specifically, it would be nice to discuss this.
+
+* The same goes for experiments based on VQ-VAE, the paper simply states that they were not able to get this working, but not what experiments were run to come to this conclusion.
+
+* The authors went through the trouble to modify the nv-wavenet inference kernels to support their modified architecture, which I appreciate -- will the modified kernels be made available as well?
+
+* The audio augmentation by pitch shifting is a surprising ingredient (but according to the authors it is also crucial). Some more insight as to why this is so important (rather than simply stating that it is important) would be a welcome addition.
+
+* Section 3.2: ""out off tune"" should read ""out of tune"".
+
+* The formulation on p.7, 2nd paragraph is a bit confusing: ""AMT freelancers tended to choose the same domain as the source, regardless of the real source and the presentation order."" Does that mean they got it right every time? I suspect that is not what it means, but that is how I read it initially.
+
+* I don't quite understand the point of the semantic blending experiments. As a baseline, the same kind of blending in the raw audio space should be done, I suspect it would probably be hard to hear the difference. This is how cross-fading is already done in practice, and it isn't clear to me why this method would yield better results in that respect. The paper is strong enough without them so these could probably be left out.",8,4.0,ICLR2019
+prdz98BzGD4,4,NfZ6g2OmXEk,NfZ6g2OmXEk,Review [Updated],"**SUMMARY**
+
+The present work considers the problem of learning in procedurally generated environments. This is a class of simulation environments in which each individual environment is created algorithmically where certain environmental factors are varied in each instance (referred to as levels in this work). Learning algorithms in this setting typically use a fixed set of training and evaluation environments. The present work proposes to sample the training environments such that the learning progress of the agent is optimized. This is achieved by proposing an algorithm for level prioritization during training. The performance of the approach is demonstrated on the Procgen Benchmark and two MiniGrid benchmarks and the authors argue that their approach induces an implicit curriculum in sparse reward settings.
+
+**STRENGTHS**
+- The general idea of prioritization for level sampling makes a lot of sense and is demonstrated to improve sample-efficiency for skill learning in procedurally generated environments.
+- I also liked that the authors compared with a big variety of different scoring metrics.
+
+**WEAKNESSES**
+- The intuition of ""greater discrepancy between expected and actual
+returns, making $\delta_t$ a useful measure of the learning potential"" makes sense. The heuristic score also works well in practice. One limitation I see is that there is no theoretical justification for why the TD-error is a good predictor for learnability. 
+- This is maybe more an avenue for future work than an actual weakness but it seems to me that the algorithm is not making use of all potentially useful information. In each timestep, it only considers the last score achieved in a level. Maybe it would also be interesting to consider the full history of scores. My intuition is that levels in which agents were historically very slow to learn are maybe not as useful (or at least not useful at the moment). I.e., maybe in order to learn competing at such levels it is better to compete on other levels first?
+- Is there, at least from a qualitative perspective, an explanation for why certain environments do not benefit as much from the proposed level sampling approach?
+
+**REPRODUCIBILITY**
+
+The work seems reproducible. Most of the information relevant for reproducibility is given in Appendices A & B. It would be great if the authors would also make the source code available.
+
+**CLARITY**
+
+Overall, I found the work to be very clearly written and have only minor questions/remarks:
+- To what extent does the use of TD-errors potentially limit the type of learning algorithms that can be used in the context of the proposed framework. Computing the TD-error requires a value function. As I understand it, some RL algorithms never compute a value function.
+- If I haven't overlooked it, there is no explanation of $c$ after eq. (4) while $C_i$ is explained earlier. Is $c$ simply the current episode?
+
+
+**EVALUATIONS**
+
+The work is compared with several scoring function baselines using PPO. While the authors claim that the method is applicable to other RL agents, the evaluations do not show any results with other agent types. The authors mention several different benchmarks in that space. It would be interesting to know why particularly Procgen Benchmark and MiniGrid environments were chosen.
+
+It is also not clearm to me why PPO is used as the base agent. Was this for ease of implementation / its popularity? Wouldn't it make sense to use more recent agents to see the added benefit of the proposed approach. E.g., would V-MPO be applicable here? 
+
+**NOVELTY / RELEVANCE**
+
+The work is very interesting and the authors make a compelling case that procedurally generated environments can benefit from a conscious sampling of the levels with regard to usefulness for learnability.
+
+I am not sure whether the claim ""Prioritized Level Replay induces an implicit curriculum, taking the agent gradually from easier to harder levels."" is fully valid. As I understand it, the hardest levels are also the most likely to be sampled. The force counteracting this to some extent is the staleness-based sampling term $P_C$. For a gradual curriculum, I would expect $P_S$ to be designed such that it does not choose the hardest level but the one promising the best learning outcome. Particularly in the early stages of the training, the hard levels might be less useful than levels of medium difficulty.
+
+**SUMMARY**
+
+I found that paper very interesting. While I am not working in the particular subfield of the work and cannot sufficiently judge relation with prior works, I can confidently say that the idea and implementation details were conveyed very well. My main concerns are regarding the understanding of the ""failure cases"" and to what extent the graduality claim applies. That being said, I believe this line of work to be really interesting and to have a lot of potential for improved sample-efficiency when training RL agents in algorithmically generated simulation environments.
+
+**POST-DISCUSSION UPDATE**
+
+I want to thank the authors for correcting my misunderstandings, answering my questions, and providing additional material. As a consequence of this, I have raised my score to ""Accept"". To answer your question about what would be needed for a higher score: For a strong accept recommendation, I would have expected a mix of several additional things such as a clear impact outside of own subfield, code availability at time of submission (to evaluate how easy it is to reproduce the results and re-use the code), or more additional theoretical justification (in the sense of new formal guarantees for at least certain aspects of the proposed method). While not directly working in this subfield, I still think this work is solid and worthy of publication.",7,3.0,ICLR2021
+rygRU5E5nm,2,HyGEM3C9KQ,HyGEM3C9KQ,"Promising modifications to the Differentiable Neural Computer (DNC) architecture, but needs stronger empirical evidence ","
+Overview: 
+This paper proposes modifications to the original Differentiable Neural Computer architecture in three ways. First by introducing a masked content-based addressing which dynamically induces a key-value separation. Second, by modifying the de-allocation system by also multiplying the memory contents by a retention vector before an update. Finally, the authors propose a modification in the link distribution, through renormalization. They provide some theoretical motivation and empirical evidence that it helps avoiding memory aliasing. 
+The authors test their approach in the some algorithm task from the DNC paper (Copy, Associative Recall and Key-Value Retrieval), and also in the bAbi dataset.
+
+
+Strengths: Overall I think the paper is well-written, and proposes simple adaptions to the DNC architecture which are theoretically grounded and could be effective for improving general performance. Although the experimental results seem promising when comparing the modified architecture to the original DNC, in my opinion there are a few fundamental problems in the empirical session (see weakness discussion bellow).
+
+Weaknesses: Not all model modifications are studied in all the algorithmic tasks. For example, in the associative recall and key-value retrieval only DNC and DNC + masking are studied. 
+
+For the bAbi task, although there is a significant improvement (43%) in the mean error rate compared to the original DNC, it's important to note that performance in this task has improved a lot since the DNC paper was release. Since this is the only non-toy task in the paper, in my opinion, the authors have to discuss current SOTA on it, and have to cite, for example the universal transformer[1], entnet[2], relational nets [3], among others architectures that shown recent advances on this benchmark. 
+Moreover, the sparse DNC (Rae el at., 2016) is already a much better performant in this task. (mean error DNC: 16.7 \pm 7.6, DNC-MD (this paper) 9.5 \pm 1.6, sparse DNC 6.4 \pm 2.5). Although the authors mention in the conclusion that it's future work to merge their proposed changes into the sparse DNC, it is hard to know how relevant the improvements are, knowing that there are much better baselines for this task.
+It would also be good if besides the mean error rates, they reported best runs chosen by performance on the validation task, and number of the tasks solve (with < 5% error) as it is standard in this dataset.
+
+
+Smaller Notes. 
+1) In the abstract, I find the message for motivating the masking from the sentence  ""content based look-up results... which is not present in the key and need to be retrieved.""  hard to understand by itself. When I first read the abstract, I couldn't understand what the authors wanted to communicate with it. Later in 3.1 it became clear. 
+
+2) page 3, beta in that equation is not defined
+
+3) First paragraph in page 5 uses definition of acronyms DNC-MS and DNC-MDS before they are defined.
+
+4) Table 1 difference between DNC and DNC (DM) is not clear. I am assuming it's the numbers reported in the paper, vs the author's implementation? 
+
+5)In session 3.1-3.3, for completeness. I think it would be helpful to explicitly compare the equations from the original DNC paper with the new proposed ones. 
+
+--------------
+
+Post rebuttal update: I think the authors have addressed my main concern points and I am updating my score accordingly. ",7,5.0,ICLR2019
+SkYyvPyWf,3,BkCV_W-AZ,BkCV_W-AZ,"The paper provides a game-theoretic inspired variant of policy-gradient algorithm based on the idea of counter-factual regret minimization. The paper claims that the approach can deal with the partial observable domain better than the standard methods. However the results only show that the algorithm converges, in some cases, faster than the previous work.","Quality and clarity:
+
+The paper provides a game-theoretic inspired variant of policy-gradient algorithm based on the idea of counter-factual regret minimization. The paper claims that the approach can deal with the partial observable domain better than the standard methods. However the results only show that the algorithm converges, in some cases, faster than the previous work  reaching asymptotically to a same or worse performance. Whereas one would expect that the algorithm achieve a better asymptotic performance in compare to methods which are designed for fully observable domains and thus performs sub-optimally in the POMDPs. 
+
+The paper dives into the literature of counter-factual regret minimization without providing much intuition on why this type of ideas should provide improvement in the case of partial observable domain. To me it is not clear at all why this idea should help in the partial observable domains beside the argument that this method is designed in the game-theoretic settings   which makes no Markov assumption . The way that I interpret this algorithm is that by adding A+ to the return the algorithm  introduces some bias for actions which are likely to be optimal so it is in some sense implements the optimism in the face of uncertainty principle. This may explains why this algorithm converges faster than the baseline as it produces better exploration strategy. To me it is not clear that the boost comes from the fact that the algorithm deals with partial observability more efficiently.
+
+
+Originality and Significance:
+
+The proposed algorithm seems original. However,  as it is acknowledged by the authors this type of optimistic policy gradient algorithms have been previously used in RL (though maybe not with the game theoretic justification). I believe the algorithm introduced  in this paper, if it is presented well, can be  an interesting addition to the literature of Deep RL, e.g.,  in terms of improving the rate of convergence. However, the current version of paper  does not provide conclusive evidence for that as in most of the domains the algorithm only converge marginally faster than the standard ones. Given the fact that algorithms like dueling DQN and DDPG are   for the best asymptotic results and not  for the best convergence rate, this improvement  can be due to the choice of hyper parameter such as step size or epsilon decay scheduling. More experiments over a range of hyper parameter is needed before one can conclude that this algorithm improves the rate of convergence.
+ ",5,5.0,ICLR2018
+dknl2hx5_CD,1,SOVSJZ9PTO7,SOVSJZ9PTO7,A joint KG and language pre-training model,"This work proposes a method for joint pre-training of knowledge graph and text data which embeds KG entities and relations into shared latent semantic space as entity embeddings from text. The proposed model JAKET consists of two main parts: a language module and a knowledge module. The model is pre-trained on a collection of tasks: entity category prediction, relation type prediction, masked token prediction and masked entity prediction. The proposed framework enables fine-tuning on knowledge graphs which are unseen during pre-training.
+
+
+Overall, I believe that the work on knowledge-enhanced language models to be an interesting and important area of research. However, I believe the paper is not ready for publication in its current form as (i) the demonstrated improvements obtained by pre-training with the added KG module seem minor compared to the computational overhead of having to compute entity and relation embeddings using GNNs; and (ii) experimental comparison to some relevant prior work is missing.
+
+Questions/comments for the authors:
+
+1. One of the drawbacks of the proposed method is that it assumes entity descriptions to always be available, which might be the case for Wikidata, but it is not usually the case with e.g. standard knowledge graph completion datasets WN18 and FB15k. How would fine-tuning work on knowledge graphs that do not have entity descriptions?
+2. What does M in RoBERTa+GNN+M stand for? Is it memory?
+3. The improvements over a pure language model on few-shot relation classification and KGQA are minor, especially given the computational overhead that adding a KG module entails. The authors should include a discussion on computational overhead of having a KG module vs a pure language model.
+4. The experimental comparison to existing knowledge-enhanced language models from Section 2 is missing.
+",5,4.0,ICLR2021
+LHA7j68nHef,1,CR1XOQ0UTh-,CR1XOQ0UTh-,Good Submission,"Summary:
+
+This paper investigated how to sample informative/hard negative examples for self-supervised contrastive learning without label information. To tackle this challenge, this paper proposed an efficient tunable sampling distribution to select negative samples that are similar to the query when the true label or similarity information is not accessible. Positive-unlabeled learning is used to address the challenge of no label, while the importance sampling technique is used for efficient sampling.
+
+My main concerns are (1) how to distinguish hard negative examples and same-class samples as Fig.1 depicts, (2) how to sample $v$ in Eq.(4) to estimate the expectation.
+ 
+Pros: 
+
+\+ This paper is the first to propose a hard negative sampling for unsupervised contrastive representation learning.
+
+\+ This paper provides a theoretical analysis of the proposed method that can tightly cluster similar inputs while departing the clusters from each other.
+
+\+ The proposed method can be easily implemented with few additional lines of code.
+
+\+ Experiments are conducted on several datasets (STL10, CIFAR100, CIFAR10). The proposed method works well even with a small number of negative samples.
+
+\+ This paper has a high writing quality. The paper is clearly written and well organized. The motivation, challenges, and contributions are clearly stated.
+
+\+ This paper shows plenty of visualized results to intuitively and detailedly show how the proposed method works.
+
+Cons:
+
+\- The experiments are conducted only on small datasets. Experiments on larger datasets, including Imagenet-1k and Imagenet-100 are not provided, especially the latter. The most related work, debiased [1], has conducted experiments on Imagenet-100.
+
+ 
+Questions: 
+
+It is not clear to me how to distinguish hard negative samples and same-class samples. Taking Figure 1 for example, how to distinguish 'oak' from other types of trees without labels?
+
+In Sec. 3.1, $\beta$ is used to meet Principle 2. Since $\beta $ is like the temperature in softmax for scaling and does not change the order/rank of $f(x)^Tf(x^-)$, it is not clear how using $\beta$ satisfies Principle 2.
+
+As shown in Eq.(4), estimating $E_{v\sim q_{\beta}^+}$ requires sampling $v$ from $q_{\beta}^+$. Are there any comments or ablations on the number of samples like Fig. 4(c) in [1]?
+
+The paper focuses on hard negative samples mining with a relatively small number of negative samples (i.e., less than 512). Since memory-bank/queue-based methods like MoCo [2] has a relatively large number of negative samples (i.e., 65536), is it possible to improve the performance of MoCo, or reduce the number of negative samples?
+
+References:
+
+[1] Ching-Yao Chuang, Joshua Robinson, Lin Yen-Chen, Antonio Torralba, and Stefanie Jegelka. De- biased Contrastive Learning. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
+[2] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9729–9738, 2020.
+
+
+
+
+
+
+
+
+
+
+*******************************************
+
+Final decision: 
+
+I would keep my score unchanged. 
+
+As for Principle 1, the authors said that upholding Principle 1 is impossible with no supervision, and they proposed to uphold Principle 1 approximately. This is acceptable to me as they build on ideas from positive-unlabeled learning.
+
+This paper is clearly written and well organized. Both empirical and theoretical analysis is provided. The feedback addressed my concerns well.",7,4.0,ICLR2021
+rJ_jlsdlf,3,ByaQIGg0-,ByaQIGg0-,This is a good deep learning application paper without sufficient algorithmic/theoretical novelty suitable for ICLR.,"1. This is a good application paper, can be quite interesting in a workshop related to Deep Learning applications to physical sciences and engineering
+2. Lacks in sufficient machine learning related novelty required to be relevant in the main conference
+3. Design, solving inverse problem using Deep Learning are not quite novel, see
+Stoecklein et al. Deep Learning for Flow Sculpting: Insights into Efficient Learning using Scientific Simulation Data. Scientific Reports 7, Article number: 46368 (2017).
+4. However, this paper introduces two different types of networks for ""parametrization"" and ""physical behavior"" mapping, which is interesting, can be very useful as surrogate models for CFD simulations 
+5. It will be interesting to see the impacts of physics based knowledge on choice of network architecture, hyper-parameters and other training considerations
+6. Just claiming the generalization capability of deep networks is not enough, need to show how much the model can interpolate or extrapolate? what are the effects of regulariazations in this regard? ",4,5.0,ICLR2018
+U0DfeVsJ1e,3,zElset1Klrp,zElset1Klrp,"Weak accept: interesting idea, could have benchmarked against stronger baselines, needs revision for better reproducibility","### Update during review period
+
+- The reproducibility of the paper is now much better. It's great that the authors promised to release the LTA code. I hope that this includes the code for the experiments. 
+- Based on the above, I changed my review score to 7. 
+
+### Summary 
+
+The paper presents a novel activation function (Leaky Tiling Activation - LTA) to produce sparse activations, which have been found to stabilize learning in continual learning and RL settings. The new nonlinearity and its theoretical properties are described well, and the authors present convincing experiments demonstrating that the method yields practical benefits on synthetic datasets and RL games (e.g. Atari). 
+
+### Reasons for score: 
+ 
+The paper should certainly be published somewhere, but maybe in a workshop that focuses on continual learning or RL. 
+
+While the proposed new activation function may be useful in some settings, there is not enough evidence in the paper that it would become a go-to solution, or significantly change the way we think about interference in continual learning and RL. In particular, the authors compare LTA variants of DQN, but it might have been useful to compare with e.g. Rainbow too. While I agree that interference is still an issue in modern RL with function approximation, it would be useful for potential users of LTA to know whether LTA provides benefits when used in conjunction with an existing state-of-the-art RL algorithm. Adding an experiment to this effect to the appendix would make this paper stronger. 
+ 
+### Pros
+ 
+1. The paper addresses a key problem in continuous learning and RL: interference and catastrophic forgetting. It presents a novel method to combat this problem, and demonstrates its usefulness in a number of experiments. 
+2. The paper is generally well-written. 
+3. The experiments on synthetic data were compelling, and made for a very nice controlled experiment.  
+ 
+
+### Cons
+ 
+1. The authors could have benchmarked against stronger baselines. The authors might be able to do this during the review process. 
+2. The precise set-ups that the authors used for their experiments should be described more clearly. As it is, the paper's work is not reproducible. See my the section ""Questions during rebuttal period"" below for details. 
+
+### Questions during rebuttal period: 
+ 
+Have the authors considered to benchmark using stronger (state-of-the-art) baselines? Someone who considers using LTAs would likely use a state-of-the-art method already, not DQN, which the authors benchmarked against. The question is then whether LTA yields benefits when used in conjunction with state-of-the-art methods, not when used in conjunction with DQN. 
+
+I did not fully understand what architecture the authors used in their experiments. The architecture appears to be described mostly in this single sentence: “All the algorithms use a two-layer neural network, with the primary difference being the activation used on the last layer. ” 
+- Between what and what is this difference? Between baselines and the LTA experiments? 
+- Later on the page from which I quoted above, the authors mention that they experimented with DQN-like networks. The original DQN paper used 3 layers (https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf). Did the authors mean to write ““All the algorithms use a two-layer neural network”?
+- Do the authors insert the LTA before the third layer?
+
+More generally, it would be helpful if the authors could describe the architectures used in more detail and maybe ask a colleague who is not yet familiar with the paper to review for clarity and reproducibility. 
+ 
+### Some suggestions
+
+I would rephrase the first sentence of the abstract in order to introduce “interference” in a gentler way. While interference is an active research topic in the RL community, many members of the ICLR community might wonder “what kind of interference”? A half-sentence like “where updates for some inputs degrade accuracy for others“ (copied from the paper’s introduction)  could suffice here. 
+
+In section 5.1, the authors have lines beginning with bolded “DQN”, “DQN-LTA”, “DQN-Large” et cetera. For readability’s sake, it might be useful to format these as bulleted lists. 
+
+A small grammatical issue: “This issue is usually called gradient vanish”. Maybe rephrase this as “This issue is known as the vanishing gradient problem”. 
+
+The authors list ReLUs as an example of an activation function with vanishing gradients. As vanishing gradient problems go, ReLU is a bit different from tanh and other nonlinearities that saturate, so I would avoid listing it here to avoid unnecessary debates. 
+
+The authors write “Mnist” in a number of places. The correct spelling is “MNIST”: this is an acronym for “Modified National Institute of Standards and Technology”. 
+",7,4.0,ICLR2021
+yabqxVDP67L,1,9hgEG-k57Zj,9hgEG-k57Zj,An interesting work,"This paper considers the problem of policy learning in Markov Decision Process (MDP) from the combination of online and offline samples. The offline samples are generated by a behavior policy in the same MDP model, i.e., the behavior agent and the learning agent share the same state-action space. The learning procedure goes as follows. One first trains a MDP policy from the offline data; the online samples are then used to fine-tune the learned policy.
+
+The authors propose a simple yet effective approach. First, the authors keep separate offline and online replay buffers, and carefully balance the number of samples from each buffer during updates. Then,  multiple actor-critic offline RL policies are trained, and a single policy is distilled from these policies using ensemble methods. Experiment results show that the proposed method consistently outperforms state-of-art algorithms.
+
+This paper is clearly written and well organized. I am not sure about the novelty of the proposed method, since it seems to follow the line of carefully reweight online and offline samples. However, the experimental results show a significant improvement over existing methods.
+
+Question for the authors:
+1. In practice, what is a good heuristic for selecting the initial fraction p0 of online samples? How sensitive is the learned policy w.r.t. the initial fraction p0?",6,2.0,ICLR2021
+r1xdde8nKH,1,B1e3OlStPB,B1e3OlStPB,Official Blind Review #1,"In this paper, CNNs specialized for spherical data are studied. The proposed architecture is a combination of existing frameworks based on the discretization of a sphere as a graph. As a main result, the paper shows a convergence result, which is related to the rotation equivalence on a sphere. The experiments show the proposed model achieves a good tradeoff between the prediction performance and the computational cost. 
+
+Although the theoretical result is not strong enough, the empirical results show the proposed approach is promising. Therefore I vote for acceptance. 
+
+The paper is overall clearly written. It is nice that the authors try to mitigate from overclaiming of the analysis. 
+
+As a non-expert of spherical CNN, I don't understand clearly the gap between the result Theorem 3.1 and showing the rotation equivalence. It would be nice to add some counterexample (i.e., in what situation the proposed approach does not have rotational equivalence while Theorem 3.1 holds).",6,,ICLR2020
+h8KTGBTvXjz,1,SQ7EHTDyn9Y,SQ7EHTDyn9Y,"An interesting experimental analysis about neural network instability, but that which is not completely  convincing","Summary of the paper:
+The authors analyze the effect of sources of uncertainty on neural network performance. In particular, they consider the effect of parameter initialization, data shuffling, data augmentation, regularization, and the choice of deep learning libraries on network performance and show that all of these aspects have similar effects. Furthermore, the authors claim that these sources of uncertainty are all dependent on network weights, with even small changes in network weights drastically affecting the network’s performance.  The authors use statistical measures such correlation between model predictions, change in performance with and without ensemble game models, and a state of the art method to characterize the functional behavior. Results are reported for image classification and language modeling .
+
+Positives:
+1. The paper offers some interesting revelations such as : all sources of uncertainty have similar effects, which is surprising as the authors note, and hence a valuable insight.
+2. The problem is well motivated, and the presentation is mostly clear. 
+3. Experiments have been conducted on diverse domains (image and language) to demonstrate the effectiveness of the proposed method.
+Concerns:
+1. Technical sophistication: 
+-As I understand, the goal is to be able to quantify the effect of various sources of non determinism on performance. Fundamentally, this seems like a causal attribution problem. While correlation based metrics can offer insight, it is not sufficient enough to establish causal claims. 
+-Moreover, the different sources of non determinism may be influencing each other. The authors in one of their protocols study the effect of each of these in a rather independent fashion, which makes it hard to estimate the influence of one source on another if any. It is therefore necessary to analyze all possible combinations of source variations for the conclusions made to hold true.
+-Also, if one thinks of the problem as that of causal inference and imagines a DAG whose nodes are various sources of non determinism, then the sources which need to be controlled for will be provided by adjustment formulas. This is more concrete than just controlling for few sources as the authors propose because it is not guaranteed to remove all spurious correlations.
+-As one of the ways to address the problem, the authors suggest leveraging snapshot ensembles. It is not clear how the non determinism that can arise in this model itself (e.g. choice of samples in the ensembles) does not affect the performance.
+
+2. Novelty
+- One of the methods in the protocols is a state of the art method for functional analysis of neural networks, and the other two are are common measures. So, the contribution from a protocol perspective is not significantly novel.
+
+ Minor Comment:
+It would help to define non determinism formally early on in the paper. While the paper provides sufficient motivation and later on describes the various sources, it still helps to define the term in one sentence or two. 
+
+Question to authors:
+- Has the ordering of changing sources sequentially and simultaneously been analyzed? 
+- How are the sources of non determinism in the snapper ensembles overcome? Or are there no such sources?
+
+Overall comments:
+The paper offers some interesting insights via experiments. Some aspects about the protocol metrics are intuitive as the authors explain, despite this, a theoretical analysis to back the experimental findings would have made the paper stronger. This is because it is hard to be convinced that the process of attribution can be established entirely based on statistical measures. The relationships between various sources and the graphs describing their dependencies have to be analyzed in determining the sources that need to be adjusted for in determining causal effects. ",5,3.0,ICLR2021
+BJlBFC6EjQ,1,SkMON20ctX,SkMON20ctX,Results of questionable value,"The paper tries to describe SGD from the point of view of the distribution p(y',y) where y is (a possibly corrupted) true class-label and y' a model prediction. Assuming TV metric of probabilities, a trajectory is defined which fits to general learning behaviour of distributions.
+
+The issue is that the paper abstracts the actual algorithm, model and data away and the only thing that remains are marginal distributions p(y) and conditional p(y'|y). At this point one can already argue that the result is either not describing real behavior, or is trivial. The proposed trajectory starts with a model that only predicts one-class (low entropy H(y') and high conditional entropy) and ends with the optimal model. the trajectory is linear in distribution space, therefore one obtains initially a stage where H(y') and H(y'|y) increase a lot followed by a stage where H(y'|y) decrease.
+
+This is known to happen, because almost all models include a bias on the output, thus the easiest way to initially decrease the error is to obtain the correct marginal distribution by tuning the bias. Learning the actual class-label, depending on the observed image is much harder and thus takes longer. Therefore no matter what algorithm is used, one would expect this kind of trajectory with a model that has a bias.
+
+It also means that the interesting part of an analysis only begins after the marginal distribution is learned sufficiently well. and here the experimental results deviate a lot from the theoretical prediction. while showing some parabola like shape, there are big differences in how the shapes are looking like.
+
+I don't see how this paper is improving the state of the art, most of the theoretical contributions are well known or easy to derive. There is no actual connection to SGD left, therefore it is even hard to argue that the predicted shape will be observed, independent of dataset or model(one could think about a model which can not model a bias and the inputs are mean-free thus it is hard to learn the marginal distribution, which might change the trajectory)
+
+ Therefore, I vote for a strong reject.",2,4.0,ICLR2019
+rylUZPZVl,1,BJYwwY9ll,BJYwwY9ll,Review,"The work presented in this paper proposes a method to get an ensemble of neural networks at no extra training cost (i.e., at the cost of training a single network), by saving snapshots of the network during training. Network is trained using a cyclic (cosine) learning rate schedule; the snapshots are obtained when the learning rate is at the lowest points of the cycles. Using these snapshot ensembles, they show gains in performance over a single network on the image classification task on a variety of datasets.
+
+
+Positives:
+
+1. The work should be easy to adopt and re-produce, given the simple techinque and the experimental details in the paper.
+2. Well written paper, with clear description of the method and thorough experiments.
+
+
+Suggestions for improvement / other comments:
+
+1. While it is fair to compare against other techniques assuming a fixed computational budget, for a clear perspective, thorough comaprisons with ""true ensembles"" (i.e., ensembles of networks trained independently) should be provided.
+Specificially, Table 4 should be augmented with results from ""true ensembles"".
+
+2. Comparison with true ensembles is only provided for DenseNet-40 on CIFAR100 in Figure 4. The proposed snapshot ensemble achieves approximately 66% of the improvement of ""true ensemble"" over the single baseline model. This is not reflected accurately in the authors' claim in the abstract: ""[snapshot ensembles] **almost match[es]** the results of far more expensive independently trained [true ensembles].""
+
+3. As mentioned before: to understand the diversity of snapshot ensembles, it would help to the diversity against different ensembling technique, e.g. (1) ""true ensembles"", (2) ensembles from dropout as described by Gal et. al, 2016 (Dropout as a Bayesian Approximation).",8,3.0,ICLR2017
+DfVvXj7CZtn,2,iqmOTi9J7E8,iqmOTi9J7E8,feasibility of the proposed method in deep neural networks,"This paper tackles a timely problem of privacy leakage on the edge devices when applying deep neural networks. Instead of mitigating the leakage of a set of private attributes, the proposed method tries to remove the information irrelevant to the primary task. The proposed method does not need to identify the private attributes. The main contribution of this paper is the two proposed approaches for removing “null content” and “signal content.” The evaluations of the proposed approach are conducted on four image datasets.
+
+Pros:
+1. The idea of removing irrelevant information instead of private attributes is an interesting idea.
+2. The paper is well organized and well written.
+3. The experimental evaluation is comprehensive. Feature pruning and adversarial training are included in the evaluation.
+
+Cons:
+1. The key concern about the paper is the feasibility of the proposed methods in deep neural networks. Both proposed feature-removing methods are derived from a single linear layer. However, in many cases and even shown in the evaluation, the device side may process more than one layer of neural networks. In addition, the convolution layer is often deployed as the first layer in neural networks. It would be great if the proposed methods can be extended to multiple layers and multiple types of neural networks.
+2. The adversary uses the same architecture in the paper. However, the adversary can choose to use a more complex model to extract the privacy attributes in the evaluation. The different architecture may cause the failure of the proposed methods. It would be nice if more adversarial models can be evaluated in the paper.
+3. In Figure 4, the proposed methods do not perform well in balancing the utilities and privacy achieved. It is hard to tell if the better tradeoffs are due to the deeper layers or fully connected layers. From Figure 4, it seems the proposed methods do not perform well on the convolutional layers.
+4. The experiments only evaluate on a six-layer neural network, which is not a “deep network” claimed in the title. It would be great if the paper can evaluate the performance on other architectures and deeper models.
+5. The algorithms in Figure 2 are hard to understand.
+
+
+Minor comments:
+1. In Equation 2 it should be “M2 * M1” instead of “M1 * M2”
+2. Page 6 Figure 4 shows that the information leakage can be controlled using the following factors “factors”
+",5,5.0,ICLR2021
+#NAME?,3,K5YasWXZT3O,K5YasWXZT3O,This paper provides a unified framework for solving a bunch of issues in ERM. ,"This paper considers a unified framework named TERM for addressing a bunch of problems arising in the simple averaged empirical minimization. By leveraging the key hyper-parameter t in the TERM loss, it can recover the original average loss and approximate robust loss, min/max loss, and the superquantile loss, etc. The authors also propose gradient-based optimization algorithms for solving the TERM problem.  
+
+One thing that I do not understand very well is the paragraph under Lemma 1. Why is it necessary that outliers can cause a large (positive t) or small (negative t)  losses? Note that outliers can be arbitrary, say adversarial. 
+
+Also, do you have numerical issues for large enough t?
+
+Is it possible to show certain convergence results of the algorithms for solving the TERM? Especially, the TERM has the nice property that it is always smooth (depending on the value of t). 
+
+Overall, the TERM seems to be a good unification of different losses used in machine learning society for different purposes. The theoretical justifications also look reasonable and informative. In addition, the authors conduct a series of experiments to show the good performance of TERM for different tasks such as robustness to outliers, handling imbalance, and improving generalization, etc. 
+",6,3.0,ICLR2021
+VaqLW4BhWcR,2,vLaHRtHvfFp,vLaHRtHvfFp,Another generative model for videos with a slow feature loss,"The authors present a generative model for videos where the latent trajectories have two components - a term without a slowness loss that represents ""content"" and a term with a slowness loss that represents ""style"". They present results on a dataset simulating the wave equation and on videos of moving MNIST digits and 3D chairs. The results are generally good, especially for long roll-outs, and they demonstrate something like disentangling by showing that the identities of the digits can be swapped in the moving MNIST data.
+
+My main objection with the paper is that it has nothing to do with PDEs or separation of variables. The actual latent trajectories are simulated as *ODEs*, not PDEs, which are then used to generate images. The justification in terms of separation of variables is also a bait-and-switch...a slowness penalty is added to the loss for one of the latent trajectories, that is all. Factored latent trajectories are a well-established modeling technique for time series already. The use of ODEs parameterized by neural networks rather than, say, LSTMs or other RNN architectures is also more a difference in degree than in kind from other sequence models. Much like how ResNets become neural ODEs in the limit of very deep networks, simply using Runge-Kutta updates parameterized by a neural network is still technically a kind of RNN (if you define RNN very loosely as any nonlinear iterated function learnable by gradient descent). So I'm not sure that this paper actually does most of what it claims to be doing in the motivation.",5,4.0,ICLR2021
+SJgepihG9H,2,HyxehhNtvS,HyxehhNtvS,Official Blind Review #2,"This paper looks at the neural net training problem in a ""canonical space"" which is parameterized by the Fourier coefficients of the function. This canonical space is a bijection of the function space L^1([0, 1]^K), and if we allow an epsilon approximation of the function we can truncate the Fourier coefficients so that the canonical space is finite-dimensional. The paper shows that in the canonical space, the training problem is always convex. Going back to the literal space (original parameter space for a neural network), it is shown that as long as a ""disparity matrix"" remains full rank, gradient descent will converge to a global minimum.
+
+I don't think this paper has anything new or non-trivial. I also don't think it's helpful to look at the canonical space proposed in the paper. In particular, it just transfers the difficulty of the problem into a disparity matrix, which we actually don't have control over. The paper claims that the matrix can be made full rank. This is not correct. Maybe one can prove it's full rank at random initialization, but I don't see how to prove this throughout training. The authors would need to provide a rigorous proof in order to claim this.
+
+In fact, in non-convex optimization it's easy to arrive at a scenario where you ""only"" need some matrix to remain full rank in order to prove convergence to global minimum. One such example is the recent series of work on neural tangent kernel (NTK). There, as long as the NTK matrix stays full rank (actually, one needs eigenvalues bounded away from 0), one can show convergence to global minimum. However, to actually show this, one needs to apply stringent assumptions on the neural network architecture and to devote dozens of pages to the proof.",1,,ICLR2020
+HkG_DfE0r24,3,SUyxNGzUsH,SUyxNGzUsH,"This paper present a novel NMN approach to solving video-grounding lanuguage tasks, which decompses all language into entity references and detect corresponding action-based visual feature, then instantiate NMN with those inputs to get the final response.","The author present a novel neural moudalar network  for video grounding tasks, which can provide interpretable intermediate reasoing outcomes and show the model robustness. 
+This model achieves competitive results on AVSD datasets and state-of-the-art performance on TGIF-QA datasets, which demonstrates the effectiveness of the model design.
+
+Detailed comments are listed in the following
+• The novelity of the NMN is limited in this paper. The similari idea have been used in many previous literatures. I am wondering that how you define the modular space? Is there any prinpicle guidelines to design module like ""find, summarize, when, describe""?
+• The reasoning struture in this papar is simple. The module ""find, when, where"" are more like signal detectors. There is no reasoning structure for how to get the final response. (in this paper, just fuse the detected information to get the final answer by a response decoder). So this methods cannot reveal the inner correlation between final response and detected visual/language entities.
+• [Question] How do you train the program generation tasks from language? Is there any groundtruth program structure annotation to supervise this? How do you determine the hyper-parameters \alpha and \beta?
+• The paper is written pooly and some expressions are confusing, like ""Different from..., our model are trained to fully genrate the parameters of components in the text"". The parameter here refer the input of each module, which is different from model parameters.",4,4.0,ICLR2021
+WczXyHM2dzC,3,OqtLIabPTit,OqtLIabPTit,Interesting Set of Experiments but Insufficient Clarity and Evaluations,"**Overview:** The paper presents experiments showing that the contrastive learning losses produce better embeddings or feature spaces than those produced by using binary cross-entropy losses. The experiments show that embeddings learned using contrastive learning losses seem to favor long-tailed learning tasks, out-of-distribution tasks, and object detection. The paper also presents an extension of the contrastive loss to improve the embeddings. The experiments in the paper  use common and recent long-tail datasets as well as datasets for object detection and out-of-distribution tasks. 
+
+**Pros**:
+*Interesting problem and approach*. I think the paper tackles a hard and important problem, i.e., learning from a long-tailed dataset. Overall, I think that learning a feature space improving the learning from these imbalanced datasets is an interesting idea.
+
+*Clarity of the paper*. Overall the clarity of the paper is good. The motivation is clear and the narrative is clear overall. However, I think the clarity in the experiments is insufficient, see below.
+
+**Cons**:
+*Insufficient clarity in the experiments*. I have several concerns with the experiments:
+1. The *balancedness* metric in Eq.(3) may not be a robust metric for measure performance. The reason I am not convinced about this metric is that if the accuracies of the classifier are low but equal, then the metric will say that the *balancedness* is good. I think a good metric for a classifier learning from an imbalanced dataset is one that indicates if the overall accuracy is high, maintains the many-shot the classification accuracy high, and increases the accuracy of the classes in the tail. I think this metric does not indicate if the overall accuracy is high.
+
+2. I am not convinced about how classifiers are trained in experiments in Sec. 3.2. The paper trains and tests using a balanced set after learning a feature space. To my understanding, the challenge of learning from a long-tailed dataset is to test whether a classifier can generalize well for classes with few training samples while maintaining a good performance on classes with more training examples. Thus, by training a linear classifier with a balanced set using the learned feature space does not really comply with learning from an imbalanced dataset. I think if the experiments would've been stronger if they included results of a trained linear classifier on the learned embedding and still showing good results, then I would be more convinced about the impact of a contrastive loss. From the practical point of view, what matters is the classifier performance. In practice, it is challenging to have a balanced dataset as the paper used. The main question is about the performance when training a linear classifier from a long-tailed dataset using the learned representation.
+
+3. Datasets derived from ImageNet-LT. While I value the goal of using different datasets varying the imbalance in a dataset, I am not convinced that ImageNet-LT is the dataset to use. The reason is that ImageNet-LT is a synthetic long-tailed dataset. In fact, while the dataset shows imbalance, it does not necessarily follow a power-law distribution. I think the generation of these datasets in the experiments should be done using a power-law distribution. From the text, it is unclear how the datasets from ImageNet-LT were generated for the experiments.
+
+
+Minor concerns:
+*Plots lack information*. What is the y-scale in Fig. 2? The figure is missing y-scale information and it is hard to interpret the gap in accuracy  between CE and CL in the left plot. Same comment for Fig 4, what is the scale in y-axis?",5,5.0,ICLR2021
+1FjLPpFmBl_,4,CNA6ZrpNDar,CNA6ZrpNDar,"Review of ""On the Decision Boundaries of Neural Networks. A Tropical Geometry Perspective""","Summary:  This work studies the decision boundaries of neural networks (NN) with piecewise linear (ReLU) activation functions from a tropical geometry perspective. Leveraging the work of [1], the authors show that NN decision boundaries form subsets of tropical hypersurfaces. This geometric characterization of NN decision boundaries is then leveraged to better understand the lottery ticket hypothesis, and prune deep NNs. The authors also allude to the use of tropical geometric perspectives on NN decision boundaries for the generation of adversarial samples, but do not explicitly discuss it in any detail within the main text of the paper.
+
+Strengths: 
++ The paper is insightful and novel. 
++ Tropical geometry (TG) promises to be a particularly convenient language to study (ReLU) DNNs, and this work does a good job of showcasing the versatility afforded by a principled, geometric approach in improving our understanding of DNNs. 
++ The efforts put into making this paper accessible to readers unfamiliar with TG are also worth appreciating.
+
+Weaknesses: 
+- The paper perhaps bites off a little more than it can chew. It might be best if the authors focused on their theoretical contributions in this paper, added more text and intuition about the extensions of their current bias-free NNs, fleshed out their analyses of the lottery ticket hypothesis and stopped at that.
+
+- The exposition and experiments done with tropical pruning need more work. Its extension to convolutional layers is a non-trivial but important aspect that the authors are strongly encouraged to address. This work could possibly be written up into another paper. Similarly, the work done towards generating adversarial samples could definitely do with more detailed explanations and experiments. Probably best left to another paper. 
+
+Contributions:  The theoretical contributions of the work are significant and interesting. The fact that the authors have been able to take their framework and apply it to multiple interesting problems in the ML landscape speaks to the promise of their theory and its resultant perspectives. The manner in which the tropical geometric framework is applied to empirical problems however, requires more work.
+
+Readability: The general organization and technical writing of the paper are quite strong, in that concepts are laid out in a manner that make the paper approachable despite the unfamiliarity of the topic for the general ML researcher. The language of the paper however, could do with some improvement; Certain statements are written such that they are not the easiest to follow, and could therefore be misinterpreted.
+
+Detailed comments: 
+- While there are relatively few works that have explicitly used tropical geometry to study NN decision boundaries, there are others such as [2] which are similar in spirit, and it would be interesting to see exactly how they relate to each other.
+
+- Abstract: It gets a little hard to follow what the authors are trying to say when they talk about how they use the new perspectives provided by the geometric characterizations of the NN decision boundaries. It would be helpful if the tasks were clearly enumerated.
+
+- Introduction: “For instance, and in an attempt to…” Typo – delete “and”. Similar typos found in the rest of the section too, addressing which would improve the readability of the paper a fair bit.
+
+- Preliminaries to tropical geometry: The preliminaries provided by the authors are much appreciated, and it would be incredibly helpful to have a slightly more detailed discussion of the same with some examples in the appendix. To that end, it would be a lot more insightful to discuss ex. 2 in Fig. 1, in addition to ex. 1. What exactly do the authors mean by the “upper faces” of the convex hull? The dual subdivision and projection $\pi$ need to be explained better.
+
+- Decision boundaries of neural networks: The variable ‘p’ is not explicitly defined. This is rather problematic since it has been used extensively throughout the rest of the paper. It would make sense to move def. 6 to the section discussing preliminaries.
+
+- Digesting Thm. 2: This section is much appreciated and greatly improves the accessibility of the paper. It would however be important, to provide some intuition about how one would study decision boundaries when the network is not bias-free, in the main text. In particular, how would the geometry of the dual subdivision $\delta(R({\bf x}))$ change? On a similar note, how do things change in practice when studying deep networks that are not bias free, given that, “Although the number of vertices of a zonotope is polynomial in the number of its generating line segments, fast algorithms for enumerating these vertices are still restricted to zonotopes with line segments starting at the origin”? Can Prop. 1 and Cor. 1 be extended to this case trivially?
+
+- Tropical perspective to the lottery ticket hypothesis: It would be nice to quantify the (dis)similarity in the shape of the decision boundaries polytopes across initializations and pruning using something like the Wasserstein metric.
+
+- Tropical network pruning: How are $\lambda_1, \lambda_2$ chosen? Any experiments conducted to decide on the values of the hyper-parameters should be mentioned in the main text and included in the appendix. To that end, is there an intuitive way to weight the two hyper-parameters relative to each other?
+
+- Extension to deeper networks: Does the order in which the pruning is applied to different layers really make a difference? It would also be interesting to see whether this pruning can be parallelized in some way. A little more discussion and intuition regarding this extension would be much appreciated.  
+
+- Experiments: 
+- The descriptions of the methods used as comparisons are a little confusing – in particular, what do the authors mean when they say “pruning for all parameters for each node in a layer” Wouldn’t these just be the weights in the layer?
+- “…we demonstrate experimentally that our approach can outperform all other methods even when all parameters or when only the biases are fine-tuned after pruning” – it is not immediately obvious why one would only want to fine-tune the biases of the network post pruning and a little more intuition on this front might help the reader better appreciate the proposed work and its contributions. 
+- Additionally, it might be an unfair comparison to make with other methods, since the objective of the tropical geometry-based pruning is preservation of decision boundaries while that of most other methods is agnostic of any other properties of the NN’s representational space.
+- Going by the results shown in Fig. 5, it would perhaps be better to say that the tropical pruning method is competitive with other pruning methods, rather than outperforming them (e.g., other methods seem to do better with the VGG16 on SVHN and CIFAR100)
+- “Since fully connected layers in DNNs tend to have much higher memory complexity than convolutional layers, we restrict our focus to pruning fully connected layers.” While it is true that fully connected layers tend to have higher memory requirements than convolutional ones, the bulk of the parameters in modern CNNs still belong to convolutional layers. Moreover, the most popular CNNs are now fully convolutional (e.g., ResNet, UNet) which would mean that the proposed methods in their current form would simply not apply to them.
+- Comparison against tropical geometry approaches for network pruning – why are the accuracies for the two methods different when 100% of the neurons are kept and the base architecture used is the same? The numbers reported are à (100, 98.6, 98.84)
+Tropical adversarial attacks: Given that this topic is not at all elaborated upon in the main text (and none of the figures showcase any relevant results either), it is strongly recommended that the authors either figure out a way to allocate significantly more space to this section, or not include it in this paper. (The idea itself though seems interesting and could perhaps make for another paper in its own right.)
+
+- References:  He et al. 2018a and 2018b seem to be the same.
+
+[1] Zhang L. et al., “Tropical Geometry of Deep Neural Networks”, ICML 2018.
+[2] Balestriero R. and Baraniuk R., “A Spline Theory of Deep Networks”, ICML 2018.
+",7,3.0,ICLR2021
+NXPecoAyuyU,3,MuSYkd1hxRP,MuSYkd1hxRP,"This paper argues for using a single level objective on weight-sharing NAS, and proposes GAEA, which uses exponentiated gradient to update architecture parameters, to accelerate the convergence. The paper gives a proof to guarantee finite-time convergence. The experiment results show this method is efficient and can slightly improve the performance.","Pros:
+1.	This paper gives a proof of finite-time convergence, which is the first paper working on this. Besides, the paper gives corresponding analysis of ENAS and DARTS. This is a new perspective of NAS methods.
+2.	This work uses EG method to update architecture parameter, which takes the advantage of EG method and is reasonable to be applied on NAS problem.
+3.	The experiment results show the efficiency and effectiveness of the proposed method.
+
+Cons:
+1.	Using EG to update architecture parameters can only accelerate convergence, which has nothing to do with improving the NAS performance. It is still confusing that why single-level optimization can resolve rank disorder and poor performance. It is not clear that the slight improvement on the performance is due to your algorithm or accident.
+2.	Efficiency is claimed as an important point in this paper. However, only the results in Table 2 shows GAEA shorten the time cost. In Table 1, it cannot be detected that your method is more efficient. Is that because updating architecture parameters does not cost too much time in these experiment? If so, the contribution may be less.
+3.	In Figure 2, it is not easy to detect the performance difference between your method and the baseline. You should also explain the meaning of lines with deeper colors.
+4.	It is better to add discussion or conclusion at the end of your paper, which can help readers to better understand your work.
+
+Overall Review:
+This paper gives a theory about the convergence time of NAS methods, which provides new perspectives on NAS problem. The paper find that EG method is appropriate on updating architecture parameters and this method can improve the efficiency of NAS problem. There are also some questions mentioned above in this paper. With some modifications, this paper could be an excellent paper.
+",6,4.0,ICLR2021
+Bkxtl24c5r,3,Hke0K1HKwr,Hke0K1HKwr,Official Blind Review #4,"The paper looks at the problem of knowledge selection for open-domain dialogue. The motivation is that selecting relevant knowledge is critical for downstream response generation.
+The paper highlights the one-to-many relations when selecting knowledge which makes the problem even more challenging. It tries to address this by taking into account the history of knowledge selected at previous turns.
+The paper proposes a Sequential Latent Model which represents the knowledge history as some latent representation. From this methodology they select a piece of knowledge at the current turn and use it to decode an utterance. The model is trained in a joint fashion to learn which knowledge to select and on generating the response. As the two are strongly correlated. Additionally there is an auxiliary loss to help correctly identify if the knowledge was correctly selected. Additionally a copy mechanism is introduced to try to copy words from the knowledge during decoding.
+The experiments are run on the Wizard of Wikipedia dataset where there are annotations for which knowledge sentence is selected and on Holl-E, where they transform the dataset to have a single sentence tied to a response.
+For automatic metrics there is significant improvement over baselines for correctly selecting a piece of knowledge and generating a response. Additionally there is human evaluation that also shows significant improvement.  Their model also seems to generalize well to domains that were not seen during training time over baselines models.
+
+The contribution of the paper is the novel approach to selecting knowledge for open-domain dialogue. This work is significant in that by improving knowledge selection we see a subsequent improvement in response generation quality which is the overall downstream task within this problem space.
+I believe this paper should be accepted because of the significant and novel approach of modeling previous knowledge sentences selected. The linking of this knowledge selection model to topic tracking as stated in the paper is of clear importance, as ensuring topical depth and topical transition are two key aspects for open-domain dialog.
+ 
+Feedback on the paper
+In Figure 3, please provide the knowledge sentence that was selected.
+Please provide the inter-annotator agreement for human evaluation.
+I think it would be interesting to see what is the copy mechanism actually adding in terms of integration of knowledge vs the WoW MemNet approach. Are those two truely comparable because one does not have copy?
+For Related Work, also cite Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations
+
+Small grammatical errors
+""Recently, Dinan et al. (2019) propose to tackle"" -> ""Recently, Dinan et al. (2019) proposed to tackle""
+""which subsequently improves the knowledge-grounded chit-chat."" -> ""which subsequently improves knowledge-grounded chit-chat.""
+
+
+Some questions for the authors in terms of future direction
+How is the performance of the model impacted with longer dialog context vs shorter?
+
+The Holl-E dataset was transformed from spans of knowledge to a single knowledge sentence. It would be interesting to see what happens when the knowledge selected is over multiple sentences.
+
+The knowledge pool currently consists of 67.57 sentences on average. How will this method scale as the amount of knowledge sentences grows?
+
+
+
+
+",8,,ICLR2020
+Bkxc7hICKr,2,ryx1wRNFvB,ryx1wRNFvB,Official Blind Review #3,"The focus of this paper is on exploring non-normal initializations for training vanilla RNN for sequential tasks. They show on 3 different tasks, and a real-world LM task that  non-normal initializations of vanilla RNNs outperform their orthogonal counter-parts when particular forms of initialization are considered. 
+
+
+Although the results for sequence task do not outperform the gated counterparts, the authors present an interesting exploration of initializing non-normal RNNs that outperform the orthogonal counterparts. It is good to see this line of work being explored as an alternative to exploring more complex architectures with many more parameters than necessary for the task. 
+
+Strengths:
+    1. The paper explores non-normal RNNs and demonstrates on  3 synthetic tasks - copy, addition and pMNIST - how with careful initialization the proposed approach outperforms their orthogonal initialization counterpart. This line of experimentation is interesting as it potentially opens the door for more expressive modeling for sequential tasks by expanding the solution space of the weight matrices being learnt i.e orthogonal matrices are a special case.
+    2. The authors do a great job in motivating the paper, and the explanation is clear and easily understandable. The toy simulations in Section2.2 really helps drive the reasoning behind why chain initialization improves over orthogonal initialization.
+    3. Based on the insight from trained RNNs where the trained  weights exhibit a chain like structure, the authors attempt to modify the LSTM gate initializations well. However, they do not see any specific gain by doing so, and moreover they show analysis that demonstrate that the LSTM gates do not learn these chain like structures. However, they do have insight into the regularities of these learnt weights which could potentially open the door for more interesting initialization methods for training such gated architectures.
+
+
+Issues to be addressed in the paper:
+1. The plots are quite small and hard to follow. Can the authors enlarge these so they span the entire page? Also, for pMNIST it would be good to provide accuracy scores as well as a function of the training epochs.Finally, it would be good to include a comparison against LSTMs (and even Transformer networks) so it is easier for the reader to see where these approaches stack against architecture changes.
+2. The authors are missing a reference to this  work - http://proceedings.mlr.press/v48/henaff16.pdf  - which provides empirical analysis for the 3 synthetic tasks to test the ability of vanilla RNNs for solving long span sequential tasks. 
+3. What about stability of these non-normal RNNs? For example, if we perturb the inputs to the training for the LM task how much variance do we see in the performance of these models? 
+",8,,ICLR2020
+KA7OlaY4_q_,4,VVdmjgu7pKM,VVdmjgu7pKM,Genaralizing RIMs and GRUs by splitting schema memory vs. per-frame feature learning. Good idea. Needs work. ,"The motivation and the proposal for splitting the schema from the procedural (representational) block makes sense. This is a good idea. A the authors build on top of RIMs, which have shown reasonable ways to model dynamical systems. However the paper itself needs to be improved and we need to evaluate the model more before publication. 
+
+Firstly, the proposal that SCOFF is a direct alternative for LSTM or GRU and showing that it beats them is not entirely correct. SCOFF comprises of a GRU with a sequence of CNNs operations i.e., its doing more than what a GRU does? What exactly is that? And so when proposing evaluations it is expected that compared to GRU, SCOFF does better (Fig 5). A more valid comparison would be GRU+some standard CNN-style feature learning vs. SCOFF. Its not entirely clear how to do this -- and needs thinking. Secondly, while on Fig 5, the errors cannot be reported as a ratio with respect to GRU because this would miss the true error values; and we cannot know if there is significant difference here (especially with RIMs.vs SCOFF). 
+Next, how is question 3 evaluated here? What downstream task is being considered? Am I missing something? 
+Next, how doe we interpret the error change here i.e., what does 0.1 change in error mean here for better understanding where things are changing? 
+
+Presentation of the paper needs a lot of improvement: 
+Algorithm 1 in the Table needs better clarity. Firstly, a bulk of the notation in the SCOFF presentation is very confusing. Its hard to parse what is going on in each stage. The description in the steps is helpful but the motivation and sequence of operations within each step needs better explanation. ",5,3.0,ICLR2021
+5frSzxKhDB8,3,RLRXCV6DbEJ,RLRXCV6DbEJ,"Very good paper improving deep VAE performance beyond autoregressive models, ablation studies could further strengthen it","**summary** 
+the paper puts forward an idea that deep-enough VAE should perform at least as well as autoregressive models. Authors explore this in the context of image generation, and construct VAE model that is a generalisation of typical autoregressive architectures. They use several tricks to ensure stable training of very deep VAEs and show that final performance exceeds all autoregressive models. This experimentally supports their claim that very deep VAEs encompass autoregressive models.
+
+**pros**
+The idea of perceiving VAE architectures as strictly more powerful and potentially efficient is very appealing. Given the recent work on improving deep VAE training(like Vahdat & Kautz (2020)) this paper takes another step in this direction by effectively, as it seems from the text, removing the depth limitation for training such VAEs. The tricks used to stabilise training are pretty ad hoc, but their effectiveness, showed experimentally, is important in advancing the field.
+
+
+**cons**
+* The main criticism I have is around ablation studies that justify the proposed architecture choices and training stabilisation tricks, as well as comparison to other tricks in the literature (e.g. Vahdat & Kautz (2020)). Of course the positive result speaks for itself, but the paper would be even more convincing with some details on the exploration that led to the final model.
+
+
+
+**questions**
+* it would be good to clarify in the text how exactly sampled latent variables from lower layers are decoded into the images to produce Fig. 4: is the idea to pass those latents down the top-bottom path and just not add new latents in the node ""+"" within the topdown block?
+* In Section 5.2.1, it is unclear why models with 32x32 and 1024x1024 resolutions have equal number of parameters: is this because ResNet blocks used at different resolutions share parameters?
+* Did the authors experiment with methods of slowing down the training of the prior, other then stopping it for the first half of training? It seems that exponentially averaging prior parameters might be another way of doing it, although the exponent will become another hyperparameter.
+
+**comments**
+* Further investigating the relation between using NN interpolation in upsampling and having active latents in all layers would be very useful. 
+* I particularly enjoyed the perceptional shift that the paper advocates for, i.e. that VAE and autoregressive models are not competing approaches, but rather VAE is a more general one and it encompasses the latter.",8,4.0,ICLR2021
+SylAnad4om,1,H1fU8iAqKX,H1fU8iAqKX,Interesting work on matching CNN filters to Neurons,"The paper analyses the data collected from 6005 neurons in a mouse brain. Visual stimuli are presented and the responses of the neurons recorded. In the next step, a rotational equivariant neural network architecture together with a sparse coding read-out layer is trained to predict the neuron responses from the stimuli. Results show a decent correlation between neuron responses and trained network. Moreover, the rotational equivariant architecture beats a standard CNN with similar number of feature maps. The analysis and discussion of the results is interesting. Overall, the methodological approach is good.
+
+I have trouble understanding the plot in Figure 4, it also does not print well and is barely readable on paper.
+
+I have a small problem Figure 6 where ""optimal"" response-maps are presented. From my understanding, many of those feature maps are not looking similar to feature maps that are usually considered. Given the limited data available and the non-perfect modeling of neurons, the computed optimal response-map might include features that are not present in the dataset. Therefore, it would be interesting to compare those results with the stimuli used to gather the data. E.g. for a subset of neurons, one could pick the stimulus that created the maximum response and compare that to what the stimulus with the maximum response of the trained neuron was. It might be useful to include the average correlation of the neurons belong to each of the 16 groups(if there are any meaningful differences), especially as the cut-off of ""correlation 0.2 on the validation set"" is rather low.
+
+Note: I am not an expert in the neural-computation literature, I am adapting the confidence rating accordingly.",8,3.0,ICLR2019
+ZEA6s9titLI,4,JHx9ZDCQEA,JHx9ZDCQEA,Recommendation to accept,"##########################################################################
+ 
+Summary: 
+
+This paper proposes a method for the retrosynthesis prediction of polymers. A challenge in this problem is the lack of synthetic data for polymers. The method attempts to leverage models for small molecule retrosynthesis predictions (where there is more abundant data), as well as domain specific constraints derived from the chemistry of a particular class of polymerization reactions. The method is shown to outperform some baselines that are commonly used in small molecule retrosynthesis.
+
+##########################################################################
+
+Reasons for score: 
+
+Overall, I vote for acceptance. I think the paper proposes a novel approach for polymer retrosynthesis that performs better than some of the baseline methods. However, I have some concerns, especially about the overall problem formulation of polymer retrosynthesis.
+
+##########################################################################
+
+Strengths:
+
+*Paper is written clearly
+
+*Interesting multistep method and applied constraints to convert a polymer repeat unit to monomers
+
+*Some useful ablation studies
+
+Weaknesses:
+
+*Some concerns about the relevance of the overall problem formulation
+
+*The method focuses on a particular type of polymerization reaction called condensation polymerization. Not sure how generalizable this method is to other common industrially relevant polymerization reactions, such as chain-growth type reactions 
+
+*The evaluation dataset is very small (52 examples)
+
+##########################################################################
+
+Questions and other comments:
+
+*I wonder how practically relevant the overall polymer retrosynthesis problem is, at least in the way that it is currently presented. In small molecule synthesis (eg in medicinal chemistry or natural product synthesis) there is a well-defined target molecule containing the properties that we want, so it is useful to perform retrosynthesis on the target molecular structure to obtain the step by step synthesis procedure that describes how to create the target molecule from precursor starting materials. However, in polymer synthesis, there is a much less defined target polymer structure (we have a distribution of different polymer structures), and the synthetic procedure required to create the target polymer in condensation polymerization is typically a single step mixing of the monomer building blocks. In my opinion, the problem of converting the unit polymer to monomers and subsequently to precursor starting materials makes sense (and this is the aspect of the work that is similar to typical small molecule retrosynthesis because the unit polymer and monomers are small molecules). But the polymer induction part of the modeling, where we start with a given polymer repeat unit and convert it to the unit polymer, makes much less sense because in a real use case why wouldn’t you just perform the analysis (eg designing a new polymer or performing retrosynthetic analysis) directly on the unit polymer or monomer building blocks, with the knowledge that the target polymer structure is essentially just a repeated form of the unit polymer structure.
+
+*How important is the polymer induction step to find the possible end groups? My intuition is that there are not that many possible end groups for condensation polymerization. It would be interesting to see the distribution of the average number of unit polymer candidates for each of the 52 examples.
+
+*The additional seq2seq baseline for monomer proposal in Figure 7, where the repeat unit is converted to the unit polymer using a very simple heuristic (by adding the most common end group pairs) shows pretty competitive performance compared to the proposed PolyRetro model, and I think this simple heuristic is actually a very reasonable addition to the baselines. 
+
+*The PolyRetro model seems to have very marginal improvements over the PolyRetro-USPTO model, which seems to suggest a simple one step retrosynthesis model is already pretty good for the polymer induction step? 
+
+*Any thoughts about how to obtain more data for future developments in this area?
+
+
+",6,3.0,ICLR2021
+HJelNGkaKr,2,B1lnbRNtwr,B1lnbRNtwr,Official Blind Review #1,"Strength:
+-- Interesting problem
+--The paper is well written and easy to follow
+-- The proposed approach seems very effective
+
+Weakness:
+-- the novelty of the proposed is marginal
+-- Some of the claims are not right in the paper
+
+This paper studied learning the representations of source codes by combining sequential-based approaches (RNN, Transformers) and graph neural network to model both the local and global dependency between the tokens. Experimental results on both synthetic and real-world datasets prove the effectiveness of the proposed approach.
+
+Overall, the paper is well written and easy to follow. However, the novelty of the proposed technique seems to be marginal to me. Some of the claims in the paper are not right. In the abstract, the authors said the graph neural network is more local-based while the transformer is more global-based. The essential difference between the two approaches lie in the way of constructing the graphs since transformer used the fully-connected graph (more local dependency) while graph neural networks usually capture the long-range dependency. 
+
+And there are actually some existing work that have already explored this idea in the context of natural language understanding, e.g.,
+Contextualized Non-local Neural Networks for Sequence Learning. https://arxiv.org/abs/1811.08600
+The authors should clarify the difference between these approaches.
+
+",3,,ICLR2020
+r1bD3ZGBl,3,H1eLE8qlx,H1eLE8qlx,"Interesting ideas, but the paper requires more work","The paper presents an approach to constructing hierarchical RL representations which relies on assuming agents that need to spend cognitive effort in order to choose their actions.  The paper p[roposes a specific way of formulating option construction via what they call a ""Bi-POMDP"". This idea is potentially very interesting, plausible form a cognitive science point of view, and definitely deserves attention. However, there are some problems which do not make the paper acceptable in its current form. I am listing them here in order of importance.
+1. It is not clear from the description why a Bi-POMDP is not a POMDP. POMDPs allow for vector-based observations.  Suppose the observation vector is (x_t, \sigma_t * y_t). This seems like it would result in a POMDP which is identical to the proposed model. The paper should include an example of a Bi-POMDP which is *not* a POMDP, or be revised to use specific POMDP terminology (see eg the use of augmented MDPs in hierarchical RL, which *are MDPs* but do not work in the original state space)
+2. The paper make some specific assumptions about the abstractions (eg determinism in certain places). It is not clear why these are needed at all. Similarly, there are some very specific assumptions regarding the form of the approximations used (Relu, GRUs etc).  Are these necessary? In principle one could implement the ideas in the paper with other, simpler architectures. Was this the first set of choices, or was it arrived at after some experimentation? It is important to understand how much of the performance achieved is due to the specific (fairly powerful) architectures and what one could get through simpler means (eg, feedforward nets)
+3. The paper seems quite similar in spirit to Bacon & Precup, 2015b; in fact, it seems that the use of a value function or model that they discuss is a way to provide a y_t. However, there is no direct comparison to that approach. Since it is very related, it would be useful to perform some of those same experiments. Also, their paper works entirely in the MDP, not POMDP framework, so some clarification is needed here regarding the use of POMDPs instead.
+4. The choice of domains is somewhat limited to simple tasks, while some of the recent approaches in hierarchical RL use more complex domains (Atari, Minecraft etc). Ideally, the experiments should be extended to some of these more complex tasks.
+5. What are the theoretical properties of the proposed approach? Eg, is the proposed algorithm convergent? If Bi-POMDP is a POMDP, then one should be able to leverage POMDP results to build some theory here. If it is not a POMDP, then we need some understanding of how easy/hard a Bi-POMDP is to solve
+6. The paper contains many grammar problems and some broken references, and should be proof-read thoroughly 
+In summary, while the proposed approach is quite interesting and definitely worth exploring, the paper is not ready for publication in its current form.",4,5.0,ICLR2017
+S1FGCYFef,1,HJ_aoCyRZ,HJ_aoCyRZ,SpectralNet: Spectral Clustering using Deep Neural Networks,"The authors study deep neural networks for spectral clustering in combination with stochastic optimization for large datasets. They apply VC theory to find a lower bound on the size of the network. 
+
+Overall it is an interesting study, though the connections with the existing literature could be strengthened:
+
+- The out-of-sample extension aspects and scalability is stressed in the abstract and introduction to motivate the work.
+On the other hand in Table 1 there is only compared with methods that do not possess these properties.
+In the literature also kernel spectral clustering has been proposed, possessing out-of-sample properties
+and applicable to large data sets, see
+
+``Multiway Spectral Clustering with Out-of-Sample Extensions through Weighted Kernel PCA, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 2, pp. 335-347, 2010
+
+Sparse Kernel Spectral Clustering Models for Large-Scale Data Analysis, Neurocomputing, vol. 74, no. 9, pp. 1382-1390, 2011
+
+The latter also discussed incomplete Cholesky decomposition which seems related to section 3.1 on p.4.
+
+- related to the neural networks aspects, it would be good to comment on the reproducability of the results with respect to the training results (local minima) and the model selection aspects. How is the number of clusters and number of neurons selected?
+
+",6,3.0,ICLR2018
+QlbFmvUC7zQ,1,MP0LhG4YiiC,MP0LhG4YiiC,"A straightforward task with a complex architectural approach, unclear contribution scope, and missing baselines.","This paper presents a model that takes in a keyframe from a video and emits the noun and verb best matching what is being done in the frame.
+At inference time, the noun and verb have never been seen in combination with one another, but have each been seen paired with other nouns/verbs at training time.
+The paper presents a complex, three part model (ArtNet) to tackle this challenge, as well as unimodal linguistic baselines.
+Notably, the evaluation does not include vision-only baselines or pretrained model toplines, making it difficult to assess exactly what ArtNet is learning and where its advantage lies.
+
+Questions:
+
+- There are multiple references to ""creation"", e.g., ""create novel compositions"" which makes the method sound like it's performing generation. From my understanding, though, ArtNet is purely discriminative, taking in a keyframe and predicting a noun and verb. However, there is one line in the paper that says ""We also learn visual reconstruction via a regression task."", which makes it sound like there's a formulation of ArtNet that maybe takes in a noun and verb and produces a keyframe image (using a GAN, maybe? Or a nearest neighbor lookup?), and so does ""create novel compositions"". If that's the case, it isn't described, and this image reconstruction task is never mentioned again in the paper or described in any equations.
+
+- Were pretrained ViLBERT/UNITER run alone as a topline? Establishing how much is lost in performance due to lack of pretraining + how this method addresses that with sparse data would make a much stronger argument, I feel. In particular, we would want to see pretrained ViLBERT/UNITER and then pretrained + ArtNET to give a sense of how performance will change as models have seen huge amounts of aligned data.
+- [Related] In Table 1 what's the intuition for language pretrained mBERT/UNITER/ViLBERT falling behind from scratch? Why use language pretraining and not vision pretraining (e.g., topline)?
+
+
+- In Eq (1), why use cosine similarity for visual embeddings but then back off to surface forms for words? Was cosine similarity for word representations computed by UNITER tried? What is the intuition for this not working, if it did not?
+
+- Not sure in Eq (4) what the sequence input to the LSTM is; does c range over some sequence? Doesn't the sum already collapse that?
+
+- ""To ensure the focus is on new compositions, rather than new words, we removed new compositions that contain new words not seen in the train set."" All new words or just nouns/verbs? It's a big advantage/relaxation on the test set to have no OOV tokens.
+
+- Why no vision-only baseline? Strip word contexts at training time except noun/verb, then predict only noun/verb at test. A lot of this could be basically object recognition followed by activity recognition or strong priors on p(activity | object) (e.g., always ""open"" or ""close"" for cabinets).
+
+- ""outperforms Multimodal BERT/BERT with 1.23%/4.42% improvements, which is significant""; what statistical significance test was used? How many random initializations were tried to establish the average performance numbers for comparison between performance populations?
+
+Suggestions for Improvement:
+
+- ""We call attention to a challenging problem, compositional generalization, in the context of machine language acquisition, which has seldom been studied."" This is poorly worded, since compositional generalization is well and commonly studied, to the point that even in this paper there is a section in the related work about it. Major workshops also list compositionality as a topic of interest, so I don't think it's fair at all to say that this has ""seldom been studied"" [ https://sites.google.com/view/repl4nlp2020/home ]. In the context of language acquisition specifically, emergent communication work focuses heavily on composition [ https://sites.google.com/view/emecom2019/home ].
+
+- ViLBERT in intro, UNITER in description of method, ""Multimodal BERT"" (mBERT?) in Table 1. What was used? Needs to be consistent in presentation.
+
+- ""We discard the object labels due to strict constraints in the language acquisition scenario."" Because Faster RCNN is trained on ImageNet, which is based on WordNet, the object categories still exist in the form of supervision. The model has a linguistically-motivated notion of what an ""object"" is that can be traced to the WordNet. This should be acknowledged; you can't actually ""get rid"" of linguistic information inherent in a pretrained Faster RCNN.
+
+- ""But there are few works addressing this emerging and valuable challenge for language acquisition in a multimodal reasoning view."" This paper does not really tackle language acquisition, though? There's a restriction so that the test set has no OOV words, even. I think the claims and presentation of the paper need to be carefully re-scoped.
+
+- There is a lot of focus on ""learned arithmetic operations"" but no analysis as to what exactly this component ends up doing or learning.
+
+Nits:
+- Typo Introduction ""a language model that generate"" S/V agreement.
+- Figure 1 doesn't feel like it communicates anything about the method, and does not seem tied to the caption. Are boxes (1, 2, 3) meant to represent the association, reasoning, and inference steps? What's happening in each?
+- ""The results show that ARTNET achieves significant performance improvements in terms of new composition accuracy, over a large-scale
+video dataset."" strange wording makes it sound like ARTNet is outperforming a dataset, not a method.
+- ""We train the model to acquire words by directly predicting them."" this sounds like the model will be predicting words unseen at training time, which is not so. In particular, ""acquire"" here sounds like the model will be exposed to the word at most once (at inference time) and then be able to memorize that exposure in sequence.
+- Typo? In 3.1 ""by running faster R-CNN too"" what is the ""too"" pointing to? Do you run Faster-RCNN somewhere?
+- Typo 3.3 ""stringest baseline"" strictest?
+- Figures 4 and 5 are so close together their captions bleed together and are really difficult to disentangle.
+- Typo ""than our baselines 86.5"" makes it sound like the baselines achieved 86.5.
+- Typo 4 ""the goal of learned model"" missing ""the""",3,3.0,ICLR2021
+rk2M7kzNe,3,S1OufnIlx,S1OufnIlx,Experimental paper with some interesting observations. Overall its contribution is incremental,"The paper is well motivated and well written. The setting of the experiments is to investigate a particular case. While the results of experiments are interesting, such investigation is not likely to systematically improve our understanding of the adversarial example phenomenon. Overall, the contribution of the paper seems incremental. 
+
+Pros:
+1. This paper proposes the iterative LL method, which is efficient in both computation and success rate in generating adversarial examples. This method could be useful when the number of classes in the dataset is huge.
+2. Some observations of the experiments are interesting. For example, overall photo transformation does not affect much the accuracy on clean image, but could destroy some adversarial methods. 
+
+Cons:
+1. As noticed by the authors, some similar works exist in the literature. According to the authors, what differs this work from other existing works is that this paper tend to fool NN by making very small perturbations of the input. But based on the experiments and the demonstration (the real pictures), it is arguable that the perturbations in the experiments are still small.
+2. Some hypotheses proposed in the paper based on one-shot experiments seems too rushy.
+3. As mentioned above, the results of this paper seems not really improving the understanding of the adversarial example phenomenon.",5,4.0,ICLR2017
+Syxu1oN9nX,3,Hkes0iR9KX,Hkes0iR9KX,"A paper addressing an interesting problem, but lacks clarity and hard to understand, tech novelty is unknown","This paper proposes a deep GNN network for graph classification problems using their adaptive graph pooling layer. It turns the graph down-sampling problem into a column sampling problem. The approach is applied to several benchmark datasets and achieves good results.
+
+Weakness
+
+1.	This paper is poorly written and hard to follow. There are lots of typos even in the abstract. It should be at least proofread by an English-proficient person before submitted. For example, in the last paragraph before Section 3. “In Ying et al. ……. In Ying et al.”
+2.	In paragraph 1 of Section 3, there should be 9 pixels around the center pixel including itself in regular 3x3 convolution layers.
+3.	The definition of W in Eq(2) is vague. Is this W shared across all nodes? If so, what’s the difference between this and regular GNN layers except for replacing summation with a max function?
+4.	The network proposed in this paper is just a simple CNN. GNN can adopt such kind of architectures as well. And I didn’t get numbers of first block in Figure 1. The input d is 64?
+5.	The algorithm described in Algorithm 1 is hard to follow. There are some latex tools for coding the algorithms.
+6.	The authors claim that improvements on several datasets are strong. But I think the improvement is not that big. For some datasets, the network without pooling layers even performs better at one dataset. The authors didn’t provide enough analysis on these parts.
+
+Strength:
+1.	The idea used in this paper for graph nodes sampling is interesting. But it needs more experimental studies to support this idea. 
+",4,4.0,ICLR2019
+Syga-qC2FS,1,Syg6fxrKDB,Syg6fxrKDB,Official Blind Review #3,"In this paper, the authors introduce a new Monte Carlo Tree Search-based (MCTS) algorithm for computing approximate solutions to the Traveling Salesman Problem (TSP). Yet since the TSP is NP-complete, a learned heuristic is used to guide the search process. For this learned heuristic, the authors propose a Graph Neural Network-derived approach, in which an additional term is added to the network definition that explicitly adds the metric distance between neighboring nodes during each iteration. They perform favorably compared to other TSP approaches, demonstrating improved performance on relatively small TSP problems and quite well on larger problems out of reach for other deep learning strategies.
+
+I believe that the paper is built around some good ideas that tackle an interesting problem; the Traveling Salesman Problem and variants are popular and having learning-based approaches to replace heuristics is important. In particular, choosing to use an MCTS to tackle this problem feels like a natural approach, and using a GNN as a learning backend feels like a encourage better performance with fewer training samples. However, there are too many questions raised by decisions the authors have made to warrant acceptance in the current state; I would be willing to revise my score if some more detailed analysis of these points were included.
+
+First, the heuristic value function: this value function h(s) is defined in the appendix but should be motivated and described (in detail) in the text body. As written, this information is not included in the main body of the paper yet is critical for the implementation. Also, though it is intuitively clear why a random policy is unlikely to result in a poor result, it is never compared against; how does the performance degrade if the heuristic value function is not used? Finally, the parameter 'beam width' used in the evaluation of the value function but is only set to 1 in all experiments. Some experiments should be included to show how increasing beam width impacts performance (or the authors should provide a reason these experiments were not run). Finally, it seems as if there already exists heuristic methods (against which the paper compares performance); could these be used instead of this value function?
+
+Additionally, how is the set of Neighbors defined? It is suggested in the text that it is not all nodes, but not using all nodes is a limiting assumption. Relatedly, it would be helpful if the authors could better motivate their additional term in Eq. (2); at the moment, though using the euclidian distance to weight the edges, it is unclear why this function is a better choice than something else, for instance a Gaussian kernel or a kernel with finite support. In addition, the authors motivate that the distance between nodes is very important for the performance of the system, yet the coordinates of each vertex are included as part of the input vector so that (in principle) the network could learn to use this information. A comparison against a network implemented using the basic GNN model, defined in Eq. (1), should be included to compare performance.
+
+In summary, there are a few choices that would need to be better justified for me to really support acceptance. However, there are some quite interesting ideas underpinning this paper, and I hope to see it published.
+
+Minor comments:
+- Overall, I like the structure of the paper. At the beginning of all major sections there is an overview of what the remainder of the section will contain. This helps readability. I also like the comparison between the proposed work and AlphaGo, which popularized using deep learning in combination with MCTS; this enhances the clarity of the paper.
+- The related work section would be more instructive if it also gave some information about the limitations of the alternative deep learning approaches and how the proposed technique overcomes these. My assumption is that all approaches discussed in the second paragraph are ""greedy"" and suffer from the limitations mentioned in the introduction. However, I am not sufficiently familiar with the literature to be certain. A sentence or two mentioning this or relating that work to the proposed MCTS approach would be informative.
+- The last paragraph of the Related Work section, discussing the work of Nowak et al 2017 and Dai et al 2017, introduces some numbers with no context: e.g., ""optimality gap of 2.7%"". It is unclear at this stage if this number is good or bad. Some more context and discussion of this work might be helpful for clarity, particularly since the Nowak work seems to be the only other technique using GNN.
+- Some general proofreading for language should be performed, as there are occasionally typos or missing words throughout the paper. Some examples: ""compute the prior probability that indicates how likely each vertex [being->is] in the tour sequence""; ""Similar to the [implement->implementation], in Silver...""; ""[Rondom->Random]"" in tables.
+- In Sec. 4.1, it is unclear what is meant by ""improved probability \hat{P} of selecting the next vertex"".
+- I believe there is an inconsistency in the description of the MCTS strategy. Though the action value is set to the 'max' during the Back-Propagation Strategy, the value of Q is initialized to infinity.
+
+Suggestions for improvement (no impact on review):
+- Clarity: the language in the 3rd and 4th paragraphs of the introduction [begins with ""In this paper, ...""] could be made clearer.
+  - The language ""part of the tour sequence"" is not quite clear, since, when the process is complete, all points will be in the tour. It should be made clearer that the algorithm is referring to a ""partial tour"" as opposed to the final tour. This clarity issue also appears later in Sec. 4.
+  - ""Similar to above-learned heuristic approaches..."" It might be clearer if you began the sentence with ""Yet,"" or ""However,"" so that it is more obvious to the reader that you intend to introduce a solution to this problem.
+- Equation formatting: Please use '\left(' and '\right)' for putting parenthesis around taller symbols, like \sum.
+- When describing the MCTS procedure, I have seen the word ""rollouts"" used much more frequently than ""playouts"". Consider changing this language (though the meaning is clear).",6,,ICLR2020
+B1l3vxgLcS,3,H1gNOeHKPS,H1gNOeHKPS,Official Blind Review #1,"This paper aims to address several issues shown in the Neural Arithmetic Logic Unit, including the unstability in training, speed of convergence and interpretability. The paper proposes a simiplification of the paramter matrix  to produce a better gradient signal, a sparsity regularizer to create a better inductive bias, and a multiplication unit that can be optimally initialized and supports both negative and small numbers.
+
+As a non-expert in this area, I find the paper interesting but a little bit incremental. The improvement for the NAC-addition is based on the analysis of the gradients in NALU. The modification is simple. The proposed neural addition unit uses a linear weight design and an additional sparsity regularizer. However, I will need more intuitions to see whether this is a good design or not. From the experimental perspective, it seems to work well.
+Compared to NAU-multiplication, the Neural Multiplication Unit can represent input of both negative and positive values, although it does not support multiplication by design. The experiments show some gain from the proposed NAU and NMU.
+
+I think the paper can be made more self-contained. I have to go through the NALU paper over and over again to understand some claims of this paper. Overall, I think the paper makes an useful improvement over the NALU, but the intuition and motivation behind is not very clear to me. I think the authors can strengthen the paper by giving more intuitive examples to validate the superiority of the NAU and NMU.",3,,ICLR2020
+4UIsjvTZfEt,1,KOtxfjpQsq,KOtxfjpQsq,Fair paper - Straightforward extension of Janner et al. (2019) to POMDPs,"=== Summary ===
+
+The paper concerns model-based meta-RL. It exploits the fact that meta-RL can be formulated as POMDP in which the task indicator is part of the (unobserved) hidden state. Thus, the paper effectively analyzes and proposes model-based algorithms for POMDPs. The paper bounds the gap between the expected reward of a policy in the actual POMDP and the estimated model and then theoretically shows that this gap can be reduced when using dyna-style / branched rollouts instead of full rollouts under the learned model. Motivated by this finding, the paper proposes a Dyna-like algorithm for POMDPs. In the experimental evaluation, the paper compares its proposed method, M3PO, to two recent meta-RL approaches in a range of meta-RL environments for continuous control.
+
+=== Merits of the paper ===
+
+Overall, the paper contributes a new algorithm for model-based meta-RL which is neatly motivated by the paper’s theoretical analysis. Both the theoretical analysis and the proposed algorithm are sound. The experimental evaluations demonstrate that the algorithm performs similar or better than previous model-based meta-RL algorithms and thus could be a relevant contribution to the field of meta-RL.
+
+=== Strenghts ===
+- The experiments include relevant meta-learning environments and algorithms to compare to.
+- The proposed algorithm is sound and motivated by theory.
+- I have done a very simple simulation study, being able to confirm that the gap in Theorem 2 is indeed (much) smaller than the one Theorem 1.
+
+=== Weaknesses / Concerns ===
+- Both Theorem 1 and 2 seem to be a straightforward combination of the results in [1] and the fact that POMDPs can be cast as MDPs with history-states. Thus the theoretical contribution/novelty is quite limited.
+- As expected, from my simulation study it seems that bounds are vacuous and their usefulness beyond motivating branched rollouts is questionable.
+- The only difference to original dyna-like approaches is the fact that a recurrent model with internal/hidden state is used – thus the algorithmic contribution is small as well.
+- Overall, the clarity of the paper could probably be improved a lot – Many paragraphs and sentences are hard read and lack vital explanations. For examples, see below.
+
+=== Overall Assessment ===
+
+In my opinion, the paper is a borderline case. Currently, I see the paper slightly below the acceptance threshold. Both the theoretical and the algorithmic contribution of the paper are small. Overall, the paper is a straightforward extension of [1] to POMDPs with meta-RL experiments. Yet, due to a lack of clarity, it is hard to read. Nonetheless, the proposed algorithm seems to be practically relevant for meta-RL - thus I am happy to hear other opinions and open to being convinced to increase my score.
+
+=== Questions & tips for improvement ===
+
+Appendix A.1 briefly discusses the differences to [1], and claims to derive similar Theorems under more appropriate assumptions / premises in Appendix A.7. How do the theorems in A.7. differ from the ones in Section 5? If the theorems build on more appropriate assumptions, why not use them in the main part of the paper?
+
+Adding a simple simulation study that shows how much smaller the gap/discrepancy is, when branched instead of full rollouts are used, would require little effort and strengthen the claim that branched rollouts are preferable.
+
+I found section 5.1. to be very confusing – It is stated that the gap can be “expressed as the function of two error quantities of the meta-model: generalization error due to sampling and distribution shift due to the updated meta-policy”. Indeed Def. 2 is probably the generalization error due to sampling. However, Def. 1 is the expected TV distance of the estimated model from the true model. Then, how should this be the “distribution shift due to the updated meta-policy”?
+
+=== Minor comments ===
+
+- Section 4, 3rd paragraph: It should probably be ""$r_{t}$ and $o_{t+1}$ are assumed to be conditionally independent
+- Section 5.1.: It would be good to have a proper definition if $\pi_{\mathcal{D}}$, i.e. the data-collection policy
+- Theorem 1: The definition of $\epsilon_m$ is duplicate whereas the $\epsilon_\pi$ is not included
+- Section 7, 4th paragraph: Should be “In Humanoid-direct, the performance”
+
+
+[1] Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. NeurIPS 2019
+",5,3.0,ICLR2021
+Skx4JPuvYH,1,rkxxA24FDr,rkxxA24FDr,Official Blind Review #2,"= Summary
+A variation of Neural Turing Machines (and derived models) storing the configuration of the controller in a separate memory, which is then ""softly"" read during evaluation of the NTM. Experiments show moderate improvements on some simple multi-task problems.
+
+= Strong/Weak Points
++ The idea of generalising NTMs to ""universal"" TMs is interesting in itself ...
+- ... however, the presented solution seems to be only half-way there, as the memory used for the ""program"" is still separate from the memory the NUTM operates on. Hence, modifying the program itself is not possible, which UTMs can do (even though it's never useful in practice...)
+- The core novelty relative to standard NTMs is that in principle, several separate programs can be stored, and that at each timestep, the ""correct"" one can be read. However this read mechanism is weak, and requires extra tuning with a specialized loss (Eq. (6))
+~ It remains unclear where this is leading - clearly NTMs and NUTMs (or their DNC siblings) are currently not useful for interesting tasks, and it remains unclear what is missing to get there. The current paper does not try show the way there.
+- The writing is oddly inconsistent, and important technical details (such as the memory read/write mechanism) are not documented. I would prefer the paper to be self-contained, to make it easier to understand the differences and commonalities between NTM memory reads and the proposed NSM mechanism.
+
+= Recommendation
+Overall, I don't see clear, actionable insights in this submission, and thus believe that it will not provide great value to the ICLR audience; hence I would recommend rejecting the paper to allow the authors to clarify their writing and provide more experimental evidence of the usefulness of their contribution.
+
+= Minor Comments
++ Page 6: ""As NUTM requires fewer training samples to converge, it generalizes better to unseen sequences that are longer than training sequences."" - I don't understand the connecting between the first and second part of the sentence. This seems pure speculation, not a fact.",3,,ICLR2020
+ByUZ0g5lG,2,Hkn7CBaTW,Hkn7CBaTW,"Important framework, tools, and criterion for understanding deep neural networks","summary of article: 
+This paper organizes existing methods for understanding and explaining deep neural networks into three categories based on what they reveal about a network: functions, signals, or attribution. “The function extracts the signal from the data by removing the distractor. The attribution of output values to input dimensions shows how much an individual component of the signal contributes to the output…” (p. 5). The authors propose a novel quality criterion for signal estimators, inspired by the analysis of linear models. They also propose two new explanatory methods, PatternNet (for signal estimation) and PatternAttribution (for relevance attribution), based on optimizing their new quality criterion. They present quantitative and qualitative analyses comparing PatternNet and PatternAttribution to several existing explanation methods on VGG-19.
+
+* Quality: The claims of the paper are well supported by quantitative results and qualitative visualizations. 
+* Clarity: Overall the paper is clear and well organized. There are a few points that could benefit from clarification.
+* Originality: The paper puts forth an original framing of the problem of explaining deep neural networks. Related work is appropriately cited and compared. The authors's quality criterion for signal estimators allows them to do a quantitative analysis for a problem that is often hard to quantify.
+* Significance: This paper justifies PatternNet and PatternAttribution as good methods to explain predictions made by neural networks. These methods may now serve as an important tool for future work which may lead to new insights about how neural networks work. 
+
+Pros:
+* Helps to organize existing methods for understanding neural networks in terms of the types of descriptions they provide: functions, signals or attribution.
+* Creative quantitative analyses that evaluate their signal estimator at the level of single units and entire networks.
+
+Cons:
+* Experiments consider only the pre-trained VGG-19 model trained on ImageNet. Results may not generalize to other architectures/datasets.
+* Limited visualizations are provided. 
+
+Comments:
+* Most of the paper is dedicated to explaining these signal estimators and quality criterion in case of a linear model. Only one paragraph is given to explain how they are used to estimate the signal at each layer in VGG-19. On first reading, there are some ambiguities about how the estimators scale up to deep networks. It would help to clarify if you included the expression for the two-component estimator and maybe your quality criterion for an arbitrary hidden unit. 
+* The concept of signal is somewhat unclear. Is the signal 
+    * (a) the part of the input image that led to a particular classification, as described in the introduction and suggested by the visualizations, in which case there is one signal per image for a given trained network?
+    *  (b) the part of the input that led to activation of a particular unit, as your unit wise signal estimators are applied, in which case there is one signal for every unit of a trained network? You might benefit from two terms to separate the unit-level signal (what caused the activation of a particular unit?) from the total signal (what caused all activations in this network?).
+* Assuming definition (b) I think the visualizations would be more convincing if you showed the signal for several output units. One would like to see that the signal estimation is doing more than separating foreground from background but is actually semantically specific. For instance, for the mailbox image, what does the signal look like if you propagate back from only the output unit for umbrella compared to the output unit for mailbox? 
+* Do you have any intuition about why your two-component estimator doesn’t seem to be working as well in the convolutional layers? Do you think it is related to the fact that you are averaging within feature maps? Is it strictly necessary to do this averaging? Can you imagine a signal estimator more specifically designed for convolutional layers?
+
+Minor issues: 
+* The label ""Figure 4"" is missing. Only subcaptions (a) and (b) are present.
+* Color scheme of figures: Why two oranges? It’s hard to see the difference.",8,3.0,ICLR2018
+_WpKDUDjXtO,1,5IqTrksw9S,5IqTrksw9S,GLUECode: A Benchmark for Source Code Machine Learning Models,"This paper presents GLUECode, a benchmark for evaluating machine learning models of source code. GLUECode considers both global and local contexts of source code, and aims to help researchers experiment with multiple source code representations and evaluate their models. The authors also presented results of several baselines on the benchmark. 
+
+Machine learning for source code has attracted a lot of interests in recent years. It is good to see a benchmark consists of 5000+ projects, which could help advance this area of research. The authors also performed some GLUECode tasks and presented results for several baselines, which show that there is ample room for progress on GLUECode. Overall, the paper is well written.
+
+Concerns: 
+
+The proposed work considers both global and local contexts of code (the benchmark’s name is Global and Local Understanding Evaluation of Code). Section 2.1 also dedicates to this. However, it is not clear what global context is considered and how it is incorporated by the benchmark. In a ML for SE work, researchers may use various global contexts such as UML diagrams, library/API dependency, inter-procedural data/control flow, commit data, etc. It is not clear how these global context information can be satisfied by the benchmark. 
+
+The authors can also describe more about the unique advantages of using the proposed benchmark. Currently, they are already many public datasets released by various papers in this field (thanks to the open science policy). Also, it is easy for researchers to download a large amount of source code from open source websites (such as Github) themselves. They can also process the source code using existing static analysis tools to obtain the data they need and share the data. 
+
+Currently, GLUECode only provides a few types of source code representations. In recent years, researchers have proposed many different ways of representing source code tokens and ASTs. As an example, the following works use different AST-based source code representations (and it is not clear if the benchmark could provide necessary information to support these representations):
+Yao Wan, Zhou Zhao, Min Yang, Guandong Xu, Haochao Ying, Jian Wu, and Philip S. Yu. Improving automatic source code summarization via deep reinforcement learning. In ASE, pages 397–407. ACM, 2018.
+
+J. Zhang, et al., A Novel Neural Source Code Representation based on Abstract Syntax Tree, In Proc. the 41th International Conference on Software Engineering (ICSE 2019), Montreal, Canada, 2019.
+
+The data quality should be discussed in detail, as low quality data will bias the analysis results. This is particularly important for a public benchmark. For example, if the benchmark contains a lot of duplicated code, the follow-up analysis will be misleading. Furthermore, software evolves. Very soon, new versions/commits will emerge. It is not clear if the evolution will degrade the data quality and the validity of the benchmark.  
+
+The proposed benchmark data and code are not available for replication purpose.
+
+In Table 2, the baseline result for Transformer-based method completion is missing. 
+
+The paper is generally well-written. There are a few typos. For example:
+
+In page 3, ”?? provides details and examples...”
+",4,4.0,ICLR2021
+G7_fyB26TDe,4,RkqYJw5TMD7,RkqYJw5TMD7,Interesting topic and solid analysis,"The paper explores adversarial robustness in a new setting of test-time adaptation. It shows this new problem of “test-time-adapted adversarial robustness” is strictly weaker than the “traditional adversarial robustness” when assuming the training data is available for the “test-time-adapted adversarial robustness”. The gap between the two problems is demonstrated by the simple DANN solution which has good “test-time-adapted adversarial robustness” but bad “traditional adversarial robustness”. The paper also explores the subcase of “test-time-adapted adversarial robustness” when assuming the training data is not available and provide some initial result. 
+
+The paper has clear strong points. It aims to tackle an important problem, the “test time adaptation allowed” extension for the “adversarial robustness”. The paper has a nice global picture and a clear position for this piece of work. I particularly like the way the author approaches the problem. They start from the most abstract and fundamental question, “test-time-adapted adversarial robustness” v.s. “Classic adversarial robustness”, or transductive learning v.s. Inductive learning in the setting of adversarial robustness. To proceed the thinking, they develop a good theoretical framework (the two threat models from definition 1 and definition 2) to formulate the two problems. And then consider a middle setting (definition 3) between the classic minimax and new maximin threat model.  Such frameworks help to develop theoretical understanding like one setting the strictly weaker than another (proposition 1). 
+
+The weak points of the paper are mainly about the presentation. The paper currently is very dense. Sometimes I feel the author may assume the reader has certain domain knowledge without explanations. For example, dataset “CIFAR10c-fog” appears very early in the introduction, but it is never clearly explained what it is, what is the main difference between it and CIFAR 10. After reading the paper the only impression I get is they not homogeneous. There also some places in the method, I can not understand,
+For the maximin threat model, why the game is maximation over U instead of A (the left side of the equation from proposition 1).
+Why there are A0 and A1 in the Adversarial semi-supervised minimax threat model? And why in this game, A0 and A1 are jointly maximizing L(\tile{F},\tilde{V}’)?
+
+More question about experiments:
+For FPA attacks, is there any baseline method we can compare the DANN with? Currently, I am not sure how to evaluate the performance of DANN. 
+In experiment (D), it says, “we also evaluate the accuracy of the adapted DANN models in the minimax threat mode”. But where are the results? 
+
+I understand it is pretty hard to squeeze so many contents into limited 8 pages. Personally, I think it is helpful to cut off some content and make the main paper more clear, well organized, and strong. 
+
+Unfortunately, I am not an expert in adversarial robustness. I did not check the technique and experiments deeply. My current score assumes no fatal flaws exist in the theory and experiment. My rating will be changed according to other reviewers’ comments and the author’s updates. 
+",7,2.0,ICLR2021
+SyTAvN8yf,1,ByZmGjkA-,ByZmGjkA-,"Worthy goal, but implementation felt a bit underwhelming","This paper presents an analysis of an agent trained to follow linguistic commands in a 3D environment.  The behaviour of the agent is analyzed by means of a set of ""psycholinguistic"" experiments probing what it learned, and by inspection of its visual component through an attentional mechanism.
+
+On the positive side, it is nice to read a paper that focuses on understanding what an agent is learning. On the negative side, I did not get many new insights from the analyses presented in the study.
+
+3 A situated language learning agent
+
+I can't make up the chair from the refrigerator in the figure.
+
+4.1 Word learning biases
+
+This experiment shows that, when an agent is trained on shapes only, it will exhibit a shape bias when tested on new shapes and colors. Conversely, when it is exposed to colors only, it will have a color bias. When the training set is balanced, the agent shows a mild bias for the simpler color property. How is this interesting or surprising? The crucial question, here, would be whether, when an agent is trained in a naturalistic environment (i.e., where distributions of colors, shapes and other properties reflect those encountered by biological agents), it would show a human-like shape bias. This, however, is not addressed in the paper.
+
+Minor comments about this section:
+
+- Was there noise also in shape generation, or were all object instances identical?
+
+- propensity to select o_2: rather o_1?
+
+- I did not follow the paragraph starting with ""This effect provides"".
+
+4.2 The problem of learning negation
+
+I found this experiment very interesting.
+
+Perhaps, the authors could be more explicit about the usage of negation here. The meaning of commands containing negation are, I think, conjunctions of the form ""pick something and do not pick X"" (as opposed to the more natural ""do not pick X"").
+
+modifiation: modification
+
+4.3 Curriculum learning
+
+Perhaps the difference in curriculum effectiveness in language modeling vs grounded language learning simulations is due to the fact that the former operates on large amounts of natural data, where it's hard to define the curriculum, while the latter are typically grounded in toy worlds with a controlled language, where it's easier to construct the curriculum.
+
+4.4 Processing and representation differences
+
+There is virtually no discussion of what makes the naturalistic setup naturalistic, and thus it's not clear which conclusions we should derive from the corresponding experiments. Also, I don't see what we should learn from Figure 5 (besides the fact that in the controlled condition shapes are easier than categories). For the naturalistic condition, the current figure is misleading, since different classes contain different numbers of instances. It would be better to report proportions.
+
+Concerning the attention analysis, it seems to me that all it's saying is that lower layers of a CNN detect lower-level properties such as colors, higher layers detect more complex properties, such as shapes characterizing objects. What is novel here?
+
+Also, since introducing attention changes the architecture, shouldn't the paper report the learning behaviour of the attention-augmented network?
+
+The explanation of the attention mechanism is dense, and perhaps could be aided by a diagram (in the supplementary materials?). I think the description uses ""length"" when ""dimensional(ity)"" is meant.
+
+6. Supplementary material
+
+It would be good to have an explicit description of the architecture, including number of layers of the various components, structure of the CNN, non-linearities, dimensionality of the layers, etc. (some of this information is inconsistently provided in the paper).
+
+It's interesting that the encoder is actually a BOW model. This should be discussed in the paper, as it raises concerns about the linguistic interest of the controlled language that was used.
+
+Table 3: indicates is: indicates if
+",4,5.0,ICLR2018
+ZF1DUm4mdXI,1,DGttsPh502x,DGttsPh502x,Simple method but unclear results,"-------------------
+Summary
+-------------------
+This paper proposes a simple approach to discover interpretable latent manipulations in trained text VAEs. The method essentially involves performing PCA on the latent representations to find directions that maximize variance. The authors argue that this results in more interpretable directions. The method is applied on top of a VAE model (OPTIMUS), and the authors argue that different directions discovered by PCA correspond to interpretable concepts.
+
+-------------------
+Strengths
+-------------------
+- The method is simple, and can be applied on top of existing text VAEs.
+- Learning interpretable and controllable generative models of text is an important research area, and this paper contributes to this important field.
+
+-------------------
+Weaknesses
+-------------------
+- There are only mostly qualitative results presented. While I agree that performing quantitative results is difficult with this style of work, the authors could have (for example) adopted methods from the style transfer literature to show quantitative results. These metrics include perplexity (to see how fluent the generations are), reverse perplexity, and style transfer accuracy (this may not be applicable since there is no ground truth ""style"" in this work, but the ground truth style could be heuristically defined for some transformations, e.g. for singular/plural transformations).
+- Human evaluation seems nonideal since it is only tested on 12 people.
+- The generations are actually not so good in my opinion? E.g. many of the generations in the appendix are ungrammatical and/or semantically nonsensical.  Again, metrics such as perplexity could quantify the fluency of generated text.
+- The method is only applied to one text VAE mode which specifically uses BERT/GPT-2 , so it is not clear if this will generalize to other models (e.g. models trained from scratch).
+
+-------------------
+Questions/Comments
+-------------------
+- In Figure 2, are these the top 4 principal directions? If not, how were these directions discovered?
+- ""It is known that variational autoencoders trained with a schedule for the KL weight parameter (equation 1) obtain disentangled representations (Higgins et al., 2016; Sikka et al., 2019; John et al., 2019). Since OPTIMUS is also trained with KL annealing, canonical coordinates in its latent space are likely to be disentangled."" I believe this is only valid for beta > 1 so it is not really applicable here.
+-----------------------
+Edit after rebuttal: Thank you for the rebuttal and clarifying some of my questions. I have decided to keep the original score.",3,4.0,ICLR2021
+ryxyH98XYr,1,SygagpEKwB,SygagpEKwB,Official Blind Review #1,"After rebuttal edit:
+No clarifications were made, so I keep my score as is.
+
+------------------------------------------------------
+Claims: Explicitly requiring a small number of labels allows for successful learning of disentangled features by either using them as a validation set for hyper parameter tuning, or using them as a supervised loss. 
+
+Decision: Reject. This paper needs a substantial rewrite to make clear what specific contributions are from the multitude of experiments run in this study. As is, the two contributions stated in the introduction are both obvious and not particularly significant -- that having some labels of the type of disentanglement desired helps when used as a validation set and as a small number of labels for learning a disentangled representation space. There are no obviously stated conclusions about which types of labels are better than others (4.2). Section 3.2 seems to have some interesting findings that small scale supervision can help significantly and fine-grained labeling is not necessarily needed, but I don't understand why that finding is presented there when Fig. 4 seems to perform a similar experiment on types of labels with no conclusion based on its results. Conclusion sentence of 4.3 is hard to decipher, but I assume is just saying S^2/S beats U/S even when S^2/S is subject to noisy labels. Overall, I find it very difficult to absorb the huge amount of results and find the analysis not well presented.
+
+",1,,ICLR2020
+S1xCmqT0tB,3,ByxoqJrtvr,ByxoqJrtvr,Official Blind Review #2,"This paper proposes a method to learn to reach goals in an RL environment. The method is based on principles of imitation learning. For instance, beginning with an arbitrary policy that samples a sequence of state-action pairs, in the next iteration, the algorithm treats the previous policy as an expert by relabeling its ending state as a goal. The paper shows that the method is theoretically sound and effective empirically for goal-achieving tasks. 
+
+The paper is relatively clear and experiments are okay. I would then recommend it is on the positive side of the borderline.
+
+Comments:
+* The method is interesting but is still an ""RL"" method. So it is really learning to reach the goal via ""RL"". Note that in the method, the algorithm is not doing effective exploration but just randomly explore until you collect sufficient data to solve for a new goal. 
+* If you formulate the problem better, you can see that it actually has a reward: add an initial state s0; for each g sampled from p(g), transition s0 to an MDP with goal g. You can now do the usual RL algorithm in this new MDP. I would think you can also do model-based learning -- give the model a good representation and then use the policies to learn the dynamics. It may worth to compare your algorithm with these natural baselines.
+",6,,ICLR2020
+HJxrNo3xz,3,SyELrEeAb,SyELrEeAb,"The paper is overall well-written and makes new and non-trivial contributions to model inference and the application. However, not all claims are well-supported by the data provided in the paper. ","The paper presents a non-linear generative model for GWAS that models population structure.
+Non-linearities are modeled using neural networks as non-linear function approximators and inference is performed using likelihood-free variational inference.
+The paper is overall well-written and makes new and non-trivial contributions to model inference and the application.
+Stated contributions are that the model captures causal relationships, models highly non-linear interactions between causes and accounts for confounders. However, not all claims are well-supported by the data provided in the paper. 
+Especially, the aspect of causality does not seem to be considered in the application beyond a simple dependence test between SNPs and phenotypes.
+
+The paper also suffers from unconvincing experimental validation:
+- The evaluation metric for simulations based on precision is not meaningful without reporting the recall at the same time.
+
+- The details on how significance in each experiment has been determined are not sufficient.
+From the description in D.3 the p-value a p-value threshold of 0.0025 has been applied. Has this threshold been used for all methods?
+The description in D.3 seems to describe a posterior probability of the weight being zero, instead of a Frequentist p-value, which would be the probability of estimating a parameter at least as large on a data set that had been generated with a 0-weight.
+
+- Genomic control is applied in the real world experiment but not on the simulations. Genomic control changes the acceptance threshold of each method in a different way. Both precision and recall depend on this acceptance threshold. Genomic control is a heuristic that adjusts for being too anti-conservative, but also for being too conservative, making it hard to judge the performance of each method on its own. Consequently, the paper should provide additional detail on the results and should contrast the performance of the method without the use of genomic control.
+
+minor:
+
+The authors claim to model nonlinear, learnable gene-gene and gene-population interactions.
+While neural networks may approximate highly non-linear functions, it still  seems as if the confounders are modeled largely as linear. This is indicated by the fact that the authors report performance gains from adding the confounders as input to the final layer.
+
+The two step approach to confounder correction is compared to PCA and LMMs, which are stated to first estimate confounders and then use them for testing.
+For LMMs this is not really true, as LMMs treat the confounder as a latent variable throughout and only estimate the induced covariance.
+",6,5.0,ICLR2018
+nkXlOA8d2Y-,3,TiGF63rxr8Q,TiGF63rxr8Q,"Interesting idea, adding motivating examples early will help","This paper addresses the problem of reinforcement learning using limited training samples. They propose a solution by exploiting the invariance property in the tasks. In particular, they present an algorithm that exploits permutation invariance, study its theoretical properties, and propose examples where this property holds and their algorithm can be leveraged.
+
+I feel the paper could have been better presented by starting with a motivating example where the permutation invariance property holds - for example the portfolio optimization example studied in the experiments. This will make it easy to follow the multiple terminologies of tasks, entities, resources, introduced in Sec 3. 
+
+The setting considered in the paper is one where the state is a concatenation of various entities, while the actions are the fraction of resources allocated to each entity. The permutation invariance property is defined in Def. 1. 
+
+I didi not understand how a network trained using gradient descent alone would satisfy permutation invariance. There is no part of the pseudo code in Alg 1 explicitly making sure that the algorithm is permutation invariant.",5,2.0,ICLR2021
+x0SBDu3Vmz,3,QzKDLiosEd,QzKDLiosEd,Interesting but assumptions are not practical,"Summary: 
+- This paper studies the effectiveness of inferring a neural network’s layers and hyperparameters using the magnetic fields emitted from a GPU’s power cable. The results show that (under certain assumptions) one can reconstruct a neural network’s layers and hyperparameters accurately, and use the inferred model to launch adversarial transfer attacks.
+
+Strong points: 
+- The idea of using magnetic side channels to infer network structure is interesting.
+- The paper is well-written with ideas and limitations explained clearly.
+- The experiment results are thorough and explained clearly
+
+Weak points: 
+- The threat model seems impractical. Attacker assumptions include:
+  - have physical access to the GPU
+  - know the exact input feature dimensions and batch size.
+  - know the deep learning framework, GPU brand and hardware/software versions.
+- The main innovation is demonstrating magnetic side channels from GPU cables reveal information about network structures. However, I’m not sure if ICLR is the best venue for this type of contribution. This paper could be a much stronger submission to other security and system conferences.
+
+Recommendation: 
+- I’m inclined to recommend a reject. The main reason is that the results are based on multiple impractical assumptions, limiting the impact of this paper in reality. 
+
+Comments & questions:
+- How do the authors imagine launching this attack in reality? Specifically, how would one know the input dimensions and batch size of a black-box model? A clear explanation of this will help readers understand the value of this work.
+- Using consistency constraints to optimize for hyperparameter estimation is interesting. How effective is this additional optimization compared with only using the initial estimation?
+
+Minor comments
+- “But there is ‘not’ evidence proving” -> ‘no’
+
+==== Updates after the response ====
+
+I thank the authors for answering the questions in detail. Providing an example application does help readers understand scenarios where the threat model could apply. However, I still think such scenarios are not common but agree that the findings in this paper could be helpful for future security research. I adjusted my rating based on this better understanding. ",5,4.0,ICLR2021
+rJewk1Bphm,3,rygnfn0qF7,rygnfn0qF7,"Reasonable method, but not too much novelty","Reasonable method, but not too much novelty
+
+[Summary]
+
+The paper proposed techniques to pretrain two-layer hierarchical bi-directional or single-directional LSTM networks for language processing tasks. In particular, the paper uses the word prediction, either for the next work or randomly missing words, as the self-supervised pretraining tasks. The main idea is to not only train text embedding using context from the same sentence but also take the embedding of the surrounding sentences into account, where the sentence embedding is also context-aware. Experiments are done for document segmentation, answer passage retrieval, extractive document summary.
+
+[Pros]
+
+1.	The idea of considering across-sentence/paragraph context for text embedding learning is very reasonable. 
+2.	The random missing-word completion is also a reasonable self-supervised learning task. 
+3.	The results are consistently encouraging across all three task. And the performance for “answer passage retrieval” is especially good. 
+
+[Cons]
+
+1.	The ideas of predicting the next word (L+R-LM) or missing words (mask-LM) have been around and widely used for a long time. Apply this idea to an two-layer hierarchical LSTM is a straightforward extension of this existing idea.
+2.	For document segmentation, no comparison with other methods is provided. For extractive document summary, the performance difference between the proposed method and the previous methods are very minor.
+3.	Importantly, the experiments can be stronger if the learned embedding can be successfully applied to more fundamental tasks, such as document classification and retrieval. 
+
+Overall, the paper proposed a reasonable method, but the significance of the paper can be better justified by more solid experiments.
+
+
+",6,4.0,ICLR2019
+oHa3sD2uhfI,4,F8whUO8HNbP,F8whUO8HNbP,"idea is novel, more technical details needed","This paper was motivated from an observation the common lack of texture and shape variations on synthetic images often leads the trained models to learning only collapsed and trivial representations without any diversity. The authors made a hypothesis that the diversity of feature representation would pay an important role in generalization performance and can be taken as an inductive bias. 
+
+Seeing that, they proposed a synthetic-to-real generalization framework that simultaneously regularizes the synthetically trained representations while promoting the diversity of the features to improve generalization. Their strategy can be exactly formulated with a contrastive loss, which reminds me of Wang & Isola (2020) but was also customized for the synthetic-to-real generalization scenario. The framework was further enhanced by the multi-scale contrastive learning and an attention-guided pooling strategy. Besides, the dense contrastive loss (6) provided spatially denser patch-level supervision; that may be a novel idea that I haven’t seen before. However, the authors did not clarify where and how they use loss in their experiments.
+
+Experiments on VisDA-17 and GTA5 supported the hypothesis: though assisted with ImageNet initialization, fine-tuning on synthetic images tends to give collapsed features with poor diversity in sharp contrast to training with real images. This indicates that the diversity of learned representation could play an important role in synthetic-to-real generalization. Their experiments showed that the proposed framework can improve generalization by leveraging this inductive bias and can outperform previous state-of-the-arts without bells and whistles.
+
+I also feel more analysis and insights could have been provided for the segmentation experiments in 4.2. Currently there is no more information beyond Table 5. For example, some feature diversity measure like Table 1 could be reported for segmentation too, since revealing the feature diversity inductive bias is the main novelty in this paper. Also, more clarifications are needed on comparing the settings fairly with prior work like Pan et al. (2018) and Yue et al. (2019).
+",6,4.0,ICLR2021
+r1x93hOWtH,1,HJgySxSKvB,HJgySxSKvB,Official Blind Review #2,"In this paper, the authors propose generalize the FM to consider both interaction between features and interaction between samples. For the interaction between features, the authors propose to use graph convolution to capture high-order feature interactions. Moreover, the authors construct a graph on the instances based on similarity. Then a GCN is applied to the sample graph where the feature embedding is shared between the two components. Experiments are carried out on four datasets with tasks of link prediction and regression. Comparison to several baselines demonstrate the superior performance of the proposed method.
+
+Strength:
+1. The idea of utilizing GCN on the feature co-occurrence graph is interesting and innovative. The idea could possibly be combined with other variants of Deep FM models.
+2. It is an interesting idea to combine sample similarity together with feature co-occurrence for better prediction accuracy.
+
+Weakness:
+1. Many descriptions in the paper are not very clear. First, the authors only mention how prediction is carried out with trained parameters. However, there is no description of the training process like what is the target used for the two components. What is the training procedure? Are the two components trained jointly? Second, the authors provide little description on how the sample similarity graph is constructed excepts for the Ad campaign dataset. Third, it is not clear how is the link prediction evaluation carried out. From the size of the graph, the authors seem to include both user and item in the graph. However, the user and item has disjointed feature set. It is not clear how the GCN is computed for the heterogenous nodes in the graph. Moreover, how is link prediction carried out, by taking inner product (cosine similarity) of the final representation.
+2. For equation (8) in section 4.1, why we need to compute h_i^{RFI}. This should be the feature representation of sample i. However, the average is computed without include sample i itself. Also, are the neighbors defined in the sample similarity graph? Should we use the sample interaction in section 4.2 to capture that?
+3. Though it is interesting idea to use graph convolution on the feature occurrence graph, it would be much better if the authors could provide more intuition on the output of the GCN. It would be helpful to study a few simple cases like without non-linearity. Is it a generalization to high-order FM without non-linearity? Also, it would be interesting to see experiments results using the graph convoluted feature representation directly for final representation. Also, some visualization of the learned feature embedding also helps.
+4. The authors should carry out ablation study for different components of the model. Moreover, it would be much better if the authors can carry out experiments on some widely used recommendation datasets and use standard evaluation metrics for ranking.
+",1,,ICLR2020
+2vlqdG32EuB,2,VqzVhqxkjH1,VqzVhqxkjH1,the threat model is unclear,"This paper proposed a fingerprinting approach for identifying stolen models. To distinguish stolen models from reference models, the proposed approach generated conferrable adversarial examples, which can only be transferred to the stolen models, but not to the reference models. 
+
+Pros:
+1. The idea of using adversarial examples to identify stolen models is interesting.
+2. The paper provides comprehensive experiments, including model extraction attacks, adaptive model extraction attacks, etc.  The results outperform popular adversarial attacks, FGM, PGD, and CW attacks. 
+
+Cons:
+1. The key concern about the paper is the unclear threat model. Why does the model stealing attacker have white-box access to the source model? In general, the white-box access means the attacker has all the information about the source model. The definition of “strong attacker” is confusing: If the attacker requires access to domain data, then why the attacker is strong? Do attackers have access to the data? It seems the attackers have only the input data but partial label data, which is a strong assumption. Many recent works show that model stealing attacks can surrogate datasets to extract the victim models. Can these attacks be detected by the proposed approach?
+2. The paper is very hard to follow. Many definitions are missing in the paper. 
+3. What is Transfer(S, x; t) in Eq (1)? What is Classify(S, x) in Eq (2) and (3)? What is H in Eq (5)?
+4. In the conclusion section, CW-L2 should be L-infinity?",6,3.0,ICLR2021
+rJx3GcA6KS,2,ByxJO3VFwB,ByxJO3VFwB,Official Blind Review #27,"
+Main contribution of the paper
+- The paper argues that the base assumption, the i.i.d. of the activated elements (activations) in the hidden layers, the existing methods (lee.et.al 2018) hold is not convincible.
+- Instead, the author proposes a new way to probabilistically model the hidden layers, activations, and layer/layer connections.
+- Based on the probabilistic model, the paper proposes a new regularizer.
+
+Methods
+- The author argues that the activation is not iid by empirically showing that the trained MLP (in most cases) does not un-correlated.
+- The author proposes a new probabilistic model for MLP, and CNN assuming the Gibbs distribution to each activation and also assuming the product of expert (poE) model to explain the layer/layer relationship.
+- And according to their model, CNN will be explained by the MRF model.
+- The author proposes a regularization term regarding layer/layer connection.
+- They argue that the SGD training can be seen as a first-order approximation of the inference of the hidden activations in MLP.
+
+Questions
+- See the Concerns
+
+Strongpoints
+- The probabilistic explanation of the MLP and the CNN seems novel and was interesting to the reviewer
+- The proposed explanation assumes a weaker condition compared to the existing methods.
+
+Concerns
+- The main concern is that the reviewer cannot fully convince that i.i.d. assumption is wrong. 
+Even though the trained MLP does not support the i.i.d. condition, one can suppose that the reason would be the typical training method (SGD), just finding the local minima in a deterministic way.
+Maybe the proof in Appendix.G. supports the argument of the author, but the reviewer failed to clearly agree with the argument.
+A clear explanation regarding the issue would be required.
+- As far as the author understands, the paper proposes a probabilistic (Bayesian) model for explaining MLP, but it seems that they just used SGD for training the model. 
+In that case, the reviewer is little suspicious of the role of the proposed regularization in that the regularization comes from Bayesian formulation, but the model was trained in a deterministic way.
+The reviewer wants to ask the author that 
+(1) is it possible to infer the model in a Bayesian manner such as sampling?
+(2) Is there any justification for using SGD when conducting the experiments regarding the regularization? If it is related to Appendix.G, clearer explanation would be appreciated.
+- As far as the reviewer understands, the regularization deals with the practical part of the paper. It would be better to see the effect of the regularization of widely used networks such as small-layered ResNet or others.
+If the proposed formulation has other practical strongpoints, it would be nice to clarify them.
+- The explanation using Gibbs distribution and PoE looks similar to RBM. The reviewer strongly wants a clear explanation of the difference and the strongpoints compared to RBM.
+
+Conclusion
+- The author proposed a new probabilistic explanation of the neural network, which seems novel and worth reporting.
+- However, the reviewer failed to fully agree on some steps in the process of the paper.
+Therefore, the reviewer temporary rates the paper as weak-reject, but this can be adjusted after seeing the answers of the author.
+ 
+Inquiries
+- See the concerns parts.",6,,ICLR2020
+3oSIzHJI2HE,1,_CrmWaJ2uvP,_CrmWaJ2uvP,Clear reject. Lacks clarity. Findings not likely to be very general (weak significance).,"The authors present a method for incorporating basic concepts from linear systems theory into the standard structure for training artificial neural networks. They compare results of their approach against standard approaches for 3 simple datasets.
+
+Broadly speaking the work is quite unclear, and takes several passes over to have a basic sense of the approach. There are too many shortcomings to enumerate them all, so I will just present one. Figure 5 is presented in the section ""example architecture"" which might lead one to believe the authors implement this network (which it appears they do not). I believe this is included only to indicate a hypothetical architecture, but the presentation is too poor to glean this with any quickness. This is of course, in itself, not sufficient grounds for rejection, but speaks broadly to the poor presentation of the work. It does not seem ready for publication.
+
+As for the significance, the work clearly falls short. Although the motivation of constructing a ""more explainable model"" is a good one, this should not come at an extreme cost of model expressivity. It seems obvious that richer models, such as LSTMs etc., correctly trained, should be able to account for the linear transformations the authors include in their ""novel layer."" That their work is competitive with these richer models is simply an indication of the simplicity of the tasks they chose, which (as far as I can tell) can all be accounted for using linear systems analysis (although it's hard to say, since they work so poorly explains the second two tasks). It's completely unclear how effective the authors' approach would be over standard, richer models, on tasks that cannot be accounted for by linear systems analysis, and I am doubtful that the suggested approach could offer much over these richer models.
+
+Likewise, an alternative view of the authors' work is as a learnable filter bank applied to data to create a representation of the data better suited for post-hoc learning with a richer model, which is certainly an useful idea, but it is not clear to me (and the authors haven't shown) that their choice for this filter-bank is superior to many other choices (e.g. convolutional layers applied prior to FC layers, which is standard for deep networks).",3,5.0,ICLR2021
+HJxj6VX0tr,2,rJerHlrYwH,rJerHlrYwH,Official Blind Review #1,"The authors augment contrastive predictive coding (CPC), a recent representation learning technique organized around making local representations maximally useful for predicting other nearby representations, and evaluates their augmented architecture in several image classification problems. Although the modifications to CPC aren't particularly original, the authors show first that these yield a significant improvement in linear classification accuracy. They then use this improved model to obtain impressive performance in classification within semi-supervised and transfer learning settings, giving strong support for the use of such methods within image processing applications.
+
+Pros:
+Owing to its generality (CPC assumes only a weak spatial prior in the input data), and cheap computational cost relative to earlier generative approaches, CPC is already a promising unsupervised representation learning technique. The paper gives more evidence of this usefulness for image data, yielding leading performance on several different image classification benchmarks.
+
+The authors also make the observation that linear separability, the standard benchmark for evaluating unsupervised representations, correlates poorly with efficient prediction in the presence of limited labeled data. This observation should be of interest in the broader community, and points to the need for more diverse metrics for unsupervised representations.
+
+Cons:
+The improvements given in the paper are quite useful within their stated domain (image data), but aren't directly applicable to other types of input data. Although the authors make a point of emphasizing the relevance of CPC for other problem domains, they don't currently provide any suggestions for how this current work could be generalized to handle these other cases. In this sense, I think it is a bit deceptive to refer to their model as ""CPC v2"", as the majority of their changes have no bearing on the intrinsic CPC algorithm itself.
+
+I am sure that some of the methods used here could lead to improvements in the use of CPC for other types of data, but the authors currently don't provide any insight on this issue. In line with that, I think their work would be improved by some commentary on this, in particular by any concrete suggestions they have about how similar augmentations to CPC could be carried out in text, audio, and/or video data.
+
+Verdict:
+Owing to the reasons given above, I recommend acceptance.
+
+Minor suggestions:
+Please use a different color scheme for your figures that is still meaningful if the paper is printed in greyscale.",6,,ICLR2020
+HkxHBiPAKr,2,rkg-TJBFPB,rkg-TJBFPB,Official Blind Review #3,"This paper proposes a new intrinsic reward method for model-free reinforcement learning agents in environments with sparse reward. The method, Impact-Driven Exploration, learns a state representation of the environment separate from the agent to be trained, based on a combied forward and inverse dynamics loss. The agent is then separately trained with a reward encouraging sequences of actions that maximally change the learned state.
+
+Like other latent state transition models (Pathak et al. 2017), RIDE learns a state representation based on a combined forward and inverse dynamics loss. However, Pathak et al. rewards the agent for taking actions that lead to large difference between the actual next state and the predicted next state. RIDE instead rewards the agent for taking actions that lead to a large difference between the actual next state and the current state. However, because rewarding one-step state differences may cause an agent to loop between two maximally-different states, the RIDE loss term is augmented with a state visitation count term, which decreases intrinsic reward for a state based on the number of times that state has been visited in the current episode.
+
+The experiments compare RIDE to a selection of other intrinsic reward methods in the MiniGrid, Mario, and VizDoom environments. RIDE provides improved performance on a number of tasks, and solves challenging versions of the MiniGrid tasks that are not solved by other algorithms.
+
+Decision: Weak Accept.
+
+The main weakness of the paper seems to be a limitation in novelty.
+Previous papers such as (Pathak et al. 2017) have trained RL policies using an implicit reward based on learned latent states. Previous papers such as (Marino et al. 2019) have used difference between subsequent states as an implicit reward for training an RL policy. It is not a large leap to combine these two ideas by training with difference between subsequent learned states. However, this paper seems to be the first to do so.
+
+Strengths:
+The experiments section is very thorough, and the visualizations of state counts and intrinsic reward returns are insightful.
+The results appear to be state of the art for RL agents on the larger MiniGridWorld tasks.
+The paper is clearly-written and easy to follow.
+The Mario environment result discussed in section 6.2 is interesting in its own right, and provides some insight into previous work.
+
+Despite the limited novelty of the IDE reward term, the experiments and analysis provide insight into the behavior of trained agents and the results seem to improve on existing methods.
+Overall, the paper seems like a worthwhile contribution.
+
+Notes:
+In section 2 paragraph 4, ""sintrinsic"" should be ""intrinsic"".
+In section 3, at ""minimizes its discounted expected return,"" seems like it should be ""maximizes"".
+The explanation of IMPALA (Espeholt et al., 2018) should occur before the references to IMPALA on page 5.
+Labels for the axes in figures 4 and 6 would be helpful for readability.
+
+The motivation for augmenting the RIDE reward with an episodic count term is that the IDE loss alone would cause an agent to loop between two maximally different states.
+It would be interesting to know whether this suspected behavior actually occurs in practice, and how much the episodic count term changes this behavior.
+It is surprising that in the ablation in section A.5, removing the state count term does not lead to the expected behavior of looping between two states, but instead the agent converges to the same behavior as without the state count term.
+
+Also, in Figure 9, was the OnlyEpisodicCounts ablation model subjected to the same grid search described in A.2, or was it trained with the same intrinsic reward coefficient as the other models?
+Based on the values in Table 4, it seems like replacing the L2 term with 1 without changing the reward coefficient would multiply the intrinsic reward by a large value.
+",6,,ICLR2020
+qRc354yIWoL,1,8YFhXYe1Ps,8YFhXYe1Ps,Interesting idea needing more work,"Update after revision
+------------------------------
+I thank the authors for their work on this paper. The second reading was more pleasant. I agree with the authors that performing a user-study is an important effort, that should be encouraged. I however still believe that, if not benefitial to the user, the complexity of the method can be a drawback. I also wished that more comparisons, but especially other data modalities were investigated. I have updated my rating to reflect the improvement in the text.
+
+Short summary
+-----------------------
+The authors propose a technique based on an invertible network to provide counterfactuals relative to one class of interest. The counterfactuals can be interpolated across an isosurface, displaying parameters which do not affect the model’s decision. The authors propose an attribution map based on those counterfactuals and evaluate counterfactuals in a qualitative manner, based on their own observations on 3 datasets, as well as based on a human-grounded evaluation on a synthetic dataset. 
+
+Strengths
+---------------
+The use of an invertible dataset is rather novel in the field of explainability, and the relationship between the obtained counterfactuals and gradient-based interpolation methods is interesting. The human-grounded evaluation is definitely a large undertaking that is not often performed to assess the usefulness of interpretability techniques.
+
+Weaknesses
+-------------------
+I have identified several weaknesses of the work that justify my recommendation:
+- the (lack of) clarity of the text.
+- the assessment of the technique, as the results of the human-grounded evaluation are mixed, with users not being significantly more accurate in finding confounding factors compared to a baseline technique.
+- the limitations of the technique, not discussed in depth. For instance, I can see difficulties in evaluating the effect of classes that are not present as “training classes” in the dataset, which requires a large labeling effort. In addition, how the technique would transpose to non-image datasets, or whether there are limitations in the invertible architectures to consider should be mentioned.
+
+Novelty
+-----------
+The “Related works” section is rather limited, which makes it difficult to evaluate. In general, the use of invertible networks as interpretable networks is novel.
+
+Clarity
+---------
+Clarity was a major weakness of this work for me:
+- the datasets are illustrated in figures but not mentioned until much later
+-  the maths are described in sections that seem unrelated to each other, without depicting the relationships between the different steps
+- multiple concepts are unclear (see detailed comments)
+- the motivations are not clearly explained
+
+Rigor
+--------
+I found the qualitative evaluation on the 3 datasets unconvincing, as it is unclear whether the same conclusions could not have been reached using other techniques.
+While I was most interested by the discussion around the generation of counterfactuals based on the invertible network compared to based on the integration of gradients, I wished there was a definition of an “ideal” counterfactual, qualitative or (preferably) quantitative. The single example provided in the main text is appealing but this requires more evidence to me.
+Finally, the “saliency” maps defined in this work do not seem to be used later on in the work. I doubt that looking at them would improve human evaluation of a model’s behavior.
+
+Detailed comments
+-----------------------------
+- Counterfactuals: their quality seems subject to appreciation and confirmation bias, especially on potentially cherry picked examples. To assess their quality, I would suggest to use the BAM dataset (Yang and Kim, 2019, https://github.com/google-research-datasets/bam) which was generated to benchmark attribution methods. I would overall suggest the use of this dataset for assessing the faithfulness (sensitivity, specificity) of the proposed approach.
+- The choice of the mice dataset should be justified as this doesn’t seem like an obvious choice to assess the quality of attribution techniques. It is quite difficult to estimate any effect, and feels like qualitative evaluation is biased by the authors’ remarks given the lack of knowledge of the problem.
+- There should be more details about the Two4Two dataset and its motivations, as well as how it relates to other datasets (e.g. Goyal et al., 2019)
+- How does the proposed approach relate to “completeness” (Sundararajan et al., 2017)?
+- What is the mathematical justification to resize the saliency map of an intermediate layer to the input resolution? Is there a citation for this process showing that this is a reasonable assumption?
+- I am confused by the section on saliency maps: what does h represent? The activations at an intermediate layer? The motivation is unclear: what are the authors trying to highlight in these “saliency maps”? Are these computed attributions or are these L1 distance between activations (in %) between x and x_tilde? Or is it a cosine distance (as suggested by the next sentence mentioning the angle?)
+- The tasks used for illustration are not described in the text. Examples of y and epsilon should be provided.
+- Is the technique limited to the model’s predicted classes?
+- How is “ideal” counterfactual described and mathematically verified?
+- The relationship between counterfactuals and e.g. integrated gradients is unclear: the first clearly needs a model that can generate data, while the latter integrates the gradients between a baseline (defined by the user) and the input. More details and explanations are required to make this relationship clearer.
+- What are the participants in the human-based study viewing? Are they comparing the counterfactuals to e.g. SmoothGrad maps, or the saliency as defined per the proposed approach?
+- It is unclear what the participants answered: Figure 5a mentions that the main score is “strongly disagree” for “arms” (both baseline and interpolation) while the text refers to “strongly agree”. Example questions would help.
+- The results of the human-grounded study are not very conclusive. Note: please correct for multiple comparisons due to multiple statistical testing of the same effect.
+- Kim et al., 2018 already displayed that human users were performing poorly at identifying a network’s decision behavior based on saliency maps. A better comparison could have relied on TCAV instead, especially as the concepts can easily be mapped to the features given the synthetic dataset. This could have made a stronger case for the use of invertible networks, especially as Goyash et al (2019) mention the use of counterfactuals based on concepts.
+- How about non-image datasets?
+
+Minor
+-------
+- Intro: I would suggest using “transparency” rather than “interpretability” when referring to logistic regression (e.g. Lipton, 2016). The interpretability of linear model weights is indeed debatable, as weights will depend on the regularization and signal-to-noise ratio in the data (Haufe et al., 2014).
+- No clear flow between the different works in the intro. No clear motivation behind counterfactuals.
+- proofreading: paper is quite hard to follow and minor changes to grammar (e.g. “Their similarity is easy to seen”) makes it more difficult to assess. The quality of the writing deteriorates in sections 3, 4 and 5.
+- It is unclear what scale delta epsilon represents, and whether we can expect the norm of the different techniques to be comparable. ",6,4.0,ICLR2021
+DDVq9bkjgRp,1,RLRXCV6DbEJ,RLRXCV6DbEJ,A strong empirical contribution on hierarchical VAEs,"Summary
+--------------
+
+This paper provides evidence that ""very deep"" hierarchical VAEs can outperform autoregressive and flow-based models albeit using less parameters on image density estimation tasks.
+
+It seems natural to think that a hierarchy of latent variables progressively compressing information would be useful for image modelling, with top latent variables capturing more abstract/general features and bottom latent variables capturing lower-level details. However, recent success of flow based and autoregressive based models such as PixelCNN seemed to invalidate the need of such hierarchy of latent variables and to ""compress"" pixel-level information. Here, the authors show that a simple hierarchical VAE architecture inspired by previously proposed ones can outperform autoregressive models if it's made sufficiently ""deep"". I think this is an important contribution. With respect to previous work, this work relates to the concurrently proposed ""Nouveau VAE"" but obtains better results with less parameters and considerably less involved customization of the architecture.  The authors report impressive results on multiple datasets generally using less parameters than competing models. Additionally, sampling from very deep VAEs is considerably cheaper than in autoregressive models.
+
+The authors also attempt at showing that learnt latent variables implement a hierarchy of information which could be useful to have in general. This point is a bit weak and not well demonstrated in the paper.
+
+Pros
+------
+
+- Strong results on multiple tasks with a method that was previously thought to have plateaued in performance
+
+Cons
+-------
+
+- Originality / novelty is a bit weak
+- Clarity can be improved
+
+
+Detailed Remarks
+-------------------------
+
+- Figure 4 is not totally convincing as high-level features in the first image are not always maintained in higher-resolution realizations (sample in the first row seem to have glasses then they disappear?). Could you include more samples to back this claim? Do you think of a way of understanding whether high-level variables maintain general info (maybe by probing the posterior samples for some downstream attribute ?)
+
+- I find it hard to understand what is going on in Table 1 (left). In Section 5.1, referring to Table 1, what do you mean by ""grouping layers to output variables independently instead of conditioning on each other"" ? In Table 1, what do you mean by ""masking"" in the sentence ""with masking introduced such that the effective stochastic depth is lower"" ? I cannot find any other references to masking. 
+
+- In Figure 3, what do you use as the pooling operation? (2, 2) max pooling ?
+
+- An ablation study of the proposed modifications to the architecture and training tricks would be useful, e.g. what's the most single important modification that makes the model work ? Is it the neighbour upsampling ? Is it the 1/\sqrt(N) init of the last layer ? Is it the skipping gradient trick ?
+
+- How is Figure 4 obtained ? When you say ""The rest of the high-resolution variables can be output in parallel, largely independent of each other"", are you referring to the fact that you sample from the top 1x1 layer and then sample independently the other zs from the learnt prior e.g. p(z_4x4) ... p(z_64x64) without ancestral sampling ?
+
+- Section A.1: ""Without loss of generality, we simplify notation by assuming each vector-valued latent variable zi
+only has one element, which we write as zi"", do you mean each latent variable is \in R ? It'd be good to mention that
+you assume an architecture with an auto-regressive learnable prior p_\theta(z_i | z_<i) or refer to Eq. 2.
+
+- Section A.2: I am not sure you need this sentence: ""Without loss of generality, we simplify notation by assuming each vector-valued latent variable zi only has one element, which we write as zi"", as it seems copied from A.1.
+
+Grammar
+-------------
+- Section 5.2: ""an learned"" -> ""a learned""
+- Section 4.1: ""the the"" -> ""the""",7,4.0,ICLR2021
+ryxFdPLRYH,2,B1lgUkBFwr,B1lgUkBFwr,Official Blind Review #4,"*Summary.* The paper presents and addresses the problem of performing domain adaptation when the target domain is systematically (i.e., not the result of a stochastic process) missing subsets of the data. The issue is motivated by applications where one modality of data becomes unavailable in the target domain (e.g., when deciding which ads to serve to new users, the predictor may have access to behavior across other websites but not on a specific merchant's website). The proposed method learns to map source and target data to a latent space where the representations for the source and target are aligned, the missing components of the target can be inferred, and classification can be performed successfully. These are achieved by adversarial/optimal transport loss on source and target features, a mean-squared error and adversarial loss on latent generation/imputation, and a cross entropy loss on source label prediction, respectively. Experiments are performed on digits and click-through rate (CTR) prediction and include a thorough set of baselines/oracles for comparison.
+
+*Review.* While the problem statement is novel, I am unconvinced that the advertising experiment includes both a domain adaptation and imputation problem. I describe this in detail below. For this reason, I am giving the paper a weak reject.
+
+*Questions that impacted rating.*
+1. Ads experiment: From my understanding, the source domain is the traffic of users who have interacted with (clicked through to?) a specific partner and the target domain is the traffic of the users who have not interacted with that specific partner. The data that needs to be imputed is the click through rate for target users with that specific partner. In this case, it is not obvious to me why there is a domain shift between these two groups of users. This would imply that the traffic of source users and target users is different for other partners. I don't see why this would need to be true. Could the authors provide an explanation as to why this is the case (e.g., by showing that CTRs differ with other (partner, publisher) pairs between source and target). From my understanding, Table 5 only shows CTR averaged across all users in each domain, but does not show that the CTRs differ between source and target users for contexts/(partner, publisher) pairs (i.e., the results in table 5 could be due to the fact that the prior distribution over context is different for source and target users).
+
+*Additional notes. Immaterial to rating.*
+1. I personally felt that the motivation for UDA vs imputation in the first paragraph was a bit muddled. I think sticking to one example would make the motivation more clear to the reader. E.g., explain the prediction problem for medical imaging (which I assume is disease diagnosis, but it is not stated explicitly), describe how some medical imaging may be missing for certain patients (imputation), then explain that there may be noise across different medical imaging systems (UDA), then list the other applications where this arises with citations (e.g., These phenomena have also been documented in advertising applications [1], ...).
+2. I was surprised by the difference between Adaptation-Partial and the other two train/test conditions in Figure 2 when p=30%. Out of curiosity, do the authors have an explanation for this discrepancy? I would have predicted that, if most of the information necessary for prediction was available in the remaining 70% of the image that the performance of these cases would be very similar.  I think it would be helpful to see the accuracy on the source domain and the labeled target domain to better understand that result.",3,,ICLR2020
+H1xeyoxG9S,3,H1eWGREFvB,H1eWGREFvB,Official Blind Review #2,"I have the rebuttal of the authors, the paper improved indeed and some point on role of M is better clarified now although it is still a bit convoluted. The paper would be stronger if the analysis shows any theoretical advantage to the presented method. I think the author put a good effort in addressing some of my concerns and I m raising my score to 6. 
+
+ 
+####
+Summary of the paper:
+
+The paper proposes stein self repulsive dynamics for sampling from an unnormalized distribution. The method starts by using Langevin dynamic for up to time $Mc$ and then uses those pasts samples to guide the trajectories of the langevin sampling to explore new areas of the densities  using the stein witness function between the current particles  and the past samples( similar to Stein Variational gradient descent). 
+
+The paper analyses the mixing properties under standard assumptions of the potential of the Boltzmann distribution and the kernel used in the Stein discrepancy, and shows convergence to the boltzman distribution as the number of particles goes to infinity and the step size goes to zero. 
+
+Authors validate their methods and show that it indeed explores on a synthetic example new areas wrt to pure langevin dynamics. Applications in sampling from the posterior of bayesian neural networks and contextual bandits compare the performance of the proposed method to langevin dynamic and pure stein descent favorably. 
+
+Clarity/presentation :
+
+The paper is well written and the intuition are well presented . 
+
+The notation $\hat{delta}_{M} $ is not great in denoting direct measures, please using another symbol. 
+
+My main concerns with the paper are the following:
+
+- the definition of $\bar{\delta}_{M}$ averages only $M$ particles choosen for $M$ time stamps. Something is off here for the continuous approximation to work at each time stamp you need $N$ particles and then you have a past horizon $M$. As $M\to \infty$ this does not matter, but I think in your implementation you are considering at each time step $N$ particles and you average on a horizon of size $M$. is this correct? Please clarify?
+
+- The theorem show only asymptotic behavior and don't quantify the intuition behind the paper , that the ""coverage of the samples"" is higher.
+ Can you for instance bound the wasserstein  distance between the pure langevin and Stein Repulsive dynamic , and between SVGD and your method, as it was done in ""The promises and pitfalls of Stochastic Gradient Langevin Dynamics"". Basically you can find a coupling between trajectories of your methods and the langevin dynamics and bound the wasserstein distance between the two methods. This will be insightful to see if one would mix faster with respect to the other one.  
+
+While the appendices of the paper are lengthy I don't think they explain the most selling point of the method, since they are asymptotic and there is a confusion between the time horizon for the past , and the number of particles.  if $M$ is  $\infty$ in this defintion , we are just running langevin dynamic, and not using Stein witness function. I see an important issue in this definition of the time horizon, I think this time horizon should be finite, and that from time steps less we sample N particles , and we let this $N$ go to infinity.
+",6,,ICLR2020
+8dBWt3UAeMD,1,GFsU8a0sGB,GFsU8a0sGB,A well motivated and effective method for deriving local updates on the client,"The authors propose a new method of generating local (client) updates in Federated Learning (FL), where the clients return an adjusted version of their usual local updates to the server. The authors derive this new local update rigorously from the viewpoint of estimating the posterior distribution of the data (under Gaussianity assumptions). They also provide an efficient method for calculating this new update, and show that it outperforms Federated Averaging on several datasets.
+
+Pros:
+
+The new method (FEDPA) is well motivated and is 'as simple as possible, but not simpler' based on the derivation.
+
+FEDPA has standard Federated Averaging (FEDAVG) as a special case, and suggests a family of new methods based on approximations of the covariance matrix, which likely would exhibit bias-variance-tradeoff-esque behaviour.
+
+The authors provide a practical way of calculating the required new quantities efficiently.
+
+Experimental results show FEDPA has superior performance compared to FEDAVG, especially in regimes where client compute is high. 
+
+Cons:
+
+The big O analysis of the dynamic programming method for computing the local updates is very useful, but it would also be good to have empirical results on the additional cost on the client. It seems like the cost should not be too significant, but evidence of this would be very valuable, since the cost of FEDPA is strictly greater than the cost of FEDAVG. 
+
+The addition of more tuning parameters in FL is never ideal, especially with the knowledge that using too small of a burn in time can lead to arbitrarily bad behaviour. However since the positive effects of FEDPA appear very quickly after the burn in ends, this may be less of a concern in practice. 
+
+Would be valuable to include the results of the FEDAVG-1E in the experiments in the main paper, especially since it outperforms both FEDAVG-5E and FEDPA-5E on the StackOverflow LR, which is a surprising and unexpected result. 
+
+In all experiments the FEDAVG locally updated using SGD (likely to maintain the connection and comparison with SGD used in IASG). However for FEDAVG, the optimization procedure CLIENTOPT could be something else, such as adam. It would be valuable to know how this FEDPA compares to FEDAVG when the local optimizer is a more powerful method than SGD, since FEDAVG has the freedom to change the local optimizer. (Of course it is only fair to also empirically compare FEDPA where the CLIENTOPT can be the same method. However since the use of SGD in IASG provides it with certain properties, it is less clear if this substitution can be safely made). 
+
+The authors provide ample justification for their new method and sufficient evidence that it outperforms the existing standard method FEDAVG.  ",7,3.0,ICLR2021
+r1g2osJOnX,1,Syzn9i05Ym,Syzn9i05Ym,Incremental Contribution,"The paper proposes the inclusive neural random field model. Compared the existing work, the model is different because of the use of the inclusive-divergence minimization for the generative model and the use of stochastic gradient Langevin dynamics (SGLD) and stochastic gradient Hamiltonian Monte Carlo  (SGHMC) for sampling. Experimental results are reported for unsupervised, semi-supervised, and supervised learning problems on both synthetic and real-world datasets. Specific comments follow:
+
+1. A major concern of the reviewer is that, given the related work mentioned in Section 3, whether the proposed method exerts substantial enough contribution to be published at ICLR. The proposed method seems like an incremental extension of existing works.
+
+2. A major claim by the authors is that the proposed techniques can help explore various modes in the distribution. However, this claim can only seem easily substantiated by experiments on synthetic data. It is unclear whether this claim is true in principle or in reality.
+
+Other points:
+3. the experimental results of the proposed method seems marginally better or comparable to existing methods, which call in question the necessity of the proposed method.
+
+4. more introduction to the formulation of the inclusive-divergence minimization problem could be helpful. The presentation should be self-contained.
+
+5. what makes some of the statistics in the tables unobtainable or unreported?
+
+
+============= After Reading Response from Authors ====================
+
+The reviewer would like to thank the authors for their response. However, the reviewer is not convinced by the authors’ argument. 
+
+“The target NRF model, the generator and the sampler are all different.”
+It is understandable that modeling continuous data can be substantially different from modeling discrete data. Therefore, it is non-surprising that the problem formulations are different.
+
+As for SGLD/SGHMC and the corresponding asymptotic theoretical guarantees, this reviewer agrees with reviewer 2’s perspective that it is a contribution made by this paper. But this reviewer is not sure whether such a contribution is substantial enough to motivate acceptance.
+
+The explanation for better mode exploration of the proposed method given by the authors are the sentences from the original paper. The reviewer is aware of this part of the paper but unconvinced.
+
+In terms of experiments, sample generation quality seems to be marginally better. Performances in multiple learning settings are comparable to existing methods.
+
+A general advice on future revision of this paper is to be more focus, concrete, and elaborative about the major contribution of the paper. The current paper aims at claiming many contributions under many settings. But the reviewer did not find any of them substantial enough.
+
+",5,3.0,ICLR2019
+rJ7ZTaYxf,1,SyqShMZRb,SyqShMZRb,Interesting idea but poor presentation,"The paper presents an approach for improving variational autoencoders for structured data that provide an output that is both syntactically valid and semantically reasonable.  The idea presented seems to have merit , however, I found the presentation lacking. Many sentences are poorly written making the paper hard to read, especially when not familiar with the presented methods. The experimental section could be organized better. I didn't like that two types of experiment are now presented in parallel. Finally, the paper stops abruptly without any final discussion and/or conclusion. ",3,2.0,ICLR2018
+acgIJLTnQSx,4,6Lhv4x2_9pw,6Lhv4x2_9pw,Review on the usage of BNNs to gain insights on the physical procedure of earthquake ruptures through the study of the parameters,"### Summary
+In the present paper, the author intends to get further insights into the physics behind earthquake ruptures using a BNN to model simulated data from the literature. By using a BNN, the parameters of the model are not deterministic scalar values, but complete probability distributions. Studying the change of the distributions in the parameters before and after training, the author tries to extract information about the relative importance of the input variables, and also comprehend the physical mechanisms behind earthquake ruptures. Results are shown in figure 3, on which the change of behavior of the distributions of the parameters can be observed, as well as in figure 4 where the mean and standard deviations for all the parameters are presented. The pattern in figure 6 seems to indicate that variables previously thought to be important in the task of predicting the presence of the rupture, such as normal stress and friction, are also pointed out as being important in this case.  Finally, the authors also claim an improvement in the F1 metric in comparison to previous NN methods. 
+
+##### Pros
+* The idea of the paper seems well directed, i.e., gaining insight on complex physical procedures using an approach that results in the combination of NNs and a Bayesian approach. 
+* Using a Bayesian approach is a good way of dealing with small datasets, and also allows to account for the uncertainty of all the latent parameters, while also providing more robust and sensible predictions when new data is presented
+* The approach seems to provide results consistent with the literature findings regarding the important variables in the prediction
+* The final performance of the algorithm seems to improve on the previous state-of-the-art methods by taking advantage of the properties that the Bayesian approach offers
+
+##### Cons
+* The key concern with this paper is that NNs, as well as BNNs, are notoriously black-box algorithms with no easy way of interpreting the inner parameters in most cases. Taking this into consideration, I would suggest the author to motivate in a stronger manner why the usage of BNNs is desirable for the proposed problem, and why not use other already established Bayesian approaches to assess the importance of the input variables. 
+* Taking into account the previous point, I consider there is a general lack of rigorous experiments that could, in principle, suggest a clear advantage of using BNNs instead of any other approaches. No systematic comparisons with previous methods are present, such as for example with the Random Forest Feature Importance algorithm, which is mentioned a couple of times. If the main goal is to gain insights on the main variables involved in the presence of an earthquake rupture, I would expect a more detailed analysis comparing how good these insights provided by BNNs are and how do they stand in comparison with the established literature. 
+* Other basic techniques for assessing the importance of the variables in the prediction tasks are not mentioned, although it would be nice to use them as a baseline to compare against. Examples such as PCA, LOO cross-validation and others could be used here. 
+* The claim of an improvement of 2.34% w.r.t. NNs is not strongly addressed, since the NN experiments are not included here or, at least, there is no mention of the setup of these NNs. As before, there is no systematic comparison between the BNNs trained and the NNs that are used as baselines. 
+* There is a lengthy discussion on how to obtain the ELBO for VI. However, in the end, there is no final expression for the loss function which is going to be employed. I would appreciate in section 4 an explicit description of the objective of the system since there's no mention of the final binary classification problem anywhere.
+* The prediction uncertainties lack a systematic evaluation as well since all that is provided is presented in figure 5. How well do the predictions provided stand against other methods for obtaining final predictive distributions? 
+* VI is a method whose performance and final predictions are constrained due to its formulation. Is there any reason why using VI instead of any other approach to BNNs? In case that we wanted to study the final predictive distributions, why not use HMC or other, more flexible approaches than VI? 
+* At the end of the section 5.1, first paragraph, it is claimed that ""positive and high magnitude weights contribute to the earthquake rupture and vice versa"". This sentence seems a bit confusing since it seems to imply a causal relation between the high magnitude of the weights and the appearing of ruptures. This, I think, is the other way around: very clear rupture conditions imply positive high magnitude weights, which, in turn, return a higher predicted probability of rupture. 
+
+##### Minor comments:
+* Even though the paper tackles physical phenomena such as earthquake rupture, it does not provide any description of such process or the variables involved. Concepts such as ""nucleation"" and ""fault barrier"" should be at least briefly introduced, as well as the ""slip weakening law"" or the ""critical slip distance"". A short description of these terms and their relevance to the problem would help to interpret the final results obtained. Also, explicit expressions for the rupture physics would help a lot in section 2 to understand the different roles of the variables and their relations. 
+* Throughout the whole text it is used the first person while writing. In case there is only one author this can be okay, I only point it out since it seems to be a uncommon choice.
+* There are a lot of typos all through the paper! Please perform a careful reading and correct them.
+* The description on figure 4 is confusing, does not seem to correspond to the presented images (either that or the text is unclear when selecting the important parts of the figures for the nodes mentioned). 
+* 4th paragraph of introduction - not all ML algorithms are black boxes! NN are, but other such as linear regression, decision trees, etc., can be very interpretable!
+* 5th paragraph of introduction - ""exciting"" - avoid usage of these type of subjective adjectives all through the paper.
+* 5th paragraph of introduction - BNNs may work better with fewer data, but we have to pay close attention to the prior formulation to not introduce unreasonable biases. 
+",4,3.0,ICLR2021
+Syg22hUs2Q,2,HkzSQhCcK7,HkzSQhCcK7,"Interesting new architecture, but some clarity issues","This paper introduces a new stochastic neural network architecture for sequence modeling. The model as depicted in figure 2 has a ladder-like sequence of deterministic convolutions bottom-up and stochastic Gaussian units top-down.
+
+I'm afraid I have a handful of questions about aspects of the architecture that I found confusing. I have a difficult time relating my understanding of the architecture described in figure 2 with the architecture shown in figure 1 and the description of the wavenet building blocks. My understanding of wavenet matches what is shown in the left of figure 1: the convolution layers d_t^l depend on the convolutional layers lower-down in the model, thus with each unit d^l having dependence which reaches further and further back in time as l increases. I don't understand how to reconcile this with the computation graph in figure 2, which proposes a model which is Markov! In figure 2, each d_{t-1}^l depends only on on the other d_{t-1} units and the value of x_{t-1}, which then (in the left diagram of figure 2) generate the following x_t, via the z_t^l. Where did the dilated convolutions go…? I thought at first this was just a simplification for the figure, but then in equation (4), there is d_t^l = Conv^{(l)}(d_t^{l-1}). Shouldn't this also depend on d_{t-1}^{l-1}…? or, where does the temporal information otherwise enter at all? The only indication I could find is in equation (13), which has a hidden unit defined as d_t^1 = Conv^{(1)}(x_{1:t}).
+
+Adding to my confusion, perhaps, is the way that the ""inference network"" and ""prior"" are described as separate models, but sharing parameters. It seems that, aside from the initial timesteps, there doesn't need to be any particular prior or inference network at all: there is simply a transition model from x_{t-1} to x_{t}, which would correspond to the Markov operator shown in the left and middle sections of figure 2. Why would you ever need the right third of figure 2? This is a model that estimates z_t given x_t. But, aside from at time 0, we already have a value x_{t-1}, and a model which we can use to estimate z_t  given x_{t-1}…!
+
+What are the top-to-bottom functions f^{(l)} and f^{(o)}? Are these MLPs?
+
+I also was confused in the experiments by the >= and <= on the reported numbers. For example, in table 2, the text describes the values displayed as log-likelihoods, in which case the ELBO represents a lower bound. However, in that case, why is the bolded value the *lowest* log-likelihood? That would be the worst model, not the best — does table 2 actually show negative log-likelihoods, then? In which case, though, the numbers from the ELBO should be upper bounds, and the >= should be <=. Looking at figure 4, it seems like visually the STCN and VRNN have very good reconstructions, but the STCN-dense has visual artifacts; this would correspond with the numbers in table 2 being log-likelihoods (not negative), in which case I am confused only by the choice of which model to bold.
+
+
+
+UPDATE:
+
+Thanks for the clarifications and edits. FWIW I still find the depiction of the architecture in Figure 2 to be incredibly misleading, as well as the decision to omit dependencies from the distributions p and q at the top of page 5, as well as the use in table 3 of ""ELBO"" to refer to a *negative* log likelihood.
+",6,3.0,ICLR2019
+BJgyrccuh7,1,Byxz4n09tQ,Byxz4n09tQ,"interesting idea, some important experiments missing","I like this paper. What the authors have done is of high quality. It is well written and clear. However, quite a lot of experiments are necessary to make this paper publishable in my opinion.
+
+Strenghts:
+- The idea to use a GAN for model compression is something that many must have considered. It is good to see that someone has actually tried it and it works well.
+- I think the compression score is definitely an interesting idea on how to compare GANs that can be of practical use in the future.
+- The experimental results, which are currently in the paper, largely support what the authors are saying.
+
+Weaknesses:
+- The authors don't compare how good this technique is in comparison to simple data augmentation. My suspicion is that the difference will be small. I realise, however, that the advantage of this method over data augmentation is that it is harder to do it for tabular data, for which the proposed method works well. Having said that, models for tabular data are usually quite simple in comparison to convnets, so compressing them would have less impact.
+- The experiments on image data are done with CIFAR-10, which as of 2018 is kind of a toy data set. Moreover, I think the authors should try to push both the baselines and their technique much harder with hyperparameter tuning to understand what is the real benefit of what they are proposing. I suspect there is a lot of slack there. For comparison, Urban et al. [1] trained a two-layer fully connected network to 74% accuracy on CIFAR-10 using model compression.
+
+[1] Urban et al. Do Deep Convolutional Nets Really Need to be Deep (Or Even Convolutional)? 2016.",5,4.0,ICLR2019
+HJxBByxuaX,3,HJMXus0ct7,HJMXus0ct7,The paper is not written well and needs major modifications. The contribution of the paper which is analyzing RDA with arbitrary init point is a small incremental contribution. ,"iRDA Method for sparse convolutional neural networks 
+
+This paper considers the problem of training a sparse neural network. The main motivation is that usually all state of the art neural network’s size or the number of weights is enormous and saving them in memory is costly. So it would be of great interest to train a sparse neural network. To do so, this paper proposed adding l1 regularizer to RDA method in order to encourage sparsity throughout training. Furthermore, they add an extra phase to  RAD algorithm where they set the stochastic gradient of zero weights to be zero. They show experimentally that the method could give up to 95% sparsity while keeping the accuracy at an acceptable level. 
+More detail comments: 
+
+1- In your analysis for the convergence, you totally ignored the second step. How do you show that with the second step still the method converge? 
+
+2- \bar{w} which is used in the thm 1, is not introduced. 
+
+3- In eq 5, you say g_t is subfunction. What is it? 
+
+4- When does the algorithm switch from step 1 to step 2? 
+
+5- In eq 35 what is \sigma? 
+
+6- What is the relation between eq 23 and 24? The paper says 23 is an approximation for 24 but the result of 23 is a point and 24 is a function. 
+
+7- What is MRDA in the Fig 1? 
+
+",3,5.0,ICLR2019
+HkxENuvTFH,1,H1g4M0EtPS,H1g4M0EtPS,Official Blind Review #1,"This paper employs Markov random fields to exploit the input data structure and further model the covariance structure of the gradients. This embeds covariance structures of input data space into the gradient operator for an adversary attack.
+They further use this gradient operator with a fast gradient sign Method. The numerics show effective for using fewer queries to obtain high attack accuracy. This paper is well written with clear derivations. I suggest the publication of the paper.
+
+ In fact, modeling the structure of an input space structure into the adversary attack is a good direction. For similar intuitions in this direction, I recommend a related work:
+      
+         ""A. Lin, Y. Dukler, W. Li, G. Montufar, Wasserstein Diffusion Tikhonov Regularization"" 
+ ",8,,ICLR2020
+S1lfYa-IcH,3,rkgOlCVYvB,rkgOlCVYvB,Official Blind Review #1,"This paper studied the landscape of linear neural networks using algebraic geometry tools. They introduced a distinction between ""pure"" and ""spurious"" critical points. They showed that for quadratic loss, there are no spurious local minimum in both the filling and non-filling case. For other convex loss, there are no spurious local minimum in the filling case, but there are spurious local minimum in the non-filling case. They gave a precise description of the number of topologically connected components of the variety of global minima. 
+
+My concern to this paper is that the landscape of linear neural networks, which is the subject of this paper, has already been analyzed in the literature. The final results of this paper, though derived using a different approach, are not new. Another limitation of this paper is, the approach of algebraic geometry used in the analysis seems hard to be generalized to non-linear multi-layers neural networks. 
+
+A contribution of this paper is that the viewpoint of pure and spurious critical points made the landscape results of linear neural networks more intuitive. However, I don't have the expertise to assess whether this viewpoint was implicitly contained in the proof of previous results. Given this, I am not sure the contribution of this paper is enough for ICLR. 
+
+I don't hold a strong opinion, since there could potentially be great ideas inside the algebraic geometry tools. 
+",3,,ICLR2020
+H1xhlGTQqH,3,rkxawlHKDr,rkxawlHKDr,Official Blind Review #1,"The paper proposes a straightforward method for end-to-end learning of active contours, based on predicting a dense field of 2D offsets, and then iteratively evolving the contour based on these offsets. A differentiable rendering formulation by Kato et al is employed to make the process of aligning a contour to a GT mask differentiable. 
+
+The model shows rather compelling results on small datasets, and is very simple, with very strong parallels to active contours, which is a strength. The results improve those of DARNet, which to the best of my knowledge is the main published work in the space other than Curve-GCN. One thing that would be helpful, is  to have an experiment on a large dataset, such as Cityscapes -- right now all the datasets are testing the model in only the small-data regime. Perhaps in a supplement, it would also help to do ablation of how input image / dense deformation resolution affects the result quality -- the input can be subsampled by powers of 2 for the experiment. 
+
+As Amlan Kar helpfully points out, the work heavily overlaps with his approach ""Fast Interactive Object Annotation with Curve-GCN"", CVPR 2019, which is not cited or compared to. Curve-GCN similarly utilizes differential rendering (only a different variant) to match the GT masks. To me, the main difference wrt Curve-GCN is that explicit dense displacement fields are generated by the net and used directly for the iterative refinement steps, while Curve-GCN leverages implicit feature embeddings and uses GCN layers for their iterative updates. A second main difference is that Curve-GCN supports splines and interactive editing, while the proposed approach does not. Beyond these, there are multiple other differences that the authors point out, but those are more of a technical nature. Unfortunately, without a more direct comparison, it is very difficult to evaluate the design choices in the two approaches, which I feel is necessary for proper understanding of the paper. 
+
+AFTER REBUTTAL: The authors made additions that covered my concerns, so I have switched my recommendation. 
+
+A few more minor clarity / presentation issues. 
+-- “The recent learning-based approaches are either non-competitive or proven to be effective in the specific settings of building segmentation"". It's not exactly clear what the point is in the context. Which ""learning-based approaches""? 
+-- Typo 'backpropogation'. 
+-- A little better explanation of how a differentiable renderer of Kato works would have been helpful. 
+-- Figure 3 is not referenced in the text, takes a little bit of thought why it is relevant (helps explain Fig 1, but maybe better to show it prior to Fig 1). 
+-- In Eq 4 it’s not clear what F is.  (I see it is explained in Algorithm box, but that's much later)
+
+
+
+
+
+",8,,ICLR2020
+Bye_X4Dt2m,1,SyxwW2A5Km,SyxwW2A5Km,Concern of invalid evaluation and a weak contribution,"Quality: 
+- In 4.4, the authors have vigorously explored the space of hyperparameters. However, they do not describe how to determine the hyperparameters, e.g., set aside a validation set from a part of the training set and determine the hyperparameters using this validation set, while the authors split the two datasets into only training and test sets, respectively. Without this procedure, the results may overfit to the test set via repeated experiments. Even though the used datasets are of few-million, this procedure guarantees a minimum requirement for a reliable outcome from the proposed model. I firmly recommend the authors to update their results using a validation set to determine the hyperparameters and then report on the test set. Please describe these experimental details to ensure that the performed experiments are valid.
+   
+Clarity:
+- Overall, the writing can be improved via proof-reading and polishing the sentences. In Introduction section, ""there is little work applying..."" can be specified or rephrased with ""it is underexplored to apply"", and ""input features are not independent"" can be specified on what there are not independent. Moreover, the last two sentences in the second paragraph in the Introduction section is unclear what the authors want to argue: ""The combinations in linear models are then made by cross product over different fields. Due to the sparsity problem, the combinations rely on much manual work of domain experts.""
+- The authors use top-k restriction (Shazeer et al., 2017) to consider sparse relationships among the features. For this reason, have you tried to use the L1 loss on the probability distributions, which are the outputs of softmax function?
+- In 4.5, the authors said, they ""are in most concern of complementarity."" What is the reason for this idea and why not the ""relevance""?
+- In Table 4, I'm afraid that I don't understand the content (three numbers in parenthesis) of the third column. How does each input x_i or x_j, or a tuple of them get their own CTR?
+
+Originality and significance:
+- They apply self-attention to learn multiple categorical features to predict Click-Through-Rate (CTR) with a top-k non-zero similarity weight constraint to adapt to their categorical inputs. Due to this, the scientific contribution to the corresponding community is highly limited to providing empirical results on the CTR task.
+- The authors argue that ""most of current DNN-based models simply concatenate all feature embeddings""; however, this argument might be an over-simplified statement for the existing models in section 2.
+- Similar works can be found but missed to cite: [1] proposes a general framework to self-attention to exploit sequential (time-domain) and parallel (feature-domain) non-locality. [2] learns bilinear attention maps to integrate multimodal inputs using skip-connections and multiple layers on top of the idea of low-rank bilinear pooling.
+
+Pros:
+- Strong empirical results on two CTR tasks using the previous works of self-attention and top-k restriction techniques.
+
+Cons:
+- This work fairly lacks its originality since the proposing method heavily relies on the two previous works, self-attention and top-k restriction. They apply them to multiple categorical features to estimate CTR; however, their application seems to be monotonic without a novel idea of task-specific adaptation.
+
+Minor comments:
+- In Figure 1, ""the number of head"" -> ""the number of heads"".
+
+
+[1] Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local Neural Networks. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'18).
+[2] Kim, J.-H., Jun, J., & Zhang, B.-T. (2018). Bilinear Attention Networks. In Advances in Neural Information Processing Systems 32 (NIPS'18).",5,4.0,ICLR2019
+rklrTtQRFB,1,BkleBaVFwB,BkleBaVFwB,Official Blind Review #1,"# Response to rebuttal
+
+I would like to thank their authors for their rebuttal.
+
+After reading the other reviews, the author response and the revised manuscript, I have decided to keep my score of weak reject for the time being.
+
+In short, while I appreciate the effort the authors put in partly addressing some of the most important comments raised during the review process, I think the paper would greatly benefit from some additional work. In particular:
+
+(1) Given the emphasis on scalability, I still believe the authors should carry out more thorough experiments to characterize the runtime of their approach with respect to different characteristics of the graphs. While the result provided in the response to Reviewer #3 is a first step, I recommend the authors to extend it by (1) varying graph size (in terms of nodes and edges); (2) varying graph type and (3) reporting the speedup with respect to other baselines.
+
+(2) To the best of my knowledge, the ablation experiment in Section A.7.4 does not provide results for the setting in which no graph attention mechanism is used at all, neither for the case where the graph attention mechanism used is identical to GAT (restricted to 1-step neighbourhoods).
+
+(3) While NSPDK might be a reasonable choice, I still am of the opinion that the choice of graph kernel for this purpose is highly arbitrary and, thus, should be investigated further. Given that such a choice is being used to define a performance metric, which moreover is being highlighted as a contribution, the authors should study the robustness of the metric to the choice of graph kernel, as well as its sensitivity to known perturbations.
+
+(4) Finally, I did not see any error bars added to the main results in the paper.
+
+Despite these shortcomings, I would like to reiterate that I believe the proposed approach is promising and, with some additional work, would be a contribution definitely worth publishing. Therefore, I would like to encourage the authors to further revise the manuscript.
+
+# Summary
+
+In this paper, the authors propose an auto-regressive deep generative model for graph-structured data, motivated by the goal of scalability with respect to graph size, graph density and sample size.
+
+In a nutshell, the approach follows closely the ideas in [1, 2], which model graph generation as an auto-regressive process after fixing or sampling an ordering for the nodes. Unlike [1, 2], however, the proposed method makes use of graph convolutions and a graph attention mechanism, closely related to GAT [3], to parametrize the conditional distributions of node/edges given the previously generated graph elements.
+
+The performance of the proposed approach is evaluated in comparison to [1, 2] in several synthetic and real-world datasets, using MMD [4] between generated and held-out test graphs as metric. Unlike [2], which applies MMD on three graph statistics (degree, clustering coefficient and average orbit counts), this manuscript proposes to evaluate MMD using a graph kernel as well [5].
+
+# High-level assessment
+
+The main contribution in this paper is to combine a graph attention mechanism, which can be seen  as a simplification of GAT [3], with deep autoregressive graph models, such as DeepGMG [1] or GraphRNN [2]. In this way, the manuscript has a large conceptual overlap with the method in [6], which can be nevertheless be regarded as concurrent rather than prior work. From a methodological perspective, I believe the contribution is sound and sufficiently novel, although perhaps slightly on the incremental side. 
+
+However, the current version of the manuscript has shortcomings regarding (i) lack of clarity in the exposition of the method’s relation to prior work, low-level implementation details and experimental setup and (ii) insufficient experimental results to back up some of the authors’ claims.
+
+Nonetheless, I believe the proposed approach is promising, and encourage the authors to address or clarify these issues during the author discussion phase.
+
+# Major points / suggestions
+
+1. The manuscript presents the proposed approach in a way that does not clearly differentiate between prior work and original contributions.
+
+In particular, I believe that the ideas in Section 3.1 and 3.2 are almost identical to those in [1, 2], the graph attention mechanism in Section 3.3 can be seen as a minor modification of GAT [3], and Section 3.4 also has a strong conceptual overlap with [1, 2].
+
+I would encourage the authors to be more clear with respect to what is novel and what is borrowed from prior work. Moreover, when slightly departing from prior work (e.g. the modifications applied to the graph attention mechanism in Section 3), I would also encourage the authors to focus on explaining what specifically has changed and what is the rationale behind those design choices, rather than explaining the entire mechanism “from scratch”, leaving up to the reader to figure out what is novel.
+
+2. The paper’s clarity could be improved, with some parts presented in an unnecessarily complicated manner (e.g. the graph attention mechanism) and others without sufficient detail (e.g. the edge estimator module, the zero-ing heuristic for attention or the generation of graphs based on “seed graphs”, which is only mentioned in the appendix).
+
+For example, regarding the graph attention mechanism, I would recommend: (i) explaining more clearly what the “feature vector of node $v_{i}$” is exactly in relation to the notation of Section 3.1; (ii) if the query, key and value matrices are identical, as the text seems to imply, I would rewrite the equations directly in terms of $X$ which would simplify the notation significantly; (iii) perhaps most importantly, the bias functions $b^{Q}$, $b^{K}$ and $b^{V}$ should be defined mathematically and discussed in greater detail and (iv) the output FNN should also be described mathematically. Finally, as mentioned above, I would emphasise the differences between the proposed attention mechanism and GAT.
+
+The edge estimator mechanism is described too imprecisely in Section 3.4.4. While Section A.4 definitely helps, I would recommend defining the entire operation mathematically in Section 3.4.4 as well. Likewise, a precise mathematical definition of GRAM-A in Section 3.5.2 would also be helpful.
+
+Finally, as mentioned in this forum by Prof. Ranu prior to this review’s writing, the graph generation procedure described in Section A.7.2 seems unconventional. I would encourage the authors to both clarify what they mean by “for the convenience of implementation” and to investigate whether the experimental conclusions are affected by this departure from prior practices.
+
+3. Key details about the experimental setup, such as the hyperparameter selection protocol for the proposed approach and baselines, as well as the resulting architectures, seems to be missing, making it difficult to assess if the experimental setup is “fair”. 
+
+In particular, all methods should be allowed to use a similar number of parameters or, alternatively, have their hyperparameters tuned equally carefully for each dataset separately.
+
+4. Most importantly, I believe the experimental results are insufficient to back up some of the claims made in the introduction. 
+
+    4.1. Despite the focus on scalability throughout the motivation, there are no experiments systematically exploring how the runtime at train and test time of the proposed approach and the main baselines scales with respect to sample size, number of nodes per graph and graph density. Moreover, no results are provided for large graphs (e.g. ~5k nodes as in [6]). 
+
+    4.2. The graph attention mechanism was claimed to be an original contribution. However, no results are provided to evaluate its advantages with respect to the different GAT variants nor ablation studies to see its usefulness relative to a variant of the proposed approach using only graph convolutions.
+
+    4.3 The idea of using MMD in conjunction with graph kernels as a performance metric is interesting. However, there is no investigation of key aspects such as (i) its relation to other metrics and (ii) the impact that the choice of graph kernel, among the many available, and/or of graph kernel hyperparameters has on the resulting metric (see [7] for a comprehensive review on graph kernels).
+
+   4.4. Finally, the results have been reported without error bars, making it difficult to quantify the statistical significance of the observed performance differences between approaches.
+
+# Minor points / suggestions
+
+1. I strongly believe the authors should adapt the manuscript to mention [6] and related/concurrent work. Ideally, including it as an additional baseline would be even better, but not necessary given the limited rebuttal time. Nevertheless, this point was not taken into consideration when scoring the manuscript, given how recent [6] is.
+
+# References
+
+[1] Li, Yujia, et al. ""Learning deep generative models of graphs."" *International Conference on Machine Learning.* 2018.
+[2] You, Jiaxuan, et al. ""Graphrnn: Generating realistic graphs with deep auto-regressive models."" *International Conference on Machine Learning.* 2018.
+[3] Veličković, Petar, et al. ""Graph attention networks."" *International Conference on Learning Representations*. 2018.
+[4] Gretton, Arthur, et al. ""A kernel method for the two-sample-problem."" Advances in Neural Information Processing Systems. 2007.
+[5] Costa, Fabrizio, and Kurt De Grave. ""Fast neighborhood subgraph pairwise distance kernel."" Proceedings of the 26th International Conference on Machine Learning. Omnipress; Madison, WI, USA, 2010.
+[6] Liao, Renjie, et al. ""Efficient Graph Generation with Graph Recurrent Attention Networks."" *Advances in Neural Information Processing Systems.* 2019.
+[7] Kriege, Nils M., Fredrik D. Johansson, and Christopher Morris. ""A Survey on Graph Kernels."" *arXiv preprint arXiv:1903.11835* (2019).",3,,ICLR2020
+S1ecMH32YS,2,Hygy01StvH,Hygy01StvH,Official Blind Review #2,"The work performs a systematic empirical study of how the latent space design in a GAN impacts the generated distribution. The paper is well-written and was easy to read. Also, I find this to be an interesting and promising direction of research.
+
+The convergence proof in Goodfellow (2014) assumes that the data distribution has a density, and essentially states that the JS-divergence is zero if and only if the two distributions are the same. In practice, the data distribution is discrete, while the latent distribution has a probability density function. It is not possible to transform a density into a discrete distribution by a continuous map and neural networks are always continuous by construction. In theory, as training progresses, more and more latent mass will be pushed on the discrete samples and no minimizer exists (unless the function space of generator is constrained or the real distribution is smoothed out a bit). 
+
+Since it is not possible to assess whether the GAN training has converged due the nonconvexity of the energy and non-existence of a global optimizer, the empirically observed results might be very specific to the chosen optimization procedure, stopping criterion, dataset, hyper-parameters, initialization, network architectures, etc etc.  It is a challenge to study the choice of latent space in a somewhat ""isolated"" way. These issues should be discussed in the paper and the reader should be made aware of such problems.
+
+Another point, could it be, that by increasing the dimension of the latent space, one makes it easier for the nonconvex optimization in (5) to converge to ""unlikely but realistic looking samples""? I think this is not too far-fetched, as increasing the dimension of an optimization problem often makes local optimization less likely to get stuck at local optima. Also it might not be the best idea to optimize (5) with Adam since it is not a stochastic optimization problem and there are provably convergent solvers out there for this problem class. 
+
+Since it is possible to evaluate the likelihood of the optimized reconstructions that are nearby the data points, one could check whether this is indeed the case. While constrained not to be too unlikely, I wonder whether the likelihood increases or decreases with the dimensionality of the latent space and this would make an interesting plot. 
+
+Unfortunately, I did not understand the connections to auto-encoders, as they might optimize a fundamentally different criterion than GANs. In particular ""In principle, getting from an AE to a GAN is just a rearrangement of the NNs. "" is unclear to me. 
+
+Also, what is meant by lower-bound? Is the claim that the reconstruction error in an auto encoder will be lower, than if one optimizes the latent code in a GAN to reconstruct the input? Figure 3 seems to support this hypothesis, but I don't have an intuition why this should be true and have some doubts. A mathematical proof seems out of reach. 
+
+I have trouble to understand the ""intuition that the AE complexity lower-bounds the GAN complexity."" Before reading this paper, my intuition was the opposite: If the generator distribution covers the real distribution, the reconstruction error for GAN is zero. Intuitively, it seems a much easier task to somehow cover a distribution than to minimize an average reconstruction error. 
+
+The connection of WGANs to the L2 reconstruction loss in the auto-encoder is very hand-wavy. It is still an open question whether WGANs actually have anything to do with the Wasserstein distance. People working in optimal transport doubt this, due to huge amount of approximations going on. 
+
+At this point I'm reluctant to recommend acceptance, as the paper tries to connect things, which for me are quite disconnected and the evaluations of reconstruction error, etc. might depend in intricate ways on the nonconvex optimization procedures.
+
+Minor suggestions, typos, etc (no influence on my rating):
+
+- What is the ""generated manifold"" that is talked about in the introduction, contributions and throughout the paper? To me, it is not directly clear that the support of the transformed distribution will be a manifold (especially if G is non-injective). Anyway, the manifold structure is nowhere exploited in the paper, so I suggest to call it ""transformed latent distribution"".
+
+- Had to pause a little bit to understand Eq. 2 (simple polynomial interpolation). It is unnecessary to show the explicit form, as I'm sure no one doubts the existence of a smooth curve interpolating a finite set of points in R^d.  
+
+- Equations should always include punctuation marks. 
+
+- Eq. 5: dim --> \text{dim} and s.t. --> \text{s. t.}
+
+- Fig 3b: the red curve is missing or hidden behind another curve.
+",1,,ICLR2020
+SkxR9kKJ5H,2,ryxW804FPH,ryxW804FPH,Official Blind Review #1,"This paper proposes to compare different methods to build BERT/GPT  representations of long documents, to bypass the limitation of the input size of these models. One of the proposed method uses attention mechanism to discover the most significant portion of the text which are use to backpropagate the error on the language model. Three combination methods (concatenation, RNN and attention)  are tested on 2 databases plus one modified version of one of the databases to show the impact of the presentation bias in the texts (most important part are at the beginning). 
+Results show that the largest improvement is the base BERT model over the previously proposed model : this aspect should be comment : what is the reason of the improvement ? 
+Combination of textual part also yields improvement, but to a smaller extend. Hyper-parameter and Training/Testing time are reported, which is useful from a practical point of view if one should decide to implement the proposed method or not, considering the extra computational load and the relatively small improvement. The Shuffling experiment demonstrate an interesting behaviour of the models, that should be confirmed on a real dataset.
+
+ ",6,,ICLR2020
+9BXM6zs1gV,3,HP-tcf48fT,HP-tcf48fT,Learning-based formulation of the MCS problem. Interesting approach based on GNNs and RL. Experiments on larger graphs would be interesting to consider.,"The paper deals with the problem of Maximum Common Subgraph (MCS) detection, following a learning-based approach. In particular, it introduces GLSEARCH, a model that leverages representations learned by GNNs in a reinforcement learning framework to allow for efficient search. The proposed model has been experimentally evaluated on both artificial and real-world graphs, and its performance has been compared against traditional and learning-based baselines.
+
+Strong points:
+
+--- The paper deals with an important problem, and the overall learning-based formulation and solution look very interesting.
+
+--- The paper is well-written and most concepts, especially the proposed approach, have been clearly presented. Besides, the supplementary material describes in detail most of the aspects of the paper.
+
+--- The ablation study is interesting and demonstrates that the chosen architecture of GLSEARCH has consistent behavior.
+
+
+Weak points:
+
+--- My main concern is related to various aspects of the experimental evaluation of the proposed model. First, most of the datasets used in the evaluation seem to be unlabeled. In the basic formulation of the model though, the input graphs are allowed to be labeled. To my view, this makes the overall task more challenging. How consistent are the results in the case of labeled graphs?
+
+--- Second, the size of the input graphs is also an important parameter. Definitely, most heuristic baselines might not be able to scale to graphs with more than a few thousands of nodes, but I would be expecting to consider some large-scale network containing a few tens of thousands of nodes for the evaluation of GLSEARCH.
+
+
+Typos:
+
+--- Page 4, first paragraph: *F*or MCS, …
+--- Caption of Table 3: *Ablation*
+",7,4.0,ICLR2021
+HJetRuM0KH,3,rJgUfTEYvH,rJgUfTEYvH,Official Blind Review #2,"This paper extended the flow-based generative model for stochastic video prediction. The proposed model takes an advantage of the flow-based models which provide exact latent-variable inference, exact log-likelihood evaluation, and efficiency. The paper used the autoregressive model and the multi-scale Glow architecture. The experiments on the stochastic movement dataset (synthetic) and the BAIR Robot push dataset show the performance improvement against other state-of-the-art stochastic video generation models (SV2P and SAVP-VAE). 
+
+The main contribution in this paper is the use of flow-based models for video prediction, and it is the first work in this direction. The major idea sounds and the paper is clearly written. 
+
+Below is my concerns and the feedback. 
+
+It looks like the low-temperature sampling is important to achieve the better scores for prediction. Can the low-temperature sampling trick be applied for SV2P and SAVP-VAE as well? If then, how is the performance difference compare to the proposed model?
+
+The authors reported the best possible values of PSNR, SSIM and VGG perceptual metrics by choosing the video closest to the ground-truth. However, I believe this evaluation does not present the benefit of the stochastic models. The better comparison I believe is to report the median/mean with the range between best and worst values. 
+
+The BAIR robot push dataset is with a pretty limited setting: a small robot and/or object motion between frames and a small variation of the background between videos. It would be interesting to see more dynamic scenarios such as driving or human motion scenes. ",6,,ICLR2020
+WQVTWtyRHqa,1,FcfH5Pskt2G,FcfH5Pskt2G,Interesting work,"
+This work aims to demonstrate that VAE-based architectures can take advantage of inherent correlation between data to produce representations. They claim that small perturbations of the data prevent these architectures from taking advantage of such correlations resulting in failure to produce disentangled representations. This work therefore claims that the perturbed datasets should be used to design and learn models that can discover the true structure of the data
+
+This work highlights a characteristic of leading models for disentanglement. It demonstrates that these models are taking advantage of the correlation in the data and not fully learning the true structure. Therefore, while the current models are certainly useful, models that can learn the true structure might be even more beneficial. 
+
+They take a structured approach by perturbing the data in a manner such that the local variance is `misaligned' from the global variance. They demonstrate empirically that the performance of the models goes down for the data set when it is perturbed in the proposed structured manner. This also shows the merit of the perturbation procedure as adding uniform noise to pixel doe not produce the similar drop in performance.
+
+
+The work, therefore, does seem to have merit. However, there are some clarifications that are needed.
+
+
+Figure 1 is a bit unclear. If 1a and 1b is supposed to show a distribution then why does the vertical axis have -ve values? I am basically not fully sure what exactly is being depicted by 1a and 1b. They look similar to the raw data in the inset. If its just the principal components then should it not be directions and not a patch? Perhaps the figure can be clarified.  
+
+Eq8 is a bit unclear. How is $c_j$ selected from the set $\{1,2,3,4\}$. Also assuming s = s(x^{(i)}), is it possible that $i'$ = $i$ for all $i$ ?. Furthermore I am a bit confused about the notations. For example, $f:\mathbb{R} \rightarrow \mathbb{R}^{r} \times \mathbb{R}^{r}$. This seems to imply that $f(s) = (v,u)$ where $v,u \in  \mathbb{R}^{r}$. In eq8, however it seems that f(s) is a matrix of scalars? This would imply that $f:\mathbb{R} \rightarrow P^{2}$ where $P = \{ -1,1 \}$. In addition to this, why do i,j belong to \mathbb{N} (and not related to the size of input x).
+
+In the context of Figure 4, How are the $1 - \mathbb{E}(\sigma_{i}^{2})$ calculated for shape scale orient, PosX and PosY. Are they calculated using the reconstruction? 
+
+
+Were any other types of f() explored. What influence does the choice of f() (that abides by all the conditions in Section 4.2) have on the performance. If it is specifically the proposed Eq8 that produces this influence then what might be the major characteristics of the f() that produce the drop in performance?
+
+ 
+",5,2.0,ICLR2021
+rJeOPGyda7,3,rkl6As0cF7,rkl6As0cF7,A good first step towards endowing deep reinforcement learning agents with recursive reasoning capabilities,"The high-level problem this paper tackles is that of endowing RL agents with recursive reasoning capabilities in a multi-agent setting, based on the hypothesis that recursive reasoning is beneficial for the agents to converge to non-trivial equilibria.
+
+The authors propose the probabilistic recursive reasoning (PR2) framework for an n-agent stochastic game. The conceptual difference between PR2 and non-correlated factorizations of the joint policy is that, from the perspective agent i, PR2 augments the joint policy of all agents by conditioning the policies of agent i's opponents on the action that agent i took. The authors derive the policy gradient for PR2 and show that it is possible to learn these action-conditional opponent policies via variational inference in addition to learning the policy and critic for agent i.
+
+The proposed method is evaluated on two experiments: one is an iterated matrix game and the other is a differential game (""Max of Two Quadratics""). The authors show in the iterated matrix game that baselines with non-correlated factorization rotate around the equilibrium point, whereas PR2 converges to it. They also show in the differential game that PR2 discovers the global optimum whereas baselines with non-correlated factorizations do not.
+
+This paper is clear, well-motivated, and well-written. I enjoyed reading it. I appreciated the connection to probabilistic reinforcement learning as a means for formulating the problem of optimizing the variational distribution for the action-conditional opponent policy and for making such an optimization practical. I also appreciated the illustrative choice of experiments that show the benefit of recursive reasoning. 
+
+Currently, PR2 provides a proof-of-concept of recursive reasoning in a multi-agent system where the true equilibrium is already known in closed form; it remains to be seen to what extent PR2 is applicable to multi-agent scenarios where the equilibrium the system is optimizing is less clear (e.g. GANs for image generation). Overall, although the experiments are still small scale, I believe this paper should be accepted as a first step towards endowing deep reinforcement learning agents with recursive reasoning capabilities.
+
+Below are several comments.
+
+1. Discussion of limitations: As the authors noted in the Introduction and Related Work, multi-agent reinforcement problems that attempt to model opponents' beliefs often become both expensive and impractical as the number of opponents (N) and the recursion depth (k) grows because such complexity requires high precision in the approximate the optimal policy. The paper can be made stronger with experiments that illustrate to what extent PR2 practically scales to problems with N > 2 or K > 1 in terms of how practical it is to train.
+2. Experiment request: To what extent do the approximation errors affect PR2's performance? It would be elucidating for the authors to include an experiment that illustrates where PR2 breaks down (for example, perhaps in higher-dimensional problems).
+3. Minor clarification suggestion: In Figure 1: it would be clearer to replace ""Angle"" with ""Perspective.""
+4. Minor clarification suggestion: It would be clearer to connect line 18 of Algorithm 1 to equation 29 on Appendix C.
+5. Minor clarification suggestion: In section 4.5: ""Despite the added complexity"" --> ""In addition to the added complexity.""
+6. Minor clarification: How are the importance weights in equation 7 reflected in Algorithm 1?
+7. Minor clarification: In equation 8, what is the significance of integrating over time rather than summing?
+8. Minor clarification: There seems to be a contradiction in section 5.2 on page 9. ""the learning outcomes of PR2-AC and MASQL are extremely sensitive to the way of annealing...However, our method does not need to tune the the annealing parameter at all..."" Does ""our method"" refer to PR2-AC here?",8,3.0,ICLR2019
+XqINH1hRp3-,5,qU-eouoIyAy,qU-eouoIyAy,Low significance and ethical concerns,"Short summary
+---------------------
+The authors propose a framework combining a GAN and linear encoder and decoders to reconstruct perceived face stimuli from fMRI data. They compare their framework to two baselines in the field and display a higher similarity (in different spaces) between the reconstructed images of their method and the stimuli, compared to the baselines.
+
+Strengths
+--------------
+The framework is simple and allows for using different models to generate the stimuli, encode brain activity, decode brain activity and decode the stimuli. Appropriate baselines are selected and the authors quantify the similarity between generated and reconstructed images in different spaces.
+
+Weaknesses
+-------------------
+I am uncertain of the impact of the proposed approach, as it does not propose techniques to investigate brain functioning, nor does it provide a means of communication with disabled patients (as claimed by the authors). I see Ethical concerns with this type of model, which, in my opinion, are not counterbalanced by its usefulness.
+
+Novelty
+-------------
+The approach combines well established models and techniques into a novel framework. While I found some creativity in the setup, the novelty is overall low and I am not convinced that some of the techniques used are not acting as bottlenecks (see detailed comments).
+
+Clarity
+----------- 
+The paper is relatively clear. I appreciated the presence of multiple figures and of various examples. I believe that the methods could be clearer (see detailed comments).
+
+Significance
+----------------
+To me this was a major concern with this work as I found some claims bold and not substantiated by the experimental setup or the results. For instance, the authors claim to decode naturalistic stimuli. However, they can decode GAN generated images, which is substantially different, especially given the approach to average the fMRI signals over 14 repetitions in the test set to increase SNR.
+
+Rigor
+--------
+Overall, I found that the work was relatively well performed, although not excellent. I wished that the comparison with the VAE approach was fair and would suggest that the authors work towards achieving SOTA in their baselines for a fair comparison.
+
+Detailed comments
+---------------------------
+- Faces generated by a GAN cannot be deemed naturalistic
+- Isn’t the setup circular? How would that model help understand face processing in naturalistic settings?
+- Ethics concerns: how is the application helping people with disabilities or understanding brain function? The authors mention this as a potential communication means for locked-in patients. However, (1) the results are not strong enough to suggest a potential communication tool, (2) the reliance on fMRI signals makes this impractical and expensive. There are no conclusions regarding brain function that this technique brings, especially given the voxel selection based on a linear regression model.
+- Novelty and technical sophistication is rather low: combining existing techniques in a novel system.
+- A better test would have been to reconstruct stimuli that were not generated by the GAN
+- “Importantly, we only took the centers of the activation maps to exclude surrounding background noise”. I believe the authors refer to “activation maps” as the fMRI z-score maps. This can however be confusing for an ML reader. Please clarify the language.
+- It is unclear to me what the goal of the “five additional loss functions” is, or how they are formulated.
+- why is the test set not sampled the same way as the training set? While taking the average of 14 repetitions increases signal-to-noise ratio, this setting further departs from any naturalistic decoding.
+- An fMRI study with 2 subjects is not representative, given inter-subject variability. This is further compounded by the low number of test images (36 per subject after averaging).
+- There is a clear imbalance in the generated stimuli in terms of age, race or whether they wear glasses. This limits the impact of the scores for the 5 different attributes. It is also a reason why neuroscience experimental stimuli are thoroughly controlled for. While this is touched upon in the discussion, this could be an indication that the proposed approach would fail on naturalistic stimuli.
+- Some reconstructed stimuli are highly similar, despite different generated images (e.g. 23 and 24). What could explain this phenomenon? Could there be some type of mode collapse in the reconstruction? It would be interesting to compute the pair-wise similarity between reconstructed images compared to the pairwise similarity of the generated images (in the different spaces mentioned, i.e. latent, feature, attributes).
+- Isn’t the linear regression model a bottleneck here? Why not use the “raw” BOLD signal instead of z-scores? This reflects an assumption that the encoding between stimulus and activation map is linear. Couldn’t there be a non-linear mapping between stimulus perception and the latent space? Overall, the proposed framework relies on established techniques without questioning or reflecting on their assumption.
+
+Minor
+-------
+- Given the linear model used, why limit to 4096 voxels? This seems like an arbitrary number. Is it related to the z-scores or a specific p-value threshold on the activation map?
+- The relationship between the trained models and the ResNet mentioned in section 2.4 is unclear. Is it used for evaluation as another way to estimate the similarity between the reconstruction and the GAN generated images? Is there a justification/reference for this technique?
+- Figure 5C, could the authors use the same y-scale? The legend mentions “We found high correlations for gender, pose, and age, but no significant correlation for the smile attribute.” Pose however had the lowest correlation values. How about eyeglasses?
+- “[…] permutation test), indicating the probability that a random latent vector or image would be more similar to the original stimulus”. This sentence is unclear to me: is the permutation test assessing whether HYPER has significantly higher similarity between the reconstructed image and the generated image than if using a random latent vector? Please provide more details on the hypothesis tested by the permutation test in each space, as well as how these tests compare the different techniques. Couldn’t the permutation tests be applied to the baselines techniques as well?",2,5.0,ICLR2021
+Sy8xDgYxG,2,H13WofbAb,H13WofbAb,"Overall, the proposed method is not well-motivated, simple, with no theoretical support, and experimental results are not convincing.","This paper considers distributed synchronous SGD, and proposes to use ""partial pulling"" to alleviate the problem with slow servers.
+
+The motivation is that the server may be a straggler. The authors suggested one possibility, namely that the server and some workers are located on the same machine and the workers take most of the computational resource. However, if this is the case, a simple solution would be to move the server to a different node. A more convincing argument for a slow server should be provided.
+
+Though the authors claimed that they used 3 techniques to accelerate synchronous SGD, only partial pulling is proposed by them (the other 2 are borrowed straightforwardly from existing papers). The mechanism of partial pulling is very simple (just let SGD proceed after pulling a partial parameter block instead of the whole block). As mentioned by the authors in section 1, any relaxation in synchrony brings more noise and higher variance to the updates, and also may cause slow convergence or convergence to a poor solution. However, the authors provided no theoretical study on any of these aspects.
+
+Experimental results are not convincing. Only one relatively small dataset (cifar10) is used Moreover, the slow server problem is only simulated by artificially adding delays to the server.",3,4.0,ICLR2018
+HJlnHqQhFH,1,rkeNr6EKwB,rkeNr6EKwB,Official Blind Review #3,"Summary:
+This paper addresses the challenging problem of how to speed up the training of GANs without using large mini-batch sizes and causing significant performance drop. To achieve this, the authors propose to use the method of core-sets, mainly inspired by recent use of core-set selection in active learning. The proposed method allows us to generate effectively large mini-batches though actually small during the training process, or more concretely, drawing a large batch of samples from the prior and then compress that batch using core-set selection. To address the curse of dimensionality issue for high-dimensional data like images, the authors suggest using a low-dimensional embedding based on Inception activations of each training image. Regarding the experimental evaluation, it is clearly shown that the proposed core-set selection greatly improves GAN training in terms of timing and memory usage, and allows significantly reducing mode collapse on a synthetic dataset. As a by-product, it is successfully applied to anomaly detection and achieves state-of-the-art results.
+
+Strengths:
+The paper is generally well written and clearly presented.  As mentioned in the text, the use of core-sets is not novel in machine learning, but unfortunately not yet sufficiently explored in deep learning, and there are still few useful tools available in the literature. I believe this work will have a positive impact on the community and especially help establishing more efficient methods for training GANs.
+
+Weaknesses:
+- Experimental results are indeed very promising, however, GAN implementation details and hyperparameters used for training, such as optimizer and learning rate, do not seem to be mentioned in the text. I think this would be helpful for readers to better understand how this all works.
+- There does not seem to be any discussion on the convergence and stability of GAN training, which should be clarified in the experimental section.
+- On page 3, in Sect. 3.2,  I find “random low dimensional projections of the Inception Embeddings” is not clear, more technical details should be provided.",6,,ICLR2020
+Sye48n16FB,2,Byg9A24tvB,Byg9A24tvB,Official Blind Review #3,"The paper compares between SCE loss,  large-margin Gaussian Mixture (L-GM) loss and proposes the Max-Mahalanobis center (MMC) loss as an alternative to explicitly learn more structured representations and induce high-density regions in the feature space. Overall the paper is well written, with sufficient theoretical reasoning and experiments. However, the reviewer has the following concerns and questions,
+The theoretical analysis depends largely on the Gaussian assumption and argues that when the loss is distributed as Gaussian, it seems to be not even a fair comparison since assuming L_{MMC} is gaussian is totally different from assuming L_{g-SCE} is Gaussian. Also in practice it is hard to justify whether certain loss function really behaves like a Gaussian distribution, which makes the application of the theorem more limited. In fact, if the samples are concentrated (which can be common in practice), is the proposed method still able to induce high density sample region?
+The experiments give very competitive results for MMC loss. It would also be interesting to see if implementing other defenses or do an adversarial training would still make MMC loss much better than other loss (at least from the AT example, it seems that MMC does not perform uniformly better than SCE as before).
+Are the experiment results sensitive to the choice of parameters C_MM and L?
+
+I have read the author responses and I think they are quite solid. I have updated my score. 
+",6,,ICLR2020
+HyeKbrJc27,2,SylU3jC5Y7,SylU3jC5Y7,Interesting paper that needs more work,"This work proposes Variational Beta-Bernoulli Dropout, a Bayesian way to sparsify neural networks by adopting Spike and Slab priors over the parameters of the network. Motivated by the Indian Buffet Process the authors further adopt Beta hyperpriors for the parameters of the Bernoulli distribution and also propose a way to set up the model such that it allows for input specific priors over the Bernoulli distributions. They then provide the necessary details for their variational approximations to the posterior distributions of both such models and experimentally validate their performance on the tasks of MNIST and CIFAR 10/100 classification.
+
+This work is in general well written and conveys the main ideas in an clear manner. Furthermore, parametrising conditional group sparsity in a Bayesian way is also an interesting venue for research that can further facilitate for computational speedups for neural networks. The overall method seems simple to implement and doesn’t introduce too many extra learnable parameters.
+
+Nevertheless, I believe that this paper needs more work in order to be published. More specifically:
+
+- I believe that the authors need to further elaborate and compare with “Generalized Dropout”; the prior imposed on the weights for the non-dependent case is essentially the same with only small differences in the approximate posterior. Both methods seem to optimise, rather than integrate over, the weights of the network and the main difference is in how to handle the approximate distributions over the gates. Why would one prefer one parametrisation rather than the other? Furthermore, the authors of this work argue that they employ asymptotically unbiased gradients for the binary random variables, which is incorrect as the continuous relaxation provides a biased gradient estimator for the underlying discrete model.
+
+- At section 3.2 the authors argue about the inherent sparsity inducing nature of the IBP model. In the finite K scenario this is not entirely the case as sparsity is only encouraged for alpha < K.
+
+- At Eq. 11 the index “n” doesn’t make sense as the Bernoulli probability for each point depends only on the global pi_k. Similarly for Eq. 12.
+
+- Since you tie q(z_nk|pi_k) = p(z_nk|pi_k) then it makes sense to phrase Eq.16 as just D_KL(q(pi) || p(pi)). Furthermore, I believe that you should properly motivate on why tying these two is a sensible thing to do.
+
+- Figure 1 is misleading; you start from a unimodal distribution and then you simply apply a scalar scale and shift to the elements of that distribution. The output of that will always be a unimodal distribution but somehow you end up with a multimodal distribution on the third part of the figure. As a result, I believe that in this case you will not have two clear modes (one at 0 and one at 1) when you apply the hard-sigmoid rectification.
+
+- The motivation for 21 seems a bit confusing to me; what do you mean with insignificant dimensions? What overflow does the epsilon prevent? If the input to the hard sigmoid is a N(0, 1) distribution then you will approximately have 1/3 of the activations having probability close to 1. Furthermore, it seems that you want beta to be small / negative to get sparse outcomes but the text implies that you want it to be large.
+
+- It would be better to rewrite eq. 22 to include also the fact that you have a separate z per layer as currently it seems that the there is only one z. Furthermore, you have written that the variational posterior distribution depends on x_n on the RHS but not on the LHS.
+
+- Above eq. 23 seems that it should be q(z_nk| pi_k, xn) = p(z_nk| pi_k, xn) rather than q(z_nk| pi_k) = p(z_nk| pi_k, xn)
+
+
+Regarding the experiments; the MNIST results are not particularly convincing as the numbers are, in general, similar to other methods. Furthermore, Figure 2 is a bit small and confusing to read. Should FLOPS be on the y-axis or something else? Almost zero flops for the original model doesn’t seem right. Finally, at the CIFAR 10/100 experiment it seems that both BB and DBB achieve the best performance. However, it seems that the accuracy /sparsity obtained for the baselines is inferior to the results obtained on each of the respective papers. For example, SBP managed to get a 2.71x speedup with the VGG on CIFAR 10 and an error of 7.5%, whereas here the error was 8.68% with just 1.34x speedup. The extra visualisations provided at Figure 3 do look interesting though as it shows what the sparsity patterns learn.",5,4.0,ICLR2019
+XFTo9ZrMlR0,2,jphnJNOwe36,jphnJNOwe36,"A demonstration of a few methods for post-hoc improvements for bias in overparametrized models. Interesting, but not sure if there is enough new information.","This paper is concerned with potential improvements to the worst-case
+(mainly minority class/group) generalizations in over-parametrized
+neural networks through post-hoc corrections. The authors demonstrate
+the problem and the suggested corrections (that were used in previous
+literature) on one artificial classification task as well as two
+image classification tasks. The paper shows that post-hoc corrections
+may improve the worst-subgroup scores similar to an earlier
+state-of-the-art system that modifies the learning objective.
+
+This topic is interesting. The paper is in general written well, and
+demonstrates the problem convincingly. The post-hoc fix solution
+suggested also seem to be performing reasonably well on the problems /
+data sets used in the study.
+
+That being said, it feels the study/paper builds on a few earlier
+studies heavily, and I am not fully convinced that there is enough new
+findings in the present paper to warrant publication in ICLR.
+
+I also have a few minor notes/suggestions:
+
+- The findings presented in Figure 4 is interesting (mainly the fact
+  that the overparametrization seem to be improving the worst-group
+  performance with threshold tuning). It would be interesting to see
+  more investigation/discussion of what could be the underlying
+  reason.
+
+- Page 2 (middle): ""also common the fairness literature"" ->   ""also common in the fairness literature"" 
+- Figure 2 is not very readable. If increasing size/scale is not an
+  option due to limited space, taking legend out the figures may also
+  improve it. Colorblind friendly colors may also be a good idea.
+- There are case (normalization) issues in the references:
+  ""ml"", ""t-sne"" (not exhaustive, a through check is
+  recommended).
+",5,3.0,ICLR2021
+gy3wUQnkBC,4,QoWatN-b8T,QoWatN-b8T,An extension of Mar,"Summary:
+The authors develop a hierarchical Bayesian memory allocation scheme to bridge the gap between episodic and semantic memory via a hierarchical latent variable model. They take inspiration from traditional heap allocation and extend the idea of locally contiguous memory to the Kanerva Machine, enabling a novel differentiable block allocated latent memory. In contrast to the Kanerva Machine, the authors simplify the process of memory writing by treating it as a fully feed forward deterministic process, relying on the stochasticity of the read key distribution to distribute information within the memory.
+
+Pros:
+- The authors combine the idea of differentiable indexing in Spatial Transformer (Jaderberg et al., 2015) into the memory of Kanerva Machine (Wu et al, 2018a;b) and prove by experiments that this allocation scheme on the memory helps improve the test negative likelihood. Also, its speed is about 2x faster than the Dynamic Kanerva Machine. 
+- The authors show the efficiency of Temporal Shift Module (TSM) (Lin et al., 2019) in the encoder of memory models. Replacing a standard convolutional stack by TSM improves the ELBO in Dynamic Kanerva Machine by 6.32 nats/image for the Omniglot dataset.
+- The experiments are well reported for different tasks, such as reconstruction and generation, on various datasets.
+
+Cons:
+- The whole article is just like a mechanical mixture of old ideas. To be specific, the K++ model is Kanerva Machine + Spatial Transformer + a powerful encoder (namely, Temporal Shift Module). The authors do not introduce any significant improvement or novel insight for old models and techniques.
+- Is there any theory support for the idea that we should use an allocated deterministic memory instead of a full variational memory? The authors mention the theory of complementary learning system in the Abstract and heap allocation at the beginning of Section 1, but there is no further analysis for these two theoretical intuitions. 
+
+Comments and questions:
+- The basic idea is clear and reasonable, but the authors should provide deeper analysis as well as deeper insights for their new model (and for old models that they use, if possible).
+- The authors use q_phi(Z|X) in the ELBO (eq. 4) instead of q_phi(Z|X, Y, M) as in the Kanerva Machine. How is the memory used in the read model?
+- Where do the results in Table 1 come from? For example, in Kanerva Machine and Dynamic Kanerva Machine paper (Wu et al, 2018a;b), the authors did not report the negative likelihood for CIFAR10 dataset.
+- Is there any explanation for the significantly low negative likelihood (-2344.5 bits/dim) of K++ for CIFAR10 dataset?
+- The authors should include a brief introduction section about Kanerva Machine and Dynamic Kanerva Machine. Moreover, the function δ in eq. 5 is not defined beforehand.
+
+
+REFERENCES
+James L McClelland, Bruce L McNaughton, and Randall C O’Reilly. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychological review, 102(3):419, 1995.
+Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In Advances in neural information processing systems, pp. 2017–2025, 2015.
+Yan Wu, Greg Wayne, Alex Graves, and Timothy Lillicrap. The Kanerva machine: A generative distributed memory. ICLR, 2018a.
+Yan Wu, Gregory Wayne, Karol Gregor, and Timothy Lillicrap. Learning attractor dynamics for generative memory. In Advances in Neural Information Processing Systems, pp. 9379–9388, 2018b.
+Ji Lin, Chuang Gan, and Song Han. TSM: Temporal shift module for efficient video understanding. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7083–7093, 2019.
+",6,4.0,ICLR2021
+H1gyh8vq2m,2,SkzK4iC5Ym,SkzK4iC5Ym,A momentum based approach for batch normalization with asymptotic convergence analysis,"The authors propose a momentum based approach for batch normalization and provide an asymptotic convergence analysis of the objective in terms of the first order criterion. To my understanding, the main effort in the analysis is to show that the sequences of interest are Cauchy. Some numerical results are reported to demonstrate that the proposed variant of BN slightly outperforms BN with careful adjustment of some hyper parameter. The proposed approach is incremental, and the theoretical results are somewhat weak.
+
+The most important issue is that the zero gradient of the objective function does not imply that it attains an (even local) minimum point. As for the 2-layer case, the objective function can be nonconvex in terms of the weight parameters with stationary points being saddle points, it is crucial to understand whether an iterative algorithm (GD or SGD) converges to a minimum point rather a saddle point. Thus, the first order criterion alone is not enough for this purpose, which is why extensive studies are carried out for nonconvex optimization (e.g., using both first and second order criteria for convergence [1]) and considering the specific structure of neural nets [2].
+
+The analysis is somewhat confusing. The authors assume that the objective of interest have stationary points (\theta*, \lambda*), and also show that the sequence of the norm of gradient convergence to zero, with the \lambda^(m) converges to \bar{\lambda}. What is the relationship between \lambda* and \bar{\lambda}? It is not clear whether they are the same point or not. Moreover, since there is no converge of the parameter, it is not clear what the convergence for the \lambda imply here, as we also discussed above that the zero gradient itself may mean nothing.
+
+In addition, the writing need improvements. Some statements are not accurate. For example, on page 3, after equation (2), the authors state “The deep network …”, though they mentioned it is for a 2-layer net. Also, more explicit explanation and definitions are necessary for notations. For example, it is clearer to define explicitly the parameters with \bar (e.g., for \lambda) as the limit point. 
+
+[1] Ge et al. Escaping from saddle points—online stochastic gradient for tensor decomposition.
+[2] Li and Yuan. Convergence Analysis of Two-layer Neural Networks with ReLU Activation.",3,3.0,ICLR2019
+VPFq0MXCWge,3,YPm0fzy_z6R,YPm0fzy_z6R,Graph Diffusion Network on signed networks,"The paper proposes to leverage signed random walk as hidden representation propagations to construct a signed graph diffusion network model. 
+
+Pros: 
+1. The idea of the proposed model is simple but very effective according to the experiment evaluations.
+2. It's very interesting that the proposed model achieves better performance as the number of layers increases, even up to K=10.
+3. The writing of the paper is in a good shape and everything is clear to follow.
+4. The authors also prove the convergence of the proposed diffusion layer.
+
+Cons:
+1. My major concern is the novelty of the paper. The key part of the model (i.e., signed diffusion layer) totally borrows from the existing work. Even the figure used in this paper is same as the reference paper (e.g., Figure 2 (a) in this paper vs. Figure 2 (b) in reference paper).
+      Jung, Jinhong, et al. ""Personalized ranking in signed networks using signed random walk with restart."" 2016 IEEE 16th 
+      International Conference on Data Mining (ICDM). IEEE, 2016.
+2. In addition, authors need to compare with the above paper as well as it's very effective in link prediction even though it's not an embedding-based method.",4,4.0,ICLR2021
+BkgzQj3b9S,2,rygtPhVtDS,rygtPhVtDS,Official Blind Review #1,"This paper proposes a noise regularization method which adds noise on both x and y for conditional density estimation problem (e.g., regression and classification). The writing is good and the whole paper is easy to follow. However, I vote for reject, since the novelty is somehow limited, the claims made in the paper is not well supported and experiments are not very convincing. 
+
+1. Adding noise on x (e.g., [1]), y (e.g., [2]) is not new. Though it is claimed that this paper extends previous results on classification/regression to conditional density estimation which is a more general case. This claim is not well supported. Experiments are still evaluated in classification/regression tasks.
+
+2. Theorem 1 & 2 in Sec 4.2 only show the asymptotic case, which are quite obvious and seems helpless in understanding the advantage of adding noise regularization in conditional density estimation.
+
+3. Sec 4.1. The explanation that Page 5, ```""The second term in (6) penalizes large negative second derivatives of the conditional log density estimate..."". It is hard for me to understand. Large positive second derivatives also lead to poor smoothness.
+
+[1] Learning with Marginalized Corrupted Features, ICML 2013
+[2] Learning with Noisy Labels, NIPS 2013",3,,ICLR2020
+h_YnaFs8lb-,1,Z3XVHSbSawb,Z3XVHSbSawb,Simple attacks for reinforcement learning with pixels based on image distortions,"################################################
+
+Summary:
+Previous work on crafting attacks for deep reinforcement learning has relied on computing adversarial examples using knowledge of the environment, policy, and optimizer. Using Atari games and DDQN, this paper shows that simple image distortions, such as brightness changes, blurring, and rotations, often has greater impact on the agent's performance and is perceptually more similar to the original images. 
+
+################################################
+
+Pros:
+1. The proposed attacks are simple, intuitively meaningful, and computationally cheap. They are black box, not requiring information about the environment and policy, which is a more realistic setting.
+2. The proposed attacks seem to be better than Carlini & Wagner for most games. I was a little surprised how much, especially for JamesBond.
+
+Cons:
+1. The results are shown for only one algorithm, DDQN, only on Atari. It would be good to provide results for newer algorithms like Ape-X or other environments like DMControl.
+2. Deep RL algorithms usually have large variance. It would be good to provide standard errors of the results, to accurately compare Daylight to Carlini & Wagner.
+3. The writing has some weaknesses, please see below for details.
+
+################################################
+
+Overall: I would lean toward accepting this paper. Technically it is not very sophisticated and the proposed attacks have been considered for image classification [1], but I believe that the results have strong practical implications. That is, simple image distortions are surprisingly impactful for attacking deep RL agents.
+
+[1] Samuel Dodge and Lina Karam. ""Understanding How Image Quality Affects Deep Neural Networks"". ArXiv 1604.04004.
+
+################################################
+
+Suggestions for writing:
+1. I think it would be better if the experimental setup preceded the results in Section 3. In addition, more details should be given, such as: How were the specific games and algorithm chosen? 
+2. I think the paper [1] is relevant for related work, as it shows similar distortions are good attacks for image classification.
+3. The notation D(s) at the bottom of page 2 is undefined.
+4. The differences in Figure 2 are not very apparent. It may be helpful to add bounding boxes.
+
+Further comments and questions:
+1. What happens if we combine several of these attacks? Would that lead to even greater impact?
+2. Is there any intuition for why certain games are more robust to certain image distortions than others? For example, why are the results for compression artifacts more mixed?
+
+################################################
+
+Update after reading other reviews and author responses:
+
+I am happy to keep my score and remain positive about this paper; the authors have answered my questions and partially addressed my main concerns in the revised paper. Like Reviewer 3, I would hope to see complete results for A3C in the final paper.",6,4.0,ICLR2021
+juhaQEPLmZG,1,AWOSz_mMAPx,AWOSz_mMAPx,Finite Timescale Separation for Gradient Descent Ascent,"Motivated by the many applications of min-max optimization problems in Machine Learning, the authors examine the effect of using different learning rates for each player in Gradient Descent Ascent (GDA) for non-convex non-concave optimization problems. Prior work has already established that making the learning rate of one player infinitely larger than other player's learning rate alleviates the cycling problems of GDA and makes game-theoretically meaningful equilibria the only asymptotically stable fixed points. The main contribution of this work is that it proves that we can get the same stability guarantees while keeping the learning rates of both players finite. This is crucial for practical applications where using unbounded learning rates in not an option. The authors employ this result to prove a variety of local convergence results in both deterministic ans stochastic settings.
+
+Pros:
+1) The finite time scale separation is necessary in order to make the theoretical intuitions in prior work applicable for practical problems like trainings GANs. 
+2) The proof techniques used for this separation results are, to the best of my knowledge, significantly different and elaborate than the ones used in prior work (Jin et al., 2020).
+3) The theoretical findings are complemented with empirical evaluations both on small min-max problems and on complex ones like training GAN architectures. 
+
+Cons:
+Theorem 28 in the arxiv version of Jin et al. 2020 does not explicitly reference the existence of a finite time scale that satisfies their inclusion results. However,  it is clear from the proof of Theorem 28 on page 24 that such a finite time-scale separation exists even though they do not provide an explicit formula for it.  At least for one of the inclusion statements they explicitly mention that it holds for $\epsilon < \epsilon_0$  for some $\epsilon_0$ where $\epsilon$ corresponds to $1/\tau$.
+
+Of course the result of the authors gives a more direct construction of the threshold $\tau^*$ by reducing the search for it to an eignenvalue problem. From a practical standpoint though, both results are existential. Neither proof approach gives particular intuition on how this time scale can be found in a computationally efficient way.  The added value of leveraging an array of mathematical tools to provide this explicit construction is unclear to me. 
+
+Given the above concern and that the convergence results essentially leverage the asymptotic stability properties provided by Theorems 1 and 2,  I am assigning a weak reject score. However, I am willing to substantially increase my score if the authors address the above concern. ",6,3.0,ICLR2021
+ryrFPSQEx,3,HkEI22jeg,HkEI22jeg,"Review of ""Multilayer Recurrent Network Models of Primate Retinal Ganglion Cells""","This paper explores the ability of nonlinear recurrent neural networks to account for neural response properties that have otherwise eluded the ability of other models.  A multilayer rnn is trained to imitate the stimulus-response mapping measured from actual retinal ganglion cells in response to a sequence of natural images.  The rnn performs significantly better, especially in accounting for transient responses, than conventional LN/GLM models.
+
+This work is an important step in understanding the nonlinear response properties of visual neurons.  Recent results have shown that the responses of even retinal ganglion cells in response to natural movies are difficult to explain in terms of standard receptive field models.  So this presents an important challenge to the field.  If we even had *a* model that works, it would be a starting point.  So this work should be seen in that light.  The challenge now of course is to tease apart what the rnn is doing.  Perhaps it could now be pruned and simplified to see what parts are critical to performance.  It would have been nice to see such an analysis.   Nevertheless this result is a good first start and I think important for people to know about.
+
+I am a bit confused about what is being called a ""movie.""  My understanding is that it is essentially a sequence of unrelated images shown for 1 sec. each.  But then it is stated that the ""frame rate"" is 1/8.33 ms.  I think this must refer to the refresh rate of the monitor, right?   
+
+I would guess that the deviations from the LN model are even stronger when you show actual dynamic natural scenes - i.e., real movies.  Here I would expect the rnn to have an even more profound effect, and potentially be much more informative.
+",8,5.0,ICLR2017
+JdNx_mlInBy,1,Qun8fv4qSby,Qun8fv4qSby,interesting paper analyzing how non-stationarity affects generaliztion,"This paper presents empirical evidence that non-stationarity data typical in deepRL settings can affect the intermediate representation of deep neural network and affect testing performance. The paper is easy to read and the authors provide experiments to support the their observations and claims. Overall I think this is a good paper and in the following I suggest some good to have additions.
+
+(1) The examples for the supervised learning setting clearly demonstrates the impact of non-stationary data. However, given that this is inspired by the problems under DRL setting, it will be interesting to do more analysis of this effect on some DRL tasks. For example, an analysis for offline RL might be a good setting to study this effect.
+
+(2)Imitation learning algorithm like Dagger might be another good example to demonstrate the effect of nonstationarity. The data under the Dagger setting is also changing overtime and it will be interesting to see how it affects the student policy.
+
+(3) The RL experiment is mainly done in the on policy (PPO) settings. Some experiments with off policy RL setting might be useful, and the effect of the non-stationarity might be more pronounced as well.",8,4.0,ICLR2021
+BkxAILESqH,1,rJxlc0EtDr,rJxlc0EtDr,Official Blind Review #2,"Summary:
+
+This paper proposes two main changes to the End2End Memory Network (EMN) architecture: a separation between facts and the items that comprise these facts in the external memory, policy to learn the number of memory-hops to reason. The paper also introduces a new Paired Associative Inference (PAI) task inspired by neuroscience and shows that most of the existing models including transformers struggle to solve this task while the proposed architecture (called MEMO) solves it better. MEMO also works well in the shortest path finding tasks and bAbI tasks.
+
+My comments:
+
+Overall, I see this paper as an improvement over EMN. The proposed PAI task can be seen as an example task where Transformers struggle while recurrent architectures learn better. Interestingly, the authors use a separate halting policy network to reduce computation time.
+
+1. Section 2.1 requires more clarity. There is a confusion in the usage of I and S. I represents the number of stories or the number of sentences in the stories?
+2. Scaling up NTM/DNC to larger memory work was done in Sparse Access Memory (SAM) by Rae et al. 2016. This needs to be included in section 3.1
+3. In Table 2, why is the prediction of the second node easier for the model than the prediction of the first node? I see this trend only for EMN, UT, DNC. Not in MEMO.
+4. Are the authors willing to release the code and data to reproduce their results?
+
+Minor comments:
+
+1. Page 2, second para: ENM should be EMN.
+2. Vec inverse in Eqn 14 was never introduced.
+3. Table 3: The notation of (20/20) was never introduced. I can guess what it means. But please be explicit.
+
+===============
+After rebuttal: Authors have addressed all my questions. I  recommend  ""Accept"".
+",8,,ICLR2020
+rJPMO6YxG,1,S1lN69AT-,S1lN69AT-,pruning efficacy in deep learning,"This paper presents a comparison of model sizes and accuracy variation for pruned version of over-parameterized deep networks and smaller but dense models of the same size. It also presents an algorithm for gradual pruning of small magnitude weight to achieve a pre-determined level of sparsity. The paper demonstrates that pruning of large over-parameterized models leads to better classification compared to smaller dense models of relatively same size. This pruning technique is demonstrated as a modification to TensorFlow on MobileNet, LSTM for PTB dataset and NMT for seq2seq modeling.
+
+The paper seems mainly a comparison of impact of pruning a large model for various tasks. The novelty in the work seems quite limited mainly in terms of tensorflow implementation of the network pruning using a binary mask. The weights which are masked in the forward pass don't get updated in the backward pass. The fact that most deep networks are inherently over-parametrized seems to be known for quite sometime.
+
+The experiments are missing comparison with the threshold based pruning proposed by Han etal. to ascertain if the gradual method is indeed better. A computational complexity comparison is also important if the proposed pruning method is indeed effective. In Section 1, the paper claims to arrive at ""the most accurate model"". However, the validation of the claim is mostly empirical and shows that there lies a range of values for increase in sparsity and decrease in prediction accuracy is better compared to other values.
+
+Overall, the paper seems to perform experimental validation of some of the known beliefs in deep learning. The novelty in terms of ideas and insights seems quite limited.",5,4.0,ICLR2018
+HkxRkOgRnX,3,HklJV3A9Ym,HklJV3A9Ym,Useful result on universality. Probably not extremely relevant to ICLR,"The paper investigates the approximation properties of a family of neural networks designed to address multi-instance learning (MIL) problems. The authors show that results well-known for standard one layer architectures extend to the MIL models considered. The authors focus on tree-structured domains showing that their analysis applies to these relevant settings. 
+
+The paper is well written and easy to follow. In particular the theoretical analysis is clear and pleasant to read. 
+
+The main concern is related to the relevance of the result to ICLR. As the authors themselves state, the result is not surprising given the standard universality result of one-layer neural networks (and indeed Thm. 2 heavily relies on this fact to prove the universality of MIL architectures). In this sense the current work might be more suited to a journal venue. 
+
+",6,3.0,ICLR2019
+S1xfZCnTKB,2,BJgWE1SFwS,BJgWE1SFwS,Official Blind Review #1,"This paper introduces a novel approximate inference method, called PCMC-Net, for models from the family of Pairwise Choice Markov Chains (PCMC). The method relies on training a neural network. Consequently, the authors claim that inference is amortized, but its computational complexity is still quadratic in the number of choice alternatives due to separate processing of all pairs of alternatives. PCMC-Net bakes the definition of PCMC into the neural net structure and therefore satisfies the theoretical properties of contractability and uniform expansion, which are desired properties of choice models
+Moreover, since choice probabilities are a function of choice candidates’ features (and features of an individual making the choice), this method allows for new (unseen) choice candidates at test time, which was not possible with previously proposed maximum-likelihood (ML) inference. The approach is evaluated on modelling the choice of airline itinerary, on which it outperforms all considered baselines by a significant margin.
+
+I recommend REJECTing this paper. This paper tackles the problem of efficient inference and test-time generalization (to unseen choice alternatives) for choice modelling, and the proposed approach is interesting, seems to be theoretically sound, and outperforms evaluated baselines. Experimental evaluation is insufficient, however, with the method assessed only on a single dataset---in which case it is unclear if the method is better than baselines in general, or whether it is a quirk of the considered dataset. Moreover, the authors do not compare to ML inference in PCMC, which seems to be the closest possible baseline; instead, the authors only mention that ML would overfit on this dataset. Finally, the paper is full of complicated terms and cumbersome notation, which makes it difficult to read. Technical terms are often used without definition (e.g. framing effects, Luce’s axiom, asymmetric dominance), which makes the paper inaccessible to an inexperienced reader like myself.
+
+I think that this work could be improved in the following ways. The exposition should be made simpler and easier to follow (especially section 2), and all technical terms should be appropriately defined. Additionally, the method should be evaluated on at least one more dataset and compared to ML inference for PCMC. I am happy to increase my score if (all) the above points are addressed.
+",6,,ICLR2020
+BJge242NjH,3,S1lNWertDr,S1lNWertDr,Official Blind Review #3,"Claim: Backpropagation of gradients from a higher to lower level in a HRNN can be removed and replaced with auxiliary losses predicting input tokens at the lower level without affecting performance. 
+
+Significance: The significance of the claim hinges on whether HRNNs are more effective than other methods designed to help RNNs capture long-term dependencies (e.g. stacking RNNs or using different architectures). I think the authors could make a more substantive argument why this would be the case in the introduction, but they do a nice job of situating their work in the context of the present literature.
+
+Novelty: The proposed method is not very original, since augmenting RNNs with auxiliary losses in order to better capture long-term dependencies has been used in many previous papers. The authors mention some of these papers in the related work section.
+
+Clarity: The paper's description of the proposed method is well-written. Some parts of the experiment section could be made clearer. 
+--  I encourage the authors to invent a new acronym to refer to ""our model"" (perhaps aux-HRNN?). In the description of the mr-HRNN (pg. 5), I find the sentence ""trained using only as much memory as our model requires for training"" confusing.  I initially thought our model referred to the mr-HRNN in the setence.
+-- Training settings (e.g. the number of ticks of the upper RNN) should be described at the beginning of each section. 
+-- A seeming contradiction is made when discussing the results in 4.3. First, it said that because short term dependencies dominate long term dependencies it is expected that the proposed method will suffer greatly (pg. 6, bottom). In the next paragraph, it is claimed that all three models perform similarly due to the same reason. Which is it?
+
+Supporting evidence: The claim is empirical and the supporting evidence is experimental. As such, I find the comprehensiveness of the experiments wanting. There are several ways the experiments could be improved. 
+-- Results for each \beta value should be included, to see how placing increasing significance on the auxiliary loss impacts the results.
+-- Include all relevant details necessary to reproduce the results, such as the length of training or stopping criterion used. 
+-- Additional results when varying the number of ticks.
+-- More results with deeper hierarchies, since the ability to capture salient information at different levels of coarseness is a key selling point of HRNNs. 
+-- Results on larger scale tasks besides character level language modelling on Penn TreeBank.
+
+Other comments:
+-- In the intro, I think some mention of parallel architectures such as transformers or convolutional architectures is warranted here, since parallelizability of training is a significant reason why these architectures are becoming preferred over RNNs.
+-- Citations are mishandled throughout the paper. Citations should be enclosed in parentheses unless used as a subject in the sentence (e.g. ""Sordoni et al. make the case that...""). There is no need to refer to a citation twice in a sentence, like you do in ""More recently, Koutnik et al. introduced the Clockwork RNN Koutnik et al. (2014)...""
+-- I don't understand why the permuted accuracy of the gr-HRNN is so much higher than the non-permuted accuracy. One possible explanation is that the important pixels ended up at the end in each of the three trials, hence the gr-HRNN did not have to remember much information from the past. This should be addressed in the paper. 
+-- I would welcome some theoretical analysis as to why replacing the gradient path with this particular auxiliary loss does not impact results. I also think some discussion of what this means HRNNs are actually doing might be nice as well.",1,,ICLR2020
+HJl-jdkaKB,1,rkx1b64Fvr,rkx1b64Fvr,Official Blind Review #3,"Summary
+
+This paper introduces a new model architecture for doing text classification. The main contribution is proposing a deeper CNN approach, using both word and character embeddings (as well as label embeddings with attention). The paper claims improved performance over baselines. 
+
+Decision
+
+I reject the paper for 3 main reasons:
+1) Very misleading claims regarding establishing a new state of the art. The baselines used for comparison don't include any of the best existing published results. 
+2) Lack of positioning within the literature. In particular, no mention nor discussion of Transformers (self-attention) networks, including BERT and XLNet approaches, which are the state of the art in text classification. 
+3) Lack of justification/explanation for the proposed architecture. One key argument made is that current models are shallow, but it appears that only CNN models are considered for that comparison. More discussion is needed to understand why the new aspects of the proposed network are importantly different from other existing approaches.
+
+Additional details for decision
+
+The results from this paper are significantly inferior to the best results published. With a few quick searches, I found that there are several approaches performing better than the proposed model on every dataset considered in the analysis, as you can see below.
+
+http://nlpprogress.com/english/text_classification.html
+https://github.com/sebastianruder/NLP-progress/blob/master/english/sentiment_analysis.md
+https://paperswithcode.com/sota/text-classification-on-yahoo-answers
+
+Extra notes (not factoring in decision)
+
+- Consider spacing out the 3 rightmost blocks in Figure 1, I found the layout confusing and there's space available.
+- In section 3, I would have liked more explanation for the motivation of the various design choices.",1,,ICLR2020
+BJfgoy9xz,2,SJu63o10b,SJu63o10b,"Interesting idea, but lacks technical clarity","This paper presents a scheme for unsupervised metric learning using coherent point drifting (CPD)-- the core idea is to learn a parametric model of CPD that shifts the input points such that the shifted points lead to better clustering in a K-Means setup. Following the work of Myronenko & Song, 2010, this paper uses a linear parametric model for the drift (in CPD) after mapping the input points to a kernel feature space using an RBF kernel. The CPD model is directly used within the KMeans objective -- the drift parameter matrix and the KMeans cluster assignment matrix are jointly learned using block-coordinate descent (BCD). The paper uses some interesting properties of the CPD model to derive an efficient optimization solver for the BCD subproblems. Experiments are provided on UCI datasets and demonstrate some promise.
+
+Pros:
+1) The idea of using CPD for unsupervised metric learning is quite interesting
+2) The exploration into the convexity of the CPD parameter learning -- although straightforward -- is also perhaps interesting.
+3) The experiments show some promise. 
+
+Cons:
+1) Lacking motivation/Intuition 
+The main motivation for the approach, as far as I understand, is to learn cluster boundaries for non-linear data -- where K-Means fails. However, it is unclear to me why would one need to use K-Means for non-linear data, why not use kernelized kmeans? The proposed CPD model also is essentially learning a linear transformation of the kernelized feature space. So in contrast to kernelized kmeans, what is the advantage of the proposed framework? I see there is an improvement in performance compared to kernelized kmeans, however, intuitively I do not see how that improvement comes from? Perhaps providing some specific examples/scenarios or graphic illustrations will help appreciate the method.
+
+2) Novelty/Significance 
+I think the novelty of this paper is perhaps marginal. The main idea is to directly use CPD from a prior work in a KMeans setup. There are a few parameters to be estimated in the joint learning objective, for which a block-coordinate descent strategy is proposed. The derivations are perhaps straightforward. As noted above, it is not clear what is the significance of this combination or how does it improve performance. As far as CPD goes, it looks to me that the performance depends heavily on the choice of the Gaussian RBF bandwidth parameter, and it is not clear to me how such a parameter can be selected in a unsupervised setting, when class labels are not available for cross-validation. The paper does not provide any intuitions on this front.
+
+3) Technical details.
+There are a few important details that I do not quite follow in the paper.
+
+a) The CPD is originally designed for the point matching problem, and its parametric form (\Psi) is derived using a different a Tikhonov regularized regression model as described just above (1). The current paper directly uses this parametric form in a KMeans setup and solve the resultant problem jointly for the CPD parameter and the clustering assignment. However, it is not clear to me how the paper could use the optimal parametric form for Tikhonov regression as the optimum for the clustering problem. Ideally, I would think when formulating the joint optimization for the clustering problem, the optimal functional v(x) should also be learned/derived for the clustering problem, or some proof should be provided showing the functionals are the same. Without this, I am not convinced that the proposed formulation indeed learns the optimum drifts and the clusters jointly.
+
+b)  The subproblem on Y (the assignment matrix) looks like a standard SVD objective. It is not clear why would it be necessary to resort to Ky Fan's theorem for its optimal solution.
+
+c) The paper talks about manifold embedding in the abstract and in Sec. 2.2. However, it appears to be a straightforward dimensionality reduction (PCA) of data. If not, what is the precise manifold that is described here? 
+
+d) Eq. 9, the definition of Y_c is incorrect and unclear. p is defined as a vector of ones, earlier. 
+
+e) Although the assignment matrix Y has orthogonal columns, it is a binary matrix. If it is approximated by an orthonormal frame, how do you reduce it to a binary matrix? Does taking the largest values in each column suffice -- it does not look like so. However, in the paper, Y is relaxed to an orthonormal frame, which is estimated using PCA, the data points are then projected onto this low-dimensional subspace, and then k-means applied to get the Y matrix. The provided math does not support any of these steps. Thus, the technical exposition is imprecise and the solutions appear rather heuristic. 
+
+f) The kernelized variant of the proposed scheme, described in Sec. 2.4 is missing important details. How precisely is the kernelization done? How is CPD extended to that setup and what would be the Gaussian kernel G in that case, and what does \Psi signify? 
+
+g) Figure 2, it seems that kernel kmeans and the proposed CPD-UML show similar cluster boundaries for low-kernel widths. Why are the high kernel widths beneficial?
+
+4) Experiments
+There is some improvement of the proposed method -- however overall, the improvements are marginal. The discussion is missing any analysis of the results. Why it works at times, how well it improves on kernelized kmeans, and why? What is the advantage over other competitive schemes, etc. 
+
+In summary, while there is a minor novelty in connecting two separate ideas (CPD and UML) into a joint UML setup, the paper lacks sufficient motivations for proposing this setup (in contrast to say kernelized kmeans), the technical details are unconvincing, and the experiments lack sufficient details or analysis. Thus, I do not think this paper is ready to be accepted in its current form.
+
+
+",4,5.0,ICLR2018
+SklCA18chQ,2,Bygre3R9Fm,Bygre3R9Fm,"review on ""DEFactor: Differentiable Edge Factorization-based Probabilistic Graph Generation""","This paper proposed a variant of the graph variational autoencoder [1] to do generative modeling of graphs. The author introduced an additional conditional variable (e.g., property value) into the decoder. By backpropagating through the discriminator, the model is able to find the graph with desired property value. 
+
+Overall the paper reads well and is easy to follow. The conditional generation of graphs seems also helpful regarding the empirical performance. However, there are several concerns regarding the paper:
+
+1) The edge factorization-based modeling is not new. In fact [1] already uses the node embeddings to factorize the adjacency matrix. This paper models extra information including node tags and edge types, but these are not fundamental differences compared to [1].
+
+2) The paper claims the method is ‘cheaper’ and ‘scalable’. Since essentially the computation cost is similar to [1] which requires at least O(n^2) to generate a graph with n nodes, I’m not super confident about the author’s claim. Though this can be parallelized, but the memory cost is still in this order of magnitude, which might be too much for a sparse graph. Also there’s no large graph generative modeling experiments available.
+
+3) Continue with 2), the adjacency matrix of a large graph (e.g., graph with more than 1k nodes) doesn’t have to be low rank. So modeling with factorization (with typically ~256 embedding size) may not be suitable in this case. 
+
+Some minor comments:
+4) Regarding Eq (2), why the lstm is used, instead of some simple order invariant aggregation?
+
+5) the paper needs more refinement. E.g., in the middle of page 2 there is a missing citation. 
+
+[1]  Kipf & Welling, Variational Graph Auto-Encoders, https://arxiv.org/pdf/1611.07308.pdf
+",5,3.0,ICLR2019
+B1Mcr6c4l,3,SJc1hL5ee,SJc1hL5ee,Review,"The paper presents a few tricks to compress a wide and shallow text classification model based on n-gram features. These tricks include (1) using (optimized) product quantization to compress embedding weights (2) pruning some of the vocabulary elements (3) hashing to reduce the storage of the vocabulary (this is a minor component of the paper). The paper focuses on models with very large vocabularies and shows a reduction in the size of the models at a relatively minor reduction of the accuracy.
+
+The problem of compressing neural models is important and interesting. The methods section of the paper is well written with good high level comments and references. However, the machine learning contributions of the paper are marginal to me. The experiments are not too convincing mainly focusing on benchmarks that are not commonly used. The implications of the paper on the state-of-the-art RNN text classification models is unclear.
+
+The use of (optimized) product quantization for approximating inner product is not particularly novel. Previous work also considered doing this. Most of the reduction in the model sizes comes from pruning vocabulary elements. The method proposed for pruning vocabulary elements is simply based on the assumption that embeddings with larger L2 norm are more important. A coverage heuristic is taken into account too. From a machine learning point of view, the proper baseline to solve this problem is to have a set of (relaxed) binary coefficients for each embedding vector and learn the coefficients jointly with the weights. An L1 regularizer on the coefficients can be used to encourage sparsity. From a practical point of view, I believe an important baseline is missing: what if one simply uses fewer vocabulary elements (e.g based on subword units - see https://arxiv.org/pdf/1508.07909.pdf) and retrain a smaller models?
+
+Given the lack of novelty and the missing baselines, I believe the paper in its current form is not ready for publication at ICLR.
+
+More comments:
+- The title does not make it clear that the paper focuses on wide and shallow text classification models. Please revise the title.
+- The paper cites an ArXiv manuscript by Carreira-Perpinan and Alizadeh (2016) several times, which has the same title as the submitted paper. Please make the paper self-contained and include any supplementary material in the appendix.
+- In Fig 2 does the square mark PQ or OPQ? The paper does not distinguish OPQ and PQ properly at multiple places especially in the experiments.
+- The paper argues the wide and shallow models are the state of the art in small datasets. Is this really correct? What about transfer learning?
+",5,4.0,ICLR2017
+ryxWqYyctr,1,S1evHerYPr,S1evHerYPr,Official Blind Review #1,"The paper proposes a meta reinforcement learning algorithm called MetaGenRL, which meta-learns learning rules to generalize to different environments. The paper poses an important observation where learning rules in reinforcement learning to train the agents are results of human engineering and design, instead, the paper demonstrates how to use second-order gradients to learn learning rules to train agents. Learning learning rules in general has been proposed and this paper is another attempt to further generalize what could be learned in the learning rules. The idea is verified on three Mujoco domains, where the neural objective function is learned from one / two domains, then deployed to a new unseen domain. The experiments show that the learned neural objective can generalize to new environments which are different from the meta-training environments. 
+
+Overall, the paper is a novel paper and with clear motivation, I like the paper a lot! Hope that the authors could address the following concerns and make the paper even better:
+
+1. The current experiment setup is a great proof-of-concept, however it seems a bit limited to support the claims in the paper. The meta-training has only at most two environments and the generalization of the neural objective function is only performed at one environment. It would be great if the authors could show more results with more meta-training environments (say, 10 meta-training environments) and more meta-testing environments (the current setup is only with one);
+
+2. The paper states a hypothesis that LSTM as a general function approximator, it is in principle able to learn variance and bias reduction techniques. However, in practice, due to learning dynamics and many other factors, it's not necessary true, i.e., how many samples are required for an LSTM to learn such technique is unclear. At the same time, at Page 8, Section ""Dependence on V"" actually acts as an example of LSTM couldn't figure out an effective variance-reduction method during the short meta-training time. The authors may want to put more words around the learnability of variance-bias trade-off techniques.
+
+Notation issues which could be further improved:
+1.  Page 2, ""Notation"" section and all of the following time indexing. Note that in Equation (1), r(s_1, a_t) has discount gamma^1, which is not true, I'd recommend the authors to follow the time indexing starting from 0, so that the Equation (1) is correct. (Alternatively, the authors could change from gamma^t into gamma^{t-1});
+2. Section ""Human Engineered Gradient Estimators"" is missing the formal introduction of the notation \tau;
+3. Overall, the authors seem to use \Phi and \theta interchangeably, it's better to use a unified notation across the paper;
+4. In the paper, the authors choose \alpha to represent the neural net for learning the objective function, to make it clearer for the readers, the authors could consider to change \alpha into \eta, because \alpha is often considered as learning rate notation;
+5. I'd suggest the authors to rewrite the paragraph in Page 3 ""MetaGrenRL builds on this idea of ....,  using L_\alpha on the estimated return"". This describes a key step in the algorithm while at the moment it's not very clear to the readers what's going on there;
+6. Section 3.1 is missing a step to go from Q into V;
+7. The authors could consider to describe the details of the algorithms in a more general actor-critic form, instead of starting from DDPG formulation. It would make the methods more general applicable (for example, extension to discrete action space).  
+
+ ",8,,ICLR2020
+rkxR6zqqhQ,3,SyzjBiR9t7,SyzjBiR9t7,"Interesting idea, however, more explanations needed. ","This paper proposes a method to use weighted Frechet Mean (wFM) for the operation on Manifold valued data for CNN. The novel point is to view wFM as a convolutional layer. Overall, this paper is mathematically well written, however, how each theory improves CNN and the model used in experiments are not clear enough. 
+
+Pros
++ The use of wFM instead of a convolutional layer is an interesting idea. 
++ This paper is mathematically well written. 
+
+Cos 
+- It is hard to understand how each theory presented in this paper helps to improve CNN. For example, the invariance to group operation. Some experimental results would help to understand the advantage of the group invariance.
+
+- It is also unclear why the authors constructed the invariant last layer although the inputs of the last layer are invariant under group operations. 
+
+- In the introduction section, the authors raised the omnidirectional camera, diffusion magnetic resonance imaging, elastography as examples of manifold-valued data. However, experiments are limited to standard video sequences. 
+
+- It is unclear how to obtain the weights {w_i} of wFM by backpropagation. 
+
+- Since the contribution of this paper is to to use wFM instead of a convolutional layer, it is more interesting to visualize the weights {w_i}. 
+
+- More explanation needs for the model used for experiments. Especially in dimensional reduction experiments, I could not understand how each subspace is obtained and averaged. If each frame is a subspace, by averaging frames, the reconstruction would be blurred. 
+",5,3.0,ICLR2019
+B1e6e0PInX,1,B1xFVhActm,B1xFVhActm,Method description confusing; empirical comparison against previous work is lacking,"This paper proposes a method for learning sentences encoders using artificially generated (fake) sentences. While the idea is interesting, the paper has the following issues:
+
+- There are other methods that aim at generating artificial training data, e.g.:  Z. Zhao, D. Dua, S. Singh. Generating Natural Adversarial Examples. International Conference on Learning Representations (ICLR). 2018,  but no direct comparison is made. Also InferSent  (which is cited as related work) trains sentence encoders on SNLI: https://arxiv.org/pdf/1705.02364.pdf. Again a comparison is needed as the encoders learned perform very well on a variety of tasks. Finally, the proposed idea is very similar to ULMfit (https://arxiv.org/pdf/1801.06146.pdf) which trains a language model on a lot of unlabeled data and then finetunes it discriminatively. Finally, there should be a comparison against a langauge model without any extra training in order to assess the benefits of the fake sentence classification part of the model.
+
+- It is unclear why the fake sentence construction method proposed by either swapping words or just removing them produces sentences that are fake and/or useful to train on. Sure it is simple, but not necessarily fake. A language model would be able to discriminate between them anyway, by assigning high probability to the original ones, and low probability to the manipulated ones. Not sure we need to train a classifier on top of that.
+
+- I found the notation in section 2 confusing. What kind of distribution is P(enc(x,theta1)|theta2, theta3)? I understand that P(x|theta) is the probability of the sentence given a model, but what is the probability of the encoding? It would also be good to see the full derivation to arrive at the expression in the beginning of page 3. 
+
+- An argument in favour of the proposed method is training speed; however, given that less data is used to train it, it should be faster indeed. In fact, if we consider the amount of time per million sentences, the previous method considered in comparison could be faster (20 hours of 1M sentences is 1280 hours for 64M sentences, more than 6 weeks). More importantly, it is unclear from the description if the same data is used in training both systems or not.
+
+- It is unclear how one can estimate the normalization factor in equation 2; it seems that one needs to enumerate over all fake sentences, which is a rather large number due to the number of possible word swaps in the sentence,
+
+- I am not sure the generator proposed generates realistic sentences only, ""Chicago landed in John on Friday"" is rather implausible. Also there is no generation method trained here, it is rule-based as far as I can tell. There is no way to tell the model trained to generate a fake sentence as far as I can tell.
+
+- It is a bit odd to criticise other methods ofr using LSTMs with ""millions of parameters"" while the proposed approach also uses them. A comparison should calculate the number of parameters used in either case.
+
+- what is the motivation for having multiple layers without non-linearity instead of a single layer?",3,5.0,ICLR2019
+prEXPl82hel,2,TmUfsLjI-1,TmUfsLjI-1,"Good baselines for model selection, but paper ignores many prior papers on this problem","Paper summary: This paper looks at the problem of efficiently choosing pre-trained models as initialization for downstream target tasks. It compares 3 strategies, a task-agnostic one which uses imagenet accuracies, a task-aware one which uses the acccuracy of linear classifiers on fixed representations, and a hybrid one which combines the two.
+
+Pros:
++ The evaluation is fairly thorough. I especially like the fact that the authors consider the different axes along which pre-trained models differ (model capacity, generalist/experts etc.)
++ The pool of downstream datasets is large.
++ The suggested strategy is simple and easy to implement. 
++ The problem is significant in practice since almost all practical applications of neural networks have this prroblem, and the gains seem large. I wish there was more work on this problem.
+
+Cons:
+- The biggest issue is that this paper ignores several important papers publiished before on this problem. Especially of note is the Task2Vec approach, which computes model and task embeddings. I would like comparisons both in terms of accuracy/regret as well as computational cost:
+Alessandro Achille, Michael Lam, Rahul Tewari, Avinash Ravichandran, Subhransu Maji, Charless C. Fowlkes, Stefano Soatto, Pietro Perona; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 6430-6439
+
+Other papers that are also relevant and should be cited and comparisons discussed:
+Bishwaranjan Bhattacharjee, John R. Kender, Matthew Hill, Parijat Dube, Siyu Huo, Michael R. Glass, Brian Belgodere, Sharath Pankanti, Noel Codella, Patrick Watson; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2020, pp. 760-761
+
+Amir R. Zamir, Alexander Sax, William Shen, Leonidas J. Guibas, Jitendra Malik, Silvio Savarese; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 3712-3722
+
+- The approach is not particularly novel. There is also no novel technical insight that explains the results.
+
+- The use of the JFT dataset hampers reproducibility since the dataset is not public. I'd like to see results with JFT excluded.
+
+For acceptance, I would definitely want to see the first of these convincingly addressed.
+[Updated rating]",6,5.0,ICLR2021
+rkoKvifef,1,HkxF5RgC-,HkxF5RgC-,The paper devises sparse GPU kernels for RNNs,"The paper devises a sparse kernel for RNNs which is urgently needed because current GPU deep learning libraries (e.g., CuDNN) cannot exploit sparsity when it is presented and because a number of works have proposed to sparsify/prune RNNs so as to be able to run on devices with limited compute power (e.g., smartphones). Unfortunately, due to the low-level and GPU specific nature of the work, I would think that this work will be better critiqued in a more GPU-centric conference. Another concern is that while experiments are provided to demonstrate the speedups achieved by exploiting sparsity, these are not contrasted by presenting the loss in accuracy caused by introducing sparsity (in the main portion of the paper). It may be the case by reducing density to 1% we can speedup by N fold but this observation may not have any value if the accuracy becomes  abysmal.
+
+Pros:
+- Addresses an urgent and timely issue of devising sparse kernels for RNNs on GPUs
+- Experiments show that the kernel can effectively exploit sparsity while utilizing GPU resources well
+
+Cons:
+- This work may be better reviewed at a more GPU-centric conference
+- Experiments (in main paper) only show speedups and do not show loss of accuracy due to sparsity",6,2.0,ICLR2018
+SPHoj_xdLXu,4,XJk19XzGq2J,XJk19XzGq2J,"Tackling an interesting question, experiments could be improved","== Summary ==
+
+The paper studies the relationship between the intrinsic dimension of images and sample complexity and generalization. The authors suggest to use a variant of the MLE method of Levina & Bickel (2004), which is based on computing the distances to nearest neighbors in pixel space, which is fairly easy to implement and 
+
+== Pros ==
+
+- The authors aim to investigate two relevant hypotheses for the field of representation learning. 1) intrinsic dimension of images is much lower than extrinsic dimension, and 2) extrinsic dimension has little effect on sample complexity. 
+
+- To test hypothesis 1), they use an estimator of the intrinsic dimension, and measure its fitness in a controlled setting for which the know the real intrinsic dimension (images generated by an state-of-the-art GAN). Hypothesis 1) is confirmed under this controlled setting and under a real scenario. 
+
+- Section 5.1. shows the (inverse) correlation between intrinsic dimension and sample complexity, and shows that extrinsic dimension (i.e. number of pixels) has a much weaker correlation. This section aims to confirm hypothesis number 2).
+
+- I also find interesting the experiments in section 5.2, which studies the (inverse) correlation between intrinsic dimension and generalization (i.e. test accuracy). 
+
+== Cons ==
+
+- Caption in Figure 3 states that the authors ""observe the estimates to converge around the expected dimensionality of 10"". However, the dimensionality estimate greatly depends on k, the number of neighbours used for each image. No variance/confidence interval methods are reported, in this figure, so it's unclear whether the differences between 12 and 10 are large or not (although they seem small if one compares against the extrinsic dimension of the images: 128x128x3).
+
+- This paper uses yet another intrinsic dimension estimator, different from Gong et al. 2019 and Ansuini et al. 2019. It's unclear what's the impact of the estimator in the predicted value of the intrinsic dimension.
+
+- One of the emphasized contributions of the paper is that it's ""the first to show that intrinsic but not extrinsic dimensionality  matters for the generalization of deep networks"" (page 6). As far as this reviewer is aware, indeed this paper is the first to measure intrinsic dimensionality *of images* and its relationship with generalization, but there are others that compare the intrinsic dimensionality of the final embedding with accuracy, showing the same conclusion (e.g. Gong et al. 2019, Ansuini et al 2019). Thus, the authors should be more specific when talking about intrinsic/extrinsic dimensionality (it refers to the image, not the embedding representation of a given deep neural network classifier).
+
+- Both the intrinsic dimensionality of the images and the classifier will impact the accuracy. This paper only focuses on the former, while other papers focus on the latter. Since this paper is posterior to the aforementioned papers, it would be appreciated if the authors could comment on which intrinsic dimensionality shows larger correlation with generalization, and draw some relationship among them.
+
+- Some figures can be hardly read if printed in grayscale (Figures 3, 6, 7, 8). I would suggest to use different line styles to better discern among curves in the plots, and using hatches in the histograms (Figure 1). 
+
+- In the introduction, it seems that the authors missed important seminal works on autoencoders (e.g. ""Reducing the dimensionality of data with neural networks"" by Hinton and Salakhutdinov, 2016), since their references for autoencoders and regularization methods date back only to 2018.
+
+- I have some minor concerns regarding computational cost. The authors use a fraction of images as ""anchors"" and compute the nearest neighbour against the rest of the images in the dataset. This still leads to a quadratic cost in the number of images in the dataset, which may become problematic with modern datasets (ImageNet or even bigger ones). Given that the dimensionality estimates don't change much (e.g. Figure 3), why not fixing the number of samples to a constant number (e.g. 1000)?
+
+== Rationale for the score ==
+
+Although I raised many points in my ""Cons"" section, many of these are more questions rather than specific issues that I have with the presented paper. The paper tackles an important question of interest for the ICLR community: how to estimate the intrinsic dimensionality of our datsets, and which impact does it have on generalization and sample complexity, and it does so with a quite convincing experimental setup. The method proposed by the authors could have important applications, such as estimating the number of required training samples for reaching a target accuracy.
+
+I hope that the authors can address my questions/concerns during the rebuttal to increase my score.
+
+*Update after discussion*: The authors have addressed all the points that I raised during the discussion. I appreciate the effort, and I'm increasing my score accordingly.",7,4.0,ICLR2021
+r1lcfqfsYS,1,H1gHb1rFwr,H1gHb1rFwr,Official Blind Review #3,"In this paper, a new network architecture called EVPNet was proposed to improve robustness of CNNs to adversarial perturbations. To this end, EVPNet employs three methods to leverage scale invariant properties of SIFT features in CNNs.
+
+The proposed network and the methods are interesting, and provide promising results in the experiments. However, there are several issues with the paper:
+
+- The authors claim that Gaussian kernels are replaced by convolution kernels to mimic DoGs. However, it is not clear (1) how this replacement, or employment of convolution kernels can mimic DoGs, or (2) more precisely, how the corresponding learned convolution kernels approximate Gaussian kernels. In order to verify and justify this claim, please provide detailed theoretical and experimental analyses.
+
+- It is also claimed that “a 1 × 1 conv-layer, can be viewed as a PCA with learnable projection matrix”. However, this statements is not clear. How do you assure that a 1x1 conv layer  employs a PCA operation or the corresponding projection?
+
+- What does \| \|_p denote? Does it denote \ell_p norm?
+
+- What does x denote in d = w x h? Previously, it was used to denote matrix size.
+
+- Why do you compute \ell_2 norm for row vectors instead of column vectors? How do the results change when they are calculated for column vectors?
+
+- According to the notation, s_0 and s_1 are vectors. Then, what does max denote in (14)? That is, how do you compute max(s_0, s_1), more precisely?
+
+- In the statement “PNL produces a hyper-ball in the manifold space”, what do you mean by the “manifold space”? What are the structures (e.g. geometry, metrics etc.) and members of this space?
+
+- Please conceptually and theoretically compare the proposed method with state-of-the-art methods following similar motivation, such as the following:
+
+Weng et al., Evaluating the Robustness of Neural Networks: An Extreme Value Theory Approach, ICLR 2018.
+",3,,ICLR2020
+SyxwlXOjcH,3,HJe88xBKPr,HJe88xBKPr,Official Blind Review #1,"Originality: The paper proposed a new scaling loss strategy for mixed-precision (8-bit mainly) training and verified the importance of rounding (quantization) error issue for low-precision training. 
+
+Quality: The authors clearly illustrated the benefit of their proposed loss strategy and the importance of quantization error for two different tasks (image classification and NMT). The experiments are very clear and easy to follow.
+
+Clarity: The paper is clearly written with some visualizations for readers to understand the 8-bit training. 
+
+Significance:
+1. The enhanced loss scaling strategy is interesting but the method seems hand-tuning. Is there any automatical way or heuristic deciding way?
+2. The stochastic rounding method is very intuitive. How do you choose the value of ""r"" in the equation? Is it a sensitive hyper-parameter or not?
+
+Typos:
+Page 7: with with roughly 200M ->  with roughly 200M 
+",6,,ICLR2020
+Skh5eWVWG,4,B1zlp1bRW,B1zlp1bRW,Very strong paper with novel and interesting results presented clearly and engagingly.,"This paper explores a new approach to optimal transport. Contributions include a new dual-based algorithm for the fundamental task of computing an optimal transport coupling, the ability to deal with continuous distributions tractably by using a neural net to parameterize the functions which occur in the dual formulation, learning a Monge map parameterized by a neural net allowing extremely tractable mapping of samples from one distribution to another, and a plethora of supporting theoretical results. The paper presents significant, novel work in a straightforward, clear and engaging way. It represents an elegant combination of ideas, and a well-rounded combination of theory and experiments.
+
+I should mention that I'm not sufficiently familiar with the optimal transport literature to verify the detailed claims about where the proposed dual-based algorithm stands in relation to existing algorithms.
+
+Major comments:
+
+No major flaws. The introduction is particular well written, as an extremely clear and succinct introduction to optimal transport.
+
+Minor comments:
+
+In the introduction, for VAEs, it's not the case that f(X) matches the target distribution. There are two levels of sampling: of the latent X and of the observed value given the latent. The second step of sampling is ignored in the description of VAEs in the first paragraph.
+
+In the comparison to previous work, please explicitly mention the EMD algorithm, since it's used in the experiments.
+
+It would've been nice to see an experimental comparison to the algorithm proposed by Arjovsky et al. (2017), since this is mentioned favorably in the introduction.
+
+In (3), R is not defined. Suggest adding a forward reference to (5).
+
+In section 3.1, it would be helpful to cite a reference to support the form of dual problem.
+
+Perhaps the authors have just done a good job of laying the groundwork, but the dual-based approach proposed in section 3.1 seems quite natural. Is there any reason this sort of approach wasn't used previously, even though this vein of thinking was being explored for example in the semi-dual algorithm? If so, it would interesting to highlight the key obstacles that a naive dual-based approach would encounter and how these are overcome.
+
+In algorithm 1, it is confusing to use u to mean both the parameters of the neural net and the function represented by the neural net.
+
+There are many terms in R_e in (5) which appear to have no effect on optimization, such as a(x) and b(y) in the denominator and ""- 1"". It seems like R_e boils down to just the entropy.
+
+The definition of F_\epsilon is made unnecessarily confusing by the omission of x and y as arguments.
+
+It would be great to mention very briefly any helpful intuition as to why F_\epsilon and H_\epsilon have the forms they do.
+
+In the discussion of Table 1, it would be helpful to spell out the differences between the different Bary proj algorithms, since I would've expected EMD, Sinkhorn and Alg. 1 with R_e to all perform similarly.
+
+In Figure 4 some of the samples are quite non-physical. Is their any helpful intuition about what goes wrong?
+
+What cost is used for generative modeling on MNIST?
+
+For generative modeling on MNIST, ""784d vector"" is less clear than ""784-dimensional vector"". The fact that the variable d is equal to 768 is not explicitly stated.
+
+It seems a bit strange to say ""The property we gain compared to other generative models is that our generator is a nearly optimal map w.r.t. this cost"" as if this was an advantage of the proposed method, since arguably there isn't a really natural cost in the generative modeling case (unlike in the domain adaptation case); the latent variable seems kind of conceptually distinct from observation space.
+
+Appendix A isn't referred to from the main text as far as I could tell. Just merge it into the main text?
+
+
+
+",8,3.0,ICLR2018
+ehP8gRQpDtC,4,JoCR4h9O3Ew,JoCR4h9O3Ew,Review: clear paper with solid motivation and experimental evaluation. Some questions about the interpretation of the results.,"## Summary
+This paper introduces a semi-supervised learning procedure that does not require labeled adversarial data to learn an ensemble model that is robust to adversarial attacks on classification tasks.
+
+## Quality
+This paper is very well written; the design decisions of the training procedures are all supported by ablation tests and comparisons to other modern adversarial baselines. However, I am slightly worried by the ablation study results -- from Table 3, it seems like the DPP component of the loss does not provide a statistically significant lift. 
+
+## Clarity
+This paper is very clearly written! The choice of parameterization, experiment settings, and overall proposed methodology were clearly explained and motivated.
+
+## Originality
+As the authors have mentioned, improving diversity in ensembles is not a novel approach to improving robustness to adversarial attacks; the originality of the method lies in the parameterization of diversity through a DPP component and multi-view complementarity. 
+
+## Significance
+Improving robustness to adversarial attacks is in of itself an important contribution to this field. This paper proposes a method that achieves a significant improvement over other competitors (Tables 1, 2); the proposed method is additionally intuitive, and can incorporate prior beliefs over the the data distribution, making it adaptable to several different settings.
+
+## Questions
+### Methodology
+My main question lies in the uses of the DPP loss: did the authors use the determinant directly? Depending on the choice of kernel, I would expect the determinant portion of the loss to eclipse the other terms, potentially hampering learning. 
+
+On a related note, I am curious to know how sensitive the ARMOURED performance is to the $\lambda_{DPP}$ and $\lambda_{NEM}$ hyper-parameters.
+
+### Experiments
+Do the authors have insight in the significant gap between ARMOURED-B and its variants for a large $\epsilon$ budget with $L_\infty$ PGD? I am surprised by the significant gap in Table 3.
+
+Again in Table 3, comparing the -F, no DPP, and -B variants, it seems like the highest lift in performance comes from the diversity in the NEM component. Do you know if a similar lift is observed by removing the NEM component and keeping the DPP? 
+
+At inference time, you state that the augmentation step is skipped. Naively, I would not have been surprised to see these augmentations improve resilience to adversarial attacks. Am I incorrect?",7,3.0,ICLR2021
+Hym5uLBVx,3,S1HEBe_Jl,S1HEBe_Jl,"Interesting thought experiment, but strong concerns about the practicality of the approach","The submission proposes to modify the typical GAN architecture slightly to include ""encrypt"" (Alice) and ""decrypt"" (Bob) modules as well as a module trying to decrypt the signal without a key (Eve).  Through repeated transmission of signals, the adversarial game is intended to converge to a system in which Alice and Bob can communicate securely (or at least a designated part of the signal should be secure), while a sophisticated Eve cannot break their code.  Examples are given on toy data:
+""As a proof-of-concept, we implemented Alice, Bob, and Eve networks that take N-bit random plain-text and key values, and produce N-entry floating-point ciphertexts, for N = 16, 32, and 64.  Both plaintext and key values are uniformly distributed.""
+
+The idea considered here is cute.  If some, but not necessarily all of the signal is meant to be secure, the modules can learn to encrypt and decrypt a signal, while an adversary is simultaneously learned that tries to break the encryption.  In this way, some of the data can remain unencrypted, while the portion that is e.g. correlated with the encrypted signal will have to be encrypted in order for Eve to not be able to predict the encrypted part.
+
+While this is a nice thought experiment, there are significant barriers to this submission having a practical impact:
+1) GANs, and from the convergence figures also the objective considered here, are quite unstable to optimize.  The only guarantees of privacy are for an Eve that is converged to a very strong adversary (stronger than a dedicated attack over time).  I do not see how one can have any sort of reliable guarantee of the safety of the data transmission from the proposed approach, at least the paper does not outline such a guarantee.
+2) Public key encryption systems are readily available, computationally feasible, and successfully applied almost anywhere.  The toy examples given in the paper do not at all convince me that this is solving a real-world problem at this point.  Perhaps a good example will come up in the near future, and this work will be shown to be justified, but until such an example is shown, the approach is more of an interesting thought experiment.",5,4.0,ICLR2017
+S1lmbNX0cB,3,HkgeGeBYDB,HkgeGeBYDB,Official Blind Review #4,"This paper proposes a novelty detection method by utilizing latent variables in auto-encoder. Based on this, this paper proposes two metrics to quantifying the novelty of the input. Their main contribution is the NAP metric based on SVD. Their method is empirically demonstrated on several benchmark datasets, and they compare their proposed metrics with other competing methods using AUROC and experiments results are encouraging.  
+
+The metrics proposed in this paper are intuitive and interesting. The experiments shown in Table2 is very convincing, and it could be better to extend Table3 to include other datasets (STL,OTTO, etc. )
+
+
+",6,,ICLR2020
+SJwrBiWVl,1,HkLXCE9lx,HkLXCE9lx,Review,"The paper proposes to use RL methods on sequences of episodes instead of single episodes. The underlying idea is the problem of 'learning to learn', and the experimental protocol proposed here allows one to understand how a neural network-based RL model can keep memory of past episodes in order to improve its ability to solve a particular problem. Experiments are made on bandit problems, but also on maze problems and show the interesting properties of such an approach, particularly on the maze problem where the agent seems to learn to first explore the maze, and then to exploit its knowledge to quickly find the goal. 
+
+The paper is based on a very simple and natural idea which is acutally a good point. I really like the idea, and also the experiment on the maze which is very interesting. Experiments on bandits problem are less interesting since meta-learning models have been already proposed in the bandit problem with interesting results and the proposed model does not really bring additionnal information.  My main concerns is  based on the fact that the paper never clearly formally defines the problem that it attempts to solve. So, between the intuitive idea and the experimental results, the reader does not understand what  exactly the learning problem is, what is its impact and/or to which concrete application it belongs to. From my point of view, the article clearly lacks of maturity and does not bring yet a strong contribution to the field. 
+
+Good:
+* Interesting experimental setting
+* Simple and natural idea
+* Nice maze experiments and model behaviour
+
+Bad:
+* No real problem defined, only an intuition is given. Is it really useful ? For which problems ? What is the performance criterion one wants to optimize ? ...
+* Bandit experiments do not really bring relevant informations
+",3,4.0,ICLR2017
+o7lbMx2EyM,1,PUkhWz65dy5,PUkhWz65dy5,Overall interesting idea. Need some clarification.,"This was a well-written, and interesting paper to read!
+
+I went over the paper many times, and I am still failing to see the use case for such an approach. I have some questions and comments that need some clarification for me to properly evaluate the submission. Please, take the time to answer the following, so my review can better reflect the paper.
+
+1 - The theory developed in the paper relies on reward functions that can be represented as linear combinations of the features of the MDP. This seems to be restrictive, and intuitively, this would be the exception rather than the rule.
+What class of problem could be modeled under this restriction?  In many problems, there is no linear reward function that would allow an agent to achieve the desired behavior, so these techniques would not be helpful. What are some practical setting where this approach would be beneficial?
+
+2 - Lemma 3... this statement is putting an upper-bound on the worst case performance, but since the paper focuses on improvement of worst case performance, it would be beneficial to have a lower bound, but an upper bound doesn't seem too useful.   Essentially, this lemma is saying ""I can guarantee that the worst-case won't be better than this upper bound, and that for some MDP with linear reward function this upper bound is attainable."" The problem is that we don't know what that MDP is, how likely it is that we would find it, and this lemma allow for the worst case performance to be arbitrarily bad.
+I don't think this lemma, as is, is particularly useful.
+
+3 - On the experimental section, I think there's a baseline that should be included that's missing. What if we have 1 policy and add a task descriptor or extra features to the features vector that corresponds to the type of task? How would the performance empirically compare?
+
+4 - In the learning curves for fig 1.a or 2.a, what does ""value"" (y-axis) represent? If it the return of the agent after training? If so, is it using the extrinsic reward or the transformed linear reward described in line 5 of ""DeepMind Control Suite""?
+
+5 - Based on equation 4, for definition 2 of SIP. There is always a trivial set improving policy, right? That would correspond to picking the policy for max(v^i_w).
+",6,4.0,ICLR2021
+B1loCbGk5S,3,ryeRn3NtPH,ryeRn3NtPH,Official Blind Review #2,"the paper studies transfer learning, which addresses the inconsistencies of the source and target domains in both input and output spaces. usually, we only worry about the inconsistencies in the input domain but here we worry about input and output. the paper proposes adversarial inductive transfer learning which uses adversarial domain adaptation for the input space and multi-task learning for the output space.
+
+the main contribution of the paper is in identifying a new type of problem that may be worth studying. the proposal of the paper is sensible and its potential application to pharmacogenomics seems appealing. the paper shows promising performance of the proposal across datasets.",6,,ICLR2020
+SyeKgE2JqS,3,Syl38yrFwr,Syl38yrFwr,Official Blind Review #1,"To improve the privacy-utility tradeoff, this manuscript proposes a voting mechanism used in a teacher-student model, where there is an ensemble of teachers, from which the student can get gradient information for utility improvement. The main idea of the proposed approach is to add a constant C to the maximum count collected from the ensemble, and then noise is furthermore added to the new counts. I can understand that by adding the large constant C, the identity of the maximum count could be preserved with high probability, leading to a better utility on the student side. However, equivalently, this could also be understood as that the noise is not added uniformly across all the counts, but instead a relatively smaller noise is added to the maximum count. Hence it is not clear to me whether the final composition will still be differentially private?
+",1,,ICLR2020
+B1klq-5lG,2,ryiAv2xAZ,ryiAv2xAZ,interesting idea for robust classification,"The manuscript proposes a generative approach to detect which samples are within vs. out of the sample space of the training distribution. This distribution is used to adjust the classifier so it makes confident predictions within sample, and less confident predictions out of sample, where presumably it is prone to mistakes. Evaluation on several datasets suggests that accounting for the within-sample distribution in this way can often actually improve evaluation performance, and can help the model detect outliers.
+
+The manuscript is reasonably well written overall, though some of the writing could be improved e.g. a clearer description of the cost function in section 2. However, equation 4 and algorithm 1 were very helpful in clarifying the cost function. The manuscript also does a good job giving pointers to related prior work. The problem of interest is timely and important, and the provided solution seems reasonable and is well evaluated.
+
+Looking at the cost function and the intuition, the difference in figure 1 seems to be primarily due to the relative number of samples used during optimization -- and not to anything inherent about the distribution as is claimed. In particular, if a proportional number of samples is generated for the 50x50 case, I would expect the plots to be similar. I suggest the authors modify the claim of figure 1 accordingly.
+
+Along those lines, it would be interesting if instead of the uniform distribution, a model that explicitly models within vs. out of sample might perform better? Though this is partially canceled out by the other terms in the optimization.
+
+Finally, the authors claim that the PT is approximately equal to entropy. The cited reference (Zhao et. al. 2017) does not justify the claim. I suggest the authors remove this claim or correctly justify it.
+
+Questions:
+ - Could the authors comment on cases where such a strong within-sample assumption may adversely affect performance?
+ - Could the authors comment on how the modifications affect prediction score calibration?
+ - Could the authors comment on whether they think the proposed approach may be more resilient to adversarial attacks?
+
+Minor issues:
+ - Figure 1 is unclear using dots. Perhaps the authors can try plotting a smoothed decision boundary to clarify the idea?",7,3.0,ICLR2018
+gLNIRUfCIVR,3,yoem5ud2vb,yoem5ud2vb,Test environments/tasks are not persuasive,"##########################################################################
+
+Summary:
+
+The paper aims at achieving learning-based graph representations of MDPs with the goal of improving results in sparse-reward RL problems via better exploration. Experimens are mostly done on synthetic environments proposed by the paper itself (besides MountainCar environment). Baselines are SPTM, SoRB and HER.
+
+
+##########################################################################
+
+Reasons for score:
+
+Experimental evidence in the paper is insufficient to estimate the value of the approach.
+
+
+##########################################################################
+
+Pros:
+
+1. Interesting ideas about planning for exploration and learnable graph representations for MDPs.
+2. Visually the paper looks nice.
+
+##########################################################################
+
+Cons:
+
+1. Test environments/tasks don't look persuasive. Why not pick some more established setups from prior work (e.g., SPTM or SoRB or HER papers) and show advantage on their ground? Creating new setups is only justified if the existing ones are not sufficient with respect to the goals of the paper.  
+2. The main algorithm idea is not clear. There are many components and the description does not comprehensively show how they fit together and what motivated those design choices. 
+3. Some statements in the paper would profit from softening/more accurate formulation. To give a few examples:
+- ""SPTM rely on human sampled trajectories to generate the graph, which is infeasible in RL exploration"" - collecting human demonstrations/trajectories is indeed costly but it's not infeasible. In fact, it is a common recipe for kick-starting RL which powered the first versions of AlphaGo and AlphaStar.
+- ""Another drawback is that existing methods cannot be used for facilitating exploration which is important in RL."" - it is worth keeping in mind a paper ""Episodic Curiosity through Reachability"" https://arxiv.org/pdf/1810.02274.pdf. While it doesn't plan for exploration, it still keeps some kind of topological map in memory, although without explicitly storing edges.
+- ""In cognitive science society, researchers summarize these discoveries in cognitive map theory (Tolman, 1948), which states that animals can extract and code the structure of environment in a compact and abstract map representation."" - that paper came many years earlier than those previously mentioned, so it can't possibly summarize them. In fact, O'Keefe was sceptical about cognitive maps from what I read in those papers. 
+
+##########################################################################
+
+Questions during rebuttal period: 
+
+Unfortunately, to make paper meet high ICLR standards would require too large a change in my opinion. I would encourage the authors to fix the cons mentioned above and resubmit.
+
+#########################################################################
+
+Minor suggestions and typos: 
+
+(1) ""which mark three disjoint regions [0; 1]; (1; 3]; (3;+1),"" - how did the magical constant 3 come to be?",3,5.0,ICLR2021
+sITqER2ztS,4,NgZKCRKaY3J,NgZKCRKaY3J,Improved Calibration Metric Based on Empirical Evidence,"### Summary
+
+This paper highlights a major flaw with the commonly-used $ECE_\text{BIN}$ calibration metric, namely that it is biased for perfectly-calibrated models. Through a large number of empirical experiments (including through simulation), the authors show that a newly proposed metric, $ECE_\text{SWEEP}$ is able to produce less biased estimates of calibration error.
+
+### Paper Strengths
+
+1. The paper has many empirical experimental and simulation data to support the claim that $ECE_\text{SWEEP}$ is a useful calibration metric.
+
+2. The paper points drawbacks of commonly-used calibration metrics such as $ECE_\text{BIN}$, namely that it is biased for perfectly-calibrated models, and suggests alternatives.
+
+### Major Concerns
+
+1. In the theoretical discussion, the paper assumes binary classification. However, the empirical experiments are conducted on multi-class classification datasets. I did not find any mention of how the binary classification discussion is generalized to the multi-class setting.
+
+2. The discussion about how the simulations were conducted was difficult to follow. Are the $m$ simulated datasets essentially subsets of the full CIFAR-10/100 or ImageNet datasets? (This seems to be implied, but is never made explicit. Otherwise, it can read as if these datasets are synthetic images.) And how do you draw samples such that ""the confidence score distribution matches the neural model's best fit Beta distribution and the true calibration curve matches the neural model's best fit GLM""?
+
+3. The equation for $ECE_\text{SWEEP}$ seems misleading. Are you really trying to maximize the quantity in parentheses over all possible values of $b$? And if so, why does that mean that this maximization will result in the largest number of bins, under the monotonicity constraint?
+
+4. Is it always possible to satisfy the monotonicity constraint for $ECE_\text{SWEEP}$? Some more theoretically-sound discussion of $ECE_\text{SWEEP}$ would be much appreciated.
+
+5. The discussion on Page 3 about Figure 2 can be expanded upon. Why does the ""optimal bin count grow with the sample size""? This is stated as fact without explanation.
+
+6. In general, while the experiments are mostly convincing, it would be useful to have some theoretical notions for why/when $ECE_\text{SWEEP}$ is less biased than $ECE_\text{BIN}$.
+
+
+### Minor Concerns
+
+1. There are many typos, grammatical errors, and mis-referenced figures throughout the manuscript. Please correct them.
+
+2. Running experiments on new datasets is understandably time-consuming. However, given that this paper is heavily dependent on empirical evidence, it would be helpful to see experiments on datasets beyond image classification.
+
+### Original Rating
+
+**Rating** - 5: Marginally below acceptance threshold
+
+**Confidence** - 3: The reviewer is fairly confident that the evaluation is correct
+
+### Post-Rebuttal Update
+
+I applaud the authors for providing detailed responses to my (and other reviewers') questions and for updating the manuscript appropriately. Several follow-up thoughts to my original questions:
+
+1. **Binary classification.** I think I understand what you mean now. Basically you are still performing multi-class classification, but then you treat the calibration problem as a binary: is the top-1 model output correct or not? I think this still can be made clearer in the manuscript.
+
+2. **Simulation procedure**: Thank you for the clarifications.
+
+3. **$\text{ECE}_\text{SWEEP}$ equation**: I am still concerned that the equation given for $\text{ECE}_\text{SWEEP}$ differs from Algorithm 1. (Thanks for moving the algorithm into the main manuscript!) The issue is that I don't think Algorithm 1 is taking the maximum over the quantity in the parentheses of the $\text{ECE}_\text{SWEEP}$ equation. Algorithm 1 is not actually computing $\max_b g(b)$ for some function $g(b)$. Instead, it is first finding the maximum $b$ that satisfies some criterion, then calculating $g(b)$ at that selected $b$.
+
+  If the equation and the algorithm are the same, it would imply that increasing the number of bins will increase the estimated calibration error. However, this does not seem to be true. Consider a binary dataset that is perfectly balanced: $\bar{y_1} = \frac{1}{n} \sum_{i=1}^n y_i = 0.5$, and consider a constant model, i.e. $\forall x: f(x) = 0.6$. Then for the case $b=1$, $ECE = 0.1$. But for the case $b=2$, $ECE = 0$ (assuming equal-width binning).
+
+4. **Monotonicity constraint**: Thanks for pointing out the trivial constraint satisfaction.
+
+5. **""optimal bin count grows with the sample size""**: I understand the empirical and intuitive reasoning for why this should be true. However, I still wish that this notion could be made more formal.
+
+6. **Theoretical notions for why/when $\text{ECE}_\text{SWEEP}$ is less biased than other estimators of calibration error**: I am now more convinced that this is difficult to show, and I understand that a lot of the literature is based on empirical evidence. However, given that the results are empirical, I still would like to see experiments are non-image datasets.
+
+**Updated Rating** - 6: Marginally above acceptance threshold",6,3.0,ICLR2021
+Bke-0uyAtr,1,BklxI0VtDB,BklxI0VtDB,Official Blind Review #3,"This paper proposed a hierarchical approach to perform robotic object search (ROS). 
+The idea is to use a high-level policy which produces subgoals and a low-level policy which produces atomic actions conditioned on both the subgoals and the true goal, and which is trained with a weighted sum of the original extrinsic reward and a reward for reaching the subgoal. Subgoals are consist of different objects in the field of view which the robot can choose to approach. 
+The approach is evaluated on the House3D dataset, where it is shown to perform well. 
+
+Recommendation: weak reject. 
+
+This isn't a bad paper, but I'm not sure it will be of broad interest at ICLR.
+It is very specific to ROS problem and House3D dataset and doesn't seem to propose a general algorithm which can be broadly applicable elsewhere. The mechanism for generating subgoals and training the low-level policy is very task dependent (subgoals are constrained to be objects in the field of view, the intrinsic reward for training the low level policy is dependent on the size of the bounding box of the object defining the subgoal). While this probably accounts for improved performance on the House3D dataset, I think the audience of ICLR would be more interested in a general approach which can be used in many different domains (even at the cost of performing less well on a specific domain than something more tailored). This paper may be a better fit for a robotics conference. 
+
+Another point concerning the experimental evaluation. Sparsity of the rewards is mentioned as a main motivation for the hierarchical approach. However, there are a number of methods which use exploration bonuses to address this issue (pseudocounts, random network distillation, ICM etc. [1, 2, 3]). At least one of these should be included as a baseline.
+
+Specific comments:
+- using two letters for denote a single variable is confusing, since it seems like a product. I.e. using ""sg"" to denote a subgoal, ""at_t"" to denote area. Please use a single letter and add subscripts if necessary to disambiguate.
+- in the various equations, please use ""\log"" instead of ""log"" so that it is not italicized.
+- bottom of page 4: ""Q-leaning"" -> ""Q-learning""
+- page 2: ""way pf"" -> ""way of""
+- please use more informative names for Settings A/B
+- First paragraph in Section 2: ""hierarchical policy for the robot to perform the object search, motivated by how human beings conduct object search"". Saying the method is similar to how humans behave is a fairly big claim that should be substantiated by appropriate references, or not made at all.  
+
+
+References:
+[1] https://arxiv.org/abs/1810.12894
+[2] https://arxiv.org/abs/1703.01310
+[3] https://arxiv.org/abs/1705.05363
+",3,,ICLR2020
+v-D8VRGthFe,2,0h9cYBqucS6,0h9cYBqucS6,"I would recommend weak accept. The paper studies an important topic in private federated learning, i.e., improving the communication/computational efficiency. The main concern that limits my score is its novelty/contribution (See reviews). ","Summary: 
+
+The paper proposes a secure aggregation framework for federated learning that is communication-computation efficient. Specifically, instead of sharing its public keys and secret shares to all the other clients as done in the existing scheme, each client only shares to a subset of selected clients, which reduces the communication and computational costs. The sufficient condition on the graph topology of the selected pairs (assignment graph) is identified under which the private and reliable learning is guaranteed. Experiments on real-world datasets validate the theory. 
+
+
+Strength:
+
+The paper provides the rigorous theoretical guarantee for the algorithm, including the followings,
+1. Sufficient conditions on the assignment graph are identified under which the federated learning is reliable and private; 
+2. For Erdos-Renyi random graph where two nodes are connected with probability $p$, the lower bound of $p$ is given such that the algorithm is asymptotically almost surely reliable and private. The upper bound of error probability is also given for a finite number of nodes such that the algorithm is reliable and private.  
+
+
+Weakness/Comments: 
+
+1. My biggest concern is the novelty of the paper, which seems to be not significant. Specifically, the proposed algorithm's main contribution is generalizing the existing secure aggregation framework (Bonawitz et al., 2017) from the complete assignment graph to an arbitrary graph. The only modification is on the assignment graph, while the framework itself is still the same as (Bonawitz et al., 2017). Moreover, the idea of limiting communications over a li
+distributed learning over the Erdos-Renyi random graph has been studied and analyzed extensively in the literature.
+
+2. In experiments (Fig 4), when p >0.795, the proposed algorithm can achieve the same test accuracy as SA. It seems communication/computational efficiency can be attained ""for free. "" However, intuitively, if the fewer nodes are aggregated each round in update, the convergence rate ought to decrease. Is the same test accuracy because of the generalization property? How does the training accuracy of two algorithms compared to each other? 
+
+3. Related work: secure multiparty computation (MPC) has been studied extensively in distributed learning, with or without a central server. Federated learning essentially is a special type of distributed learning. I suggest authors can include more related work about MPC in distributed learning. For example, 
+(1) K. Tjell and R. Wisniewski, ""Privacy Preservation in Distributed Optimization via Dual Decomposition and ADMM,"" 2019 IEEE 58th Conference on Decision and Control (CDC), Nice, France, 2019
+(2) C. Zhang, M. Ahmad and Y. Wang, ""ADMM Based Privacy-Preserving Decentralized Optimization,"" in IEEE Transactions on Information Forensics and Security, vol. 14, no. 3, pp. 565-580, March 2019
+(3) Shen, S., Zhu, T., Wu, D., Wang, W. and Zhou, W., ""From distributed machine learning to federated learning: In the view of data privacy and security. Concurrency and Computation: Practice and Experience"", 2020.
+
+
+",6,3.0,ICLR2021
+HJeobaJAFS,2,rkxNelrKPB,rkxNelrKPB,Official Blind Review #1,"
+
+This paper focuses on signSGD with the aim of improving theoretical understanding of the method. The main contribution of the paper is to identify a condition SPB (success probability bounds), which is necessary for convergence of signSGD and study its connections with the other conditions known in the literature for signSGD analysis. One important point here is that the norm in which the authors show convergence now depends on SPB, meaning that the probabilities in SPB are used to define the norm-like function they use in the theorems.
+
+This paper is well-written and nicely structured and I like the relationships of SPB with other conditions. However, I have some concerns on the generality of SPB that I will detail below.
+
+- First of all, Lemma 2 is not clear to me at all. The authors say that the variance is bounded by a constant (0 \leq c_i < 1/sqrt{2}) multiplied by the true gradient norm and then they show that this assumption implies SPB. I do not know how restrictive this condition is. For example, what happens when all elements of true gradient is close to zero, I don’t know if it is reasonable to assume the noise to be small for this case. I cannot make the connection of this assumption and the classical bounded variance assumption (E((\hat{g_i}-g_i)^2)\leq\sigma_i). I can believe the result of Lemma 3 with specific constants $c_i$ as given, but I feel that it is then much stronger than standard bounded variance assumption. Because it would be asking the variance to be smaller than some specific constant.
+
+- Related to first point, I did not understand the remark in the first footnote of page 2. The authors argue that SPB is weaker than bounded variance assumption. But at the same time, it is known that bounded variance assumption is not enough to make signSGD work, with counterexamples given in Karimireddy et. al. 2019. So, it is quite weird that an assumption weaker than bounded variance (for which signSGD provably does not convergence) makes signSGD converge. So I think it is more natural for SPB to be stronger than bounded variance, because it is enough to make signSGD work. The only proof in the paper that would support this claim is Lemma 2, as I discussed above is stronger than standard variance bound. I hope that authors can clarify this point.
+
+- After Theorem 1, the authors compare their result with Bernstein et. al. 2019 and mention that Bernstein et. al. needs to use mini-batches depending on $K$ where $K$ is the iteration and unimodal symmetric noise assumption. But when I check Bernstein et. al. 2019, I see that these are different cases. Specifically, Theorem 1 in Bernstein et. al. 2019, uses mini-batch size 1 under unimodal symmetric noise assumption. The case where they would use mini-batches of size $K$ is in Section 3.3 of Bernstein et. al. 2019 where they *drop* unimodal symmetric noise assumption. So, I would suggest the authors to be more exact on this comparison because it is confusing. In fact, in Section 3.3 of Bernstein et. al. 2019, the authors also identify SPB as it is implied by unimodal symmetric noise assumption. It is the paragraph under Lemma 1 in Bernstein et. al. 2019.
+
+- My other concern is the comparison with Karimireddy et. al. 2019 both in theory and practice. Karimireddy et. al. 2019 modifies signSGD and under unbiased stochastic gradients and bounded variance assumption, obtains similar guarantees as this paper. I am aware that this paper does not assume unbiasedness, but like I said before, I do not know how SPB compares to variance bound. So, I see Karimireddy et. al. 2019 and this paper as similar results, so I would want to see some practical comparison as well. In Appendix E, the authors mention that Karimireddy et. al. 2019 has storage need but I think that need is negligible since they only need to store one more vector.
+
+- A side-effect of SPB is that now the convergence results are given in $\rho$-norm where $\rho$ is determined by SPB. I understand why this is needed from the proof of Theorem 1, and its implications in the theorem, but given that Karimireddy et. al. 2019’s result is given in l_2 norm which is easier to interpret, I think more comparison is needed.
+
+- Lastly, I like the fact that SPB is implied by the previous assumption in Bernstein et. al. 2019, namely unimodal symmetric noise, I am not convinced that SPB is much weaker than this assumption. The authors mention in several places in the paper that their assumption is very weak, but looking at Lemma 1, Lemma 2 and Lemma 3: Lemma 1 and Lemma 3 are the already known cases where signSGD works, and Lemma 2 is a new case where signSGD works but as I explained before, it is not clear to me how restrictive this assumption is. Therefore, I am rather unsure if this generalization of SPB is practically significant.
+
+Minor comments:
+- page 2, Table 1: I think it would be useful to add the results of Karimireddy et. al. 2019.
+- page 2, Table 1 and footnote 1: Footnote sign is given for the bounded stochastic gradient assumption but the explanation in the footnote text talks about the bounded variance assumption. Of course bounded stochastic gradient implies bounded variance, but this should be clarified. In addition, the footnote text is not clear to me, could the authors either point out to some references or give a proof?
+- page 2, Adaptive methods paragraph: The end of the paragraph says that signSGD is similar to Adam so, studies on signSGD *can* improve the understanding of Adam. I would be happy if the authors are more exact about this, such as when signSGD is equivalent to Adam etc.
+- page 3 discussion after Assumption 1: I do not understand the sentence starting with ‘Moreover, we argue that’. Can the authors give more details on why it is reasonable?
+- page 4 Lemma 2: I think the authors should include the definition of variance in the paper. Since the assumption in this Lemma is rather non-standard, I think it makes sense to be as exact as possible.
+page 21, Appendix E: It is written that ‘SPB is roughly a necessary and sufficient condition’. I could not understand what *roughly* means in this sentence. From what I have read, the authors have a counter-example showing without SPB, signSGD does not work and with SPB, it works, so I could not understand why it is written roughly here.
+
+Overall, I like the generalization of SPB, but as I detail above, I am not sure how significant the generalization is compared to other results and more specifically how it compares to standard bounded variance (which I believe is weaker than Lemma 2). Therefore, I remain not convinced about the impact of this generalization hence the results. In addition, I would have liked to see more comparisons (both theoretical and practical) with Karimireddy et. al. 2019.",3,,ICLR2020
+xn8vr0P4wAI,1,DigrnXQNMTe,DigrnXQNMTe,new kernels incorporating probability notion are proposed to perform two-sample test; but not yet clearly defined/explained.,"This paper tries to propose a kernel-based discrepancy measure called generalised probability kernel that can unify MMD and KSD which is an interesting topic of discussion. The paper applies the new discrepancy to perform two-sample tests. 
+The new kernel proposed, unlike the previous RKHS kernels that only depend on data-points, incorporate the notion of probability. e.g. kernel K_{p,q} depends on density p and q. also a symmetric version on discrete KSD is discussed.
+Despite the idea is interesting, there are several flaws which can be reviewed.
+
+Firstly, I think the paper is not clearly presented, with some confusing notation.
+--in Definition 1, you defined a kernel, on distributions p and q, that is a k x k matrix; while in definition 2, the notion of K, are on samples and is a scalar output.
+it is unclear of how \phi is defined in general; only examples are given later for specific cases so that we got an conjuncture.
+--in Definition 5, why is it different from stein operator of KDSD? or it is supposed to say difference operator?
+
+In addition I have several confusions:
+1. why is MMD_E^2 an unbiased estimator? what happened to k(x_i, x_i)? it is not clear from the Bernstein polynomial introduced in appendix. 
+2. in abstract, it claims that the kernels are between distributions instead of samples, but in the main text it is still evaluation at p_i=p(x_i) on samples; I m confused of the difference and novelty claimed.
+3. The above concern brings up the question while applying on two sample test. 
+--When the MMD is used to perform two-sample test, it is assumed that both p and q are unknown. however, to my understanding, we need to know p and q to define k_{prob}; how is this going to be applied to two-sample test? 
+--for KSD setting, when the symmetric KDSD is introduced, it also seems to require p and q to known for two-sample testing. In the Liu2016 setting, where goodness-of-fit test is proposed with KSD, q is known (up to normalization) while p is unknown with samples; that is a key point why KSD is useful for goodness-of-fit test.
+In addition, is there any argument on why the symmetric-KDSD might be better than KDSD Yang et.al 2018?
+
+An additional point is regarding literature review, which is yet throughout  to check; e.g. as
+Chwialkowski, et. al  ""A kernel test of goodness of fit."" proposed independently as Liu et.al for KSD goodness-of-fit test, that might be useful to cite.
+
+In my point of view, ICLR may not be a venue of fit either. More reviews and clarifications may be required, for both kernel construction and application. ",2,4.0,ICLR2021
+MNWzxywcNVF,2,lfJpQn3xPV-,lfJpQn3xPV-,Empirical work. Results are expected but not exciting.,"This work empirically evaluates the sliding-window strategy for training GNNs with temporal graphs. One may cast the temporal nature of the graph data in an online setting, under which the change of the graph structure as well as the variation of the classes cause distribution shift. The authors conduct a series of experiments to show that the sliding-window strategy is as effective as using the entire historical data for training.
+
+Pluses:
+
++ For different temporal graphs, the duration of a time step and the number of time steps (window size) are often ad-hocly defined and are not comparable. The authors introduce a measure of temporal difference that facilitates a more principled definition of the time step and the window size so that they are comparable across datasets.
+
++ The authors pose four important questions and conclude clear answers based on experimentation. The findings are: (1) incremental training is necessary to account for distribution shift, compared to a once-trained, static model; (2) incremental training with warm start does not always yield better performance than cold start; (3) the window size needs be large enough for incremental training to catch up with the performance of full-data training (e.g., covering at least 50% receptive field); and (4) these findings extend to several GNN models.
+
++ The authors compile three temporal graphs, which enrich the availability of benchmark datasets.
+
+Minuses:
+
+- The empirical findings are very much expected, which means that they are not exciting. From the methodological point of view, using sliding windows to train temporal GNNs is a no brainer choice if certain RNN modeling is involved. Since most of the presented results are naturally expected and there lacks theory/method contribution, the reader is unsure about the value of the paper.
+
+- A common pattern of the contributed datasets is that nodes and edges are inserted but never deleted. While the empirical findings are quite natural in this simple scenario, there will be a lot more uncertainty when the scenario becomes increasingly complex. For example, in social networks, accounts represented by nodes may be deleted and relationships represented by edges may dynamically change.
+
+  For another example, in communication networks where an edge denotes communication between two entities, the edges are instant and time stamped. The challenge in this case is less about distribution shift, but more about how to handle edges and what are the consequences. The online learning of this kind of data necessarily goes beyond a simple GNN such as the ones experimented in this paper, but the findings will be more valuable.
+",5,4.0,ICLR2021
+ENuFZ4RneTa,4,qkLMTphG5-h,qkLMTphG5-h,submission 831 review,"The paper proposes to reutilize pretrained MAML checkpoints for out-of-domain few-shot learning, combining with uncertainty-based adversarial training and deep ensembles.
+
+Pros:
+
+1. The idea of combining meta-learning, uncertainty learning and adversarial training is well-structured. In particular, the related work part provides a clear introduction of background work.
+
+2. It is quite novel to leverage adversarial learning as data augmentation for meta-testing in MAML.
+3. The paper provides extensive and convincing experiment results over evaluating the proposed model’s robustness to the choice of base stepsizes.
+
+Cons:
+
+1. It would be better if an optimization equation is provided, especially if there are generated adversarial examples.
+
+2. For the ablation study, the authors mention that the best absolute performance (Top-1) is always obtained through some use of adversarial training. Actually, it would be more convincing if they can discuss more choices of \lambda_{AT} and \lambda_{AUG} and \lambda_{a} to present the sensitivity analysis of the hyper-parameters.",5,2.0,ICLR2021
+gSP20LdcKgC,4,9EKHN1jOlA,9EKHN1jOlA,Review 3,"Summary
+----------
+
+This paper presents an approach to uncertainty modeling in recurrent neural networks through a discrete hidden state. The training of this discrete model is done using a reparameterizable approximation (in particular, using the Gumbel-Softmax trick). The authors show the utility of this method on a variety of problems, including showing effective out of distribution detection and improved calibration in classification tasks.
+
+Comments
+----------
+
+This paper presents a relatively simple idea that builds relatively directly from previous work, and uses the now common Softmax-Gumbel trick to enable differentiability. The main strength of this paper is the thorough experimental evaluation, on a wide variety of problems. 
+
+The main weakness of this paper is the very unclear presentation of the method. In section 2.1, the authors do not define all quantities, the mathematics of the method is interspersed with discussions of the approaches of others, and the writing is unclear. The authors must clarify the presentation of their method, and have this presentation be distinct from discussion of previous work. 
+
+Overall, the experimental results seem compelling and interesting. The authors should clarify their discussion of the partially observed RL task. In the partially observed task, is the agent only provided lagged measurements of the state? The presentation if quite confusing and the authors should state what this task is as clearly as possible. 
+
+Post-Rebuttal
+----------
+I thank the authors for their response. Both of the sections are now more clear, although the authors should make an effort to polish the narrative of the paper and the clarity of exposition throughout. The discussion of epistemic versus aleatoric uncertainty in the appendix is also interesting. I have increased my score from 6 to 7. ",7,2.0,ICLR2021
+yKDcZ1zqZx3,3,tqc8n6oHCtZ,tqc8n6oHCtZ,Official Blind Review #2,"This work introduces a method, called LengthDrop, to train a Length-Adaptive Transformer that supports adaptive model architecture based on different latency constraints. In order to make the model robust to variable input lengths, the method stochastically reduces the length of a sequence at each layer during training. Once the model is trained, the method uses an evolutionary search to find subnetworks that maximize model accuracy under a latency budget. 
+ 
+Pros:
+- Accelerating the inference speed of Transformer networks is an important problem.
+- The idea of training a length-adaptive Transformer once and using it in different scenarios with different latency constraints is interesting. 
+ 
+Cons:
+- The discussion with several state-of-the-art work is lacking.
+- The experimental setup is vague, and the evaluation results are inadequate. 
+ 
+The paper looks from an interesting angle to build adaptive Transformers for inference -- reducing the input sequence at each Transformer layer. However, there are a few concerns.
+ 
+First, the paper proposes to use a series of techniques to make LengthDrop work but lacks the ablation studies to show how those techniques help to make Transformer length adaptive. For example, Section 3.1 states that LengthDrop requires LayerDrop[1], which also supports adaptive Transformer by stochastically dropping layers during training for adaptive inference. However, there are no ablation studies or comparison results on LengthDrop vs. LayerDrop in terms of the accuracy-vs-latency trade-off. This raises the question of whether LengthDrop is necessary to obtain the given accuracy-vs-latency or perhaps simpler alternatives such as LayerDrop would be sufficient.
+ 
+In addition to LayerDrop, it appears that the paper also incorporates several other fixes, such as the sandwidth rule and inplace distillation, which are borrowed from prior work.  However, how these fixes contribute to LengthDrop is not clearly explained, and there are no studies nor experimental results to explain how each technique contributes to the final accuracy-vs-latency results. 
+
+Second, the comparison with related work is weak. In particular, LengthDrop is built on top of PoWER-BERT, yet the evaluation does not compare with PoWER-BERT.  Furthermore, the paper compares with DistillBERT, but there are multiple knowledge distillation based work that show better performance than DistillBERT, such as TinyBERT[2]. In terms of adaptive architecture, the evolutionary search of length configurations is similar to the NAS process in the Hardware-aware Transformer[2], which seems to be very related as it also uses evolutionary search to find a specialized sub-network of Transformer models with a latency constraint. The paper briefly mentioned [2], but it is not clear the advantage of this work as compared with [2]. 
+ 
+Third, the paper lacks enough information on the evaluation setups, raising several questions on the reported speedups. For example, it is unclear what's the batch size used in the evaluation. Figure 3(a) shows that reducing FLOPs on GPU does not lead to a reduction of latency for batch size 1, which is the common setting for online inference scenarios as queries come in one-by-one. It is unclear whether input length reduction may actually bring significant latency reduction when the batch size is small (e.g., 1), as the large matrix multiplications have been highly optimized on modern CPU and GPU through efficient kernels (e.g., cuDNN). Even for results on CPU and GPU with batch size >= 16, it is less clear whether the linear correlation between FLOPs and latency is a fact of failing to use highly optimized BLAS libraries, because the paper does not report the details on the hardware, the inference frameworks, and libraries it uses for the experimental results. 
+
+In addition to the batch size and lack of hardware/platform/library information, the experimental setup for training a Length-Adaptive Transformer is also not very clear. For example, it is unclear what's the maximum sequence length is used in training. Whether mixed sequence length is used in training BERT?  What is the sequence length(s) to obtain the results in Table 1? Without associating the actual length reduction ratio, it is hard to evaluate the reported FLOPs reduction.
+ 
+[1] Fan et. al. ""Reducing Transformer Depth on Demand with Structured Dropout"", https://arxiv.org/abs/1909.11556
+
+[2] Jiao et. al. ""TinyBERT: Distilling BERT for Natural Language Understanding"", https://arxiv.org/abs/1909.10351",4,4.0,ICLR2021
+H1xTS8cptB,1,ryguP1BFwr,ryguP1BFwr,Official Blind Review #3,"Summary:
+This paper studies some of the properties of fully convolutional autoencoders (CAE) as a function of the shape and total size of the bottleneck. They train and test CAEs with bottlenecks consisting of different ratios of spatial resolution versus number of channels, as well as different total number of neurons. The authors investigate which type of change in the bottleneck is most influential on training behavior, generalization to test set, and linear separability for classification/regression. Their first main finding is that the spatial resolution of the bottleneck is a stronger influencer of generalization to the test set than the number of channels and the total number of neurons in the bottleneck. The second main finding is that even when the total number of neurons in the bottleneck is equal to the data input size, the neural network does not appear to simply learn to copy the input image into the bottleneck. 
+
+
+Decision: 
+Weak reject: It is always refreshing to see papers that address/challenge/investigate common assumptions in deep learning. However, I find the experimental findings and discussion of borderline quality for a full conference paper. It might be more suitable as a good workshop paper.
+
+Supporting arguments for decision:
+It is unclear to me why the authors have chosen to only take a subset of the CelebA and STL-10 datasets for training and testing. It seems like dataset size is also an important factor that increases the complexity for training a model, and it certainly affects how a model can generalize. When auto-encoders are studied in other literature it is uncommon practice to restrict the dataset sizes like this, so this makes me question the applicability of this paper’s results to the literature. 
+
+It seems that the experimental validation is based on one run per CAE model with one single seed. This is on the low side of things, especially when quite extensive claims are made. An example of such a claim is on page 6 when discussing a sudden jump in training and test scores for 8x8x48 model trained on the Pokemon dataset. Because the same behavior appeared when the authors repeated the experiment with the same seed, the authors conclude “This outlier suggests, that the loss landscape might not always be as smooth towards the end of training, as some publications (Goodfellow et al., 2014) claim and that ‘cliffs” (i. e., sudden changes in loss) can occur even late in training.” Making this claim based on something that occurs with one single model for a single seed is not convincing and overstating this finding. 
+Another example is on page 7 under bullet point 4, where the authors discuss the obtained evidence against copying behaviour when the bottleneck is of the same size as the input. The authors state “We believe this finding to have far-reaching consequences as it directly contradicts the popular hypothesis about copying CAEs.” The paper definitely shows some empirical evidence that supports the claim that copying does not occur, but these findings are all done with a single seed and by considering small subsets of datasets (celebA and stl-10). In my opinion, it is therefore too much to state that the current findings have far reaching consequences. It has potential, but I wouldn’t go much further than that. 
+
+On page 7 in the second to last paragraph the influence of dataset complexity is discussed. The authors state “the loss curves and reconstruction samples do not appear to reflect the notion of dataset difficulty we defined in Section 2.3” and “This lack of correspondence implies that the intuitive and neural network definitions of difficulty do not align. Nevertheless, a more detailed study is required to answer this question definitively as curriculum learning research that suggests the opposite (Bengio et al., 2009) also exists.” It is unclear to me what the authors expected to find here. Moreover, the absence of major differences across the chosen datasets does not immediately make me doubt or question results from curriculum learning. My skepticism is again enhanced by the fact that the authors have taken a subset of the data for the more complex celebA and STL-10 datasets. Dataset size seems like a crucial part of dataset complexity.
+
+An interesting smaller finding is that linear separability of the latent codes for classification is better on the test set for the pokemon dataset, even though the training and test reconstruction losses showed signs of overfitting. The authors hypothesize that overfitting might occur more in the decoder than in the encoder.
+
+Additional feedback to improve the paper (not part of decision assessment)
+- Section 2.3.1: what was the original resolution of the pokemon dataset?
+- Section 3.1, c_j is used as the compression level, but in eq 3 and 4 c^i is also used to indicate the number of channels. - For clarity please use n_ci in eq 2 and 3 or choose a different all together for the compression level.
+- Please increase the font size of the plots in figures 1, 3 and 4.
+",3,,ICLR2020
+6FJuWMBxSp,4,MLSvqIHRidA,MLSvqIHRidA,A new theoretical understanding of contrastive divergence as adversarial training,"# General statements
+This paper has a special flavour, in the sense that it provides new light on a very established training method for energy-based models: contrastive divergence. Its core contribution is to provide a theoretically grounded understanding of CD as it is widely used, avoiding the common assumption that this algorithm stems out of a simplifying assumption.
+
+This is done through a connection between CD and adversarial training. On their way, the authors show how some minor corrections suggested by their interepretation may dramatically improve performance of CD, at least on their toy example.
+
+Since CD is a widely accepted method, I feel that the deliberate choice of restricting their experiments on toy data is legitimate.
+
+
+All in all, I would say that the paper is a very nice read, and its english usage is good, as well as the references that are appropriate.
+I think that it is appropriate for presentation at ICLR, since it may stimulate new research on CD.
+
+
+# Detailed comments
+Below are some minor comments in chronological order
+## Introduction
+* ""Thus, Our"": uppercase
+
+## Toy example
+* In figure 4, you probably mean ""from left to right""
+* To be extra sure, are you effectively disabling gradient recording when computing \tilde{x} as I assume you do ? I'm asking because \tilde{x} actually appears as a function of x, parameterized by \theta, i.e. as \tilde{x}_\theta(x), since it involves the transition kernel q_theta for its computation. As you write below eq. (17), you are considering that the kernel q as kept fixed, explaining such a choice. 
+however, and if I'm not mistaken, it should not be too difficult with autograd mechanics to include this dependency in the updates. Did you try it ? Did it break the algorithm ? 
+* I would appreciate more steps in your derivations (22) and (24): I don't follow easily the transitions to lines 2 and 4 of each.
+* The neural net used for the toy data looks impressively large (8 layers of FC+leakyReLU with 512 hidden size). Was it really necessary ?",8,4.0,ICLR2021
+rm1bVAYFuPs,2,AMoDLAx6GCC,AMoDLAx6GCC,The paper needs more work,"This paper describes a method to generate symmetric and asymmetric uncertainty estimates. The method is proposed to work for the non-stationarity processes found in real-world applications. The paper introduces a meta-modelling concept as an approach to achieve high-quality uncertainty quantification in deep neural networks for sequential regression tasks. The paper also introduces metrics for evaluating the proposed approach. A proposed meta-modelling approach is related to the work of Chen et al. (2019) which is mainly used for classification task in a non-sequential setting, however, the proposed method is mainly for the sequential setting. 
+
+The paper has interesting explorations for handling uncertainty and its evaluation. However, the paper needs more work for clarity and rigorous analysis. For example, equation 2 is not clear to me and needs explanation and/or related citations. For instance, why even function reflects symmetric uncertainty and how separating network nodes produces lower and upper band estimates to accomplish asymmetric prediction.
+
+The evaluation metrics were introduced (Eq: 3,4,5,6) to measure the effectiveness of the proposed method. Is there any existing evaluation metric for symmetric and asymmetric uncertainty estimates that can be used for measuring the effectiveness of the proposed method?
+
+There are two proposals in this paper: one is a meta-modelling concept for uncertainty measurement and another is the evaluation metrics to evaluate the uncertainty. It will be better to separate them, the first part needs to be evaluated with exiting uncertainty evaluation metrics with some well-known benchmark datasets. This part is to measure the effectiveness of the meta-modelling concept using the existing metric. The second part can be for proposing new evaluation metric and justify the reason behind the proposed the new metrics.
+
+Two real datasets was used for experimentation which is a good idea for evaluating on real applications. For the purpose of evaluating the proposed systems, it is also necessary to utilize some benchmark datasets from the literature.
+
+Minor comment: In figure 1, change the subscript N to M for consistency with the text.
+",5,3.0,ICLR2021
+sfZV3fxO9xU,1,q_Q9MMGwSQu,q_Q9MMGwSQu,"The observation is interesting, but more experiments are required","- Summary:
+This paper shows that introducing an abstention class for out-of-distribution (OOD) works well for detecting it when the in-distribution dataset is CIFAR and TinyImageNet is available during training as an OOD dataset.
+
+- Reasons for score:
+1. The proposed setting with a large OOD dataset has already been proposed by [Hendrycks et al.], and the proposed method has been experimented in [Lee et al. (a)] and [Dhamija et al.], so the technical novelty of this work is limited.
+2. The experiments are not thoroughly conducted. The only value I can find in this paper is the empirical observation in a limited condition. More specifically, this paper found that adding an abstention class is better than prior methods when the in-distribution dataset is CIFAR and TinyImageNet is available during training as an OOD dataset. I am not sure this observation is consistent in other settings, so I recommend to conduct more thorough experiments, as done in [Hendrycks et al.]. Again, [Hendrycks et al.] considered a similar setting, but they proved the effectiveness of their method in image and natural language domains, with 7 in-distribution datasets and 3 large OOD datasets. However, even with more experiments, I am not sure this work is significant enough for publication in ICLR, because of the lack of novelty in both the experimental scenario and method.
+3. The comparison is unfair. Performances of prior methods are borrowed from original works, but they are mostly experimented in settings different from this paper. In particular, some works like [Lee et al. (b)] and [Hsu et al.] considered training/validating without OOD data, because it is hard to assume to have a large OOD dataset covering all possible OOD in practice.
+
+- Minor Comments:
+4. Citation format issue: you can use \citet for noun and \citep for adverb.
+
+[Lee et al. (a)] Training confidence-calibrated classifiers for detecting out-of-distribution samples. In ICLR, 2018.
+
+[Lee et al. (b)] A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In NeurIPS, 2018.
+
+[Dhamija et al.] Reducing network agnostophobia. In NeurIPS, 2018.
+
+[Hendrycks et al.] Deep anomaly detection with outlier exposure. In ICLR, 2019.
+
+[Hsu et al.] Generalized ODIN: Detecting Out-of-distribution Image without Learning from Out-of-distribution Data. In CVPR, 2020.
+
+**After rebuttal**
+
+I'd like to thank authors for their efforts to address my concerns. I didn't change my initial rating, due to the two main concerns below:
+
+(1) To me, the main argument of this paper sounds ""when a large (and maybe diverse) OOD is given, adding an OOD class to the classifier is better than baselines."" Since the large OOD setting has already been proposed by [Hendrycks et al.], the only contribution of this work is on the empirical observation that the proposed method is better than baselines. While the observation is interesting, I think the contribution is not enough as a full ICLR paper at this point.
+
+During the rebuttal period, R3 corrected it that ""the main question investigated by the paper is how to best use the outlier exposure set,"" and this sounds better. However, authors didn't emphasize the setting but their method, such that their main argument is (if they intended to say as like what R3 understood) misleading. Training with a large OOD dataset like [Hendrycks et al.] is not common, and the observation in this paper is limited to this setting. However, the only statement about the setting I could find in the intro is that ""as in Hendrycks et al. (2018), we uses additional samples of real images and text from non-overlapping categories to train the model to abstain, ..."" i.e., rather than elaborating/emphasizing the setting (together with their method), they just cited a prior work.
+
+In short, I recommend authors to rewrite abstract/intro as suggested by R3, to properly emphasize their contribution.
+
+(2) The comparison is unfair, as authors didn't re-evaluate baselines in the same setting (they had to make it the same as much as possible) but just pasted numbers from original papers. Even the comparison with the closest prior work [Hendrycks et al.] is unfair, as the prior work fine-tuned the model while the proposed method trained the model from scratch.
+
+Regarding the performance of similar methods evaluated in [Lee et al. (a)] and [Dhamija et al.], I think the main reason why they didn't get the same observation is on the size of OOD dataset, i.e., they didn't train their model with a large OOD dataset like [Hendrycks et al.] or this work. As this work claims, it might be true that when a large OOD dataset is available, adding an OOD class to the classifier is simply good enough.",4,4.0,ICLR2021
+rkxtaO2mcr,2,S1eYchEtwH,S1eYchEtwH,Official Blind Review #2,"After rebuttal:
+
+Thank you to the authors for responding to my review.
+
+1) The title of the conference is ""... on Learning Representations"". As I stated in the review (""no, e.g., neural networks are employed""), neural networks are an *example* of, but do not subsume, all representation learning methods. Therefore, I agree that papers that do not cover neural networks are welcome at the conference. However, as I stated in the review, my evaluation of the method proposed in the submission is that it does not concern representation learning (""The employed features in Table 3 are handcrafted""). I believe this evaluation is defensible, but of course the final evaluation is up to the chairs. However, I note that the authors did not respond directly to my evaluation that the method is not engaging in representation learning.
+
+2-7) As the other reviewer notes, the paper lacks clarity in many places, and does not sufficiently discuss prior work, including in postural control (there is one citation in the references that is not mentioned in the main text), hierarchical Bayesian optimization within or without a Gaussian processes framework (https://scholar.google.com/scholar?hl=fr&as_sdt=0%2C5&q=hierarchical+bayesian+optimization&btnG=), or experience replay (https://scholar.google.com/scholar?hl=fr&as_sdt=0%2C5&q=replay+machine+learning&btnG=). Therefore, it is difficult to ascertain the research contribution.
+
+As such, I stand by my evaluation that this submission is not ready for publication at ICLR.
+
+===========================
+
+Before rebuttal:
+
+The submission presents a hierarchical Bayesian optimization (HiBO) approach to solving a postural control task expressed as a proportional-derivative (PD) controller.
+
+Strengths:
+- The HiBO approach outperforms the non-hierarchical BO approach on the task of postural control.
+
+Weaknesses:
+- The paper does not make use of representation learning (no, e.g., neural networks are employed) and is therefore out-of-place at ICLR. The employed features in Table 3 are handcrafted.
+- The task (simulating human postural control) is not well-situated in the context of prior work using HiBO for robotics, so the contribution remains unclear.
+- It is not clear why this problem should be formulated as contextual policy search (i.e., to what the context variable refers).
+- Only one baseline (Bayesian optimization (BO)) is reported. This baseline corresponds to the ablation of the HiBO method (i.e., the omission of the context variable), and so does not represent, more broadly, an alternative approach.
+- The concept of ""mental replay"" is briefly introduced, but no reference is made to prior work in imagined rollouts, and no ablation study on the impact of this component is performed.
+
+Minor points:
+- It is unclear why the problem setting should be labeled as ""psychological"" postural control.
+- There are several missing references (""?"") in the text.",1,,ICLR2020
+tNw-vUDSOPh,4,V8jrrnwGbuc,V8jrrnwGbuc,Nice contribution in understanding generalization and memorization of deep neural networks,"The paper empirically studies the reason for the phenomenon that deep neural networks can memorize the data labels, even the labels are randomly generated. New geometric measures by replica mean-field theory are applied in the analysis. 
+
+The findings of the paper are interesting. It shows the heterogeneity in layers and training stage of the neural net:
+
+i) Memorization occurs in deeper layers; rewinding the final layer to the early weights mitigates memorization.
+
+ii) When memorization happens, the early layer still learn representations that can generalize.
+
+iii) In the training, early activations stabilize first, and deeper layers weights stabilize first. 
+
+iv) Near initialization, the gradient is dominated by unpermuted examples.
+
+I have the following questions/comments:
+
+- It is better to further explain the intuition of the Manifold Geometry Metrics.  The current Figure 1(B) is not very clear.
+
+- In Manifold Capacity, what do P and N exactly mean? Is this P the number of classes as used elsewhere?
+
+- The paper explains that by training on permuted examples, the network can learn generalizable representations at the initial training stage because the gradient ignores permuted examples. But why in the later training stage, the early layers and later layers show different generalization properties?
+
+In general, this paper carries well-organized experiments. One shortcoming is that the paper does not provide a methodology to solve the generalization problem or further theoretical analysis of the observations.  But the empirical discoveries are novel and can be beneficial to the deep learning community.  
+
+###########
+
+Updates: Thanks for the authors' response. The modified version improves clarity. I think this paper provides nice observations and initial analysis to the community and can be beneficial to future work, so I recommend this paper to be accepted.",7,3.0,ICLR2021
+rJRJeMoxz,3,H1I3M7Z0b,H1I3M7Z0b,Review,"This paper presents a method for reducing the number of parameters of neural networks by sharing the set of weights in a sliding window manner, and replicating the channels, and finally by quantising weights. The paper is clearly written and results seem compelling but on a pretty restricted domain which is not well known. This could have significance if it applies more generally.
+
+Why does it work so well? Is this just because it acts on audio and these filters are phase shifted?
+What happens with 2D convnets on more established datasets and with more established baselines?
+Would be interesting to get wall clock speed ups for this method?
+
+Overall I think this paper lacks the breadth of experiments, and to really understand the significance of this work more experiments in more established domains should be performed.
+
+Other points:
+- You are missing a related citation ""Speeding up Convolutional Neural Networks with Low Rank Expansions"" Jaderberg et al 2014
+- Eqn 2 should be m=m* x C
+- Use \citep rather than \cite",5,5.0,ICLR2018
+BklXppaonm,3,HkGmDsR9YQ,HkGmDsR9YQ,Not unknown but nice systematic exploration,"I totally disagree with the authors that any of their observations are surprising. Indeed the fact that an RL agent does not generalizes to small modifications of the task (either visual or in the dynamics) is well known. If the agent should generalize though is a different question. And I do not mean this in the sense that it is an undesirable property but rather if it is outside of what “learning one task” means. Particularly I feel this is a very pessimistic view of RL and potentially not even in-line with what happens in supervised learning. 
+
+I think one mantra of deep learning (and deep RL needs to obey by it) is that one should test in the same setting as training for things to truly work. For supervised, there is a distribution of data, and the test set are samples from the same distribution. However the testing scenario used here is slightly different. During training, if I do not see car accelerating, I think it makes no sense to expect to generalize to a new game that has this property as it is out-of-distribution. Of course it would be ideal if it could do that. And to clarify, while for us some of these extensions seem very similar and minimal changes, hence it should generalize to rather than transfer to, this is just the effect of imposing our own biases on the learning process. Deep Nets do not learn like we do, and in their universe they have never seen a car accelerating -- it makes sense that it might not to be able to generalize to it. Again, I’m not arguing that we don’t want this, but rather if we should expect it as part of what the system should normally generalize to.
+
+To that end I think this paper enters in that unresolved dispute of what generalization should be versus what is transfer. At what point do we have truly a new task vs a sample from the same task. I don’t think there is an answer.
+
+Going back to the observations in this work. I think the fact that the environment is not stochastic reinforces this overfitting (as in the extreme you end up with a policy that just repeats the optimal sequence of actions). I think in this particular case I can see how finetuning to a variation of the task fails. However true stochasticity in the environment (e.g. having a distribution of variations) like is done in Distral paper (where each episode is a different layout) can behave as a regularizer that will mitigate a bit the overfitting. That is to say that I believe the observed behaviour will be less pronounced in complex stochastic setting. 
+
+Nevertheless the paper seems to highlight an important observation (and back it up with empirical evidence), namely we should use more regularization like L2 or otherwise in practice. Which is mostly absent from publications. And I think this on its own is valuable. 
+",6,3.0,ICLR2019
+SyxONoAPpX,3,HyGh4sR9YQ,HyGh4sR9YQ,"Interesting exploration, but lacks needed rigor.","This paper demonstrates that Genetic Algorithms can be used to train deep neural policies for Atari, locomotion, and an image-based maze task. It's interesting that GAs can operate in the very high-dimensional parameter space of DNNs. Results show that on the set of Atari games, GAs perform roughly as well as other ES/DeepRL algorithms.
+
+In general, it's a bit hard to tell what the contribution of this paper is - as an emperical study of GA's applied to RL problems, the results raise questions:
+
+1) Why only 13 Atari games? Since the algorithm only takes 1-4 hours to run it should only take a few days to collect results on all 57 games?
+
+2) Why not examine a standard GA which includes the crossover operator? Do the results change with crossover?
+
+3) The authors miss relevant related work such as ""A Neuroevolution Approach to General Atari Game Playing"" by Hausknecht et al., which examines how neuroevolution (GA based optimization which modifies network topology in addition to parameters) can be used to learn policies for Atari, also scaling to million-parameter networks. This work already showed that GA-based optimization is highly applicable to Atari games.
+
+4) Is there actual neuroevolution going on? The title seems to imply there so, but from my reading of the paper - it seems to be a straightforward GA (minus crossover) being applied to weight values without changes to network topology.
+
+I think this paper could be strengthened by providing more insight into 1) Given that it's already been shown that random search can be competitive to RL in several Mujoco tasks (see ""Simple random search provides a competitive approach to reinforcement learning"") I think it's important to understand why and in what scenarios GAs are preferable to RL and to ES given similar performance between the various methods. 2) Analysis as to whether Atari games in particular are amenable to gradient-free optimization or if GA's are equally applicable to the full range or RL environments?
+",4,4.0,ICLR2019
+Syl8NXg6tH,2,rkxawlHKDr,rkxawlHKDr,Official Blind Review #2,"This paper investigates an image segmentation technique that learns to evolve an active contour, constraining the segmentation prediction to be a polygon (with a predetermined number of vertices).  The advantage of active contour methods is that some shapes (such as buildings) can naturally be represented as closed polygons, and learning to predict this representation can improve over pixelwise segmentation.
+
+The authors propose to learn an image-level displacement field to evolve the contour, and a neural mesh renderer to render the resulting mask for comparison with the ground truth mask.  The performance compared to prior learning-based active contour methods is impressive.
+
+In section 4.3, there’s a reference to a “gap in performance” between the proposed method and DARNet and a reference to a ""low number of vertices,"" but a comparison between the two methods as the numbers of vertices is varied seems to only be present in Fig. 6 -- it would be interesting to see an explanation of the discrepancy for the lower number of vertices seen in this figure.
+
+Overall, due to the relative simplicity of the approach and impressive performance compared to prior learning-based approaches I recommend to accept.
+
+Post-rebuttal:  I maintain my recommendation.",8,,ICLR2020
+BJgQvDCcn7,1,rJlpUiAcYX,rJlpUiAcYX,"The paper is well written and is, in my opinion, a good contribution to the literature. ","The authors introduce a novel distance function between point sets, based on the ""permutation invariance"" of the zeros of a polynomial, calling it ""holographic"" distance, as it essentially depends on all the points of the sets being compared. They also consider two other permutation invariant distances, and apply these in an end-to-end object detection task. These distance functions have time-complexity O(N^2) unlike the previously proposed ""Hungarian distance"" based on the Hungarian algorithm which is O(N^3) in general. Moreover, they authors show that in two dimensions all local minima of the holographic loss are global minima.
+
+Pros: The paper is well written, the ideas are clearly and succinctly presented. Exploiting the connection between 2D point sets and zeros of polynomials is an interesting idea.
+
+Cons: The experimental section could be better. For example, the authors could do simple experiments to show how an optimization algorithm would explore the holographic loss surface (in terms of hitting global/local minima) in dimensions greater than two. Also, in the object detection example, no comparison is given with the Hungarian loss based algorithm of Stewart et al. (2016) (at the very least, the authors could train their neural nets using the Hungarian loss, choosing one optimal permutation at the ""transitioning points"") .",7,3.0,ICLR2019
+Bygxx5IpFH,2,Skgaia4tDH,Skgaia4tDH,Official Blind Review #1,"Authors of this paper propose to utilize the embedding model for the other local structure as a weak form of supervision based on the insight that the real-world datasets often shares some structural similarity between each neighborhood. The Local VAE is proposed to have the different model parameters for each local subset and train these local parameters by the gradient-based meta-learning.
+
+Local VAE incorporates local information by using prior distributions of local parameters in VAE. The overall model performs probabilistic inference via the conditional distribution from the meta parameters. There are several concerns:
+1. In section 2, authors discussed LLE. It is unclear the purpose of the section 2.1? 
+2. LLE does not require W is nonnegative only, and \sum_j W_{i,j}=0 is also contradictory with the nonnegative assumption. 
+3. Authors claimed that Local VAE algorithm corresponds to the assumption that the dataset approximately lies on multiple subsets and each subset is generated from different parameters. It is unclear what is the connection of the Local VAE to multi-scale structures of the datasets.
+4.  Authors evaluated neighbors by sampling and k-nearest neighbors on latent space. It is unclear why not use the common k-nearest neighbors on the input data. K-nearest search should not be a computational problem for large datasets by using fast approach.
+
+As the motivations of this paper, existing methods require massive amount of data and extended training time. However, authors did not demonstrate these points by comparing the proposed method with existing methods.
+",3,,ICLR2020
+Uvxbqhu7Tbg,1,#NAME?,#NAME?,A Theory of Self-Supervised Framework for Few-Shot Learning,"The paper establishes a relationship between self-supervised learning (SSL) and supervised few-shot learning (FSL) method and shows that when both are equivalent. The whole analysis and proof are based upon the two main assumptions: mean classifier and balanced class training data. The paper shows that if we have a too large number of classes in the SSL, then it is equivalent to the supervised learning scenario and model enjoy the same generalization ability. Always supervised loss is the upper bound by the SSL loss.
+
+Comment:
+1: The paper theoretically connects the SSL and FSL and shows when both will be equivalent. Theorem-1 shows that the supervised loss is upper bound by SSL loss by a linear relation (mostly scale+shift) when |C|-->infinity then both loss is equivalent. It seems that Theorem-1 is trivial since it is obvious that for the large class there will be very less chance of the negative pair is incorrect (i.e. false negative). If all the negative pair is correct, then it is same as we know the class label and we make the negative pair using the class information of all samples. I believe this theorem provides less useful information for a practical perspective.
+
+2: Theorem 2 provides the underlying factor between the L_sup and L_U, and shows that L_sup loss is upper bound by the loss of the true-negative and the intraclass variance. For the small variance, we can reduce the gap between the supervised loss and SSL loss. Once a trivial solution is when |C|--> infinity. This theorem shows then when |C| is not large still we can still focus on reducing the intraclass variance and reduce the gap. 
+
+3: It is clear that if we have large number of class, we can reduce the gap between the supervised loss and self-supervised loss, but why the large batch size help in to get a practically better result? In this case, the probability of the false-negative samples is the same, and it does not depend on the batch size. Could you please explain that? It is written that ""We can increase N by increasing the total negative samples N_k"", is true but in the total negative samples the probability of the false-negative will be same, and it depends on the number of class only. Then how large batch size help?
+
+4: In the N-way and M-shot, it is intuitive that when M increase the model performance will increase, but why with the increase of the N model performance will increase? 
+
+5: Omniglot dataset has 1623 classes, while in the paper it is written that ""Omniglot involves up to 4800 classes"" please check that.
+https://github.com/brendenlake/omniglot",4,3.0,ICLR2021
+s6F3cyu449,2,sfy1DGc54-M,sfy1DGc54-M,Study of a threat model that decomposes an image into foreground and background with results in-line with expectations,"Summary
+=======
+The paper studies threat models for images which take into account the foreground and background within the image. Specifically, they use DeepGaze II to identify pixels as either foreground or background, and use an Lp threat model with a larger radius in the background and a smaller radius in the foreground. The results are largely what one would expect: defenses for different threat models perform poorly against this attack, in comparison to adversarial training on adversarial examples generated by the attack. These attacks have good foreground score, though this is according to the same DeepGaze II model used to generate the examples to begin with. 
+
+Overall, my impression is lukewarm. Although there are certainly some aspects which could be improved, there does not appear to be anything glaringly incorrect and the results are in line with expectations. The rest of this review is separated into comments which I would be happy to discuss with the authors, and other minor aspects that the authors can take into account in their revision. 
+
+On a general note, I'm not so sure if ""unsuspicious"" is the right qualifier for the work in this paper. The most relevant work to this paper is that by Xiao et al., however the authors simply give a very brief statement saying that Xiao et al. do not ensure that their attack is unsuspicious. Suspicious is very subjective: a human may very well look at the adversarial example Figure 2 with the corrupted background and think that this looks suspicious, and it's not at all obvious to me why this is any more or less suspicious than the discrete background changes studied by Xiao et al. To be clear, this is not a point of contention (as in this was not incorporated this into my score), but perhaps the authors can take this into account. 
+
+Comments for discussion
+=======================
+1. The datasets considered are quite small (STL-10 and ImageNet-10 are 5k training, 100 test, and Segment-6 is 18k train, 1.2k test), all of which are smaller than even CIFAR10. Is this a limitation of the the DeepGaze model or some other aspect of the approach? 
+
+2. In section 3.2, the paper describes how the projection for the PGD attack is a computationally challenging problem since it is not an Lp ball. However, I believe this setting is not any more difficult than the standard PGD attack: since the perturbation is separated into disjoint foreground and background pixels, the projection is exactly equivalent to the standard projection over each set. Upon looking at the algorithm in the Appendix, this is exactly what is done by the authors. The resulting dual-perturbation attack is consequently a natural and straightforward application of PGD to the proposed threat model. This is not a negative aspect; however the text implies that this is splitting and merging is somehow substantially different from PGD when it isn't. Is my understanding here correct? 
+
+3. The presentation of the claims of improvement in standard adversarial robust accuracy are somewhat misleading. For example, in Figure 4 the paper shows improved standard robust accuracy past epsilon=2 for the dual training, and the authors attribute this in the text at the end of section 5.3 to their defense, which trains on larger epsilon in the background. This is a misleading comparison because the baseline L2 adversarially trained model is only trained to be robust up to epsilon=2. In order to meaningfully conclude the final sentence of the section, i.e. that AT-Dual can achieve 20% more accuracy at a larger threshold of epsilon=3, it would only be fair to compare to a baseline which has been trained to be adversarially robust up to the same threshold that is being compared, otherwise the new approach will work better simply because the baseline is handicapped by the radius used at training and not necessarily because splitting foreground and backgrounds is any better. 
+
+Minor comments
+==============
++ In section 2.2, the paper cites a couple of papers for adversarial training as an empirical defense. Among these, Cohen et al 2019 is listed. However, this is not adversarial training; Cohen et al 2019 is a certified defense with guarantees, and is quite distinct from the adversarial training paradigm described in the text. 
+
++ Given the closeness in setting of Xiao et al. 2020 to the work done in this paper, there should be an additional sentence or two in the related work distinguishing the work in this paper from Xiao et al. 2020 other than mentioning this arbitrary concept of ""unsupiciousness"" of their attacks. In my understanding, the specifics are actually quite different (i.e. the threat model and how to get the backgrounds), but this is not at all indicated in the text which does not give a very meaningful description beyond ""suspiciousness"" of how these works are actually different. 
+
+Update
+======
+I have read the author response, which largely does not change my score. I would kindly point out to the authors that the ""the technical challenge is that the feasible region is not an ball, and computing the projection is challenging in high-dimensional settings"" is in fact not very challenging at all. As mentioned in my initial review, the ""split and merge"" is exactly the standard projection operator on the proposed set and is not a ""new heuristic"", and so it is exactly PGD and not an adaptation of PGD. 
+
+I also understand the other reviewers concerns on suspiciousness, which we all brought up. This likely needs to be thought about and presented more carefully, for example by posing it more formally if the authors insist on this framing. ",6,4.0,ICLR2021
+b1Hy1zNAOmS,4,AWOSz_mMAPx,AWOSz_mMAPx,Local analysis of Gradient Descent-Ascent with a finite timescale separation,"The paper studies  the local asymptotic stability of a specific class of solutions points, referred to as strict local minmax equilibria (or differential Stackelberg equilibria), in the case of Gradient Descent-Ascent Dynamics with a finite time-scale separation. The time-scale separation (\tau) is being captured by the ratio of the step-sizes between the min and max agents respectively. Recently, Jin et al. showed the set of asymptotically stable critical points of gradient descent-ascent coincide with the set of differential Stackelberg equilibrium as the time separation goes to infinity. The paper shows that an infinitely large separation is not needed and some finite but large enough separation suffices.  The paper provides a close analogue of another previous result by Mescheder about local stability of gradient descent dynamics in GANs under strong technical assumptions. The paper ends with GAN experiments where \tau=1, 2, 4, 8 are tested and the performance seems to peak at 4. 
+
+The paper performs a detailed theoretical analysis of the coupling between GDA and diff Stackelberg equilibria. Although this is positive, the results are not particularly surprising given the prior work. The writing of the paper could also be significantly improved.
+
+One issue that I had reading the paper is that at times and especially in the introduction the 
+treatment of (asymptotically stable), stable, unstable fixed points seem to be a little ambiguous.The paper only formally defines locally exponentially stable equilibrium in the preliminaries which is a notion that is not used in the introduction. I think it is important to set early on a clear terminology that is consistent throughout the whole paper. 
+I am also a bit confused about some statements in the paper about which type of solution concepts are game theoretically meaningful. The paper seems to state that any critical point that does not satisfy the definitions of differential Stackelberg equilibria lack game theoretic meaning.  From the paper
+
+the stable critical points of gradient descent-ascent coincide with the set of differential Stackelberg equilibrium as \tau goes to infinity. All ‘bad critical points’ (critical points lacking game-theoretic meaning) become unstable and all ‘good critical points’ (game-theoretically meaningful equilibria) remain or become stable as \tau goes to infinity.
+
+This seems like a strong statement. It seems to me that min-max solutions of bilinear zero-sum games does not satisfy the differential Stackelberg definition. Such statements would imply that min-max solutions are 1) bad critical points and 2) lack game theoretic meaning despite being the golden standard of a solution concept in game theory. 
+Maybe I am missing something here? 
+
+[1] has recently shown that alternating GDA with fixed time-separation does not converge in the case of bilinear zero-sum games but is instead recurrent with the min-max equilibrium being stable but not asymptotically stable. This seems to be exactly the setting that you are studying. How are the results of [1] connected to yours? I think that due to the tight match between the two settings a thorough discussion is needed. 
+
+The definitions of differential/strict Nash/Stackelberg equilibria are two of several alternative definition/solution concepts that have only been recently introduced in the context of non-convex non-concave games. The paper should compare and contrast to other notions ideas  (e.g. proximal equilibria are only mentioned briefly [2], see also [3], [4]).
+
+ Although one should of course not expect a global convergence as such a result would be too ambitious, the title could be interpreted as such a result by non-specialists. I think it might be better if the term local analysis is used instead. A more thorough discussion about non-convergence results for GDA and variants in zero-sum games could be also helpful [1,3,5-8] to dispel any possible confusion.
+
+In terms of the experimental results why do the simulations stop with \tau=8? The theoretical results as well as the prior work by Jin et. al are supportive of arbitrary large \tau. What happens e.g. for tau= 2^4, 2^8, .... It already seems that performance starts dropping for \tau>4. Does this trend continue? Does the performance have a unique peak? or does it fluctuate?
+The regularization results (Theorem 3) remain true even for \tau<<1 e.g. \tau = 2^{-4}, 2^{-8}, ... 
+What would experiments show for such \tau under regularization?
+
+
+[1] Bailey et al. Finite Regret and Cycles with Fixed Step-Sizevia Alternating Gradient Descent-Ascent. COLT 2020
+[2] Farnia et. al Do GANs always have Nash equilibria? ICML 2020
+[3] Vlatakis-Gkaragkounis et al. Poincaré Recurrence, Cycles and Spurious Equilibria in Gradient-Descent-Ascent for Non-Convex Non-Concave Zero-Sum Games. NeurIPS 2019.
+[4] Zhang, et al. Optimality and Stability in Non-Convex-NonConcave Min-Max Optimization. arXiv e-prints, art. arXiv:2002.11875, February 2020.
+[5] Mertikopoulos et al. ""Cycles in adversarial regularized learning."" Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, 2018.
+[6] Cheung et al. ""Vortices instead of equilibria in minmax optimization: Chaos and butterfly effects of online learning in zero-sum games."" arXiv preprint arXiv:1905.08396 (2019).
+[7] Letcher ""On the Impossibility of Global Convergence in Multi-Loss Optimization."" arXiv preprint arXiv:2005.12649 (2020).
+[8] Hsieh, et. al. ""The limits of min-max optimization algorithms: convergence to spurious non-critical sets."" arXiv preprint arXiv:2006.09065 (2020).",6,4.0,ICLR2021
+r1xXqpp6n7,2,BJfguoAcFm,BJfguoAcFm,interesting paper but somewhat unclear,"The paper considers a solution to a statistical association problem. The proposed solution involves a decomposition they call a kolmogorov model (what sort is not justified in any way and confused me a lot). The decomposition has two parts 1) a discrete basis function that needs to be discovered and 2) a discrete distribution over the basis elements. The define an optimization problem (2) which has a data term and some binary and simplex constraints and they propose a relaxation and decomposition of this optimization problem. They go on to claim that (mutual) causal relations can be then inferred by inspecting the representations they have learnt but they give little details on how and what impacts this distinction has in practice.  This may be obvious to a subfield expert but it is not clear to me at all. The paper is locally consistent but I have trouble understanding the contribution and placing in the broad machine learning field. 
+
+I am not an expert in causality so I cannot evaluate the contribution but I can say that what interests me are section 2.2 and sec. 4-5. And they both require a lot better writing. 2.2 made things much more intuitive but i fail to see how the indicator variable annotations (action, scifi, etc.) can possibly come out of the data. I think this is an important point to support the interpretability claim. As for 4 I think there is room for intuition building there as well as limitations (e.g. what sort of inferences can be made and not etc.) Finally for 5 i find that very interesting but i find it difficult to have the right intuition about what the support condition means and how that helps in a practical setting.
+
+pros:
+- causality and interpretability are major directions of research
+- seems like a valid contribution on an interesting problem
+cons:
+- the highlevel picture is relatively clear but i find important things very difficult to grasp 
+- the kolmogorov model definition i find confusing but i am not an expert in causality (the introduction should give some intuition about what that is and why it is a good idea).
+- find it very hard to have a coherent picture of the limitations and assess the contributions of the paper.",5,2.0,ICLR2019
+rkdmp2J-f,3,BJuWrGW0Z,BJuWrGW0Z,"Interesting application, but lacks clarity","This paper considers the task of learning program embeddings with neural networks with the ultimate goal of bug detection program repair in the context of students learning to program. Three NN architectures are explored, which leverage program semantics rather than pure syntax.  The approach is validated using programming assignments from an online course, and compared against syntax based approaches as a baseline.
+
+The problem considered by the paper is interesting, though it's not clear from the paper that the approach is a substantial improvement over previous work. This is in part due to the fact that the paper is relatively short, and would benefit from more detail.  I noticed the following issues:
+
+1) The learning task is based on error patterns, but it's not clear to me what exactly that means from a software development standpoint.
+2) Terms used in the paper are not defined/explained. For example, I assume GRU is gated recurrent unit, but this isn't stated.
+3) Treatment of related work is lacking.  For example, the Cai et al. paper from ICLR 2017 is not considered
+4) If I understand dependency reinforcement embedding correctly, a RNN is trained for every trace. If so, is this scalable?
+
+I believe the work is very promising, but this manuscript should be improved prior to publication.",6,2.0,ICLR2018
+HyxrliSJcS,3,SkxoqRNKwr,SkxoqRNKwr,Official Blind Review #1,"Overall, this paper provides valuable insight into the trade-off between privacy preservation and utility when training a representation to fight against attribute inference attack.
+
+Detail comments:
+
+Strength:
++ The whole paper is well organized with logic. Notations are well defined and distinguished.
++ The final results have a good intuitive explanation.
++ The most impressive part of this paper is the analysis of the trade-off between privacy and utility from which the upper bound is quantified.
+
+Weakness:
++ The minimax method looks trivial. The difficulty of using such an objective should be emphasized for practical implementation.
++ The major weakness is the experiments. The experiments only on two datasets may not be convincing for me. And the repetition times, 5 or 3, for each dataset are pretty small. Considering the experiments are conducted with random noise, e.g., DP methods, such a small repetition time is not fair since there a large chance these results could be selected.
++ Which DP Laplacian mechanism is used is not specified. Since there are already many improvements on the DP Laplacian mechanism, e.g., [A], it is necessary to make sure the baseline should be state-of-the-art for a fair comparison.
++ The result is not intuitively surprising that the privacy loss and utility loss will be balanced toward an upper bound (Theorem 4.2). I am not sure how the Jensen-Shannon entropy between D^Y_0 and D^Y_1 can be calculated in practice since the true conditional distribution is not observed. For example, when the data distribution is heavily biased, then the conditional distribution might show less correlation between Y and A. Then, the privacy protection will be a pretty simple task with a very high upper bound. Based on this, it is worth to ask what the upper bound looks like in the dataset used in the experiments.
++ How efficient this algorithm could be? In comparison to other baselines, does this method provide a more efficient solution?
+
+[A] Phan, N., Wu, X., Hu, H., & Dou, D. (2017). Adaptive Laplace Mechanism: Differential Privacy Preservation in Deep Learning. 2017 IEEE International Conference on Data Mining (ICDM), 385–394. https://doi.org/10.1109/ICDM.2017.48",3,,ICLR2020
+r1xcwnpq3Q,2,SkVe3iA9Ym,SkVe3iA9Ym,"Interesting work, but need further improvement","This paper presents NMBM, a general inverse reinforcement learning (IRL) model that considers multifaceted human motivations. The authors have motivated and proposed the algorithm (Section 2 and 3), and demonstrated some experiment results based on a real-world dataset (WoWAH, Section 4).
+
+-- Originality and Quality --
+
+To the best of my knowledge, the proposed NMBM algorithm is new. However, I feel that the derivation of this algorithm is relatively straightforward based on existing literature. Specifically, this algorithm is based on (1) Theorem 3 and (2) the linear program defined in equation 9. My understanding is that both Theorem 3 and the derivation of the linear program in equation 9 are relatively straightforward based on existing literature.
+
+On the other hand, the experiment results in Section 4 are very strong and interesting. It is the main strength of this paper.
+
+-- Clarity --
+
+My understanding is that the writing of Section 3 and 4 can be (and should be) further polished.
+
+Some key notations in the paper seem to be wrong:
+
+(1) In Theorem 3, how can the value function v^\pi(s) be in the convex hull of policies? Also, e_i is not a set.
+
+(2) In equation 9, the linear program, \eta should be another decision variable. 
+
+-- Pros and Cons --
+
+Pros:
+
+1) Strong experiments.
+
+Cons:
+
+1) Insufficient novelty for algorithm design.
+
+2) No performance analysis for the proposed algorithm.
+
+3) Clarity needs to be further improved.",4,4.0,ICLR2019
+r1xK6GH5YS,1,Skl4mRNYDr,Skl4mRNYDr,Official Blind Review #3,"Summary:
+- key problem: expert-like probabilistic online motion planning to reach arbitrary goals without reward shaping thanks to off-line learning from expert demonstrations;
+- contributions: 1) an imitative planning procedure via gradient-based log-likelihood maximization leveraging ""imitative models"" q(future states | features), 2) multiple proposals to define flexible goals in this probabilistic framework, 3) a complete implementation for end-to-end navigation in CARLA, 4) an extensive experimental evaluation showcasing the performance, flexibility, interpretability, and robustness of the proposed approach w.r.t. the previous state of the art and several Imitation Learning (IL) and Model-Based Reinforcement Learning (MBRL) baselines.
+
+Recommendation: weak accept (leaning towards strong accept)
+
+Key reason 1: principled probabilistic framework bringing the best of both IL and MBRL worlds.
+- this planning as inference method is very succinctly and elegantly described in the paper with enough details in appendix (+ code) to suggest a high chance of reproducibility;
+- the flexibility of defining different interpretable goals (6 different types explored in the paper) highlights the versatility of the approach;
+- the additional benefits in terms of plan reliability estimation (Appendix E) are significant;
+- the paper showcases how powerful and useful a good ""imitative model"" can be, therefore, reinforcing the interest of the research community in the important topic of off-line learning from large datasets of demonstrations (without requiring costly on-line data collection).
+
+Key reason 2: thorough experimental evaluation with convincing results.
+- the experimental protocol used is the standard one on CARLA and the results are state of the art;
+- the comparison with related works, including recent ones, is thorough and well explained;
+- the additional claims regarding robustness are substantiated by multiple experiments.
+
+Suggested improvements:
+- Not needing reward engineering is a major claim of this approach, but it seems that constructing goal likelihoods could be seen as a form of reward engineering, no? Table 3 indeed reports significant performance differences (absolute and relative) depending on how the goals are specified, especially in dynamic environments. As the experiments aim at maximizing the same performance metrics, is there a preferred goal type that works well across all experiments? If not, how is ""goal definition"" different than ""reward engineering""?
+- Could the authors please include a variance analysis (using different seeds) in Tables 3 and 4? Previous papers have reported high variance in similar settings (cf., Codevilla et al 2019), and this is a common issue in IL/RL.
+- How important is knowing \lambda (traffic light state) perfectly in practice? Can the robustness to noise in \lambda be experimentally assessed? I would also clarify in section 4 and Table 2 that other methods do not use \lambda (the traffic light state), which is a signal very strongly correlated with the ""ran red light"" metric.
+- More generally, what is the robustness of this approach to uncertainty / noise in \phi? Although it is typically available (as the authors mentioned) it is never perfect in practice. Can this be handled in a principled probabilistic way as an extension of the current formulation?
+- The current model does not factor the influence of the agent on its environment (\phi := \phi_{t=0}). Is this framework limited to open loop planning, or does this open interesting future research directions towards closing the loop? It seems to be a key open problem to at least discuss in Section 5.
+
+Additional Feedback:
+- Figure 5 is confusing, not sure it adds much value to the paper;
+- typos in Appendix (""pesudocode"", ""baselines that predicts"", ""search search"").
+
+ ## Update post rebuttal
+
+Thanks to the authors' excellent replies and my initial inclination towards strong accept, I am happy to bump my score to 8. The authors did an excellent job, their rebuttal is on point, not avoiding hard questions, running additional requested experiments (incl. in a clever way for the most computationally expensive ones), and showing clear insights in future steps. Great job!",8,,ICLR2020
+n4WD6iv-6hr,3,qbRv1k2AcH,qbRv1k2AcH,Learning to Reason in Large Theories without Imitation,"In this paper, they tackle the challenge of learning to automated theorem (ATM) proving without any human imitation. Specifically the problem they focus on is premise selection. They find that while a vanilla RL scheme for ATM without any imitation learning of existing proofs frequently gets stuck and is not able to prove very many theorems, when a portion of the premises are selected via a simple term frequency-inverse document frequency (tf-idf) rule,  it dramatically increases the fraction of theorems proved — approaching the fraction of theorems proved with imitation learning of existing proofs.
+
+In general I like this idea. While the approach is simple, the authors convincingly demonstrate that it is effective and it’s certainly interesting to learn that it is effective in this novel and highly important context. The authors also do several ablation experiments to thoroughly evaluate the components of the system.
+
+A couple of questions — 
+(1) Would taking into account the similarity of the newly generated sub-goals as per tf-idf (or some other metric) with respect to past sub-goals that had short proofs help?
+(2) Is the allowed maximum runtime the same for the systems compared with and the new system?",6,4.0,ICLR2021
+YMGeABT7Fqi,3,ascdLuNQY4J,ascdLuNQY4J,Official Blind Review #3,"Summary:
+
+The paper introduces Kaleidoscope-operations to reprameterize convolutions. The reparameterization results in a more general search space of convolutions. Moreover, it enables one-level optimization on the convolution search as well as a supernet optimization, which avoids post-search discretization and post-discretization re-training.  
+
+
+Pros:
+
+The paper seems a good reading material to teach readers about a big picture of convolution search problem or even neural architecture search problem.
+
+The paper also shows a huge ambitious motivation to touch the boundary of the current main-stream NAS methodologies. 
+
+The idea of introducing repratermeterized convolutions (i.e., K-operations that was originally proposed by Dao et al., 2020) to convolution search seems novel and promising to me. 
+
+The evaluation is comprehensively conducted on several novel search spaces over vision and text data, and the results show the effectiveness of the proposed method. When being evaluated on permuted CIFAR and spherical MNIST, the new method shows some superiorities.
+
+
+Cons:
+
+It seems that the proposed model is designed to merely search for convolutions with a reprameterization approach (i.e, K-operation). Compared to regular NAS algorithms like DARTS that search for a much larger architecture space including convolutions, poolings, skip connections, this paper’s search space is merely on reprameterized convolutions. This makes me disappointed as both the title and the beginning parts somehow mislead readers that the paper aims at making a good innovation in the big scope of NAS. However, I finally realize that it actually searches for better convolutions rather than an entire neural architecture, after I went to the last paragraph of Page 5. Moreover, the search merely on convolutions is very likely to result in a much easier neural architecture optimization task, and make the so-called supernet (without post-search discretization & post-discretization retraining) work. This also reminds me that there exists one work [Stamoulis et al., 2019] which shares a similar motivation with this submission. In particular, [Stamoulis et al., 2019] proposes one single-path over-parameterized ConvNet to encode all architectural decisions with  convolutional kernel parameters. The overall network loss is directly a function of the “superkernel” weights, where the learnable kernel- and expansion ratio-related parameters can be directly derived as a function of the kernel weights. This strategy also enables one-level optimization and has a potential for supernet optimization as suggested by the submission. 
+
+
+[Stamoulis et al., 2019] Single-path NAS: Designing hardware-efficient convnets in less than 4 hours. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2019.
+
+
+In Table 1, I find the best performances on CIFAR-10 and the transferring to CIFAR-100 are still far away from the state of the art. For example, DARTS (Liu et al., 2019) can obtain about 97% and 74% accuracies on CIFAR-10 and CIFAR-100 respectively. Please explain which causes this gap. For a fair comparison, I see the results of the proposed K-op with supernet SGD/SGDR are worse than Conv (fixed operation baselines, offline), while they are better when warm starting with convolution. What if warm-start is also applied to Conv (fixed operation baselines, offline)? 
+
+As presented in the paper, Fig.2 shows that learned K-operations use more global information to extract features. But what is the benefit for deep learning?
+
+
+Table 2 is not self-contained. It should clarify the meaning of CR, MPQA, …, TREC in the caption. I guess they are the 7 used datasets according to the presentation in the main text. Again, I find the proposed method can obtain better results only when using the warm-start strategy, while it generally performs worse than the competitor (i.e., convolution) when training from scratch. 
+
+The last column of Table 4 seems confusing to me whether it corresponds to the case that uses warm start or from scratch. 
+
+In the paragraph of “utterfLeNet: Unpermuting Image Data”, the index of the referred section is missing.
+
+Overall I think the paper oversells the new idea and the corresponding technology. Thus I tend to suggest a major revision by tuning down its current tone throughout the paper, while I like the idea and really expect the paper can be baked better for publication. 
+
+
+",5,4.0,ICLR2021
+HJlEmq1jFS,1,B1elCp4KwH,B1elCp4KwH,Official Blind Review #1,"Overview:
+
+The paper proposes a method to learn discrete linguistic units in a low-resource setting using speech paired with images (no labels). The visual grounding signal is different from other recent work, where a reconstruction objective was used to learn discrete representations in unsupervised neural networks. In contrast to other work, a hierarchy of discretization layers are also considered, and the paper shows that, with appropriate initialization, higher discrete layers capture word-like units while lower layers capture phoneme-like units.
+
+Strengths:
+
+The paper is extremely well-written with a clear motivation (Section 1). The approach is novel. But I think the paper's biggest strength is in its very thorough experimental investigation. Their approach is compared to other very recent speech discretization methods on the same data using the same (ABX) evaluation metric. But the work goes further in that it systematically attempts to actually understand what types of structures are captured in the intermediate discrete layers, and it is able to answer this question convincingly. Finally, very good results on standard benchmarks are achieved.
+
+Weaknesses:
+
+Although I think the paper is very well-motivated, my first criticism is that discretization itself is not motivated: why is it necessary to have a model with discrete intermediate layers? Does this give us something other than interpretability (which we obtain due to the sparse bottleneck)? In the detailed questions below, I also specifically ask whether, for instance, the downstream speech-image task actually benefits from including discrete layers.
+
+My second point is that it is unclear why word-like units only appear when the higher-level discrete layers are trained from scratch; as soon as warm-starting is used, the higher level layers capture phoneme-like units (Table 1). Is it possible to answer/speculate why this is the case?
+
+Overall assessment:
+
+The paper presents a new approach with a thorough experimental investigation. I therefore assign an ""accept"". The weaknesses above asks for additional motivation and some speculation.
+
+Questions, suggestions, typos, grammar and style:
+
+- Section 3.3: It maybe makes less sense for the end-task, but did the authors consider discretization on the image side of the network? This could maybe lead to parts of objects being composed to form larger objects (in analogy to the speech network).
+- Section 3.3, par. 3: ""with the intention that they should capture discrete word-like and sub-word-like units"" -> ""with the intention that they should capture discrete *sub-word-like and word-like units*"" (easier to read with first part of sentence)
+- Section 3.3: The more standard VQ-VAE adds a commitment loss and a loss for updating the embeddings; was this used or considered at all, or is this all captured through the exponential moving average method?
+- Section 3.4: ""with same VQ layers"" -> ""with *the* same VQ layers""
+- Section 3.5: Can you briefly outline the motivation for adding the two losses (so that it is not required to read the previous work).
+- Section 4.1: Following from the first weakness listed above, the caption under Figure 2 states that the non-discrete model achieves a speech-image retrieval R@10 of 0.735. This is lower than some of the best scores achieved in Table 1. Can this be taken as evidence that discretization actually improves the downstream task? If so, it would be worth highlighting the point more; if there is some other reason, that would also be worth knowing.
+- Figure 1: Did the authors ever consider putting discrete layers right at the top of the speech component, just before the pooling layer? Would this more consistently lead to word-like units?
+",8,,ICLR2020
+SJgeVwIWqH,1,BJxt60VtPr,BJxt60VtPr,Official Blind Review #1,"This paper studies the problem of visual representation learning from 2.5D video streams by exploring the 2D-3D geometry structures in the 3D visual world. Building upon the previous work GRNN (Tung et al. 2019), this paper introduced a novel view-contrast objective applied to its internal 2D and 3D feature space. To facilitate the 3D view-contrast learning, this paper proposed a novel 2D-3D inverse graphics networks with a 2D-to-3D un-projection encoder, a 2D encoder, a 3D bottlenecked RNNs, an ego-motion stabilization module, and a 3D-to-2D projection module. Compared to previous work (Tung et al. 2019), view-contrastive inverse graphics networks decode in the feature space rather than RGB space. Experimental evaluations are conducted using CARLA simulator (sim) and KITTI dataset (real). Results demonstrate the strengths of the proposed view-contrastive framework in feature learning, 3D moving object detection, and 3D motion estimation.
+
+Overall, this paper studies an important problem in computer vision with a novel solution using unsupervised feature learning. While the technical novelty is clear, reviewer has several questions regarding the implementation and experimental details.
+
+(1) For 3D box detection on KITTI (see Table 1), the comparisons to state-of-the-art models are currently missing. While the benefit of unsupervised feature learning has been demonstrated, it would be more convincing to compare against the following papers (at least with a paragraph of discussion).
+
+(2) The 3D-to-2D projection module seems very expensive. Can you possibly report the training and inference time compared to baselines? Also, the design of the projection module is a bit counter-intuitive as it has 8x8 convolutions. In principle, such projection should be learning-free or with only 1x1 convolutions (aggregation along depth channel). It would be good to consider such ablation studies in the final version.
+
+-- Multi-view Supervision for Single-view Reconstruction via Differentiable Ray Consistency. Tulsiani et al. In CVPR 2017.
+-- Perspective Transformer Nets: Learning Single-View 3D Object Reconstruction without 3D Supervision. Yan et al. In NIPS 2016.
+-- MarrNet: 3D Shape Reconstruction via 2.5D Sketches. Wu et al. In NIPS 2017.
+
+(3) It seems that the proposed method assumes slow moving background across consecutive frames. In principle, the view-contrastive objective should mask out new pixels in frame T+1. Also, because the view-contrastive loss is applied at feature-level, reviewer would like to know performance on detecting small objects.
+
+(4) As the latent map update module uses an RNN, it would be good to consider consistency beyond 2 frames (given mask is applied to view-contrastive objective). Curriculum learning could be helpful for further improvements.
+
+-- Weakly-supervised Disentangling with Recurrent Transformations for 3D View Synthesis. Yang et al. In NIPS 2015.
+
+(5) How does the proposed method perform when applied to indoor environments?
+
+(6) Additional ablation study to consider: what if 2D/3D contrastive loss is turned off?
+
+",6,,ICLR2020
+BJl39ddR37,3,HyGBdo0qFm,HyGBdo0qFm,"Potentially interesting results, very dense and confusing writing","The paper shows Turing completeness of two modern neural architectures, the Transformer and the Neural GPU. The paper is technically very heavy and gives very little insight and intuition behind the results. Right after surveying the previous work the paper starts stacking definitions and theorems without much explanations.
+
+While technical results are potentially quite strong I believe a major revision to the paper might be necessary in order to clarify the ideas. I would even suggest to split the paper into two, one about each architecture as in the current form it is quite long and difficult to follow. 
+
+Results are claimed to hold without access to external memory, relying just on the network itself to represent the intermediate results of the computation. I am a bit confused by this statement -- what if the problem at hand is, say EXPSPACE-complete? Then the network would have to be of exponential size (or more generally of arbitrary size which is independent of the input). In this case the claim about not using external memory seems to be kind of vacuous as the network itself has unbounded size. The whole point of Turing-completeness is that the program size is independent of the input size so there seems to be some confusion here.
+",6,2.0,ICLR2019
+Hkeyp1RFqB,4,SkeXL0NKwH,SkeXL0NKwH,Official Blind Review #4,"This paper proposes a low-rank training method targeting for edge devices. The main contribution is an algorithm called Streaming Kronecker-Sum Approximation. The authors claim that the proposed method addresses four key challenges of low weight update density, weight quantization, low auxiliary memory, and online learning.
+
+The paper should be rejected because of the following reasons:
+(1) The paper is a little hard to follow and the writing can be significantly improved. In particular, the authors introduce four main challenges in section 3. However, I found they are not that accessible and hard to understand. In section 4.4.2, the objective is to get a minimum variance rank-r approximation to the diagonal matrix \Sigma, but I think the authors mix ""m"" up with ""r"".
+(2) The novelty of the algorithm is limited. From section 4.1 to 4.5, most discussions are about previously proposed methods. The algorithm proposed by the author (i.e., SKS) only involves some basic manipulations of linear algebra. I don't think it's novel enough to be a new algorithm.
+(3) Experimental results are limited. The authors spent a lot of time discussing on-device computing, but all their experiments are just simulations on standard benchmarks. For such a paper concerning training on edge devices, I would expect to see some experiments on real edge devices.
+
+Overall, I think the paper needs further improvements to be qualified for being accepted.
+
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------
+post rebuttal:
+
+I've read the authors' responses and the updated paper. Though my concern on writing has been resolved to some extent, I'm still unsatisfied with the empirical experiments. I believe the authors need to do experiments on edge devices since they have emphasized a lot about on-device computing. That being said, I'm not an expert in hardware and have no idea how hard it is to conduct those experiments. I've increased the score to 3 but still vote for rejection.",3,,ICLR2020
+rkg5v8DDpQ,3,B1lG42C9Km,B1lG42C9Km,"Interesting paper, possibly some confusion on some causal modelling (especially Section 2.1)","The paper introduces a new intrinsic reward for MARL, representing the causal influence of an agent’s action on another agent counterfactually. The authors show this causal influence reward is related to maximising the mutual information between the agents’ actions. The behaviour of agents using this reward is tested in a set of social dilemmas, where it leads to increased cooperation and communication protocols, especially if given an explicit communication channel. As opposed to related work, the authors also equip the agents with an internal Model of Other Agents that predicts the actions of other agents and simulates counterfactuals. This allows the method to run in a decentralized fashion and without access to other agents’ reward functions.
+
+The paper proposes a very interesting approach. I’m not a MARL expert, so I focused more on the the causal aspects. The paper seems generally well-organized and well-written, although I’m a bit confused about the some of the causal modelling decisions and assumptions. This confusion and  some potential errors, which I describe in detail below, are the reason for my borderline decision, despite liking the paper otherwise. 
+ 
+First, I’m a bit confused about the utility of the Section 2.1 model (Figure 1), mostly because of the temporal and multiple agents aspects that seem to be dealt with (“more”) correctly in the MOA model. Specifically in Figure 1, one would need to assume that there is only one agent A influencing agent B at the same time (and agent B does not influence anything else). For example, there is no other agent C which actions also influence agent B, and no agent D that is influenced by agent B, otherwise the backdoor-criterion would not work, unless you add also the action of agent C to the conditioning set (or its state). Importantly, adding the actions of all agents, also a potential agent D that is downstream of B would be incorrect. So in this model there is some kind of same time interaction and there seems to be the need for a causal graph that is known a priori. These problems should disappear if one assumes that only the time t-1 actions can influence the time t actions, as in the MOA model. I assume the idea of the Figure 1 model was to show a relationship with mutual information, but for me specifically it was quite confusing. 
+
+I was much less confused by the MOA causal graph represented in Figure 4, although I suspect there are quite some interactions missing (for example s_t^A causes u_t^A similarly to the green background? s_t causes s_{t+1} (which btw in this case should probably be split in two nodes, one s_{t+1} and one s_{t+1}^B?). Possibly one could also add the previous time step for agent B (with u_{t+1}^B influenced by u_t^B I would assume?). As far as I can see there is no need to condition on a_t^B in this case to see the influence of a_t^A on a_{t+1}^B, u_t^A and s_t^A should be enough?
+
+Minor details:
+Is there possibly a log missing in Eq. 2?
+",5,3.0,ICLR2019
+ryl__Ymjh7,3,HyEtjoCqFX,HyEtjoCqFX,"Interesting idea, more experimental results needed","** Summary: **
+
+The authors use the reformulation of RL as inference and propose to learn the prior policy. The novelty lies in learning a state-independent prior (instead of a state-dependent one) that can help exploration in the presence of universally unnecessary actions. They derive an equivalence to regularizing the mutual information between states and actions.
+
+** Quality: **
+The paper is mathematically detailed and correct.
+
+** Clarity: **
+The paper is sufficiently easy to follow and explains all the necessary background.
+
+** Originality & Significance: **
+The paper proposes a novel idea: Using a learned state-independent prior as opposed to using a learned state-dependent prior. While not a big change in terms of mathematical theory, this could lead to positive and interesting results empirically for exploration. Indeed they show promising results on Atari games: It is easy to see how Atari games could benefit as they have up to 18 different actions, many of which are redundant. 
+
+My two main points where I think the paper could improve are:
+- More experimental results, in particular, how strong are the negative effects of MIRL if we have actions that are important, but have a lower probability in the stationary action distribution?
+- A related work section comparing their approach to the many recent similar papers in Maximum Entropy RL",7,4.0,ICLR2019
+BylxH1MM6X,3,ryxxCiRqYX,ryxxCiRqYX,Review of Deep Layers as Stochastic Solvers,"Overview:  This paper shows that the forward pass of a fully-connected layer (generalized to convolutions) followed by a nonlinearity in a neural network is equivalent to an iteration of a prox algorithm, where different regularizers in the objective of the related prox problem correspond to different nonlinearities such as ReLu. This connection is quite interesting. They further relate different stochastic prox algorithms to different dropout  layers and show results of improved performance on CIFAR-10 and CIFAR-100 on several architectures. The paper is well-written.
+
+Major Concerns:
+
+1. While the equivalence of one iteration of a prox algorithm and a single forward pass of the block is understandable, it is not clear what happens from making several iterations (10 in the case of fully-connected layers in the experiments) of the prox algorithm. It seems that this would be equivalent to making a forward pass through 10 equivalent blocks (i.e., 10 layers with the same weights and biases). But then the backward pass is still through the original network, so the problem being solved is not clear. Clarity on this would help.
+
+2. Since the equivalence of 10 forward passes of a block are done at each iteration, using solvers does more computations (can be thought of as extra forward passes through extra layers as noted above), which makes the comparison not completely fair. Either adding more batches or more passes over the same batch multiple times (or at least for a few batches just to use the some computational power) would be more fair and likely improve the performance of the baseline networks.
+
+Minor Issues:
+
+1. missing definitions such as g(x) at beginning of Section 3 and p in Proposition 1.
+
+2. Give examples of where the prox problems in Table 1 show up in practice (outside of activation functions in neural networks)
+
+3. It says ""for different choices of dropout rate the baseline can always be improved by..."" in the Experiments.  This is not provable.
+
+4. Include results for Dropout rate p=0 in Table 5.",7,4.0,ICLR2019
+HJglemDstB,1,B1gskyStwr,B1gskyStwr,Official Blind Review #3,"Summary:
+This paper basically built upon [1]. The authors propose to do sampling in the high-frequency domain to increase the sample efficiency.  They first argue that the high-frequency part of the function is hard to approximate (i.e., needs more sample points) in section 3.1. They argue that the gradient and Hessian can be used to identify the high-frequency region. And then they propose to use g(x)=||gradient||+||Hessian || as the sampling metric as illustrated in Algorithm 1. To be noticed that, they actually hybrid the proposed metric (6) and the value-based metric (7, proposed in [1]) in their algorithm.
+
+Strength:
+Compared to [1], their experiment environment seems more complicated (MazeGridWorld vs. GridWorld). 
+Figure 3 shows that their method converges faster than Dyna-Value.
+Figure 5 is very interesting. It shows that their method concentrates more on the important region of the function. 
+
+Weakness:
+In footnote 3: I don't see why such an extension is natural.
+In theorem 1, why the radius of the local region has to be?
+Theorem1 only justifies the average (expectation) of gradient norm is related to the frequency. The proposed metric $g$, however, is evaluated on a single sample point. So I think if adding some perturbations to $s$ (and then take the average) when evaluating $g$ will be helpful.
+The authors only evaluate their algorithm in one environment, MazeGridWorld. 
+I would like to see the experiment results of using only (6) as the sampling rule. 
+What kind of norm are you using? (||gradient||, ||hessian||)
+Why $g$ is the combination of gradient norm and hessian norm? What will be the performance of using only gradient or hessian?
+Figure 4(b), DQN -> Dyna
+
+Reference:
+[1] Hill Climbing on Value Estimates for Search-control in Dyna",6,,ICLR2020
+BkghGs6ZpQ,3,B1g29oAqtm,B1g29oAqtm,Incremental advance on model-based reinforcement learning methods,"The authors learn a model that predicts the state R steps in the future, given the current state and intervening actions, instead of the predicting the next time step state. The model is then used for standard model predictive control. The authors find numerically that their method, termed Plan-Conditional Predictor (PCP), performs better over long horizon times (~100 time steps), than other recent model-based and model-free algorithms. This because for long horizon time scales, the model predicting the state for the next time step accumulates error when used recursively.
+
+The key idea is to use a model that directly predicts multiple time steps into the future. While seemingly an obvious extension, it does not appear to have been used in current algorithms. A main issue that I find with this approach is: since only the state after R steps is predicted, reward r(s_t,a_t) can only be used every R steps, not at every step. The authors gloss over this issue because for both MuJoCo environments that they tested, they only need to consider reward at the end of the planning horizon. Thus to make their algorithm generally applicable, the authors also need to show how or whether their method can deal with rewards that may appear at any time step.
+
+Further, rather than speculate on the cause of the difference between their PCP and PETS (Chua et al 2018) on half-cheetah to be their different settings for CEM optimization (Fig 7b), the authors should just use the same settings to compare. Possibly the authors ran out of time to do this for the current submission, but should certainly do it for the final version.
+
+While the authors have already compared to other algorithms with similar aims, eg Chua et al 2018, they may also wish to compare to a recent preprint Clavera et al Sep 2018, which also aims to combine the sample efficiency of model-based methods while achieving the performance of model-free ones, by using an ensemble of models, over a 200 time step horizon. However, given the recency of this algorithm, I don't consider this essential.
+
+Overall, I feel that the authors idea of an R-step model is worth spreading in the community, if the above two main points are addressed. At the same time, I can only rate it at the border of the cutoff mark.",6,4.0,ICLR2019
+KqOKh1MHzym,3,LIOgGKRCYkG,LIOgGKRCYkG,"The idea of target training is novel, but several key issues on potential limitations are not evaluated","Summary:
+This paper proposes target training to defend against adversarial attacks on machine learning models. Target training doubles the number of output classes, and aims to trick untargeted attacks into attacks that target at designated classes. Experimental results show that targeting training can achieve slightly better performance than adversarial training.
+
+Pros:
+1. Target training applies simple changes to the structure of machine learning models, to defend against adversarial attacks.
+2. The idea of tricking adversarial attacks is novel.
+3. Target training can partially break transferability of adversarial samples.
+
+Cons:
+1. What is the overhead of target training?
+2. Would target training degrade the performance on clean data?
+3. More discussion should be included on whether target training is effective against adaptive attacks.
+
+
+Detailed comments:
+While the idea of applying target training to defend against adversarial attacks is interesting, I have the following questions regarding the proposed method (performance, limitation, etc.).
+
+1. Target Training aims to convert “untargeted attacks to attacks targeted at designated classes”, but doesn’t Minimization 1 in Section 2.1 correspond to targeted attacks rather than untargeted attacks (l is the target label)?
+
+2. What is the overhead of target training, especially Algorithm 3? How does that compare with normal training and adversarial training?
+
+3. Adversarial training degrades the model’s performance on clean data. Does target training have the same limitation?
+
+4. Would Algorithm 1 (Algorithm 3) also be working against attacks that do not (do) minimize perturbations? How to choose between these two algorithms?
+
+5. Not effective against adaptive attacks is one of the main limitations of many existing defence method. It is unclear whether target training has the same limitation. Not using techniques that have been broken does not mean that target training is robust.",5,3.0,ICLR2021
+BkxLPtYt3Q,1,B1VWtsA5tQ,B1VWtsA5tQ,Algorithm description is unclear,"I have to say that this paper is not well organized. It describes the advantage function and CMA-ES, but it does not describe PPO and PPO-CMA very well. I goes through the paper twice, but I couldn't really get how the policy variance is adapted. Though the title of section 4 is ""PPO-CMA"", only the first paragraph is devoted to describe it and the others parts are brief introduction to CMA.
+
+The problem of variance adaptation is not only for PPO. E.g., (Sehnke et al., Neural Networks 2009) is motivated to address this issue. They end up using directly updating the policy parameter by an algorithm like evolution strategy. In this line, algorithm of (Miyamae et al. NIPS 2010)  is similar to CMA-ES. The authors might want to compare PPO-CMA with these algorithms as baselines.",4,2.0,ICLR2019
+SJK13DGEg,2,BJAA4wKxg,BJAA4wKxg,A well-executed NLP paper,"This paper is the first (I believe) to establish a simple yet important result that Convnets for NMT encoders can be competitive to RNNs. The authors present a convincing set of results over many translation tasks and compare with very competitive baselines. I also appreciate the detailed report on training and generation speed. I find it's very interesting when position embeddings turn out to be hugely important (beside residual connections); unfortunately, there is little analysis to shed more lights on this aspect and perhaps compare other ways of capturing positions (a wild guess might be to use embeddings that represent some form of relative positions). The only concern I have (similar to the other reviewer) is that this paper perhaps fits better in an NLP conference.
+
+One minor comment: it's slight strange that this well-executed paper doesn't have a single figure on the proposed architecture :) It will also be even better to draw a figure for the biLSTM architecture as well (it does take some effort to understand the last paragraph in Section 2, especially the part on having a linear layer to compute z).",6,5.0,ICLR2017
+SkxBghB537,3,BJgolhR9Km,BJgolhR9Km,"An interesting idea, but needs more comprehensive/diverse evaluations","This paper introduces a new neural network layer for the purposes of defending against ""white-box"" adversarial attacks (in which the adversary is provided access to the neural network parameters). The new network unit and its activation function are constructed in such a way that the local gradient is sparse and therefore is difficult to exploit to add adversarial shifts to the input. To train the networks in the presence of a sparse gradient signal, the authors introduce a ""pseudogradient"", and optimize this proxy-gradient to optimize the parameters. This training procedure shows competitive performance (after training) on the permutation-invariant MNIST dataset versus other more standard network architectures, but is more robust to both adversarial attacks and random noise.
+
+High-level comments:
+- Using only a single dataset, and one on which the classification problem is rather easy, is cause for concern. I would need to see performance on another dataset, like CIFAR 10, to be more convinced that this is a general pipeline. In Sec 4, the authors mention that, using the pseudogradient, ""one may be concerned that ... we may converge ... and yet, we are not at a minimum of the loss function"". They claim that ""in practice it does not seem to be a problem"" on their experiments. This claim is a bit weak considering only a single, simple dataset was used for training. It is not obvious to me that this would succeed for more complex datasets.
+- I would also like to see an additional set of adversarial attacks that are ""RBFI-aware"". A motivated attacker who is aware of this technique might replace the gradient in the adversarial attack with the pseudogradient instead; I expect such an attack would be effective. While problematic in general, I do not think this is necessarily an overall weakness of the paper (since we, the community, should be investigating methods like these to obfuscate the process of exploiting neural network models), but I would still like to see results showing the impact/performance of adversarial training over the pseudo-gradient. (I do not expect this will be very much effort.)
+- What is the purpose of showing robustness of your network models to random noise? It is nice/interesting to see that your results are more robust to random noise, but what is the intuition for why your network performs better?
+
+Wording and minor comments:
+- The abstract is rather lengthy, but should probably contain somewhere a spelling-out of RBFI, since it informs the reader that the radial basis function (with infinity-norm) is the structure of the new network unit.
+- Sec 4: ""...indicate that pseudogradients work much better than regular gradients"" :: Please be more clear that this is context specific ""...than regular gradients for training RBFI networks"".
+- Sec. 4 :: Try to be consistent to how you specify ""z"" in this section, you alternate between the 'infinity-norm' definition and the 'max' definition from Eq. (2). Try to homogenize these.
+- In general, the paper was well-proofed and well-written and was easy to read (high clarity).
+- To my knowledge, this work is a rather unique foray into solving this problem (original).
+
+Overall, I think this work is an interesting idea to address a rather important concern in the Deep Learning community. While the idea has merit, the small set of experiments in this paper is not sufficiently compelling for me to immediately recommend publication. With a bit more work put into exploring the performance of this method on other datasets, this paper could be made more complete. (Also, since I am aware that space is limited, some of the details on the adversarial attacks from other publications can probably be moved to an appendix.)
+",5,3.0,ICLR2019
+7XjGztR6OX0,1,bQf4aGhfmFx,bQf4aGhfmFx,"With a few small adjustment, I think it is acceptable, with a few bigger adjustments I would favour it more.","UPDATE:
+
+After the reviewers clarifications and some further explanations of the implications of Theorem 2 (in Appendix E) I think now that the paper tells an interesting story and thus I will vote to accept.
+
+
+
+========================
+
+Summary:
+
+The paper addresses the setting of meta-learning loss functions and in particular analyses the effect of the loss function on the entropy of the resulting learned function. In particular it shows that TaylorGLO learned functions tend to lead to higher entropic, and thus more regularized, neural networks, than when they are trained with the cross entropy loss.  The paper also discusses that the property of high entropy predictions can lead to better robustness against adversarial attacks.
+
+========================
+
+Pros:
+
+- Well written and structured, and thus easy to follow. (With a few exceptions, but I think with a bit effort that can be fixed. See additional feedback)
+
+- Fairly unexplored but interesting setting.
+
+========================
+
+Cons:
+
+- Some things are bit too informal, or not defined, see additional feedback.
+
+- In my opinion the results are not very strong. In particular the one shown in Table 1. The result from Theorem 2 is to me weak in the sense that by itself it does not give any intuition in what is important for a loss function to reduce entropy, and what is important for the magnitude of it.  (See also additional feedback). 
+
+========================
+ 
+Scoring:
+
+For now I will vote for a weak accept, under the assumption that some of the smaller problems would be fixed for a final submission, see additional feedback. There you can also find what is missing for me for a stronger accept. I think the paper addresses an interesting and not much explored topic, and adds sufficient new insights to warrant a publication.
+
+========================
+
+========================
+
+Additional feedback (along some questions.):
+
+Recommendation for smaller adjustments:
+
+- Third page first paragraph: ..'is important [for] the network's...'
+
+- Introduce somewhere the Baikal and TaylorGLO loss, those are not that known.
+
+- After Cross-Entropy analysis you refer to TaylorGLO's parameters a,b,c, which is at that point not introduced yet. Should somehow change the order.
+
+- Below Theorem 1, there is a bracket that never opens.
+
+- Theorem 2 is too informal. But in Theorem 2 you miss to introduce the gamma_T notation. For me both Theorem 1 and 2 rely too much on intuition, or are 'not attractors' and 'strength of entropy reduction' well-defined terms? (If so, then it should go to the appendix. As the intuition was at least clear to me, there is for me no strict need to change that though.)
+
+- Under Table 1, you say that you use Theorem 2 to calculate the strength of the bias, but Theorem 2 only holds for the case in where all non-targets receive the same probability or not?
+
+- Under Theorem 3, I don't understand why you need the inverse of the contraints to avoid non-usable loss functions.
+
+
+Adjustments that would increase my score:
+
+- Theorem 2 by itself is pretty void for me. I am missing that you draw some conclusions from it, at least for the loss functions you analyzed, in particular to analyze the magnitude of the bias.
+
+- The results in Table 1 are essentially the same after adding the invariant, the experiment is not convincing to me. I think you should create maybe even a toy example where you really can highlight the potential benefit of the invariant.
+
+- I do like the adversarial part. I would have found it very interesting to see how it compares to actually adversarially trained models. (I understand that this is not the point of the paper, but to me that would be still an interesting comparison)
+
+=====================
+
+
+
+
+",7,3.0,ICLR2021
+SJg58BU6Fr,2,BJxYUaVtPB,BJxYUaVtPB,Official Blind Review #1,"This paper provides a technique to solve match prediction problem -- the problem of estimating likelihood of preference between a pair of M-sized sets. The paper replaces the previously proposed conventional statistical models with a deep learning architecture and achieve superior performance than some of the baselines. The experiments show the efficacy of the proposed methods.
+
+I have some major concerns with this paper. These are described below:
+
+1.  The paper presentation is not very clear. The paper contains many imprecise statements. The problem is not setup very well. 
+(a) The abstract and the introduction  contains vague statements with no proper description of what the technique is about. 
+(b) In introduction, the authors start with discussing 1-sized pairwise comparisons, and then suddenly are discussing in-group items effects at the end of the third paragraph, so that means they are talking about M-sized comparisons with M > 1. This adds to the confusion.
+(c) The same things holds true for the first paragraph in ""Main Contribution"". Since the authors are using deep learning frameworks, this does not mean the authors ""can infer the underlying models accurately."" I am not sure what the authors wanted to convey with this statement. 
+(d) The reason I am being very stringent with impreciseness of the statements is majorly after reading the first statement in the Motivation section. It says, ""Our decision to incorporate two modules R and P into our architecture has been inspired by SOME state-of-the-art algorithms developed under CERTAIN statistical models, which have been shown
+therein to be OPTIMAL."" Please avoid using some, certain, optimal when they are not defined properly in the paper yet.
+ 
+ 2. The paper mentions multiple times that it does not use statistical models tailored for a dataset or application but instead use deep neural networks. In my opinion, NN is just another form of statistical model that captures statistical patterns of comparisons and in-group interaction.  They are not following the same modeling assumptions as others but have created their own in some sense. The authors may want to rephrase those statements.
+
+3. The evaluation metric for the datasets that author consider is the prediction accuracy. I am not sure why the authors evaluate on cross entropy as well in Table 1, since both are closely related (CE is a consistent surrogate of accuracy). Can you please explain? However, I liked that they compared the methods with other metrics in Table 2. 
+
+4. In problem setup, please mention M > 1. For M = 1, the problem is similar to [1] and many solutions have been proposed for that. 
+
+5. I am not sure how the R and P modules capture what the authors want them to capture. I would suggest the authors to include that when they first discuss R and P modules.
+
+6. In equations 7, it is not clear how the function R(.,.,.,) and P(.,.,.) defined? Is it similar to what is described in equation 5?
+
+7. The training procedure contains the standard details; however, it is not clear to me how the modules R, P, and G interact during training.
+
+Overall, I believe the paper has a descent idea and contains satisfactory experimental results; however, the presentation of the paper is very weak at this moment. 
+
+[1] Joachims, Thorsten. ""Optimizing search engines using clickthrough data."" Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2002.
+
+---- After Rebuttal ---
+
+I thank the authors for providing response to my questions and making edits to the paper. My clarity on boxes R and G has become better; however, I am still not totally convinced. Also, the presentation of the paper could be further improved. Therefore, I would keep the same score.  ",3,,ICLR2020
+Syg2YyWTYH,1,Skl4LTEtDS,Skl4LTEtDS,Official Blind Review #1,"The paper presents a method of scaling up towards action spaces, that exhibit natural hierarchies (such as a controllable resolution of actions), throughout joint training of Q-functions over these. Authors notice, and exploit a few interesting properties, such as inequalities that emerge when action spaces form strict subsets that lead to nice parametrisation of policies in a differential way. The evaluation is performed in simple toy-ish tasks, and in micro-management problem in 5 scenarios in the game of SC2.
+
+On a fundamental level, proposed method resembles that of Mix & Match, that authors discuss in the paper. In the M&M paper authors use the matching (distillation) of the policies, to ensure knowledge transfer, while in GAS, authors share information through said differential reparametrisation. Ablations provided imply that this part is indeed crucial (as with independent Qs, called ""Sep-Q"" learning flat-lines). The ablation testing the off-policy modification, seems a bit less conclusive, despite authors claiming that ""This ablation performs slightly, or considerably, worse in each scenario. "" We can see in Figure 3 that there were 5 experiments:
+- in one [95z vs 50m] GAS works much better
+- in two [80m v 80m, 80m v 85m] there seems to be no difference (in terms of longer term performance)
+- in one [60m v 65m] GAS works slightly better
+- in one [50h v 50h] GAS works slightly worse
+This mixed bag of results would rather suggest that the offpolicy part is not the main contributing factor, and might require closer investigation to really understand which part of the system proposed brings the benefits. Could this ablation be also done on the toy-ish tasks from experiment 1? Given its simplicity it should be cheap enough to run these extra experiments (I am assuming the SC2 ones are quite expensive?)
+
+Reviewer finds it hard to understand, given current description, how was M&M baseline adapted to the Q-learning setup? Was the distillation loss replaced with L2 one? Was the exact same architecture used for these experiments? How were the missing parent actions handled? Or did M&M experiments use an actor critic learning instead (which would make the comparison more about Q-learning vs Actor-Critic learning, than about methods of action space scaling). Methods section (mentioning entropy loss) looks like M&M was indeed trained with actor-critic, which would make the baseline hardly comparable. Adapting M&M strategy to Q-learning (or picking other baselines, that work on the same RL setup), in reviewer's opinion, is crucial for actual evaluation of author's contributions (given that this is the only baseline).
+
+On a minor point - given how unique SC2 environment (and problem of unit micromanagement) is, would it be possible to provide baselines results also for the toy-ish experiment 1? 
+
+Overall, I recommend a weak rejection of the paper, and I am happy to revisit this evaluation given that authors address the above comments.",3,,ICLR2020
+B1x90GNcn7,1,rygVV205KQ,rygVV205KQ,"Potentially practical improvement of sparse-reward RL using IL, but a bit unclear when it helps","The submission describes a sort of hybrid between reinforcement learning and imitation learning, where an auxiliary imitation learning objective helps to guide the RL policy given expert demonstrations.  The method consists of concurrently maximizing an RL objective--augmented with the GAIL discriminator as a reward—and minimizing the GAIL objective, which optimizes the discriminator between expert and policy-generated states.  Only expert states (not actions) are required, which allows the method to work given only videos of the expert demonstrations.  Experiments show that adding the visual imitation learning component allows RL to work with sparse rewards for complex tasks, in situations where RL without the imitation learning component fails.
+
+Pros:
++ It is an interesting result that adding a weak visual imitation loss dramatically improves RL with sparse rewards 
++ The idea of a visual imitation signal is well-motivated and could be used to solve practical problems
++ The method enables an ‘early termination’ heuristic based on the imitation loss, which seems like a nice heuristic to speed up RL in practice
+
+Cons:
++ It seems possible that imitation only helps RL where imitation alone works pretty well already
++ Some contributions are a bit muddled: e.g., the “learning with no task reward” section is a little confusing, because it seems to describe what is essentially a variant of normal GAIL
++ The presentation borders on hand-wavy at parts and may benefit from a clean, formal description
+
+The submission tackles a real, well-motivated problem that would appeal to many in the ICLR community.  The setting is attractive because expert demonstrations are available for many problems, so it seems obvious that they should be leveraged to solve RL problems—especially the hardest problems, which feature very sparse reward signals.  It is an interesting observation that an imitation loss can be used as  a dense reward signal to supplement the sparse RL reward.  The experimental results also seem very promising, as the imitation loss seems to mean the difference between sparse-reward RL completely failing and succeeding.  Some architectural / feature selection details developed here seem to also be a meaningful contribution, as these factors also seem to determine the success or failure of the method.
+
+My biggest doubt about the method is whether it really only works where imitation learning works pretty well already.  If we don’t have enough expert examples for imitation learning to work, or if the expert is not optimizing the given reward function, then it is possible that adding the imitation loss is detrimental, because it induces an undesirable bias.  If, on the other hand, we do have enough training examples for imitation learning to succeed and the expert is optimizing the given reward function, then perhaps we should just do imitation learning instead of RL.  So, it is possible that there is some sweet spot where this method makes sense, but the extent of that sweet spot is unclear to me.
+
+The experiments are unclear on this issue for a few reasons.  First, figure 4 is confusing, as it is titled ‘comparison to standard GAIL', which makes it sound like a comparison to standard imitation learning.  However, I believe this figure is actually showing the performance of different variants of GAIL used as a subroutine in the hybrid RL-IL method.  I would like to know how much reward vanilla GAIL (without sparse rewards) achieves in this setting.  Second, figure 8 seems to confirm that some variant of vanilla imitation learning (without sparse rewards) actually does work most of the time, achieving results that are as good as some variants of the hybrid RL-IL method.  I think it would be useful to know, essentially, how much gain the hybrid method achieves over vanilla IL in different situations.
+
+Another disappointing aspect of the paper is the ‘learning with no task reward’ section, which is a bit confusing.  The concept seems reasonable at a first glance, except that once we replace the sparse task reward with another discriminator, aren’t we firmly back in the imitation learning setting again?  So, the motivation for this section just seems a bit unclear to me.  This seems to be describing a variant of GAIL with D4PG for the outer optimization instead of TRPO, which seems like a tangent from the main idea of the paper.  I don’t think it is necessarily a bad idea to have another discriminator for the goal, but this part seems somewhat out of place.
+
+On presentation: I think the presentation is a bit overly hand-wavy in parts.  I think the manuscript could benefit from having a concise, formal description.  Currently, the paper feels like a series of disjoint equations with unclear connections among them.  The paper is still intelligible, but not without knowing a lot of context relating to RL/IL methods that are trendy right now.  I feel that this is an unfortunate trend recently that should be corrected.  Also, I’m not sure it is really necessary to invoke “GAIL” to describe the IL component, since the discriminator is in fact linear, and the entropy component is dropped.  I think “apprenticeship learning” may be a more apt analogy.
+
+On originality: as far as I can tell, the main idea of the work is novel.  The work consists mainly of combining existing methods (D4PG, GAIL) in a novel way.  However, some minor novel variations of GAIL are also proposed, as well as novel architectural considerations.
+
+Overall, this is a nice idea applied to a well-motivated problem with promising results, although the exact regime in which the method succeeds could be better characterized.",6,4.0,ICLR2019
+SksyD3Dgz,1,SyfiiMZA-,SyfiiMZA-,Paper of broad interest for control tasks,"This is a well written paper, very nice work.
+It makes progress on the problem of co-optimization of the physical parameters of a design
+and its control system.  While it is not the first to explore this kind of direction,
+the method is efficient for what it does; it shows that at least for some systems, 
+the physical parameters can be optimized without optimizing the controller for each 
+individual configuration. Instead, they require that the same controller works over an evolving
+distribution of the agents.  This is a simple-but-solid insight that makes it possible
+to make real progress on a difficult problem.
+
+Pros:  simple idea with impact;  the problem being tackled is a difficult one
+Cons:  not many;  real systems have constraints between physical dimensions and the forces/torques they can exert
+       Some additional related work to consider citing.  The resulting solutions are not necessarily natural configurations, 
+   given the use of torques instead of musculotendon-modeling.  But the current system is a great start.
+
+The introduction could also promote that over an evolutionary time-frame, the body and
+control system (reflexes, muscle capabilities, etc.) presumably co-evolved.
+
+The following papers all optimize over both the motion control and the physical configuration of the agents.
+They all use derivative free optimization, and thus do not require detailed supervision or precise models
+of the dynamics.
+
+- Geijtenbeek, T., van de Panne, M., & van der Stappen, A. F. (2013). Flexible muscle-based locomotion
+  for bipedal creatures. ACM Transactions on Graphics (TOG), 32(6), 206.
+  (muscle routing parameters, including insertion and attachment points) are optimized along with the control).
+
+- Sims, K. (1994, July). Evolving virtual creatures. In Proceedings of the 21st annual conference on
+  Computer graphics and interactive techniques (pp. 15-22). ACM.
+  (a combination of morphology, and control are co-optimized)
+
+- Agrawal, S., Shen, S., & van de Panne, M. (2014). Diverse Motions and Character Shapes for Simulated
+  Skills. IEEE transactions on visualization and computer graphics, 20(10), 1345-1355.
+  (diversity in control and diversity in body morphology are explored for fixed tasks)
+
+re: heavier feet requiring stronger ankles
+This commment is worth revisiting.  Stronger ankles are more generally correlated with 
+a heavier body rather than heavy feet, given that a key role of the ankle is to be able
+to provide a ""push"" to the body at the end of a stride, and perhaps less for ""lifting the foot"".
+
+I am surprised that the optimization does not converge to more degenerate solutions
+given that the capability to generate forces and torques is independent of the actual
+link masses, whereas in nature, larger muscles (and therefore larger masses) would correlate
+with the ability to generate larger forces and torques.  The work of Sims takes these kinds of 
+constraints loosely into account (see end of sec 3.3).
+
+It would be interesting to compare to a baseline where the control systems are allowed to adapt to the individual design parameters.
+
+I suspect that the reward function that penalizes torques in a uniform fashion across all joints would
+favor body configurations that more evenly distribute the motion effort across all joints, in an effort
+to avoid large torques. 
+
+Are the four mixture components over the robot parameters updated independently of each other
+when the parameter-exploring policy gradients updates are applied?  It would be interesting
+to know a bit more about how the mean and variances of these modes behave over time during
+the optimization, i.e., do multiple modes end up converging to the same mean? What does the
+evolution of the variances look like for the various modes?
+",9,5.0,ICLR2018
+c3LTo2vo3So,2,H6ATjJ0TKdf,H6ATjJ0TKdf,Reviews,"Summary
+- The authors propose LAMP, a layerwise adaptive magnitude-based pruning method.  The authors conduct extensive experiments on CIFAR10/CIFAR100/SVHN and Penn Treebank to validate the method.
+
+Pros
+- somewhat novel pruning method, based on new weight score
+- extensive experiments on image and language datasets
+
+Cons
+- In Equation (2), the authors point out that LAMP score is align with the order of weight squares. Therefore, one can directly prune the network based on weight squares. Why is it necessary to prune the network based on LAMP?
+- The comparisons are not sufficient. The authors should compare other ""pruning-retraining"" methods, like network slimming [1], soft filter pruning [2], etc. Though they focus on structured pruning, the core idea can be borrowed and adapted for unstructured pruning.
+- Lacking of experiments on large-scale datasets and large models, for example, on ImageNet. The performance of pruning methods  can be very sensitive and versatile when only evaluating on small datasets and models. And usually, a pruning method can be invalid when testing on ImageNet models.",5,4.0,ICLR2021
+7Gca-Le9ank,2,HxzSxSxLOJZ,HxzSxSxLOJZ,Good contribution but algorithm is not sound,"- Paper makes a good contribution by pointing an intrinsic flaw in the NeuralODE technique. The problem is that even with an  error accruing step size, the results of a NerualODE can be good, leading to a false belief that the ODE used in the construct represents the phenomena, but instead it is the dynamic behaviour arising from the mixture of the ODE and the solver that separates the classes well. 
+- However, the proposed solution does not seem convincing. It seems like a work in progress. The solution is proposed in algorithm 2,4,6 which should have been algorithm 1,2,3. 
+- It is not made clear how solvers would be able to use this algorithm.
+-The  algorithm is not nicely constructed.  Putting a function like ""calculate accuracy higher order solver();"" in an algorithm without fully describing what it does , is not advised. 
+- Figures are not illustrative, there is too much clutter. I believe a point could be made with the same amount of figures but with less clutter.
+- Based upon the contribution made by the authors, it seems appropriate that their results are published right now. 
+
+
+
+",7,4.0,ICLR2021
+HkgCK-1RKB,1,SylVNerFvr,SylVNerFvr,Official Blind Review #1,"This work focuses on learning equivariant representations and functions over input/output words for the purposes of SCAN task. Basically, the focus is on local equivariances (over vocabulary) such that the effect of replacing and action verb like RUN in the input with the verb JUMP causes a similar change in the output. However, effects requiring global equivariances like learning relationship between ""twice"" and ""thrice"", or learning relationships between different kinds of conjunctions are not handled in this work. For learning equivariant functions over vocabulary, group convolutions are used at each step over vocabulary items in both the sequence encoder and decoder.  The results on SCAN task are impressive for verb replacement based experiments and improve over other relevant baselines. Also, improvement is shown on another word replacement task (""around right""), which requires learning corresponding substitutions in output based on the word changes in the input. As expected, for experiments that require global equivariances or no equivariance (simple, length), the difference ion performance is not very pronounced over other baselines.
+While this paper does show that modelling effects of word substitution can be handled by the locally equivariant functions, it still cannot account for more complex generalization phenomena which are likely to be much more prevalent especially for domains dealing with natural language that are other than SCAN. Therefor, I think the applicability of the proposed equivariant architectures is rather limited if interesting.",6,,ICLR2020
+wa8gYEDg1tZ,1,uUX49ez8P06,uUX49ez8P06,Extension of reinforced continual learning (RCL),"This paper falls into a class of continual learning methods which accommodate for new tasks by expanding the network architecture, while freezing existing weights. This freezing trivially resolves forgetting. The (hard) problem of determining how to expand the network is tackled with reinforcement learning, largely building upon a previous approach (reinforced continual learning, RCL). Apart from some RL-related implementation choices that differ here, the main difference to RCL is that the present method learns a mask which determines which neurons to reuse, while RCL only uses RL to determine how many neurons to add. Experiments demonstrate that this allows reducing network size while significantly improving accuracy on Split CIFAR-100. The runtime is, however, increased here.
+
+The obvious downside to this approach (and to RCL) is the potentially very increase in runtime stemming from RL, which requires fully training many networks just to solve one additional task. This renders the approach impractical for large models; consistent with this, the authors only study models of modest dimensions.
+
+The present paper is mostly an extension of RCL. Thus, limited novelty is its main weakness. But the experimental gains are significant, the extension to RCL is meaningful and the paper is easy to follow.
+
+I leave some questions and comments for the authors below:
+- Did the authors re-implement the RCL objective using their own RL algorithmic choices? The RL implementation in the RCL paper differs in many ways (for example, actor-critic learning is used), which leaves some doubt as to whether (part of) the benefits stem from the changes to RL, or from the actual neuron-level control proposed here. I would like to hear the authors' reply to this point.
+
+- ""We point out that we do not strictly follow the usual $\epsilon$-greedy strategy; an exploration step consists of starting an epoch from a completely random state as opposed to perturbing an existing action.""
+Isn't this still $\epsilon$-greedy? I do have a question on this point, though: is the exploration probability annealed? Picking random states (with a high probability of 30%) seems very extreme. Can the authors provide learning curves for the controller?
+
+- Permuted MNIST is admittedly not a great dataset to study transfer learning. Can the authors repeat the analysis of Figure 6 on Split CIFAR-100? Is allocation decreasing as new tasks are learned, suggesting that some form of transfer is occurring at the architecture search level, or does it remain roughly constant, on that dataset? Still on transfer learning: it would be good to report the actual training curve of the resulting nework (not the controller) over tasks and investigate whether learning becomes faster as more CIFAR splits are learned, compared to a baseline which does not benefit from architecture search. This would make the paper much stronger, in my opinion.
+
+- MWC: how is this method different from standard EWC? I couldn't find any explanation.
+
+- While EWC is a relevant baseline, it would be good to report as well the performance of simple joint multitask training as an upper baseline.
+
+- ""The results show that compared to CLEAS, this version exhibits an inferior performance of -0.31%, -0.29%, -0.75% in relative accuracy""
+What is relative accuracy? Relative to MWC, as in Fig. 3? In any case, as these improvements are somewhat modest, how does runtime compare for the two options?
+
+- Regarding training time, Fig. 7: while runtime in seconds is important, can the number of (controller network) training iterations of RCL vs. CLEAS be provided as well? 
+
+- Readability could be improved by using \citep{} instead of \citet{}.
+
+--
+Post-rebuttal edit: I read the authors' reply and thank them for the clarifications. I maintain my score of 6.",6,3.0,ICLR2021
+AwW_BLVg59,4,n1HD8M6WGn,n1HD8M6WGn,"Interesting idea, but I have questions","This is an interesting idea where the authors propose ""SurfaceFusion"", where they use the source embeddings learned by the encoder to modulate the output of the decoder at the final layer. The authors claim this is because the embeddings contain valuable information that is lost during encoder processing because the encoder lacks the capacity to represent both semantic and surface features. The authors then show through a series of experiments that attending over the encoder embeddings is useful, and propose a way to integrate the information from the embeddings directly into the last layer of the decoder, showing that this improves experimental results.
+
+Few comments:
+
+1) there are some grammatical and spelling errors, e.g.: ""analyses"", ""which is the EncoderFusion expected to solve""
+2) while the premise and analysis are interesting, i am curious about the reasons behind the design choice of ""SurfaceFusion"". If the encoder source embeddings are very important, and layerwise/finegrained attention perform similarly well, why not simplify the approach and simply add/concat them to the output of the encoder (or somewhere else where it makes sense)? have you tried this instead of introducing additional hyperparams?
+3) what are the costs in terms of wall-clock time when introducing an additional softmax operation for every token in the decoder?
+
+---
+update: 
+
+Thanks for the clarifications. I've read the response and other reviews and have updated my rating.",7,4.0,ICLR2021
+DIKqZkFzVpe,4,OQ08SN70M1V,OQ08SN70M1V,A simple and effective method for preserving generalizability of a model when fine-tuning.,"Summary
+
+The paper proposes a method for finetuning pre-trained models that ensures the generalization ability of the representation is maintained. The key innovation is that the computationally expensive ascent step in the mirror descent method of SMART can be replaced by simply injecting noise. The results support the hypothesis that this works well for keeping the generalization-ability of the model. The authors also define the degradation of the generalizability of the representation during finetuning as “representational collapse”. 
+
+Strengths
+- The proposed approach is based on the change of the model in the output space g.f which seems like a very sensible way to constrain the model. The proposed approach therefore shares the advantage that the “change” being minimised has some meaningful interpretation. This is in contrast to many continual learning approaches which operate purely in weight space. 
+
+- Constraining the output function g to be 1-Lipschitz is also sensible and well explained in the paper as it ensures the Bregmann-divergance-based smoothness constraint applied on the output will also constrain the representation, f.
+
+- The experiments are quite strong. The method has been evaluated on a large range of NLP tasks using various transformers as the base model. All experiments include multiple runs and the average/median statistics have been reported.
+
+- The approach is much faster than the closest existing method, SMART and achieves comparable accuracy in most cases.
+
+- Overall the paper is very well written and easy to understand. The proposed novelty compared to the closest existing approach is clearly highlighted and validated by the experiments.
+
+Concerns
+- The generalization experiments in Figure 4 only compares the proposed method to standard fine-tuning with best practices (i.e. Standard++), why has a more sophisticated methods like SMART not been included in this figure? Also, the authors state that “R3F/R4F consistently outperforms the adversarial fine-tuning method SMART”, but from Figure 3 it seems that the converse is also true - in at least 2/6 of the tasks, SMART outperforms all the variants of the proposed method and is on par in two others.
+
+- There is quite a range of performance between the variants R3F and R4F, but there aren’t any guidelines or suggestions on why this is the case or which one should be used in a particular situation. 
+
+- The results in Table 3 and Table 2 show fractional improvements over the existing methods, however not variance is reported for these numbers. Another issue is that Table 2 uses median whereas Table 3 uses average. Is there a reason for this discrepancy?
+
+- The need for a new term “representational collapse” is not really justified in the paper. Most authors just use the term generalization. What exactly is the difference between “representational collapse” and just saying the models lacks the ability to generalize?
+
+- Perhaps the most significant weakness of the paper is that the novelty seems a bit limited. The difference compared to SMART is not really justified in a theoretical or principled manner. For example, what are the implications for using noise samples in Eqn. 4? Is it simply a heuristic to encourage smoothness? It would be good if the authors could explain this in more detail in the paper. At the moment it just appears as if ad-hoc modifications have been made to the cost function.
+
+Minor comments
+- There are some very minor typos throughout the paper that can be fixed. Eg. “even great degree”
+",6,3.0,ICLR2021
+DwVTcjLAAk,1,QxQkG-gIKJM,QxQkG-gIKJM,An interesting paper which applies optimism principle to practical algorithms,"This paper studies optimistic exploration for deep reinforcement learning. They propose an algorithm called OEB3, which utilizes the disagreement of bootstrapping Q-functions as the confidence bonus to guide exploration. They show that their confidence bonus matches the confidence bonus of optimistic LSVI in linear setting. After that, they evaluate their algorithm on plenty of benchmark problems including Mnist maze and 49 Atari games. Their algorithm outperforms other SOTA exploration approaches.
+
+Pros:
+
+- The optimistic exploration problem is very important for the RL community. From the theoretical perspective, the classical $\epsilon$-greedy exploration method has been shown to be inefficient, and UCB-based algorithms are shown to be sample efficient with regret guarantee. However, how to apply these theoretical insights into practical algorithms is still an open problem. The paper proposes a method that uses bootstrapping to construct confidence bonus, and connects the theoretical findings to practical algorithms.
+
+- The experimental results on 49 Atari games are rich enough to show the improvement of the algorithm, with visualization results and ablation studies. 
+
+Other comments:
+
+- The connection with optimistic LSVI is interesting but not satisfying. Theorem 2 assumes that $\epsilon$ follows the standard Gaussian distribution. Why does this assumption hold? A more natural and reasonable assumption is that the rewards and transitions are sampled from a Gaussian distribution. Can theorem 2 still hold under such assumptions?
+
+
+-------Post Rebuttal---------
+
+
+Thanks for the feedback from the authors. After the rebuttal and discussion period, I believe that the contribution of the paper are mainly empirical. The Theorems (Theorem 1 and 2) are proposed to show the intuition of the bootstrapping and backward-update methods, and to connect the algorithm with recent theoretical results. During the discussion, there are mainly three concerns about the theorems among reviewers. 
+
+- Firstly, the algorithm uses eps-greedy to help exploration, which is theoretically unclear. 
+- Secondly, the theorems focus on episodic setting, while the algorithm is designed for discounted setting. 
+- Lastly, it is unclear whether the algorithm is a frequentest approach or Bayesian. 
+
+I believe that the former two concerns can be addressed in the following way, and don't weaken the contribution of the paper.
+- From the theoretical perspective, the feedback from the authors does tackle this problem. The eps-greedy will only add an epsilon-term to the final regret. I am eager to see the discussion in the final version. From the experimental view, the authors also claim that all the algorithms in the experiments use eps-greedy to help exploration. In that case, the improved performance of their algorithms are mainly due to their bootstrapping and backward-update methods, instead of eps-greedy trick. Besides, I think applying several widely-used tricks (such as eps-greedy) to the algorithm is acceptable, which makes the results comparable to other methods that also use the tricks. 
+- To the best of my knowledge, there is currently no paper studying linear MDP in the discounted setting from the theoretical perspective. I think this is the reason why this paper studies the connection of the theorems in episodic setting. Meanwhile, it is hard to conduct experiments in episodic setting (In that case, the algorithm needs to learn the value function of all the possible horizon). Moreover, we can reduce an MDP with discounted rewards to episodic MDP by setting the effective horizon $H = \frac{1}{1-\gamma}$. This means that the theorem for episodic MDP can hold in the discounted setting with slight modification. Maybe such discussion should be added to the paper.
+
+However, for the last concern, I am still puzzled whether the algorithm is a frequentest approach or Bayesian, as there are many unclear statements.
+
+Overall, I believe the main contribution of the paper is from the empirical perspective. The theorem is intended to show the intuition of the algorithm and to connect the approach to the recent theoretical results. The experimental results look nice in general. I am a little disappointed that the authors missed the comparison with two related literatures in the initial version. I agree with R5 that these results may be done under a time-crunch. As a result, I change my score to 6, and I hope the above problems can be addressed in the next version.
+",6,3.0,ICLR2021
+alY5qxYMKRL,2,C5th0zC9NPQ,C5th0zC9NPQ,Not novel enough and poor presentation,"The idea and research direction itself is definitely interesting and worthy of pursuit. However, the execution is really poor. In addition to many improvements to clarity and writing, the proposed method is not at all novel and various variants for reconstructing one sensory modality from others have been proposed in the past:
+
+Gu et al. ""Improving domain adaptation translation with domain invariant and specific information"" .arXiv [Preprint].arXiv:1904.03879, (2019).
+
+Murez et al.“Image to image translation for domain adaptation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (Salt Lake City, UT),4500–4509, (2018).
+
+Luo et al. (2018). “ViTac: feature sharing between vision and tactile sensing for cloth texture recognition,” in2018IEEE International Conference on Robotics and Automation (ICRA) (Brisbane,QLD: IEEE), 2722–2727.
+
+Lee et al. (2019). “Touching to see”and “seeing to feel”: Robotic cross-modal sensory data generation for visual-tactile perception,” in 2019 International Conference on Robotics and Automation (ICRA) (Montreal, QC), 4276–4282.
+
+Tatiya et al. ""A Framework for Sensorimotor Cross-Perception and Cross-Behavior Knowledge Transfer for Object Categorization."" Frontiers in Robotics and AI 7 (2020): 137.
+
+These papers should be considered by the authors, discussed in related work, and used as a basis to propose something new in future work as there are still problems in that area that need solving. 
+
+Finally, the notation is way more complicated than it needs to be for a simple encoder-decoder architecture. My advise to the authors is to simplify it. ",2,5.0,ICLR2021
+w7Fzl3jMUU,3,eBHq5irt-tk,eBHq5irt-tk,"Interesting avenue, but requires improvements","This paper explores the effective dimensionality of the Hessian as a possible explanation for why neural networks generalize well despite being massively overparametrized.
+
+While I concur with the intuition, I think in the current state of the paper, some points could be improved and clarified.
+
+### Relationship between Hessian and posterior covariance
+
+While you mainly reason in the Bayesian framework about the posterior, it seems that networks in fig 1, 6, 7 are trained using ML. So why would the Hessian of the ML estimator relate to the covariance of the posterior?
+
+### Theorem 4.1
+
+Theorem 4.1 shows that even with $k \gg n$ parameters, there are only $n$ directions in which the posterior covariance changes from the prior. But the rest of the discussion shows that the effective dimension actually decreases, which is not captured by your theorem. In this regard, I consider this theorem as an illustration of why the effective dimension does not increase with increasing number of parameters, a statement that is weaker than saying that the effective dimension actually decreases.
+
+Therefore, we can say that your argument for advocating in favor of using the effective dimension as a proxy for generalization is mainly empirical. Then I would have appreciated a more thorough ablation study, that would demonstrate that the correlation is still occuring while varying other hyperparameters.
+
+### Figures 4
+
+Fig 4: can you precisely state what is plotted, i.e. for a fixed $z$ why do we have several datapoints?
+
+### Conclusion
+
+As already said in the beginning I really like the idea of effective dimension playing an important role in generalization. I however think that relationship between the hessian of ML estimator and the covariance of the posterior, as well as the empirical study, should be improved before this is published.",4,4.0,ICLR2021
+0HM4Uy_4K0,2,#NAME?,#NAME?,"Solid theoretical results, limited experimental results","##########################################################################
+
+Summary:
+
+ 
+The paper provides theoretical analysis of the regularized backup for MCTS. The paper carries out detailed analysis (regret, errors analysis) on three instantiations of the regularizations. Finally, the paper provides some empirical gains on certain toy domains and some atari games.
+
+##########################################################################
+
+Reasons for score: 
+
+ 
+Overall, I vote for rejection. I am not very familiar with the theoretical results of the MCTS literature - however, it seems that the idea of adding convex regularization is not new in rl literature overall. I cannot be a good judge for the theoretical contribution and I will focus more on the empirical side of the paper.
+ 
+##########################################################################Pros: 
+
+ 
+1. Related work. The paper seems to miss some highly related literature, in particular:
+
+[1] Buesing et al, Approximate Inference in Discrete Distributions with Monte Carlo Tree Search and Value Functions, AISTATS 2020
+
+[2] Grill et al, Monte Carlo Tree Search as Regularized Policy Optimization, ICML 2020
+
+[1] uses entropy-regularized MCTS backup; [2] relates MCTS to policy optimization with convex regularization. In particular, [2] proposes that the conventional MCTS backup used in Alpha-Zero (which is different from the max-MCTS backup), is carrying out approximate regularizations. I hope that the author could clarify on the connection between this work and these two pieces of prior work.
+ 
+2. Regarding results in Table 2, I wonder how many seeds do the authors run per game per algorithmic baseline. Does each number correspond to a mean value across a few seeds, or it is just a single run? Could the author also clarify how the t-test is done to denote significant differences? I would expect such tests to be run on averages over a collection of seeds.
+ 
+##########################################################################
+
+Please address and clarify the concerns above.
+
+ 
+#########################################################################",5,4.0,ICLR2021
+ksnRmL86yP,3,Ig-VyQc-MLK,Ig-VyQc-MLK,Official blind review #3,"##########################################################################
+
+Summary:
+
+Generating a pruned network falls into two broad categories: 1) spend some extra time and effort to train or fine-tune the pruned model after first training a dense version, or 2) cut out that extra time and effort by generating a sparse network ""from scratch.""  While approach (1) has historically given the best accuracy, recent advances (such as the lottery ticket hypothesis) suggest that there are sparse networks hidden in the initialization that don't need to first be trained, if only we could divine the structure of those models.  Approach (2) seeks to do just this: determine the connectivity as close to initialization possible.  However, even the best results taking this second path fall short when compared to the accuracy of the former path - why is this?  The submission pokes at three recent techniques to pull out some commonalities that are *not* shared with (1), suggesting possible issues that need to be overcome to improve accuracy, and proposes a set of experiments and comparisons that should be part of any new technique that claims to discover a good sparse mask at initialization.
+
+##########################################################################
+
+Reasons for score: 
+
+Overall, this submission raises some important questions (why is from-scratch pruning falling short of SOTA accuracy?), empirically shows the performance of some leading techniques (and that they fall short in a well-motivated range of sparsities), and, via ablation studies, points towards some potential reasons.  More importantly, these findings are interesting:
+- There's no single SOTA method for sparse-from-scratch training
+- There's a need for consistent reporting in this area, and the ablation studies performed have been shown to lead to useful insights; adopting them as standard for future work seems fruitful.
+
+I think these are enough to warrant an ""above-the-threshold"" rating, but a higher rating would require empirical results on more networks on large-scale data sets, or the inclusion of techniques the submission itself suggests might be a promising direction forward in the current gauntlet of tests.
+
+##########################################################################
+
+Pros:
+
++ The organization of the paper makes it easy to follow the logical progression and points being raised.
++ The direct comparisons of three recent techniques (SNIP, GraSP, and SynFlow) on different networks and data sets fills in some gaps in the literature.
++ Further, the ablation studies performed on these techniques yield surprising results, both in isolation (inverting GraSP improves accuracy!) and when compared to magnitude pruning after training (these three are invariant to shuffling and re-initialization).
+
+##########################################################################
+
+Cons:
+
+- Experiments on large data sets are limited to RN50 on ImageNet.
+- The three particular techniques chosen (SNIP, GraSP, SynFlow) aren't particularly motivated - why these three, and not other recent techniques?  (A potential answer may be that there's no training before pruning is finished, but why is this important?)
+- Mostly tongue-in-cheek: the submission doesn't answer all the questions it raises, unfortunately.
+
+##########################################################################
+
+Questions:
+
+- What benefit do the three at-initialization ""static"" pruning techniques have over those that reduce training FLOPs and memory requirements but allow the sparse mask to change dynamically, like RigL (Evci et al., 2020) and sparse momentum (Dettmers and Zettlemoyer, 2019)?  Is there a reason they do not belong in the current lineup?  The overhead of occasionally updating the mask shouldn't be too imposing.
+
+- In the final paragraph, it is suggested that it may be tricky to compare the training cost of ""a method that prunes to sparsity s at step t against a method that prunes to sparsity s' < s at step t' > t.  If method A prunes to a higher sparsity at an earlier time step, shouldn't it cost strictly less than method B, which prunes to a lower sparsity later in training?
+
+##########################################################################
+
+Minor suggestions:
+
+Figure 2 is never referenced in the text (that I could find), and LTR isn't defined until the following page.
+
+
+##########################################################################
+
+Updates are appreciated
+
+Hi, Authors,
+
+I appreciate the updates you've made to the paper and the responses to my questions.  You're quite correct that RN50 and ImageNet is sufficient to illustrate deficiencies.
+
+(I'd still want to see broader experiments for claims of some new method overcoming these deficiencies, though!  When broadening scope to other tasks, I'd expect the authors of prior methods designed for vision tasks would be okay with use of their methods as baselines if there are no methods designed specifically for those new tasks.)
+
+With this in mind, I'll update my rating to a 7.",7,3.0,ICLR2021
+BkgwHNj93X,3,HyxSBh09t7,HyxSBh09t7,Simple combination of existing works,"The paper used the graph scattering network as the encoder, and MLP as the decoder to generate links/graph signals/graphs.
+
+Pros:
+1.	Clearly written. Easy to follow.
+2.	No need to train the encoder
+3.	Good results on link prediction tasks
+
+Cons:
+1.	Lack of novelty. It is a simple combination of existing encoders and decoders. For example, compared to VGAE, the only difference in the link prediction task is using a different encoder. Even if the performance is very good, it can only demonstrate the effectiveness of others’ encoder work and this paper’s correct selection of a good encoder. 
+2.	Lack of insights. As a combination of existing works, if the paper can deeply explain the why this encoder is effective for the generation, it is also beneficial. But we also do not see this part. In particular, in the graph generation task, the more important component may be the decoder to regulate the validness of the generated graphs (e.g. “Constrained Generation of Semantically Valid Graphs via Regularizing Variational Autoencoders. In NIPS 2018” which used the similar decoder but adding strong regularizations in VAE). 
+3.	 Results on QM9 not good enough and lack of references. Some recent works (e.g. “Junction Tree Variational Autoencoder for Molecular Graph Generation, ICML 2018”) could already achieve 100% valid. 
+
+",4,4.0,ICLR2019
+HJlQrTscYr,2,ByxJjlHKwr,ByxJjlHKwr,Official Blind Review #2,"This paper presents a technique for model based RL/planning with latent dynamics models, which learns the latent model only using reward prediction. This is in contrast to existing work which generally use a combination of reward prediction and state reconstruction to learn the latent model. The paper suggests that by removing the state reconstruction loss, the agent can learn to ignore irrelevant parts of the state, which should enable better performance in settings where state reconstruction is challenging. 
+
+Overall the motivation for this work is good, and the idea is promising. Difficulty in reconstructing high dimensional states is a challenge for learning latent dynamics models. The paper is also very well written and easy to follow.
+
+My concerns are centered around the experimental evaluation. Specifically, I see the following issues: (1) the experimental environments seem artificial, and hand tailored for this method, (2) given that the proposed method is a minor modification to the PlaNet paper, it seems that PlaNet should be included as a comparison (especially because it has been shown to work on high dimensional states), and (3) the proposed method seems very prone to overfitting to the given task, and there should be an analysis of how the proposed change affects generalization and robustness.
+
+(1): The testing environments contain many distractor pendulums/cheetahs, which makes state reconstruction especially challenging. While this does seem to be the point the authors are trying to show, the environments are an extreme, almost artificial, case of difficult state reconstruction. Would the same results hold in more realistic settings, for example, visual robot manipulation in a cluttered scene? Model based RL with video prediction models has been shown to work in such real cluttered robot manipulation environments. Showing that the proposed method can outperform such approaches in robot manipulation settings would be a powerful result. The results on images in the appendix seem to show a delta between the true and predicted reward, suggesting that the proposed method does not yet work on images. Why might this be the case?
+
+(2): From what I can see, the proposed method is very similar to the PlaNet algorithm with state reconstruction loss removed. Given the similarity, PlaNet should be included as a comparison in both the pendulum and cheetah environments. Similarly why was DeepMDP performance not shown in the Cheetah environment?
+
+(3): One of the strengths of model based reinforcement learning is the ability to plan to reach unseen goals with a model trained via self-supervision or different goals. Does the proposed approach lose some of this, by overfitting to only the task reward? I suspect that in generalizing to unseen tasks, a model trained with state prediction would potentially perform much better. If trained on many tasks, could this method achieve similar generalization? 
+
+Due to some of these questions which remain unanswered by the experimental evaluation my current rating is Weak Reject. If the authors are able to clarify some of the questions above I may adjust my score.
+",3,,ICLR2020
+H1S_cEcxM,3,r1h2DllAW,r1h2DllAW,review,"The authors consider the problem of ultra-low precision neural networks motivated by 
+limited computation and bandwidth. Their approach first posits a Bayesian neural network
+a discrete prior on the weights followed by central limit approximations to efficiently 
+approximate the likelihood. The authors propose several tricks like normalization and cost 
+rescaling to help performance. They compare their results on several versions of MNIST. The 
+paper is promising, but I have several questions:
+
+1) One major concern is that the experimental results are only on MNIST. It's important 
+to have another (larger) dataset to understand how sensitive the approach is to 
+characteristics of the data. It seems plausible that a more difficulty problem may 
+require more precision.
+
+2) Likelihood weighting is related to annealing and variational tempering
+
+3) The structure of the paper could be improved:
+ - The introduction contains way too many details about the method 
+    and related work without a clear boundary.
+ - I would add the model up front at the start of section 2
+ - Section 2.1 could be reversed or equations 2-5 could be broken with text 
+   explaining each choice 
+
+4) What does training time look like? Is the Bayesian optimization necessary?",5,4.0,ICLR2018
+HJlp8ls19B,4,B1x1MerYPB,B1x1MerYPB,Official Blind Review #2,"Summary:
+The paper describes a noisy channel approach for document-level translation, which does not rely on parallel documents to train. The approach relies on a sentence-level translation model (from target-to-source languages) and a document level language model (on target language), each is trained separately. For decoding, the paper relies on another proposal model (i.e., a sentence level translation model from source to target) and performs beam-search weighted by a linear combination of the scores of all three models. Experiments show strong results on two standard translation benchmarks.
+
+Comments:
+-  The proposed approach is strongly based on the neural noisy channel model of Yu et al. 2017 but mainly extends it to context aware translation. While the paper is referenced, I believe more emphasis should be put on the differences of the proposed approach
+-  It seems that the Document Transformer uses parallel-documents to train, so I am wondering if you can still claim that your approach does not require parallel documents.
+-  In general, I think the paper is well written and results are compelling.
+",6,,ICLR2020
+a3Clmt4sfxW,3,jWXBUsWP7N,jWXBUsWP7N,"Concerns with clarity, attribution and correctness","This paper proposes to learn a Gaussian Mixture Model of the distribution of returns and use it as the critic in an actor-critic RL agent. From what I can tell the principal novel contribution of this work is the Sample-Replacement method, in particular the observation that when paired with a GMM the replacement can be done at the level of modes of the mixture instead of individual samples. Another potential contribution is showing that the GMM can be optimized using the Cramer metric, although obviously this metric has been fairly widely studied previously.
+
+However, this work has several problems that make it unpublishable at the moment. I'll begin with the least severe (lack of clarity and poor attribution) and then move to the much more problematic (misleading statements and factual errors).
+
+
+Unclear:
+
+Section 4.3, assumptions:
+""1) the reward given a state-action pair R(x, a) follows a single distribution with finite variance""
+
+What does this mean? Finite variance is clear, what do you mean ""a single distribution"".
+
+""3) the policy follows a distribution which can be approximated by Dirac mixtures""
+
+What does this mean? Approximated how well? Under what metric?
+
+Equation 18, second line, \mu and \sigma seem like they should both be functions of (x, a).
+
+
+Poor attribution:
+
+Existing work has used Gaussian Mixture Models for distributional RL, as well as for the actor-critic setting (D4PG, among others).
+
+Existing work has considered multi-step returns in distributional RL (Rainbow, Reactor, as well as almost all methods that use AC with Dist. RL). However, the Sample-Replacement method is an interesting contribution that is novel compared with this existing work.
+
+""The distributional Bellman operator is a [...] contraction mapping in the Cramer metric space, whose proof can be found in Appendix C.""
+
+This has previously been proven in the Rowland et al. (2019) paper the authors cite, but do not attribute such a result to.
+
+
+Misleading statements:
+
+""Third, the Wasserstein distance that is commonly used in DRL does not guarantee unbiasedness in sample gradients""
+
+While this is true for direct minimization, the quantile regression work cited in this paper does guarantee unbiased sample gradients.
+
+""The instability issue is not present under the stochastic policy... Combining these solutions, we arrive at a distributional actor-critic..."" (Much later) ""One way to overcome this issue is learning value distributions under the stochastic policy and finding an optimal policy under principles of conservative policy iteration...""
+
+The instability issue the authors reference here is that the distribution of returns, though not its mean, can be an expansion under any probability metric when applying the optimality operator. While this is an interesting topic, the authors do not actually address it or contribute towards its understanding or solution in any way. The evaluation operator was already known to be a contraction in Wasserstein (as well as for Cramer), which is the relevant operator when considering an actor-critic framework. Unlike the authors' claim that this is due to using a stochastic policy, it is in fact due to performing evaluation as opposed to optimality operators.
+
+""Barth-Maron et al (2018) expanded DDPG by training a distributional critic through quantile regression.""
+
+This is completely incorrect, as they considered categorical distributions and Gaussian mixtures, but not quantile regression.
+
+
+""The actor-critic method is a specific case of temporal-difference (TD) learning method in w hich the value function, the critic, is learned through the TD error defined by the difference...""
+
+Actor-critic uses TD to learn the critic, but it is not a specific case of TD.
+
+""However, the Wasserstein distance minimized in the implicit quantile network cannot guarantee the unbiasedness of the sample gradient, meaning it may not be suitable for empirical distributions like equation 13.""
+
+This is 100% false and shows a lack of understanding of multiple papers being cited in this work.
+
+Figure 2 and ""Wasserstein distance (labeled as IQ) converges to a local minimum which does not correctly capture the locations of the modes""
+
+This seemed off to me so I went ahead and reimplemented this experiment myself. This has nothing to do with the Wasserstein distance and is exceedingly misleading to the reader. Suggestion to read the Rowland et al. (2019) paper that the authors cite for better understanding. Huber-quantiles are not quantiles. The authors learn Huber-quantiles and then treat them as quantiles and observe they look wrong. If you run IQ with the Huber parameter at 0 (corresponding to quantiles) then you get the correct (unbiased) distribution. If you instead learn Huber-quantiles and use the imputation in the Rowland you again get the right distribution.
+
+The experimental results in the main text look promising for GMAC, but looking at the full set of results in the appendix paints a much more mixed picture.",5,4.0,ICLR2021
+j5JWWZU-rHB,3,LzhEvTWpzH,LzhEvTWpzH,Interesting trick for training NMT systems,"This paper shows that aligning parallel text with fastalign and then randomly replacing source words with their aligned target words, or interpolating their embeddings, improves machine translation.
+
+This method is different from other data-augmentation methods that try to alter the source sentence without changing its meaning; here, the source sentence is altered into a mixture of the source and target. That’s interesting, but not very strongly motivated.
+
+The paper doesn’t make clear whether the noise probability / coefficient is optimized on a development set or the test set. Based on Figures 3 and 4, it looks as though these hyperparameters may have been optimized on the test set, which is concerning. For both the baseline systems and your system, hyperparameters should be optimized on a development set and then tested using only a single hyperparameter setting on the test set. If this is what you did, please explicitly state this to reassure the reader.
+
+Not much attempt is made to explain why this method helps; the only analysis is a measurement of cosine similarity between five German-English word pairs. Do you tie word embeddings between the source and target languages (Press and Wolf, 2017)?
+
+- If so, one would expect that the transformer would already be able to place words with similar meanings close together, so the fact that your method improves this is interesting; do you know whether it helps more, e.g., for rare words, proper names, technical terms? Why is fastalign able to align some words better than the transformer? Would an even simpler method help, e.g., if (and only if) word f and word e both occur <= k times in the training data and they occur in exactly the same sentence pairs, then allow f to be switched to e?
+
+- If not, I'd suggest doing so and rerunning the experiments to see if you still get an improvement.
+
+Overall, this seems like a good trick for training NMT systems, but I would hope to see more insight either into why the proposed method works, or how NMT works or doesn’t work.
+",3,4.0,ICLR2021
+iu0-QGq_8sY,3,zx_uX-BO7CH,zx_uX-BO7CH,Review,"This paper proposes CTN (Contextual Trasnformer Networks) for online continual learning. In particular, the authors introduce a dual memory framework that contains an episodic memory for base networks and semantic memory for task controllers. The overall framework is optimized with bi-level optimization. In addition, the base network also uses a KL-divergence loss to prevent catastrophic forgetting. Experiments are conducted on multiple datasets and the authors demonstrated that the proposed framewok outperforms other alternative approaches.
+
+####### Strengths######
++ The paper is addressing an important problem, i.e., continual learning.
++ The motivation is clear.
++ Good results have been shown compared to other incremental learning methods.
+
+#######Weakness######
+- The novelty is a bit limited. The framework seems like a loose combination of existing well-explored techniques. For example, the modulation part  conditioned on tasks (Eqn 4) have been widely used before to modulate feature maps. For example.
+[1] TAFE-Net: Task-Aware Feature Embeddings for Low Shot Learning
+[2] TADAM: Task dependent adaptive metric for improved few-shot learning
+The KL divergence to prevent forgetting has also been used before.
+
+- The semantic memory and the  episodic memory are a bit confusing to me. What exactly are stored into the memory? Are image samples stored in the memory? From Algorithm 1,  L8 is updating the episodic memory with a batch of samples. But L16, suggests the predictions are added.
+
+Minor:
++ Page 4 L3, the symbols are the same for the semantic memory and the  episodic memory.
++ Algorithm L8, the episodic memory is denoted as M^{tr} rather than M^{em}
+
+########After rebuttal#########
+
+I appreciate the author's effort in addressing my concerns. After reading the rebuttal and other reviews, I am raising my scores to 6.",6,3.0,ICLR2021
+rJez-cJNcr,3,r1e74a4twH,r1e74a4twH,Official Blind Review #2,"This paper proposes a method for learning disentangled representations.  The approach is used on both supervised (where the factors to be disentangled are known) and unsupervised settings. The authors demonstrate the efficacy of their approach in both settings on several datasets with both quantitative and qualitative results.
+
+This task is an important one. However, I found that the contribution of this paper is fairly small. The proposed approach seems reasonable but it is mostly a work of engineering and provides little insights into the problem nor the proposed model.
+
+The setup where labeled data (c) also seems a bit unnatural (this also seems to be confirmed by the fact that the authors had to build datasets for the problem). Perhaps the authors could give examples of situations where this would naturally arise. In practice, it seems difficult to obtain these data for all required variables to be disentangled.
+
+The unsupervised results are more interesting but not very much explored (a single set of sampled faces). I was also curious as to why the learned Y's are blurry. This sort of two-stage generation is also potentially interesting, I was wondering if the authors had ideas to generalize this idea. 
+
+I also was not convinced by the experiments which are mostly qualitative. I did not find that this set of experiments provide enough support to the proposed method.
+
+
+Detailed comments:
+- It is a bit unclear to me how the authors propose to obtain independent posteriors over z and c. Is it purely empirical or is there a formal reason that guarantees it?
+
+- Some of the figures your report are compelling but it is a bit unclear to the reader if the results are general (e.g., the examples could have been hand-picked). Are there any quantitative measures you could provide (in addition to Tables 1 and 2 which don't measure the quality of the approach)?
+
+- Comparing to CGAN seems reasonable but given the task at hand, it seems like other methods could have been tried (although I do realize that no one may have done this before for deep generative models).
+
+
+
+Other comments:
+- In Figure 3, it would be good to label the upper trapezoid.
+
+- Some paragraphs are very long and the manuscript may benefit from segmenting them into multiple paragraphs.",3,,ICLR2020
+LtqlSMX4WBH,2,EKw6nZ4QkJl,EKw6nZ4QkJl,Combination of rule learning and embedding learning for knowledge reasoning,"The work utilizes relational background knowledge contained in logical rules to conduct multi-relational reasoning for knowledge graph (KG) completion. This is different from the superficial vector triangle linkage used in embedding models. It solves the KG completion task through rule-based reasoning rather than using rules to obtain better embeddings. Experiments on FB15K, WN18, and a new dataset FB15K-R demonstrate the effectiveness of the proposed model EM-RBR. 
+
+The work is well motivated. As for KG completion, embedding-based methods usually fail to predict long-tail relations, while rules can alleviate the problem. Hence, the work proposes to use a rating mechanism combined with a rule-based reasoning process for ranking test triplets. Generally, the technique sounds reasonable.
+
+A minor question:
+
+1) In Algorithm 1, why to push the state into the queue when H_snew < L_scur? My understanding is that if H_snew >= L_scur, and we already know L_snew >= H_snew, then we can get L_snew >= L_scur. In this case, it is not necessary to push the state into the queue. Right? I think the authors should further clarify this.
+
+Some suggestions:
+
+1) The paper demonstrates that the proposed EM-RBR(X) performs better than the X model, but I think it is better to compare with more models that utilize rules for KG completion.
+
+2) FB15k and WN18 are too old and suffer from the test data leakage issue. Recent studies for KGC no longer use FB15k and WN18. I would suggest the authors to try FB15k-237 and WN18RR. Besides, I found that if many of the test triples can not be matched by rules, the performance improvement would be subtle. I think in sparse KGs like WN18RR or FB15k-237, the proposed model will also suffer this problem, as the rules mined by AMIE will also be sparse. 
+
+3) It is better to provide a rule-based reasoning path that can explain why the inferred triple is true. I think such a case study is meaningful.
+
+Minor mistakes (my point of view):
+
+1) On page 3, it says that ""The initial state is the target triplet itself, whose H is 1 and L is its score under an embedding model"", and at the footnote of page 3, it says that ""Φ is defined as Definition 2.2 and initialized as L_{s0}"". However, Algorithm 1 of Appendix B initializes H and L as 1. And in section 2.3 (page 4), it says that ""The initial state s0 only contains one triplet (h, r, t), and its state score and heuristic score are both 1"". I think it is very confusing.
+
+2) On page 4 (Section 2.2.2), it says that  ""kB1 + B2 − Hk ∈ R 3*k"". I think it may be wrong.
+
+3) I think Algorithm 1 of Appendix B lacks the procedure to handle the state whose triples are all in the KG. Just like in Figure 2 (I), I don't know which part in Algorithm 1 is used to update \phi using L_s2.
+
+4) In Table 3, the s3 state is {(h, B5, m3) (m3, B6, m1)* (m1, B2, t)}. And in Table 5, the rule 5 (r5) is ""B9(x, z) ∧B10(z, y) ⇒ B5(x, y) r5"". Based on these, I think the example extension in Figure 2 (III) may be wrong.
+
+Overall, I think this paper is interesting, but it needs further improvement.
+
+-- after rebuttal --
+
+Thank the authors for their responses. ",4,4.0,ICLR2021
+BJN6F-dBx,3,Bk8BvDqex,Bk8BvDqex,Using metacontroller optimization produces more efficient learning on one-shot control task,"Pros (quality, clarity, originality, significance:):
+
+This paper presents a novel metacontroller optimization system that learns the best action for a one-shot learning task, but as a framework has the potential for wider application. The metacontroller is a model-free reinforcement learning agent that selects how many optimization iterations and what function or “expert” to consult from a fixed set (such as an action-value or state transition function). Experimental results are presented from simulation experiments where a spacecraft must fire its thruster once to reach a target location, in the presence of between 1 and 5 heavy bodies.
+
+The metacontroller system has a similar performance loss on the one-shot learning task as an iterative (standard) optimization procedure. However, by taking into account the computational complexity of running a classical, iterative optimization procedure as a second “resource loss” term, the metacontroller is shown to be more efficient. Moreover, the metacontroller agent successfully selects the optimal expert to consult, rather than relying on an informed choice by a domain-expert model designer. The experimental performance is a contribution that merits publication, and it also exhibits the use of an interaction network for learning the dynamics of a simulated physical system. The dataset that has been developed for this task also has the potential to act as a new baseline for future work on one-shot physical control systems. The dataset constitutes an ancillary contribution which could positively impact future research in this area.
+
+Cons:
+It's not clear how this approach could be applied more broadly to other types of optimization. Moreover, the REINFORCE gradient estimation method is known to suffer from very high variance, yielding poor estimates. I'm curious what methods were used to ameliorate these problems and if any other performance tricks were necessary to train well. Content of this type this could form a useful additional appendix.
+
+A few critiques on the communication of results:
+
+- The formal explication of the paper’s content is clear, but Fig.’s 1A and 3 could be improved. Fig. 1A is missing a clear visual demarcation of what exactly the metacontroller agent is. Have you considered a plate or bounding box around the corresponding components? This would likely speed the uptake of the formal description.
+
+- Fig. 3 is generally clear, but the lack of x-axis tick marks on any subplots makes it more challenging than necessary to compare among the experts. Also, the overlap among the points and confidence intervals in the upper-left subplot interferes with the quantitative meaning of those symbols. Perhaps thinner bars of different colors would help here. Moreover, this figure lacks a legend and so the different lines are impossible to compare with each other.
+
+- Lastly, the second sentence in Appendix B. 2 is a typo and terminates without completion.",8,3.0,ICLR2017
+Zo7QIaT_Xkd,1,_TM6rT7tXke,_TM6rT7tXke,"Nice results, but focus on the low data regime should be clarified","**Summary**
+
+The authors propose the inclusion of an auxiliary task for training an RL model, where the auxiliary task objective is to learn an abstraction of the state-action space that clusters (s,a) pairs according to their expected return.  The authors first describe a basic abstraction learning framework (Z-learning) followed by the extension to Deep RL as an auxiliary task (RCRL). The authors present results in Atari (discrete action) building on Rainbow, showing an improvement compared to baselines on median HNS in the low-data regime, and results on DMControl (continuous action) building on SAC, showing similar or improved performance compared to baselines.
+
+**Quality**
+
+Overall I found the approach and results to be interesting and moderately compelling. At first glance the improvement is surprising, given that model-free Deep RL already needs to abstract the state space on the basis of returns even without an auxiliary task. The key appears to be the focus on sample efficiency in the low-data regime, where the task seems to improve non-local value signal propagation compared to a bootstrapped algorithm (particularly on Atari, note that in the 100k regime the model-free algorithms have not yet learned to play Pong). Since it is not clear that the algorithm will generalize to more data (it's easy to imagine that the abstraction task will hinder performance when the base algorithms become more finely tuned), I would like to see more clarification of the goal throughout the paper (e.g. ""In the low-data regime, our algorithm outperforms strong baselines on complex tasks in the DeepMind Control suite and Atari games"" in the Abstract), as well as a reference to the focus in the Conclusions.
+
+In the low-data regime, it's also critical to justify this approach compared to a model-based alternative. On Atari the authors compare to SimPLe, but MuZero would be a stronger baseline.
+
+Besides the empirical results, the authors also nicely provide a description of the Z_\pi abstraction and an error bound.
+
+**Clarity**
+
+I was confused by the description of the positive/negative sampling procedure in 4.3 paragraph 2. Are segments temporally consecutive within a trajectory? If so, is it primarily a heuristic that they will ""contain state-action pairs with similar returns"" (i.e. couldn't a reward achieved mid-segment make this statement incorrect)? As I understand it, segmenting avoids the problem of determining bins on the return distribution a-priori, however it also seems like it will limit the agent's ability to cluster non-local (s,a) pairs with the same returns. It might also mean that the agent is learning to cluster temporally adjacent states in the underlying state space rather than similar returns.
+
+**Originality**
+
+The paper builds on existing work in the abstraction literature and auxiliary tasks for deep RL. The primary novel component is using a return-based auxiliary task. 
+
+The Z_\pi abstraction framework also appears to be novel, although its closely related to existing abstractions like Q_\pi abstraction.
+
+**Significance**
+
+The RCRL model itself improves on existing model-free approaches and can be easily incorporated into many model-free architectures, although it seems unlikely to beat a strong model-based baseline like MuZero in the low-data regime. 
+
+The description of the Z_\pi abstraction, and the exploration of return-based auxiliary tasks in general, could prove more significant in the long term.
+
+**Pros:**
+- The model improves performance in the low-data regime over existing model-free baselines
+- The model can be easily added to many existing architectures
+- Description and theoretical results on a new type of abstraction
+
+**Cons:**
+- The paper needs some more clarity around the focus on low-data / sample efficiency and how applicable the model is to higher data regimes
+- Unclear if the segment-based sampling strategy is clustering (s,a) pairs with similar returns or just states that are nearby in the underlying state space
+- The model seems unlikely to improve on a stronger model-based baseline in the low-data regime
+",7,3.0,ICLR2021
+1R222uVdrJG,3,wWK7yXkULyh,wWK7yXkULyh,A principled approach to updating LSH based ANNS for faster matrix multiplication in NN training,"Some neural network runs involve layers with a large number of neurons. These require large matrix-vector or matrix-multiplication which can slow their training/inference. However, if the output of mat-vec/mul is dominated by a few neurons with which the activation has large inner product (a matmul can be thought of as a weighted sum of inner products), then the computation can be sped up by approximating mat-vec/mul by a limited weighted sum with the dominant terms. This requires maintaining an ANNS data structure that is up to data with the back prop. These updates to ANNS have to be done carefully -- too frequent and the training will slow down, or too infrequent and the results of the mat-mul are way off.  This paper studies how to do this in a principled way using data-dependent LSH updates and backs it up with experimental data.
+
+The ideas and algorithms explored in the papers are as follows:
+1. The weights changed in a limited manner over time, so it should be possible to take advantage of this.
+2. Concentrated changes to a subset of weights can be tracked and patched upon.
+3. An LSH update rule to make these changes effectively
+4. An earlier algorithm that is reused to decide when the LSH scheme is updated.
+
+The paper also talks about how to identify the layers that benefit the most from this scheme. then it goes on to show the training time benefits of the smart scheduler and the update scheme on extreme classifications tasks as well as transfomers.
+
+
+Few questions:
+1. Why no serious consideration of graph based ANNS? They are data dependent and SOTA for search efficiency and it is possible to update them. Why is LSH the better choice for ANNS here? This needs a rigorous argument. 
+2. Is this really a general and serious enough problem amongst practitioners that a solution merits publication at a popular conference? It might be, in which case a better quantification of potential impact can help. 
+3.   Is wiki-325K really the largest dataset for XC? What about larger language models -- sec 4.1.2 seems to study more medium sized networks. Larger scale experiments could make this paper more compelling.
+
+I would more strongly recommend this paper if these questions  can be addressed.",7,4.0,ICLR2021
+0gVhohYtfQV,3,V69LGwJ0lIN,V69LGwJ0lIN,"While interesting,  baselines are weak and OPAL does not directly address problems in offline RL","Summary
+-------
+
+To best leverage diverse datasets of logged experience, the authors
+propose to extract a space of primitive behaviors in a continuous space,
+and to use these for downstream learning. The primitive behaviors are
+learned through a VAE loss, and CQL is applied to learn a policy over
+the primitives. The authors claim that this approach to offline RL
+avoids known distribution shift and allows for temporal abstraction. The
+method is also applied to few-shot imitation learning, exploration and
+transfer to online RL.
+
+Decision
+--------
+
+Despite this paper being an interesting read, I feel that my concerns
+about the experiments lead me to not be confident in the proposed
+approach. As such, my preliminary rating for the paper is ""Okay but not
+good enough"". The baselines chosen for the experiments do not seem
+representative of the problem being addressed. This also leads into the
+motivation, where OPAL is motivated from the offline RL perspective but
+does not explicitly mitigate the issues in offline RL. Only the
+theoretical results investigate the effect of a fixed replay-buffer, but
+even these claims are framed in terms of what $\mathcal{D}$ should be to
+ensure downstream RL.
+
+Originality
+-----------
+
+The proposed approach seems limited in its novelty, combining approaches
+from skill-discovery, hierarchical RL and somewhat from offline RL. The
+theory section is interesting, however the technical arguments seem
+similar to that of Nachum et al. (2018).
+
+Quality and Clarity
+-------------------
+
+The overall quality of the paper is very high. The writing is clear, and
+the theoretical arguments are put into an easily understandable context.
+
+Strengths
+---------
+
+-   OPAL leverages techniques from many areas of machine learning:
+    unsupervised learning, hierarchical RL and offline RL. This is an
+    interesting combination, and deserves to be investigated. It seems
+    like the workhorse of OPAL is the unsupervised learning component
+    however, and perhaps this should be emphasized and investigated
+    independently.
+-   The theoretical section is very clear and well contextualized. The
+    claims in the paper seem correct, and they do provide much needed
+    validation for hierarchical RL. The assumption of an optimal
+    high-level controller seems strong, but the analysis is nonetheless
+    interesting.
+
+Weaknesses
+----------
+
+-   Despite Figure 2, OPAL is difficult to disentangle with many moving
+    parts. It doesn't help that there is no discussion of how
+    hyperparameters were chosen for each component, or an ablation study
+    to investigate different hierarchical RL approaches, fine-tuning,
+    offline RL algorithms, or VAE losses.
+-   The proposed method is not specifically designed to leverage
+    anything particular in offline RL. As you show, it can be applied to
+    online RL. However, why is this motivated from the offline
+    perspective? OPAL itself does not seem to mitigate distributional
+    shift.
+-   It would be helpful to directly compare to other skill-discovery
+    methods that have been applied to online RL but can be adapted to
+    offline RL. For example, what prevents the work by Campos et
+    al. (2020) and the baselines therein to be applied in conjunction
+    with an offline RL algorithm? The baselines for the experiments do
+    not seem representative of the problem you are addressing. You use
+    standard offline RL algorithms, yet claim that the temporal
+    abstraction of skill discovery is crucial to the results. As such,
+    you should compare OPAL to other skill-discovery algorithms that can
+    be combined with offline RL.
+
+Detailed Comments
+-----------------
+
+-   Section 4.1: "" Prior: ρω(z|s0) tries to predict the encoded
+    distribution of the sub-trajectory.."" Prior doesn't seem like the
+    right word here, since it is being learned.
+-   Section 4.2: In what way is the task-specific policy $\pi_\psi$
+    different from the learned prior in Section 4.1? Both are
+    distributions over latent $z$ conditioned on a state. While the
+    prior is designed to only be conditioned on the initial state, this
+    is still quite similar to the high-level policy $\pi_\psi$.
+-   Section 4.2: ""\[to\] ensure that the c-step transitions remain
+    consistent … with the labelled latent action $z_i$"" Why is this
+    necessary, should CQL ensure consistency at the action-level, while
+    the latent actions are consistent by design?
+-   Corollary 4.1.1: How would $\epsilon$ actually approach
+    $\mathcal{H}_c$ in the algorithm? The condition is never explicitly
+    enforced because you formulate it as a constrained optimization
+    problem and you are never able to change $\mathcal{H}_c$ since it is
+    a constant.
+-   Section 5.1, baselines: shouldn't baselines be compared to different
+    unsupervised skill discovery algorithms, paired with offline RL
+    algorithms? BEAR and EMAQ are offline RL algorithms without any
+    temporal abstraction. And as you say in the results section with an
+    ablation study, this temporal abstraction is crucial.
+-   Section 5.1, results: An ablation study is discussed but I cannot
+    find the corresponding results table/figure in the paper.
+-   Section 5.2, Table 3: With so many 0.0's, this leads me to believe
+    that the baselines are quite weak, or the problem is too hard for
+    the baselines. How were the baselines chosen. For example, why was
+    DDCO paired with DDQN instead of SAC?
+
+Minor Comments
+--------------
+
+-   Section 5.2 mis-capitalized We ""in this setting We use the Antmaze
+    environ-""
+-   Section 5.3, Table 4: why are there no standard errors for SAC?
+
+Post Rebuttal
+--------------
+After discussion with the authors, I have decided to increase my score to a 6. The authors have addressed many of my concerns with respect to motivation, theoretical analysis and empirical evidence. As it stands, I still think OPAL is hampered by the many ""moving parts"" involved. The theoretical analysis and empirical evidence suggests a very effective approach however, and should lead to further work combining variational sequence encoding techniques for primitive extraction in RL. My decision would be to ""accept the paper if there is room"".
+
+References
+--------------
+Campos, Víctor, Alexander Trott, Caiming Xiong, Richard Socher, Xavier Giro-i-Nieto, and Jordi Torres. 2020. “Explore, Discover and Learn: Unsupervised Discovery of State-Covering Skills.” *arXiv:2002.03647*.
+<http://arxiv.org/abs/2002.03647v4>.
+
+Nachum, Ofir, Shixiang Gu, Honglak Lee, and Sergey Levine. 2018. “Near-Optimal Representation Learning for Hierarchical Reinforcement Learning.” *arXiv:1810.01257*. <http://arxiv.org/abs/1810.01257v2>.
+
+",6,4.0,ICLR2021
+rkAEA0NFB,1,ByxXZpVtPB,ByxXZpVtPB,Official Blind Review #1,"The paper proposes a new faster algorithm to add inequality constraints to neural layers. The paper focuses on a novel constraining approach with seemingly superior scalability, and this is potentially a significant contribution. However the paper does not motivate the constraining at all. I am baffled by this, since one would assume at least some benefits from all of this work could be presented. The only mentions are binarization of the predictions (which softmax already does), and monotonicity/convexity of neurons, with no proposed benefits. The running example of the paper is the chessboard constraint, which is either pointless (fig1) or harmful (fig5). Without justification and motivation the method has no merit and won’t have any impact in the machine learning community.
+
+The monotonicity constraint could have a huge impact for MCMC sampling of neural parameters since it can reduce away all multimodalities of the posterior caused by reordering nodes or layers. 
+
+I had hard time following the method, and I its not clear how the neural network is modified and how backpropagation is performed with the contraints. It is not defined properly how the constrained optimisation works. Apparently additional neural layers are added that map z's to r's. The backpropagation in the constrained case is undefined. Here an algorithm box or schematic figure comparing unconstrained and constrained NN architectures would be extremely helpful. It’s also not explained how are modelling/domain constraints different. 
+
+The paper does not compare to the earlier constrained methods (Marquaz-Neila or OptNet), and thus there is no demonstration of the methods claimed superior computational efficiency. The paper also does not make very clear the different constraining approach advantages and tradeoffs. A comparison table would be help a lot.
+
+The method is interesting, novel and seemingly efficient; but it is insufficiently defined, the method is not motivated and experiments are quite weak with little comparisons and no experiments with practical value.
+",1,,ICLR2020
+SJeYoULs2m,3,H1lo3sC9KX,H1lo3sC9KX,Interesting paper but the contribution seems not be good enough,"Overall, this paper is well written and clearly present their main contribution.
+However, the novel asynchronous distributed algorithm seems not be significant enough.
+The delayed gradient condition has been widely discussed, but there are not enough comparison between these variants.
+",5,4.0,ICLR2019
+xBaCrn6HGLp,3,ClZ4IcqnFXB,ClZ4IcqnFXB,Feature acquisition method that relies on RL with auxiliary rewards,"This paper operates in the setting where there is a possibility to adaptively acquire features for the prediction on each datapoint. The authors study classification, regression, time series and feature completion problems. The proposed solution relies on RL and introduces additional hand crafted rewards and features.
+
+Strong points:
+- The paper tries to solve an important problem that deserves additional attention from the researchers.
+- The experimental results are promising and the proposed method outperforms the baselines in several settings. The experimental results include several datasets and types of the problems.
+- The approach has shown itself being quite generic in a sense that it is applicable to different prediction problems, such as classification, regression, time series and feature completion.
+
+Weak points:
+
+- The datasets for the experiments are rather simple and have only a handful of dimensions (except for MNIST, but even MNIST is downsampled to 16x16). It makes me wonder how the proposed method would scale to high-dimensional realistic datasets.
+
+- The approach is rather heuristic and relies heavily (as shown in the ablations) on the hand-crafted features for the state representation and engineered rewards. Another heuristic part is the prediction model f_\theta that is sometimes used to make predictions and sometimes not.
+
+- In the experiments with MNIST the features are the different pixels in the images. However, this is a very special type of the features and many methods that rely on the inductive bias about images could be used in addition. As it is the most complex experiment in this paper, I think the related works that are specific for images could be mentioned. For example, see [1] (but there are many works in this problem setting).
+
+- The auxiliary loss used in the paper seems to be quite intuitive and helps to eliminate the problem of sparse rewards. However, I am wondering if this auxiliary reward actually modifies the optimal solution to the original MDP [2]. 
+
+- The experimental section is quite short and does not provide empirical evidence towards understanding of the benefits of the proposed method. For example, the authors state that the approach learns a ""non-greedy policy"". It would be informative to see how different the queried features are from the greedy selection. Then, the authors mention the superiority of the adaptive feature acquisition over selecting a fixed set of features for all datapoints. While the argument seems to be very logical, it would be nice to see the empirical confirmation of this and how big the gain of the proposed adaptive approach is.
+
+- While the paper is reasonably well written at the sentence level, the structure of the paper is a bit hard to follow. For example, the abstract and introduction go into many details about the method, which are hard to understand at that point. Then, it feels that the methods section repeats a lot of things that are already described (at the same level of details). The experimental section is on the contrary very short (for example, it does not even mention what the dataset for time series experiments is and refers directly to the appendix).
+
+I am leaning towards the rejection of this paper. While I appreciate the challenges of the studied problem and the proposed solution based on RL, I think the paper would benefit from 1) better motivation of the proposed techniques, 2) more experimental analysis to support the claims of the paper, 3) experimental evaluation on more complex datasets, and 4) some restructuring to improve the reader's understanding.
+
+Questions:
+- What is the computational complexity of this method and how would it scale to more realistic datasets with many features?
+- Could the authors elaborate on what effect the auxiliary loss has on the original MDP?
+- Could the authors elaborate on the prediction function f_\theta and explain when it is used and when not and why?
+- I am not entirely convinced that using the proposed ""surrogate model"" shares a lot in common with model-based RL methods. For example, ""surrogate model"" does not encode much about the transition dynamics in the environment. Could the authors elaborate more on this?
+
+Additional comments:
+- I am not completely convinced by the argument with ""limited power"" of the sensors in time series predictions, especially in the context of the potential computational cost of the method.
+- Scales in figures vary, maybe the most informative way of selecting the scale would be to put the upper/lower bound that a method with all features can achieve?
+- I didn't understand how varying \alpha is reflected in the plots.
+
+[1] Learning to Look Around: Intelligently Exploring Unseen Environments for Unknown Tasks, Dinesh Jayaraman, Kristen Grauman. CVPR, 2018.
+[2] Policy invariance under reward transformations: Theory and application to reward shaping, Andrew Ng, Daishi Harada, Stuart Russell. ICML, 1999.
+
+--- After reading the authors' response ---
+
+The authors' response addressed some of my concerns. However, I could see that some of the concerns regarding novelty, relation to the prior work and experimental results are shared among several reviewers. Thus, I keep my original score and I believe that the revised manuscript would benefit from another round of reviewing.",5,4.0,ICLR2021
+HkeB8Ry92X,1,rkVOXhAqY7,rkVOXhAqY7,"Interesting approach with decent results, but far lacking in related works on mutual information","Update: see comments ""On revisions"" below.
+
+This paper essentially introduces a label-dependent regularization to the VIB framework, matching the encoder distribution of one computed from labels. The authors show good performance in generalization, such that their approach is relatively robust in a number of tasks, such as adversarial defense.
+
+The idea I think is generally good, but there are several problems with this work.
+
+First, there has been recent advances in mutual information estimation, first found in [1]. This is an important departure from the usual variational approximations used in VIB. You need to compare to this baseline, as it was shown that it outperforms VIB in a similar classification task as presented in your work.
+
+Second, far too much space is used to lay out some fairly basic formalism with respect to mutual information, conditional entropy, etc. It would be nice, for example, to have an algorithm to make the learning objective more clear. Overall, I don't feel the content justifies the length.
+
+Third, I have some concerns about the significance of this work. They introduce essentially a label-dependent “backwards encoder” to provide samples for the KL term normally found in VIB. The justification is that we need the bottleneck term to improve generalization and the backwards encoder term is supposed to keep the representation relevant to labels. One could have used an approach like MINE, doing min information for the bottleneck and max info for the labels. In addition, much work has been done on learning representations that generalize using mutual information (maximizing instead of minimizing) [2, 3, 4, 5] along with some sort of term to improve ""relevance"", and this work seems to ignore / not be aware of this work.
+
+Overall I could see some potential in this paper being published, as I think the approach is sensible, but it's not presented in the proper context of past work.
+
+[1] Belghazi, I., Baratin, A., Rajeswar, S., Courville, A., Bengio, Y., & Hjelm, R. D. (2018). MINE: mutual information neural estimation. International Conference for Machine Learning, 2018.
+[2] Gomes, R., Krause, A., and Perona, P. Discriminative clustering by regularized information maximization. In NIPS, 2010.
+[3] Hu, W., Miyato, T., Tokui, S., Matsumoto, E., and Sugiyama, M. Learning discrete representations via information maximizing self-augmented training. In ICML, 2017.
+[4] Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Trischler, A., & Bengio, Y. (2018). Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670.
+[5] Oord, Aaron van den, Yazhe Li, and Oriol Vinyals. ""Representation learning with contrastive predictive coding."" arXiv preprint arXiv:1807.03748 (2018).",6,3.0,ICLR2019
+H9U-XB0g8W,5,UOOmHiXetC,UOOmHiXetC,rather weak paper,"summary:
+This paper introduces Shoot Tree Search (STS), a planning algorithm that performs a multi-step expansion in Monte-Carlo Tree Search. Standard MCTS algorithms expand the search tree by adding one node to the tree for each simulation. In contrast, the proposed STS adds multiple nodes to the search tree at each simulation, where each node corresponds to the state and action that are encountered during rollout. By multi-step expansion, the evaluation of the trajectory is less-biased, which can be analogous to n-step TD. In the experiments on Sokoban and Google research football domains, STS outperforms baselines that include Random shooting, Banding shooting, and MCTS.
+
+
+Overall, my main concerns are technical novelty and presentation quality.
+
+The most common MCTS methods assume that the leaf node is expanded one at a time in each simulation (and its evaluation is performed either by rollout policy or by function approximator), but this common practice does not necessarily mean that MCTS should always do that. The main reason for only expanding one node per simulation in standard MCTS is memory efficiency: if we fully expand the rollout trajectory and retain its information to the search tree, we may get slightly more accurate value estimates. However, the nodes located deep in the tree will not be visited more than once in most cases, thus its effect is usually not significant, leading to the common practice of one-step expansion. More importantly, multi-step expansion has already been used in existing works (e.g. in [1], the tree is expanded by adding the whole rollout trajectory), thus I am not convinced that this work introduces a technical novelty.
+
+It seems that the relative benefit of the STS over MCTS observed in the experiments comes from the bias of the value function approximator. However, to show the effectiveness of 'multi-step' expansion compared to 'single-step' expansion, I think that more thorough ablation experiments should have been conducted. For example, we can consider the setting where both STS and MCTS perform leaf-node evaluation (i.e. UPDATE in Algorithm 5) by executing rollout policy rather than by using value function approximator. By doing so, we can focus only on the benefits of STS's retaining information of full rollout trajectory (i.e. multi-step expansion), compared to MCTS's retaining one-step information (i.e. single-step expansion) while eliminating the effect of biased value function estimation.
+To relieve too much bias in the current MCTS's leaf node evaluation, mixing MC return of rollout policy and the output of the value network could also have been considered, as in AlphaGo (Silver et al. 2016). It would be great to see if STS still has advantages over MCTS in various leaf node evaluation situations.
+
+Also, more writing effort may be required, and the current version of the manuscript seems premature to be published. There are some unclear or questionable parts.
+- Algorithm 3 and Algorithm 4 are not the contributions of this work, thus they can be removed or moved to the Appendix. Instead, more discussions regarding the proposed method should have been placed in the main text.
+- In Algorithm 2: the definition of CALCULATE_TARGET is missing.
+- In Algorithm 5: In SELECT, the tree policy is defined by CHOOSE_ACTION that selects purely greedy action. If this describes the MCTS used in the experiments, I would say this is wrong. To make MCTS be properly working, an in-tree policy that balances exploration vs. exploitation is required (e.g. a classical choice is UCB rule).
+- In Algorithm 6: In UPDATE, $N(s,a)$ and $quality$ are increased by $c$ times more, which means that the longer rollout length, the more weight is given. What is the reason for assigning more weight to the trajectory that has a longer rollout length? If the entire planning horizon is limited to finite length, this means that early simulations (short $path$ length, long $rollout$ length) have more weight than later simulations (long $path$ length, short $rollout$ length), but I do not think this is desirable. Is my understanding correct?
+- For the Sokoban experiments, the pre-trained value function would significantly affect the performance of MCTS and STS, but I could not find the way how the value function was pre-trained.
+- In Appendix A.2., the hyperparameters for Shooting and STS are very much different. Why did you set Shooting's hyperparameter differently from STS (e.g. VF zero-initialization, action sampling temp, etc.)?
+- It seems that the choice of zero-initialization of the value network is rather arbitrary. I am not convinced that this would always work better. In some situations, optimistic initialization of the value network may be helpful to encourage exploration of the uncertain state regions.
+- In Table 2, Why does RandomShooting-PPO underperform PPO? Since RandomShooting-PPO puts additional search efforts upon PPO, I expected that RandomShooting-PPO must outperform PPO.
+- Table 5 could have been moved to the main text, replacing Table 2.
+
+[1] Soemers et al., Enhancements for Real-Time Monte-Carlo Tree Search in General Video Game Playing, 2016 IEEE Conference on Computational Intelligence and Games (CIG 2016)
+",3,4.0,ICLR2021
+bitzfW4zgM_,2,pHgB1ASMgMW,pHgB1ASMgMW,"Review of ""Rethinking Uncertainty in Deep Learning: Whether and How it Improves Robustness ""","# Summary
+This paper investigates the complementary mechanisms of adversarial training and uncertainty promoting regularizers. In the field of adversarial machine learning, adversarial training as proposed by Madry et al. 2017 has been the common method. In the field of uncertainty regularization, maximum entropy and label smoothing have been the accepted methods. However, the combination of both has not been investigated before. The paper provides extensive experiments on these methods. The final section gives insights in the theory behind adversarial training and uncertainty promoting regularization, where they show that the combined method increases a notion of normalized margin and a notion of adversarial robustness. 
+
+# Strong & Weak points
+
+## Strong points
+
+  * The related work section contains an extensive overview of related and contemporary literature, where both literature in adversarial ML and uncertainty promotion is being discussed.
+  * Ablation experiments in figure 1 and 2 show that combining adversarial training and uncertainty promoting regularizers have better accuracy under a range of attack settings. 
+  * Theoretical insights as to why Entropy Maximization would help improve adversarial robustness, complementary to adversarial training is provided in Section 6
+
+## Weak points
+
+  * The method is studied on small datasets (CIFAR, MNIST, SVHN) using small models (ResNet18). It remains to be seen how these insights translate to larger datasets (ImageNet) and larger models (ResNet50).
+  * In the abstract and introduction, the text speaks of “true” robustness, against “strong” attacks, but these terms are never defined.
+  * The increase in (approximate) normalized margin when using entropy maximization as shown in table 4 provides an important argument for the claim of this paper. However, the numbers were obtained using the CIFAR10 dataset and a ResNet18 model. A stronger case could be made with a larger dataset and a larger model.
+
+# Statement
+
+Recommendation: 6
+
+Reasons
+   * The claims for a complementary benefit of adversarial training and uncertainty promoting regularization are backed up with both extensive experiments and theoretical insights.
+   * The fields of adversarial training and uncertainty regularizers have been evolving separately and this paper provides initial insights how these two lines of research can be combined. [Comment: I am not 100% up to date with the related literature, so I'll be looking to other reviewers if they are aware of existing work combining uncertainty regularization and adversarial training.]
+   * Ablation experiments in Figure 3 shows substrates for the complementary actions of Entropy Maximization and Adversarial training. Entropy maximization can be shown to increase a notion of “normalized margin width”, adversarial training can be shown to increase a notion of  adversarial robustness, and when combining the methods, both metrics increase. 
+
+# Additional questions:
+
+Table 3a) misses the point cloud for normal training. How do the normalized margins and adversarial distances of normal training compare in this plot? Table 4 already shows that the normalized margin (0.19) is smaller than the three methods, but I miss the numbers for adversarial distances during normal training.
+
+# Minor feedback
+
+These points are minor feedback and not part of the assessment.
+
+  * KL divergence can be decomposed as $KL(p|q) = H[q] + CE(p|q)$. Could this decomposition explain the differences between TRADES, PAT, and entropy maximization?
+
+  * Figure 1 & 2: please provide labels for the x axes.
+  * None of the derivations on eqn. 9 to 13 depend on $\theta$. Consider dropping the $\theta$ variables everywhere to focus the analysis on what really matters: the derivatives w.r.t. $x$. 
+  * Equation 12: the $\delta$ variable is only implicitly defined. It is not clear to me if its direction is parallel or orthogonal to the decision boundary. 
+
+",6,2.0,ICLR2021
+uaGz5M7KT-,2,EsA9Nr9JHvy,EsA9Nr9JHvy,Reviews for The Heavy-Tail Phenomenon in SGD,"This paper studies the relations between the heavy tail phenomenon of SGD and the ‘flatness’ of the local minimum found by SGD and the ratio of the step size $\eta$ to the batch size $b$ for the quadratic and convex problem. They show that depending on the curvature, the step size, and the batch size, the iterates can converge to a heavy-tailed random variable.  They conduct experiments on both synthetic data and fully connected neural networks, and illustrate that the results would also apply to more general settings and hence provide new insights about the behavior of SGD in deep learning. 
+I do not have time to check the proofs and are not familiar with this topic. However the results seem to be novel and interesting. 
+
+pros: 1, They characterize the empirically observed heavy-tailed behavior of SGD with respect to the step size, mini batch size, dimensions, and the curvature, with explicit convergence rates. 
+
+cons:1, The theoretical analysis is only for the quadratic and convex function. The input data is Gaussian, which is quite restrictive. 
+
+After the rebuttal.
+
+The authors addressed my concerns. I have read other reviewers' comments. I decide to remain the current score.",6,3.0,ICLR2021
+nnkihBavup,1,D3PcGLdMx0,D3PcGLdMx0,Review of MELR: META-LEARNING VIA MODELING EPISODELEVEL RELATIONSHIPS FOR FEW-SHOT LEARNING,"Summary of Paper:
+This paper proposes to improve Prototypical Networks by a method MELR which aims to fix the problem caused by poorly represented classes in sampled episodes and to reap benefits from enforcing cross-episode consistency. The proposed method achieves State of Art results on commonly used benchmarks miniImageNet and tieredImageNet. 
+
+Reasons for score:
+Overall, I tend towards rejecting this paper. I think cross-episode relationship is an important topic of study in the context of few-shot learning, and the experimental results in this paper are strong. However, I am unconvinced that this ‘hammer’ is the right tool for the ‘nail’ that the authors claim to solve. A more thorough study of the motivating problem and how & why MELR works would greatly improve this paper.
+
+Pros:
+1.The proposed method performs well in standard benchmark datasets miniImageNet, tieredImageNet, and CUB200. Assuming correctness of experimental protocol, the improvement over previous methods is significant.
+2. Visualization of embedding space by t-SNE lends further credibility to the strong performance. Samples from each class are well clustered yet still disperse.
+3. Ablation of hyperparameters and algorithmic alternatives is mostly complete and honestly presented.
+
+Cons:
+1. The paper is motivated by the supposed “poorly sampled episodes” problem. While it intuitively makes sense that some data points are more representative than others, whether this has a disparate impact on episode few-shot learning is unclear. In my opinion, the small sample problem in few-shot episodes is no worse than that encountered in standard batch training. In standard supervised learning tasks, batch size as small as 1 has been used successfully to train deep networks given sufficient training epochs. Without empirical or theoretical illustration, I am not convinced that the problem the authors seek to address is a real problem.
+2. The reasoning behind equation 2 is unclear. It is not clear why the attention module output in CEAM is added to the embedding F if the goal is to ignore bad examples and emphasize good examples. The choice to ignore class labels when doing attention is also surprising as it doesn’t use the fact that both episodes have the same classes.
+3. The proposed method aims to use inter-episode information to stabilize representation learning, hence samples two episodes at a time and apply CEAM and CECR to the joint episode. I don’t see why the authors restrict the method to just two episodes. Classifier consistency should hold transductively across any number of episodes with the same base classes. Thus, you could make the number of episodes an hyperparameter and experimentally verify what is the best choice.
+4. Grammar mistakes are common and writing generally lacks polish. 
+
+Questions:
+	Please address the points in cons above.
+
+Minor points and additional feedback:
+Introduction
+“even may be impossible” -> may even be impossible
+“reliance of deep neural networks on sufficient annotated training data”: you always need “sufficient” data, FSL aims to make fewer data be sufficient.
+“Meta-training”, “episode” and “base class” used before definition
+“Concretely, MELR consists of two key components: a Cross-Episode Attention Module (CEAM) and a Cross-Episode Consistency Regularization (CECR)” -> remove ‘a’
+“two cross-episode components (i.e., CEAM and CECR) to explicitly enforcing” -> to explicitly enforce
+Related Work
+Very few model-based methods are mentioned but I guess that’s beyond the point here.
+“Almost all existing meta-learning based FSL methods ignore the relationships across episodes” -> “Ignore” is probably not true since many works (incl. MAML) frame the problem of Meta-learning as an inter-task learning process (as presented in this review https://arxiv.org/pdf/2004.05439.pdf). There’s also this work (https://arxiv.org/abs/1909.11722) that looks at the role of shots when building episodes during and after meta-training.
+Methodology
+Too many inline equations. Even with a dedicated definition section there is still a new definition almost every paragraph. Important equations should be made standalone, definitions placed into its own section, and fluff math be removed.
+Why does CEAM take S as argument twice? Should they be S_k and S_v instead?
+Eqn 2: Is the softmax taken over rows or columns (or both) of F_qS_k^T?
+
+[Post rebuttal] I have increased the score of my review to 7. Below is a copy-pasta of my comments post discussion:
+
+While my original concern about how much sampling affects FSL is still not fully addressed, I think it is not a trivial question to answer and a full exploration of the topic could constitute its own paper. So although I'm not fully convinced about the motivation of this paper, I think the thorough experimental evaluation along with the strong empirical results together warrants publication. From my perspective, a particular important strength of this paper is its ablations. I'm fairly convinced that the presented implementation is likely the best way to implement the idea of cross-episode attention + distillation. I think the baseline proposed by the AC makes sense. It would be great if that could be incorporated into the final version of the paper. A possible explanation for the drop in performance when using 3 or more episodes is due to the relative decrease in episode diversity. Results on wider datasets could corroborate this hypothesis.",7,4.0,ICLR2021
+B1ls5I2aYr,2,HJeEP04KDH,HJeEP04KDH,Official Blind Review #3,"Training and deployment of DRL models is expensive. Quantization has proven useful in supervised learning, however it is yet to be tested thoroughly in DRL. This paper investigates whether quantization can be applied in DRL towards better resource usage (compute, energy) without harming the model quality. Both quantization-aware training (via fake quantization) and post-training quantization is investigated. The work demonstrates that policies can be reduced to 6-8 bits without quality loss. The paper indicates that quantization can indeed lower resource consumption without quality decline in realistic DRL tasks and for various algorithms.
+
+The researchers propose a benchmark called QUARL that allows them to evaluate the effectiveness of quantization as well as the impact of quantization across a set of established DRL algorithms (e.g., DQN, DDPG, PPO) and environments (e.g., OpenAI Gym, ALE). Quantizations tested: fp32 -> fp16, int8, uniform affine.
+
+The idea is simple and carries over from (image-based) supervised learning. The experiments are exhaustive and have to the best of my knowledge not yet been conducted. The conclusions indicate the advantage of quantization, however it is unclear how these results would generalize to real environments (the environments used are after all still simple benchmarks, e.g., half-cheetah or pong). The results are also not entirely surprising or impactful: how is quantization impacting reinforcement learning in a different way than supervised learning? E.g., DQN is supervised learning of a Q-value function against a target. What secondary effects does quantization have on the learning procedure: e.g., does it boost exploration behavior or does it regularize training? We also know that some of these tasks can be solved by extremely small models (https://arxiv.org/abs/1806.01363), while the models used in this work are significantly larger: is quantization working simply because the network capacity is large enough to allow it? These could be investigated in more detail. Furthermore, I'm also missing some experimental setup details: e.g., how many seeds were used for all of the experiments (which is known to greatly affect the results on the benchmarks used in this paper)?",3,,ICLR2020
+HJhn9OtxG,1,rJLTTe-0W,rJLTTe-0W,Solid methodology; its practical performance requires further investigation,"The paper introduces a Bayesian model for timeseries with anomaly and change points besides regular trend and seasonality. It develops algorithms for inference and forecasting. The performance is evaluated and compared against state-of-the-art methods on three data sets: 1) synthetic data obtained from the generative Bayesian model itself; 2) well-log data; 3) internet traffic data.
+
+On the methodological side, this appears to be a solid and significant contribution, although I am not sure how well it is aligned with the scope of ICLR. The introduced model is elegant; the algorithms for inference are non-trivial.
+
+From a practical perspective, one cannot expect this contribution to be ground-breaking, since there has been more than 40 years of work on time series forecasting, change point and anomaly detection. In some situations the methodology proposed here will work better than previous approaches (particularly in the situation where the data comes from the Bayesian model itself - in that case, there clearly is no better approach), in other cases (which the paper might have put less emphasis on), previous approaches will work better. To position this kind of work, I think it is important the authors discuss the limitations of their approach. Some guidelines on when or when not to use it would be valuable. Clearly, these days one cannot introduce methodology in this area and expect it to outperform existing methods under all circumstances (and hence practitioners to always choose it over any other existing method).
+
+What is surprising is that relatively simple approaches like ETS or STL work pretty much equally well (in some cases even better in terms of MSE) than the proposed approach, while more recent approaches - like BSTS - dramatically fail. It would be good if the authors could comment on why this might be the case.
+
+Summary:
++ Methodology appears to be a significant, solid contribution.
+- Experiments are not conclusive as to when or when not to choose this approach over existing methods
+- writing needs to be improved (large number of grammatical errors and typos, e.g. 'Mehtods')",6,3.0,ICLR2018
+OAWw-qgpJKf,2,HW4aTJHx0X,HW4aTJHx0X,Official Blind Review #3,"This paper discussed the task of generating scientific paper summaries along two axes: contribution (ie. novelty) and context/background. The point is two disentangle these two, since often different readers would be interested in only one of these. While I like the idea of this paper, I think it has the following issues:
+
+- Evaluation: There is no evaluation against a manually annotated gold standard. The references used in section 5 are based on the output of a classifier trained on manual annotations, which itself was not evaluated against manual annotations either. Thus I am rather concerned about what we can learn from the results presented. Section 6 presents some observations in 6.1. but even though they are useful on can't read much in them. And I am not convinced that the lack of expert annotators means that one cannot assess hallucinations; if one can assess whether the novelty is extracted correctly (in the same section), then hallucinations should be possible to assess. In section 6.3 the human evaluation is essentially comparing the model proposed (unclear which of the variants though) against the output of a summarization model that doesn't separate novelty from context. However we don't know how good this summarization model is, let alone that asking about usefulness only conflates a number of aspects that matter in a summary, such as informativeness, fluency, factual correctness, readability, etc. Overall, I think the evaluation is not adequate for a paper at a top conference.
+
+- Modelling: Why not experiment with extractive summarization? Abstracts often overlap with the main paper, and you have a sentence level classifier anyway. And the hallucination wouldn't be an issue then. Also there is previous work on modelling citation function: https://www.cl.cam.ac.uk/~sht25/papers/emnlp06.pdf which should be acknowledged but also could be useful in creating better data and models. Similarly, there is work on scientific article zoning: http://oro.open.ac.uk/58880/
+
+- I am not sure I follow the second branch in equation 2: why should context surprise the articles citing the one being summarized? They would share the context quite often. Also I don't see the connection to the disentanglement.
+",4,4.0,ICLR2021
+J_uYjcV5s8r,3,2nm0fGwWBMr,2nm0fGwWBMr,Interesting paper on universal graph pretraining on heterogeneous graphs,"Summary: This paper proposes a universal and unsupervised GNN-based representation learning (node embedding pretraining) model named PanRep for heterogeneous graphs, which benefits a variety of downstream tasks such as node classification and link prediction. More specifically, employing an encoder similar to R-GCN, PanRep utilizes four different types of universal supervision signals for heterogeneous graphs, i.e. cluster and recover, motif, metapath random walk and heterogeneous info maximization, to better characterize the graph structures. PanRep can be further fine-tuned in a semi-supervised manner with limited labeled data, known as PanRep-FT, for specific applications. Experiments on benchmark datasets have shown the effectiveness and performance gain over other unsupervised and some supervised baseline approaches. One example use case is applied to COVID-19 drug repurposing to identify potential treatment candidates. 
+
+Strong aspects: 
+(1) The motivation, problem formulation, and model illustration are well explained. 
+(2) Proposed four types of universal supervision signals are interesting and generalizable for all (heterogeneous) graphs.
+(3) Extensive experiments, ablation study, and case study are supportive. 
+
+Weak aspects: 
+(1) Some comparable SOTA baselines may not be considered. 
+(2) The heterogeneous information maximization signal seems not clearly explained. (See detailed comments below)
+
+Detailed comments: 
+(1) It is strongly recommended in related work to explicitly explain the difference from other graph pretraining models such as [1,2] and the reason why they are excluded from baselines. Though some work (partially) focus on graph-level application, some node-level pretraining methods are generally comparable and applicable for heterogeneous graph representation learning. 
+
+(2) Regarding the drug repurposing task, some baseline approaches (TransE, etc) implemented by DLG-KE on DKRG dataset are not reported [4]. It would be more beneficial and convincing to include the results as well by comparing the #hit other than R-GCN. 
+(3) HIM is an important component from the observations in the ablation study, it requires a clearer explanation for better 
+understanding. Some questions are: For the bilinear scoring function, why one node close to its global summary is the ultimate goal for HIM while other signals are used to learn its structures and the nodes are different from each other even though they are of the same type? Why is it reasonable to design contrastive negative samples as introduced in PanRep? What is the type of information of the nodes are fully not available (for example, the node type is not defined in the ontology)? 
+
+References:
+[1] Hu, Weihua, et al. ""Strategies for Pre-training Graph Neural Networks."" ICLR (2020). 
+
+[2] Hu, Ziniu, et al. ""Heterogeneous graph transformer."" Proceedings of The Web Conference 2020. 2020.
+
+[3] Ioannidis, Vassilis N., et al. ""Drkg-drug repurposing knowledge graph for covid-19."" (2020). Github: https://github.com/gnn4dr/DRKG/tree/master/drug_repurpose  ",6,3.0,ICLR2021
+aKQHVTKpOcR,2,EVV259WQuFG,EVV259WQuFG,"Good results, demonstrates the value of explicit verification/hierarchies in DL, lower novelty wrt ML elements, manuscript needs significant revision","MACHINE READING COMPREHENSION WITH ENHANCED LINGUISTIC VERIFIERS
+
+The authors propose two linguistic verifiers for improving extractive question answering performance when the question is answerable. The first replaces interrogatives in the question (who etc.) with candidate answers and evaluates this both in isolation and in combination with the answer-containing sentence to do answer verification. The second verifier jointly encodes individual sentences and spans with questions in a hierarchical manner to improve answer prediction performance. Solid gains on Squad, NewsQA, and TriviaQA are reported for both methods when applied in isolation, and in combination.
+
+Strengths:
+
+- The techniques are sound and lead to solid gains on 3 benchmark datasets.
+- The approaches, while relatively straightforward, illustrate that explicit verification and hierarchical evaluation continue to improve application results, despite the high capacity and efficacy of the SOTA deep architectures.
+
+Limitations:
+
+- The paper is understandable but the presentation could be significantly improved. Figure 1a for example, is a bit overwelming, and should probably be replaced with something more focused, and moved into supplementary material. Several sentences I couldn't understand, for example ""Minimizing span losses of start and end positions of answers for answerable questions is overwhelming in current pretraining+fine-tuning frameworks."" Overall I feel that the paper could use some additional polishing.
+- A similar hierarchical (HAN) approach was previously proposed for verifying unanswerable questions, but their approach for answerable questions appears to be more effective.
+- The paper has lower novelty wrt ML elements. The component architectures/models that make up their system are well established.
+- The replacing of interrogatives with the answer and the associated rules for doing so feel like they have somewhat limited scope (e.g. factoid questions, single interrogative questions, etc.). When there is more than one interrogative, the authors back off to simply appending the answer to the question... perhaps this can be done all the time without compromising the performance gains? 
+- Verification (esp. for the HAN verifier, where extra forward passes are done for each sentence and sub-paragraph) is more more computationally demanding, but this is not discussed.
+
+Overall Assessment:
+
+A solid applications paper on extractive question answering. However, I feel that the paper is perhaps better suited for an NLP-application focused audience (e.g. NAACL, deadline approaching), since the results are strong, but the paper has lower novelty wrt core ML. Furthermore, the manuscript is in need of significant revision before it can be considered for acceptance at ICLR.
+
+quality 5/10 (+results on multiple benchmark datasets, -manuscript needs substantial revision)
+clarity 5/10 (+understandable for the most part, -manuscript/figures not clear in many places)
+originality 6/10 (+novel approaches to QA verification, -lower novelty wrt ML elements)
+significance 6/10 (+strong QA results, +demonstrates value in explicit verification/hierarchical processing in DL applications, -perhaps more suitable for an NLP-applications focused audience)
+overall (5)
+
+Post-rebuttal:
+
+Authors, thank you for your feedback. The additional results around relative speed and performance have strengthened the paper. However, I still feel that the paper still needs significant polishing before final publication (figures, grammar, presentation), and that the paper is better suited for an NLP-focused conference, and so I have not updated my final score.",5,3.0,ICLR2021
+ryn4mW9ef,2,BkeqO7x0-,BkeqO7x0-,This paper proposed CipherGAN method addressing shift and Viegenere ciphers. The performance is better than CycleGAN and relatively stable under random initial weights. This well written paper adds value to decoding literature. ,"The paper proposed to replace the 2-dim convolutions in CycleGAN by one dimension variant and reduce the filter sizes to 1, while leave the generator convex embedding and using L2 loss function.   
+
+The proposed simple change help with the dealing of discrete GAN. The benefit of increased stability by adding Jacobian norm regularization term to the discriminator's loss is nice.  
+
+The paper is well written. A few minor ones to improve: 
+* The original GAN was proposed/stated as min_max, while in Equation 1 didn't defined F and was not clear about min_{F}. Similar for Equations 2 and 3. 
+* Define abbreviation when first appear, e.g. WGAN (Wasserstein ...). 
+* Clarify x- and y- axis label in Figure 3. ",8,4.0,ICLR2018
+ryePC-_y5r,3,rJe4_xSFDB,rJe4_xSFDB,Official Blind Review #2,"In this paper, the authors introduce a framework for computing upper bounds on the Lipschitz constant for neural nets. The main contribution of the paper is to leverage the sparsity properties of typical feed-forward neural nets to reduce the computational complexity of the algorithm. Through experiments, the authors show that the proposed algorithm computes tighter Lipschitz bounds compared to baselines.
+
+The approach proposed in the paper looks interesting. Although the presentation can be made clearer in places. For example, in equation (4), it would be helpful to explicitly state over which parameters the max is taken. There's also a number of small typos that need to be fixed. For example: ""We refer to d as the depth, and we we focus on the case where fd has a single real value as output."" on page 1.
+
+I found the proposed algorithm and the discussions in Section 2 and 3 interesting, although I am not familiar enough with the literature on polynomial optimization to evaluate whether there is any significantly new idea presented in these sections. I found section 4 very interesting too, and very important towards making the algorithm actually computationally tractable. I have a couple of concerns with the rest of the paper however, which `I describe below:
+
+1. It is nice that upper bounds for the local Lipschitz constant can be incorporated easily into the formulation. I would have liked to see some experiments on evaluating local Lipschitz constants though, and how they compare with other methods, since this is a very popular setting in which such techniques are used nowadays.
+
+2. The paper overall I think would benefit from a better experimental evaluation. It would be interesting to see how much the sparsity pattern in convnets affect results compared to other baselines. It would also be interesting to see how the bound degrades as the network grows bigger, and in particular as the depth increases.
+
+Given the lack of thorough experiments in the paper, I am giving the paper a borderline rating. I am however willing to increase my score based on discussions with the authors and other reviewers.
+
+===================================
+
+Edit after rebuttal: 
+The latest draft addresses most of my concerns, and I am happy to recommend accepting this paper now.",6,,ICLR2020
+H1BZIo-Vl,3,r1VGvBcxl,r1VGvBcxl,Final Review,"The paper introduces GA3C, a GPU-based implementation of the A3C algorithm, which was originally designed for multi-core CPUs. The main innovation is the introduction of a system of queues. The queues are used for batching data for prediction and training in order to achieve high GPU occupancy. The system is compared to the authors' own implementation of A3C as well as to published reference scores.
+
+The paper introduces a very natural architecture for implementing A3C on GPUs. Batching requests for predictions and learning steps for multiple actors to maximize GPU occupancy seems like the right thing to do assuming that latency is not an issue. The automatic performance tuning strategy is also really nice to see. 
+
+I appreciate the response showing that the throughput of GA3C is 20% higher than what is reported in the original A3C paper. What is still missing is a demonstration that the learning speed/data efficiency is in the right ballpark. Figure 3 of your paper is comparing scores under very different evaluation protocols. These numbers are just not comparable. The most convincing way to show that the learning speed is comparable would be time vs score plots or data vs score plots that show similar or improved speed to A3C. For example, this open source implementation seems to match the performance on Breakout: https://github.com/muupan/async-rl
+One or two plots like that would complete this paper very nicely.
+
+-----------------------------------
+
+I appreciate the additional experiments included in the revised version of the paper. The learning speed comparison makes the paper more complete and I’m slightly revising my score to reflect that. Having said that, there is still no clear demonstration that the higher throughput of GA3C leads to consistently faster learning. With the exception of Pong, the training curves in Figure 6 seem to significantly underperform the original A3C results or even the open source implementation of A3C on Breakout and Space Invaders (https://github.com/muupan/async-rl).",5,5.0,ICLR2017
+4cP2YiYgay,1,_PzOsP37P4T,_PzOsP37P4T,"Marginal contribution, unclear what benefits are","This paper presents a new methodology for inference of discrete variables in computational graphs.
+
+Overall the paper represents a substantial amount of work, and it is clearly written. Nonetheless, I believe it is not significant enough so I can recommend acceptance.
+
+1)Theoretical analysis, while pertinent, is mostly elementary and unsurprising. I would have expected something closer to rates of convergence but the kind of analysis presented (almost sure convergence when the truncation size goes to infinity) is essentially  obvious for anyone who was taken an elementary course on probability/measure theory. I would expect theoretical results to enlighten some aspects of their methods, but that is absent.
+
+2)I am in general unconvinced by the method. I don't think there is anything specially novel besides truncation. But just proposing truncation is not enough merit, I believe, as truncation is a natural choice. I would have expected an analysis of how truncation affects the results.
+
+3)I am not convinced by experimental results. Authors don't include Gumbel-Softmax as a baseline and I wonder whether any goodness in reported results (Besides the topic model) is a consequence of the action of Gumbel-Softmax instead of their method.
+
+4)The topic modeling experiment is interesting, since it involves a poisson distribution, with infinitely many possible values. However,  there, the relevant comparison is missing. This alternative could be, for example, the detr method (Liu et al, 2019). Notice that I disagree with the authors claim that this method is doing something completely unique. The Liu et al paper can be used with similar purposes.",4,5.0,ICLR2021
+lx_jopuSfZN,2,jNTeYscgSw8,jNTeYscgSw8,"review of ""DEMYSTIFYING LOSS FUNCTIONS FOR CLASSIFICATION""","The paper attempts to analyze the softmax cross entropy loss and discuss useful properties it has in an intuitive and accessible way. The paper mainly presents comparisons between this loss function and two other (very standard as well) loss functions that are the ""sigmoid cross entropy"" and the ""squared error"" loss functions. 
+
+It is very interesting to read the paper and realize that the topics discussed here have been written in well cited and well established statistics and ML books for decades. The softmax cross entropy loss is the multinomial regression loss while the sigmoid cross entropy loss is no other than the standard binary  logistic regression loss and finally the ""squared error"" is the least squared loss. 
+
+Authors compare the performance of these three on a few benchmarks. 
+
+I think it's good to read the ""Elements of Statistical Learning"" and get a clear exposition to what science has been up to for decades before getting too deep into deep learning. ",3,5.0,ICLR2021
+rkhi37_gz,1,B1al7jg0b,B1al7jg0b,"This is an interesting method for continual learning. It relies mostly on conceptors, Linear Algebra method, for minimizing the interference of new task to the already learned tasks. ","This paper introduces a method for learning new tasks, without interfering previous tasks, using conceptors. This method originates from linear algebra, where a the network tries to algebraically infer the main subspace where previous tasks were learned, and make the network learn the new task in a new sub-space which is ""unused"" until the present task in hand.
+
+The paper starts with describing the method and giving some context for the method and previous methods that deal with the same problem. In Section 2 the authors review conceptors. This method is algebraic method closely related to spanning sub spaces and SVD. The main advantage of using conceptors is their trait of boolean logics: i.e., their ability to be added and multiplied naturally. In section 3 the authors elaborate on reviewed ocnceptors method and show how to adapt this algorithm to SGD with back-propagation. The authors provide a version with batch SGD as well.
+
+In Section 4, the authors show their method on permuted MNIST. They compare the method to EWC with the same architecture. They show that their method more efficiently suffers on permuted MNIST from less degradation. Also, they compared the method to EWC and IMM on disjoint MNIST and again got the best performance.
+
+In general, unlike what the authors suggest, I do not believe this method is how biological agents perform their tasks in real life. Nevertheless, the authors show that their method indeed reduce the interference generated by a new task on the old learned tasks.
+
+I think that this work might interest the community since such methods might be part of the tools that practitioners have in order to cope with learning new tasks without destroying the previous ones.  What is missing is the following: I think that without any additional effort, a network can learn a new task in parallel to other task, or some other techniques may be used which are not bound to any algebraic methods. Therefore, my only concern is that in this comparison the work bounded to very specific group of methods, and the question of what is the best method for continual learning remained open.   ",7,3.0,ICLR2018
+p4w7XJw71W2,4,yT7-k6Q6gda,yT7-k6Q6gda,Misleading and Even Wrong Writing,"# Summary
+This work claims SGD regularizes the model through penalizing the trace of Fisher in the early phase. This claim is supported by the similar generalization behavior of SGD (with optimal learning rate) and Fisher penalization (for SGD with small learning rate). A series experiments are conducted to verify the understanding.
+
+I find the paper writing is rather misleading. On the one hand, no mathematical justifications, and the logical chain is weak --- hard to claim which factor is the cause and which one is the effect. On the other hand, though written as ""trace of Fisher"" in the paper, in experiments they compute a different regularizer --- norm of expected gradient. Detailed questions come in the following.
+
+# Questions
+
+1. Abstract, ""We highlight that in the absence of implicit or explicit regularization..."". How do you get rid of the implicit regularization???
+
+2. Eq.(2) is super misleading. Trace of Fisher is the expected gradient norm, but in Eq. (2) you compute norm of expected gradient. Statistically the latter has nothing to do with the former: one is the second moment, and the other is the squared first moment. Please justify.
+
+3. I have a very simple explanation to your observed phenomenon. In the beginning, gradient is large, thus your version of ""trace of fisher"" is large, as the gradient decreases, the ""trace of fisher"" decreases. Now we look at large learning rate and small learning rate. With large learning rate, SGD converges faster, thus its gradients decrease faster, which causes the ""trace of fisher"" decreases faster. But with small learning rate, SGD converges slower, thus its gradient decrease slower, and the ""trace of fisher"" appears to be large in the beginning epochs.
+
+Therefore I am not at all convinced by your arguments. 
+
+4. From the above discussion, at least you need to normalize the ""trace of fisher"" by gradient norm --- which then becomes a measurement of gradient confusion. See following for references.
+
+5. FP/GP. If I understand correctly, you need to compute gradient of gradient, which expands your computation graph at least twice. Can you report the GPU memory consumption? 
+
+6. Finally, let us take about the practical role of FP. According to Table 1, the improvement of FP over GP is marginal. I cannot see a potential of FP. Not only in theory, but also in practice this paper is not satisfactory. 
+
+
+# Missing Refs
+Tons of theory paper should be discussed.  A few of them come to my brain are listed in below. Please do a more complete literature investigation.
+
+Fisher
+- Liang, Tengyuan, et al. ""Fisher-rao metric, geometry, and complexity of neural networks."" The 22nd International Conference on Artificial Intelligence and Statistics. 2019.
+- Karakida, Ryo, Shotaro Akaho, and Shun-ichi Amari. ""Universal statistics of Fisher information in deep neural networks: Mean field approach."" The 22nd International Conference on Artificial Intelligence and Statistics. PMLR, 2019.
+
+SGD regularization mechanism
+- Daneshmand, H., Kohler, J., Lucchi, A., and Hofmann, T. Escaping saddles with stochastic gradients. arXiv preprint arXiv:1803.05999, 2018.
+- Zhu, Zhanxing, et al. ""The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects."" arXiv preprint arXiv:1803.00195 (2018).
+- Wu, Jingfeng, et al. ""On the Noisy Gradient Descent that Generalizes as SGD."" arXiv preprint arXiv:1906.07405 (2019).
+
+Gradient confusion is highly related to the trace Fisher. See this one and its follow-ups.
+- Sankararaman, Karthik A., et al. ""The impact of neural network overparameterization on gradient confusion and stochastic gradient descent."" arXiv preprint arXiv:1904.06963 (2019).
+
+Adversarial regularization is also related to GP/FP:
+- Miyato, Takeru, et al. ""Virtual adversarial training: a regularization method for supervised and semi-supervised learning."" IEEE transactions on pattern analysis and machine intelligence 41.8 (2018): 1979-1993.
+
+
+  ",5,5.0,ICLR2021
+SklP0HP8oS,3,r1lGO0EKDH,r1lGO0EKDH,Official Blind Review #4,"The paper provides a multi-level graph-coarsening approach that can improve the predictive and computational performances of numerous existing unsupervised graph embedding models. The proposed approach is a pipeline consisting of 4 steps, viz: 1) Graph Fusion - that fuses attribute similarity graph with network topology, 2> Graph Coarsening - that reduces the graph size iteratively, 3> Graph embedding - using existing models and 4> Embedding refinement. While such a pipeline for scaling using a graph coarsening and refinement based approach is not new, the authors have carefully designed the pipeline to be effective and be scalable such as without any costly learning components (as in mile). The effectiveness of the proposed approach is evaluated with the node classification task on 6 datasets.
+
+
+Strengths:
+- The paper addresses a very important problem. The paper proposes a well-designed pipeline to scale existing embedding models.
+- Experimental results support that the proposed approach is effective, especially in terms of reducing computation complexity. 
+
+Weaknesses:
+- While the experimental results are convincing on the computation front, I have few concerns on the performance front. 
+   a) 'MILE with the fused graph' baseline is missing. It can been seen from Figure 3 that the incorporation of the attribute graph provides a significant performance benefit. Thus it is necessary to have this baseline to understand the improvement gap w.r.t to MILE. I believe this is a fair comparison to make as the graph fusion component is a commonly used technique in the last decade.
+  b) Improvements are inconclusive without additional results on other standard non-attributed graph datasets. In Figure 3, ignoring the model with the fused graph, MILE seems to be comparable to GraphZoom overall. As with the existing results, it's not conclusive whether GraphZoom is better than MILE. Also, add variance and report t-test results. 
+  c) That said, it can be seen from Figure 2, that GraphZoom significantly outperforms both DW and MILE(DW) on a large non-attributed dataset. However, it is not clear where the significant increase in performance benefits stems from. More analysis is required here.
+- Results on other unsupervised embedding task missings. It is important to evaluate the embeddings additionally for the link prediction task at the least. 
+
+Additional comment:
+- It would be helpful to incorporate one if not some of the attributed graph embedding model as a base model and baseline, such as Deep Graph Infomax (DGI). 
+- It should be easy to use a mini-batch version of GCN with MILE and use it for inductive learning. 
+- It would interesting to see what the performance will be without the refinement step. 
+
+If my concerns regarding the experiments are positively addressed, I'm willing to improve the score. 
+
+-----------------
+After the rebuttal, I have updated my score from 3 to 8 as the authors have satisfactorily responded to the concerns raised. 
+
+",8,,ICLR2020
+r1UheZ6gG,3,ry4S90l0b,ry4S90l0b,Interesting idea but limited novelty and impact,"This paper presents a self-training scheme for GANs and tests it on image (NIST) data.
+
+Self-training is a well-known and usually effective way to learn models in a semi-supervised setting. It makes a lot of sense to try this with GANs, which have also been shown to help train Deep Learning methods.
+
+The novelty seems quite limited, as both components (GANs and self-training) are well-known and their combination, given the context, is a fairly obvious baseline. The small changes described in Section 4 are not especially motivated and seem rather minor. [btw you have a repeated sentence at the end of that section]
+
+Experiments are also quite limited. An obvious baseline would be to try self-training on a non-GAN model, in order to determine the influence of both components on the performance. Results seem quite inconclusive: the variances are so large that all method perform essentially equivalently. On the other hand, starting with 10 labelled examples seems to work marginally better than 20. This is a bit weird and would justify at least a mention, and idealy some investigation.
+
+In summary, both novelty and impact seem limited. The idea makes a lot of sense though, so it would be great to expand on these preliminary results and explore the use of GANs in semi-supervised learning in a more thorough manner.
+
+[Response read -- thanks]",3,4.0,ICLR2018
+Clm8HLOAIB2,2,mxfRhLgLg_,mxfRhLgLg_,"This is an interesting and well-motivated paper that uses techniques from deep learning to generete approximations for the ecological inference problem on voting data. This paper also uses voter file data as an underlying data source, which allows for additional novel insights. ","This paper proposes a deep learning framework for approximating ecological inference for estimating voting propensities based on demographic aggregates. This is an important problem, as EI has become a court standard for evaluating racially polarized voting in gerrymandering cases for the Gingles factors. Additionally, the increased attention on building coalition districts and availability of individual level data means that this is a problem that is likely to have a large impact in the next redistricting cycle that begins next year. 
+
+The proposed methodologies seem natural once the approximation is constructed and this analysis explores some potential ways to incorporate it into various learning architectures. Additional work could be devoted to optimizing over the choices of hyperparameters and providing additional guidance about ways to choose which model would be appropriate based on available input data, since not all applications of these methods will have access to the full sets of surveys and validation measures that were available here. It would be nice to see the performance of these methods on some synthetic data as well and at least one comparison to one of the current state of the art methods on an aggregate version of the data would be useful. 
+
+Overall, this paper is interesting and presents an approximation that is likely to be useful in practice for real world problems and given the space constraints appears to present sufficient work to be publishable. 
+
+
+A couple of typos: 
+Last sentence of paragraph 1 in Section 3 `not correlations' seems like a misnomer
+
+End of caption 1, missing close paren. ",7,3.0,ICLR2021
+ryBhOOXlM,1,HJXOfZ-AZ,HJXOfZ-AZ,"Hints of something interesting, but a bit over-simplistic and sloppy.","The authors ask when the hidden layer units of a multi-layer feed-forward neural network will display selectivity to object categories. They train 3-layer ANNs to categorize binary patterns, and find that typically at least some of the hidden layer units are category selective. The number of category selective (""localist"") units varies depending on the size of the hidden layer, the structure of the outputs the network is trained to return (i.e., one-hot vs distributed), the neurons' activation functions, and the level of dropout-induced noise in the training procedure.
+
+Overall, I find the work to hint at an interesting phenomenon. However, the paper as presented uses an overly-simplistic task for the ANNs, and the work is sloppily presented. These factors detract from my enthusiasm. My specific criticisms are as follows:
+
+1) The binary pattern classification seems overly simplistic a task for this study. If you want to compare to the medial temporal lobe's Jennifer Aniston cells (i.e., the Quiroga result), then an object recognition task seems much more meaningful, as does a deeper network structure. Likewise, to inform the representations we see in deep object recognition networks, it is better to just study those networks, instead of simple shallow binary classification networks. Or, at least show that the findings apply to those richer settings, where the networks do ""real"" tasks.
+
+2) The paper is somewhat sloppy, and could use a thorough proofreading. For example, what are ""figures 3, ?? and 6""? And which is Figure 3.3.1?
+
+3) What formula is used to quantify the selectivity? And do the results depend on the cut-off used to label units as ""selective"" or not (i.e., using a higher or lower cutoff than 0.05)? Given that the 0.05 number is somewhat arbitrary, this seems worth checking.
+
+4) I don't think that very many people would argue that the presence of distributed representations strictly excludes the possibility of some of the units having some category selectivity. Consequently, I find the abstract and introduction to be a bit off-putting, coming off almost as a rant against PDP. This is a minor stylistic thing, but I'd encourage the authors to tone it down a bit.
+
+5) The finding that more of the selective units arise in the hidden layer in the presence of higher levels of noise is interesting, and the authors provide some nice intuition for this phenomenon (i.e., getting redundant local representations makes the system robust to the dropout). This seems interesting in light of the Quiroga findings of Jennifer Aniston cells: the fact that the (small number of) units they happened to record from showed such selectivity suggests that many neurons in the brain would have this selectivity, so there must be a large number of category selective units. Does that finding, coupled with the result from Fig. 6, imply that those ""grandmother cell"" observations might reflect an adaptation to increase robustness to noise? 
+",3,3.0,ICLR2018
+H1e_OITis7,1,HkgEQnRqYQ,HkgEQnRqYQ,Is it the RotatE scoring function or the adversarial sampling?,"# Summary
+This paper presents a neural link prediction scoring function that can infer symmetry, anti-symmetry, inversion and composition patterns of relations in a knowledge base, whereas previous methods were only able to support a subset. The method achieves state of the art on FB15k-237, WN18RR and Countries benchmark knowledge bases. I think this will be interesting to the ICLR community. I particularly enjoyed the analysis of existing methods regarding the expressiveness of relational patterns mentioned above.
+
+# Strengths
+- Improvements over prior neural link prediction methods
+- Clearly written paper
+- Interesting analysis of existing neural link prediction methods
+
+# Weaknesses
+- As the authors not only propose a new scoring function for neural link prediction but also an adversarial sampling mechanism for negative data, I believe a more careful ablation study should have been carried out. There is an ablation study showing the impact of the negative sampling on the baseline TransE, as well as another ablation in the appendix demonstrating the impact of negative sampling on TransE and the proposed method, RotatE, for the FB15k-237. However, from Table 10 in the appendix, one can see that the two competing methods, TransE and RotatE, in fact, perform fairly similarly once both use adversarial sampling it still remains unclear whether the gains observed in table 4 and 5 are due to adversarial sampling or a better scoring function. Particularly, I want to see results of a stronger baseline, ComplEx, equipped with the adversarial sampling approach. Ideally, I would also like to see multiple repeats of the experiments to get a sense of the variance of the results (as it has been done for Countries in Table 6).
+
+# Minor Comments
+- Eq 5: Already introduce gamma (the fixed margin) here.
+- While I understand that this paper focuses on knowledge graph embeddings, I believe the large body of other relational AI approaches should be mention as some of them can also model symmetry, anti-symmetry, inversion and composition patterns of relations as well (though they might be less scalable and therefore of less practical relevance), e.g. the following come to mind:
+  - Lao et al. (2011). Random walk inference and learning in a large scale knowledge base.
+  - Neelakantan et al. (2015). Compositional vector space models for knowledge base completion.
+  - Das et al. (2016). Chains of Reasoning over Entities, Relations, and Text using Recurrent Neural Networks. 
+  - Rocktaschel and Riedel (2017). End-to-end Differentiable Proving.
+  - Yang et al. (2017). Differentiable Learning of Logical Rules for Knowledge Base Completion.
+- Table 6: How many repeats were used for estimating the standard deviation?
+
+
+Update: I thank the authors for their response and additional experiments. I am increasing my score to 7.",7,4.0,ICLR2019
+SkmS_5--z,3,ryDNZZZAW,ryDNZZZAW,A new generalization bound and an extension of using adversarial learning for multiple source domain adaptation,"The generalization bounds proposed in this paper is an extension of Blitzer et al. 2007. The previous bounds was proposed for single domain single target setting, and this paper extends it to multiple source domain setting. 
+
+The proposed bound is presented in Theorem 3.4, showing some interesting observations, such as the performance on the target domain depends on the worst empirical error among multiple source domains.  The proposed bound reduces to Blitzer et al. 2007’s when there is only single source domain. 
+
+Pros 
++ The proposed bound is of some interest.
++ The bound leads to an efficient learning strategy using adversarial neural networks.
+
+Cons:
+- My major concern is that the baselines evaluated in the experiments are quite limited. There are other publications working on the multi-source-domain setting, which were not mentioned/compared in the submission.
+",6,5.0,ICLR2018
+GMj3B9nCFzC,2,c8P9NQVtmnO,c8P9NQVtmnO,"Review of ""Fourier Neural Operator for Parametric Partial Differential Equations""","Paper summary:
+
+Building on previous work on neural operators, the paper introduces the Fourier neural operator, which uses a convolution operator defined in Fourier space in place of the usual kernel integral operator. Each step of the neural operator then amounts to applying a Fourier transform to a vector (or rather, a set of vectors on a mesh), performing a linear transform (learnt parameters in this model) on the transformed vector, before performing an inverse Fourier transform on the result, recombining it with a linear map of the original vector, and passing the total result through a non-linearity. The Fourier neural operator is by construction (like all neural operators) a map between function spaces, and invariance to discretization follows immediately from the nature of a Fourier transform (just project onto the usual basis). If the underlying domain has a uniform discretization, the fast Fourier transformation (FFT) can be used, allowing for an O(nlogn) evaluation of the aforementioned convolution operator, where n is the number of points in the discretization. Experiments demonstrate that the Fourier neural operator significantly outperforms other neural operators and other deep learning methods on Burgers’ equation, Darcy Flow, and Navier Stokes, and that that it is also significantly faster than traditional PDE solvers.
+
+------------------------------------------
+Strengths and weaknesses:
+
+Much of the theoretical legwork for this paper, namely, neural operators, was already carried out in previous papers (Li et al.). The remaining theoretical work, namely writing down the Fourier integral operator and analysing the discrete case, was succinctly explained. The subsequent experimentation was extremely thorough (e.g. demonstrating that activation functions help in recovering high frequency modes) and, of course, the results were very impressive. I liked the paper a lot, and it’s definitely a big step-forward in neural operators. I’m assigning a score of 8 (a very good conference paper), and I think that the paper is more or less ready for publication as is. I’ve included a few questions below (to help my own understanding), as well as some typos I spotted whilst reading the paper.
+
+------------------------------------------
+Questions and clarification requests:
+
+1)	Section 4, The Discrete Case and the FFT – could you explain the definition of bounds in the definition of Z_{k_{max}}?
+2)	Section 4, Parametrizations of R, sentence 2 – could you explain the definition R_{\phi}? At present I can’t see how the function signature of R matches the definition given.
+------------------------------------------
+Typos and minor edits:
+- Page 3, bullet point 3 – “solving Bayesian inference problem” -> “solving Bayesian inference problems”
+- Section 1, final paragraph, sentence 2 - “approximate function with any boundary conditions” -> “approximate functions with any boundary conditions”
+- Section 4, The discrete case and the FFT, final paragraph, last sentence - “all the task that we consider” -> “all the tasks that we consider”
+- Section 4, Parametrizations of R, last sentence - “while neural networks have the worse performance” -> “while neural networks have the worst performance”
+- Section 4, final sentence – “Generally, we have found using FFTs to be very efficient, however a uniform discretization if required.” -> “Generally, we have found using FFTs to be very efficient. However, a uniform discretization is required.”
+- Section 5, final paragraph, sentence 2 – “FNO takes 0.005s to evaluate a single instances while the traditional solver” -> “FNO takes 0.005s to evaluate a single instance while the traditional solver”
+- Section 6, final sentence – “Traditional Fourier methods work only with periodic boundary conditions, however, our Fourier neural operator does not have this limitation.” -> “Traditional Fourier methods work only with periodic boundary conditions. However, our Fourier neural operator does not have this limitation.”",8,4.0,ICLR2021
+SylMDPRYcr,3,rylMgCNYvS,rylMgCNYvS,Official Blind Review #6,"Summary
+-------
+
+The authors investigate (subclasses of) generalized counter machines with respect to their weak generative capacity, their ability to represent structure, and several closure properties. This is motivated by recent indications that LSTMs have comparable expressivity to counter machines, so that the formal properties of these machines might provide indirect insights into the linguistic suitability of LSTMs.
+
+
+Evaluation
+----------
+
+I also reviewed this paper for SCiL a few months ago.
+While I had major reservations back then, I am happy to provide a more positive evaluation this time as the authors have done some revisions that clear up many points of confusion.
+I have to add two caveats, though.
+First, I am a bit disheartened that the authors chose not to adopt many of the excellent changes suggested by another SCiL reviewer (who went way beyond the call of duty with their multi-page review).
+Second, I did not have sufficient time to check all proofs for their correctness.
+In many cases the strategies strike me as intuitively sound, but my intuition tends to miss edge cases.
+Nonetheless, I think that this paper, albeit a bit of a gamble, would make for an interesting addition to the program.
+
+
+1) Weakness: Link to neural networks still unclear
+
+The central weakness of the paper is still the link between neural networks and counter automata.
+Based on what is said in the paper, this is merely a conjecture at this point, not a well-established fact. 
+Without this link, the value of the paper is unclear.
+If, however, this conjecture should turn out to be true, the paper would mark a very strong starting point for further exploration.
+This makes it a gamble worth taking.
+
+
+2) Strong results, but lack of examples
+
+The results are not trivial and provide deep insights into the inner workings of counter machines.
+In particular the fact that counter machines cannot correctly represent Boolean expressions reveals key limitations on their representational power.
+The semilinearity result is less impressive because of how limited the machines are that it applies to, and I'm not sure that the proof provides a good basis for generalization to more complex machines.
+The authors might consider removing this part to clear some space for examples, which are sorely needed.
+The formalism is abstract and unfamiliar to most readers, and a few concrete examples would greatly strengthen the readers' intuition.
+
+
+3) No investigation of linguistically important string languages
+
+As the authors make claims about linguistic adequacy, it is surprising that there is no discussion of TALs, MCFLs or PMCLFs.
+The grammar formalism of GPSG was abandoned because it was limited to context-free languages and could not handle those more complex language classes.
+So if counter machines fail here, the issue of their linguistic adequacy is already decided without further probing semilinearity or representational power.
+As far as I can tell, real-time counter machines cannot generate the PMCLF a^{2^n}, which is an abstract model of unbounded copying constructions in natural language (see Radzisnky on Chinese number names, Michaelis & Kracht on Old Georgian case stacking, and Kobele on Yoruba).
+Nor is it obvious to me that counter machines can handle the copy language {ww | w \in \Sigma^*}, a model of crossing dependencies, although they can handle a^n b^n c^n (a TAL).
+It should also be possible to generate the linguistically undesirable MIX language, which is a 2-MCFL but not a TAL.
+
+
+Minor comments
+--------------
+
+- As noted in my SCiL review, your definitions still differ from those of Fischer et al. 1968. What is the reason for this?
+
+- Theorem 3.1: \subsetneq would be clearer than \subset
+
+- p4, typo: the the
+
+- Proof of Theorem 3.2: Unless I misunderstand your modulo construction, your ICL only has resolution up to mod n. For instance, with mod 2 it can distinguish 2 from 3, but not 2 from 4. The CL can do that. Don't you need a second counter c_i' for each c_i, then, to keep track of how often you have wrapped around modulo n in c_i? That would still be incremental as you can never wrap around by more than 1 in any given update.
+
+- Sec 6.1: in all those definitions, if should be iff
+
+
+References
+----------
+
+@ARTICLE{Radzinski91,
+  author = {Radzinski, Daniel},
+  title = {Chinese Number Names, Tree Adjoining Languages, and Mild Context
+	Sensitivity},
+  year = {1991},
+  journal = {Computational Linguistics},
+  volume = {17},
+  pages = {277--300},
+  url = {http://ucrel.lancs.ac.uk/acl/J/J91/J91-3002.pdf}
+}
+
+@INPROCEEDINGS{MichaelisKracht97,
+  author = {Michaelis, Jens and Kracht, Marcus},
+  title = {Semilinearity as a Syntactic Invariant},
+  year = {1997},
+  booktitle = {Logical Aspects of Computational Linguistics},
+  pages = {329--345},
+  editor = {Retor{\'e}, Christian},
+  volume = {1328},
+  series = {Lecture Notes in Artifical Intelligence},
+  publisher = {Springer},
+  doi = {10.1007/BFb0052165},
+  url = {http://dx.doi.org/10.1007/BFb0052165}
+}
+
+@PHDTHESIS{Kobele06,
+  author = {Kobele, Gregory M.},
+  title = {Generating Copies: {A}n Investigation into Structural Identity in
+	Language and Grammar},
+  year = {2006},
+  school = {UCLA},
+  url = {http://home.uchicago.edu/~gkobele/files/Kobele06GeneratingCopies.pdf}
+}",6,,ICLR2020
+ByPKCgNgG,1,SkwAEQbAb,SkwAEQbAb,Poorly described minor variant of rank determination in SVD,"The manuscript proposes to estimate the number of components in SVD by comparing the eigenvalues to those obtained on bootstrapped version of the input.
+
+The paper has numerous flaws and is clearly below acceptance threshold for any scientific forum. Some of the more obvious issues, each alone sufficient for rejection, include:
+
+1. Discrepancy between motivation and actual work. The method is specifically about determining the rank of a matrix, but the authors motivate it with way too general and vague relationships, such as ""determining the number of nodes in neural networks"". Somewhat oddly, the problem is highlighted to be of interest in supervised problems even though one would expect it to be much more important in unsupervised ones.
+
+2. Complete lack of details for related work. Methods such as PA and MAP are described with vague one-sentences summaries that tell nothing about how they actually work. There would have been ample space to provide the mathematical formulations.
+
+3. No technical contribution. The proposed method is trivial variant of randomised testing, described with single sentence ""Bootstrapped samples R_B are simply generated through random sampling with replacement of the values of R."" with literally no attempt of providing any sort of justification why this kind of random sampling would be good for the proposed task or what kind of assumptions it builds on.
+
+4. Poor experiments using really tiny artificial data sets, reported in unprofessional manner (visual style in plots changes from figure to figure, tables report irrelevant numbers in hard-to-read format etc). No real improvement over the somewhat random choice of comparison methods that do not even represent the techniques people would typically use for this problem.",1,4.0,ICLR2018
+XEIkv9ga6ax,4,AVKFuhH1Fo4,AVKFuhH1Fo4,"My review of paper ""Transformers are Deep Infinite-Dimensional Non-Mercer Binary Kernel Machines""","In this paper, the authors treat a particular Transformer, ""dot-product attention"", as  an RKBS kernel called ""exponentiated query-key kernel"". The explicit form of feature maps and Bach space are given. Moreover, authors term a binary kernel learning problems within the framework of regularized empirical risk minimization. The problem and the correponding representer theorem is new due to its extension to Banach space. A new approximation theorem is also proved and  some experiements are done. 
+Pros:
+The idea of understanding how Transformers work with the help of non-mercer binary kernel  is interesting.
+As for the theoretical side, authors provide representer theorem to binary kernel learning for Banach space rather than Hilbert space. 
+
+Cons:
+The experiment is insufficient because only one dataset is studied.
+I think the proof is just a generalization of kernel learning problems on RKBS, without too much difficulty.",6,4.0,ICLR2021
+ByBJy2Oef,1,rJma2bZCW,rJma2bZCW,A stability analysis of local optima in constant-rate SGD,"The paper investigates how the learning rate and mini-batch size in SGD impacts the optima that the SGD algorithm finds.
+Empirically, the authors argue that it was observed that larger learning rates converge to minima which are more wide,
+and that smaller learning rates more often lead to convergence to minima which are narrower, i.e. where the Hessian has large Eigenvalues. In this paper, the authors derive an analytical theory that aims at explaining this phenomenon.
+
+Point of departure is an analytical theory proposed by Mandt et al., where SGD is analyzed in a continuous-time stochastic
+formalism. In more detail, a stochastic differential equation is derived which mimicks the behavior of SGD. The advantage of
+this theory is that under specific assumptions, analytic stationary distributions can be derived. While Mandt et al. focused
+on the vicinity of a local optima, the authors of the present paper assumed white diagonal gradient noise, which allows to
+derive an analytic, *global* stationary distribution (this is similar as in Langevin dynamics).
+
+Then, the authors focus again on individual local optima and ""integrate out"" the stationary distribution around a local optimum, using again a Gaussian assumption. As a result, the authors obtain un-normalized probabilities of getting trapped in a given local optimum. This un-normalized probability depends on the strength of the value of the loss function in the vicinity of the optimum, the gradient noise, and the width of the optima. In the end, these un-normalized probabilities are taken as
+probabilities that the SGD algorithm will be trapped around the given optimum in finite time.
+
+
+Overall assessment:
+I find the analytical results of the paper very original and interesting. The experimental part has some weaknesses. The paper could be drastically improved when focusing on the experimental part.
+
+Detailed comments:
+
+Regarding the analytical part, I think this is all very nice and original. However, I have some comments/requests:
+
+1. Since the authors focus around Gaussian regions around the local minima, perhaps the diagonal white noise assumption could be weakened. This is again the multivariate Ornstein-Uhlenbeck setup examined in Mandt et al., and probably possesses an analytical solution for the un-normalized probabilities (even if the noise is multivariate Gaussian). Would the authors to consider generalizing the proof for the camera-ready version perhaps?
+
+2. It would be nice to sketch the proof of theorem 2 in the main paper, rather than to just refer to the appendix. In my opinion, the theorem results from a beautiful and instructive calculation that should provide the reader with some intuition.
+
+3. Would the authors comment on the underlying theoretical assumptions a bit more? In particular, the stationary distribution predicted by the Ornstein-Uhlenbeck formalism is never reached in practice. When using SGD in practice, one is in the initial mode-seeking phase. So, why is it a reasonable assumption to still use results obtained from the stationary (equilibrated) distribution which is never reached?
+
+
+Regarding the experiments: here I see a few problems. First, the writing style drops in quality. Second, figures 2 and 3 are cryptic. Why do the authors focus on two manually selected optima? In which sense is this statistically significant? How often were the experiments repeated? The figures are furthermore hard to read. I would recommend overhauling the entire experiments section.
+
+Details:
+
+- Typo in Figure 2: ”with different with different”.
+- “the endpoint of SGD with a learning rate schedule η → η/a, for some a > 0, and a constant batch size S, should be the same
+  as the endpoint of SGD with a constant learning rate and a batch size schedule S → aS.” This is clearly wrong as there are many local minima, and running teh algorithm twice results in different local optima.  Maybe add something that this only true on average, like “the characteristics of these minima ... should be the same”.",6,4.0,ICLR2018
+wM0Cz3H5hlb,3,5jRVa89sZk,5jRVa89sZk,"Generally good, a few questions ","This paper investigates the unlabeled entity problem, which is generally observed in the manual annotation setting and distant supervision as well. The unlabeled problem is important and some existing works focus on solving the problems using partial CRF setting or data selector. The main observation of this paper lies in two aspects: 1) comparison between the reduction of annotated entities or treating unlabeled entities as negative instances.  Most interestingly, the authors show the observed difference between pre-trained language models and LSTM-based models.  Based on the observations, they propose a general approach to eliminate the misguidance brought by unlabeled entities and such a simple design shows good performances.  
+
+The Paper is overall well written and easy to follow. But  I still have a few questions and want to get answers from authors. 
+
+Questions:
+
+1) The first question is about 4.2 Training via Negative sampling on page 5. I am not quite sure about the procedure. Negative instance candidates are randomly selected from original sentences.  You use \hat{y}, which is a subset of randomly selected span to replace a missed entity set defined in Eq. (2)?
+
+2) Could you expand more about Equation 8 to add more details?
+
+3) The unlabeled entity problem is most serious in the distant supervision setting. However, the distant supervision setting suffers from entity ambiguation and unlabeled entity problem simultaneously.  How do you think your design to tackle entity ambiguation problem? Moreover, in the distant supervision experiment in Table  3, how will you model compare with other distant supervision models like AutoNER?
+
+",5,4.0,ICLR2021
+BkrguybEe,1,BkbY4psgg,BkbY4psgg,Greatly improved training and analysis of NPI,"This paper improves significantly upon the original NPI work, showing that the model generalizes far better when trained on traces in recursive form. The authors show better sample complexity and generalization results for addition and bubblesort programs, and add two new and more interesting tasks - topological sort and quicksort (added based on reviewer discussion). Furthermore, they actually *prove* that the algorithms learned by the model generalize perfectly, which to my knowledge is the first time this has been done in neural program induction.",9,5.0,ICLR2017
+sgj36teIP7U,2,76M3pxkqRl,76M3pxkqRl,Blind review,"This paper presents a new method for improving coordination in MARL by using the idea of the status quo. The approach uses a status quo loss and a method for converting multi-step games into matrix games (called GameDistill). The methods are described and experiments are given comparing the methods to other related work. 
+
+Coordination is a problem in MARL. When multiple agents are learning at the same time, they can get stuck in poor equilibria. Therefore, ideas such as the status quo may be helpful in escaping these poor solutions. 
+
+The status quo loss is straightforward and of questionable use. The idea (in Equation 8) balances the regular RL loss witha  status quo loss that repeats the interaction (i.e., agent actions) for k steps. k is sampled from a distribution. There are weights for each of these losses to balance them out. This idea makes sense as a way to reduce the nonstationarity of decentralized learning, but it doesn't promote coordination. Why is this the right thing to do? What can be said about it theoretically? More motivation is needed for the approach. 
+
+The details of GameDistill are unclear and it also isn't clear how general it is. The method is described in text in the main paper and in pseudocode in the appendix, but each is high-level and not formal enough to understand the details. Furthermore, the approach clusters trajectories based on their rewards using random play. This seems unlikely to work well where exporation is an issue as random play may not be sufficient and in more complex games, you may need many clusters (and not know how many is needed). The paper should make it clear how general the method is. 
+
+The experiments show the method outperforms an independent learning and LOLA, but more extensive comparisons are needed. For example, the focus is on cooperative games. As such, methods that promote cooperation should also be discussed and compared to. This includes optimistic methods such as hysteresis and leniency as discussed in the paper below:
+
+Wei, Ermo, and Sean Luke. ""Lenient learning in independent-learner stochastic cooperative games."" The Journal of Machine Learning Research 17.1 (2016): 2914-2955.
+
+Also, the results are a bit surprising. The status quo loss shouldn't favor one equilibrium over another so it isn't clear why the proposed method escapes the poor equilibrium for the good one (e.g., DD for CC). The paper should make it more clear what that is the case. Lastly, the results in Figure 4 are somewhat unfair since it appears that additional training was done by the proposed method for GameDistill before learning curve plot begins. 
+
+The paper is generally well written, but some details can be more clear as mentioned above. ",4,4.0,ICLR2021
+cK_4v1WfYvQ,4,JCz05AtXO3y,JCz05AtXO3y,Interesting exploration of  spatial resolution and structural resolution in Graph Neural Networks,"Summary:
+Authors introduces two new concepts spatial resolution and structural resolution in regards to understanding the graph structured data which are quite interesting and enlightening. Idea about projecting graph information 
+into structural landmarks is intriguing. To help make stronger case, I would suggest to do proper ablation study as it is not clear how much gain is coming unsupervised learning.
+
+Pros:
+Overall I like intuition and the method about capturing the spatial resolution and structural resolution in a strategic manner. Author have some strong empirical performance especially on Protein, PTC and IMDB-M dataset.
+
+Cons:
+Authors argue that the classic graph neural networks which employ graph pooling operations are the bottleneck in identifying necessary substructures (or their interactions) for yielding high discriminative performance.  However, such statements are quite loose and need further theoretical justification given the fact graph pooling operations such as deepsets or sum-pooling are universal/injective functions in nature and thus can reflect any changes in the graph sub-structure (however their function smoothness or amount of representative power captured is entirely different issue).
+
+It not clear why right hand side spectrum in Figure 2 will lead to lower generalization performance or over-fitting. Most of the time motifs/graphlets act as an atomic structure of a graph and their frequency distribution drives the discriminative performance.  As such identifying all such atomic structures should be helpful rather than harmful. It would be great if authors can expand on their explanation here and provide a real world example that would be more convincing to support their hypothesis. 
+ 
+There are certain paragraphs which are hard to read. For instance, Figure 1 lacks detail description and it is not clear what ""all the nodes are mixed into one"" means in the context. Also, a general suggestion would be to add more descriptive caption for each Figures in the paper.  
+
+I would suggest to provide compelling real-world examples (or do more qualitative analysis) besides the strong empirical performance in the main paper (there is some discussion in appendix but highlights can be included in the main context). 
+
+Ablation study is missing and thus hard to answer questions such as , is unsupervised learning (i.e., learning in Equation 2) even needed for getting strong results? I would really like to see the performance gains due to unsupervised learning.
+
+Can authors discuss the computation complexity of their method?
+
+Typos:
+
+Variables $T$, $b$ in equation (1) are not defined.
+",6,3.0,ICLR2021
+rkgu8hhqYr,1,rJlk71rYvH,rJlk71rYvH,Official Blind Review #2,"The paper presents regularization techniques for model based reinforcement learning which attempt to build counterfactual reasoning into the model. In particular, they present auxiliary loss terms which can be used in ""what if"" scenarios where the actual state is unknown. Given certain assumptions, they show that this added regularization can improve generalization to unseen problem settings. Specifically they propose two forms of regularization: (1) enforcing that for different actions the predicted next state should be different (action-control) and (2) enforcing that when certain parts of the low dimensional state are perturbed, over a model rollout the perturbation should only affect the perturbed parts of the state, essentially encouraging the latent space features to be independent (disentanglement). 
+
+Overall the idea is well motivated - incorporating counterfactual reasoning into model based RL has potential to to improve generalization. Also, while the assumptions needed for the regularization to be correct are not always true, they do seem to hold in many cases. Lastly, the results do seem to indicate that generalization is slightly improved when using the proposed forms of regularization.
+
+My criticisms are:
+
+(1) As mentioned in the paper Action-Control assumes that at every single timestep the agent has potential to change the state. However there may be settings where the agent can always change state, but only a small component of the state. In these cases the states should be quite similar. For example a robot only moving a single object when the state consists of many objects. Also as mentioned in the paper Disentanglement will not work in stochastic environments. One concern I have is that since different environments can violate the assumptions to varying degrees, it seems like actually using the regularization and picking the correct hyperparameter to weight it will be very challenging. 
+
+(2) The current results are only demonstrated in a single, custom environment. Additionally performance is shown on only 2 test tasks, and in all cases in Table 2 it is unclear how to interpret the reward. Does this performance constitute completing the task? What is the best possible cumulative reward in this case? The performance improvement seems small, but it is difficult to judge without knowing the details of the task.
+
+I think the paper would be significantly improved by (1) adding experiments in more environments, especially standard model based RL environments where the performance of many existing methods is known and (2) adding comparisons to other forms of model regularization, for example using an ensemble of models. My current rating is Weak Accept.
+
+Some other questions:
+- In Table 2 does MPC amount to PlaNet?
+- How sensitive are the current numbers to planning parameters (horizon, num samples)?
+- Can you provide error bars for the numbers in the tables?
+
+______________________________________________
+
+After author responses and closer examination of the paper I have some additional concerns about experimental details.  Changing my score from 'Weak Accept' to 'Weak Reject'",3,,ICLR2020
+jG0Z5eBi4L,2,9ITXiTrAoT,9ITXiTrAoT,Interesting approach for explicit control of LSTM unit timescale in natural language modeling,"## Summary
+
+This work investigates representational power of LSTM to model natural language, in particular how well it models temporal dependencies within text. They define a notion of timescale of each LSTM unit and analitycally show that LSTM memory exhibits exponential decay, while natural language tends (based on prior work) to decay following the power law. Based on this, they figure that LSTM memory may decay following the power law *if the timescales approximate samples from the particular Inverse Gamma Distribution*. To achieve that they propose the multi-timescale LSTM unit, where the desired timescale is explicitly controlled via the forget gate bias.
+
+Authors empirically validate their theoretical claims and show improvements in language modeling (PTB, Wikitext2) over the baseline LSTM using the proposed multi-timscale LSTM. Importantly, they show how multi-timescale LSTM gives improvement in modeling rare words, which are known to require longer temporal dependencies.
+
+## Strong points
+
+1. This work investigates the important (though not that popular) question of discrepancy between the temporal dependencies existing in natural language and the abilities of models we use to learn these dependencies in practice.
+2. The idea of including explicit control of the timescale (i.e. temporal horizon) of each LSTM unit is interesting and well-motivated.
+3. Experiments use a formal language too in addition to natural language modeling which allows to check if proposed approach generalizes in case of exactly computable timescale distribution.
+
+## Weak points
+
+1. There is **no code** available. I was interested in the way how test set bootstraping was performed in the experiments (see comments below for details) and found out there is no code, which is really sad. I hope authors will submit the reproducible code in the near future.
+2. Theoretical part gives some essential quantities while no derivations are given. Given that there is some free space left in the paper and the fact of unlimitied supplement material I don't see any reason to omit derivations (I struggle with some transitions between equations as you may found in the comments below).
+3. I am not sure how useful is the proposed approach on new tasks given hardly tuned hyperparameters for the Inverse Gamma Distribution proposal and LSTM model architecture for each task (more details in the comments below).
+
+## Recommendation
+
+I vote for accepting **upon fixing major weak points**: uploading reproducible code with experiments and adding all derivations necessary for essential theoretical claims in this work. Overall this is decent work which will be useful for future research in studying representational power of models we are using to learn complicated dependencies of natural language.
+
+## Questions
+
+All the questions below are welcome to be used as **suggestions** to provide more details in the manuscript.
+
+### Theoretical part
+
+1. Eq.3: why can we simply average forget gates? Out of 'free input' regime $c_t$ from eq.1 would have more dependencies. Could you elaborate why can we estimate it like this. 
+2. How to solve Eq.4 as getting Inverse Gamma Distribution? I struggle to find an obvious/trivial solution, please elaborate this in the manuscript.
+
+### Experiments
+3. From 3.1.1. '*Training sequences were of length 70 with a probability of 0.95 and 35 with a probability of 0.05. During inference, all test sequences were length 70.*' Why are you making such explicit scheduling? Given that each training sequence from WT2 is some excerpt from wikipedia (often longer than 70 words), how do you deal with tails of sequences? More detailed description of data loading will be helpful.
+4. Table 1: from my understanding columns with rare words attract most interest, and I wonder if you could add the varaiance among different training instances in Table 1 like you reported in the appendix (Table 2)? Or refactor Table 2 such that it has same freq based columns.
+5. Did you think of tokenization other than word-level? BPE for example: it gives more balanced token distribution for WT2 due to absence of UNK token there. I wonder if improvements in terms of rare words PPL will hold. How do you think (speculatively)?
+6. PTB results: >10K bin PPL got higher with your approach (also ratio drop below 1 in routing study). Why do you think this happens? Why do you think this does not happen with WT2 task? As I see with PTB the fixed forget bias underestimates true/gold high freq word probabilities on average, but why? 
+
+### Other
+
+7. Is it possible to estimate/learn the IGD alpha parameter from the data itself of the task you work on? This grid search you provide makes me less convinced in how useful this approach is for the new task, where the $\alpha$ is not known.
+8. How important is the model layers tuning you do? E.g. only specific layers have fixed forget bias, but others not, **why is that?** I am really interested in knowing if other ways of defining your model hurt the performance or keep it on the baseline level? I am sure this will be useful for all other readers too.
+9. Is it possible to apply this timescale control for other units e.g. GRU (no explicit forget gate)?",6,3.0,ICLR2021
+Bky9cL_eG,1,HkinqfbAb,HkinqfbAb,Automatic Parameter Tying in Neural Networks,"Approach is interesting however my main reservation is with the data set used for experiments and making general (!) conclusions. MNIST, CIFAR-10 are too simple tasks perhaps suitable for debugging but not for a comprehensive validation of quantization/compression techniques. Looking at the results, I see a horrific degradation of 25-43% relative to DC baseline despite being told about only a minimal loss in accuracy. A number of general statements is made based on MNIST data, such as on page 3 when comparing GMM and k-means priors, on page 7 and 8 when claiming that parameter tying and sparsity do not act strongly to improve generalization. In addition, by making a list of all hyper parameters you tuned I am not confident that your claim that this approach requires less tuning. 
+
+Additional comments:
+
+(a) you did not mention student-teacher training
+(b) reference to previously not introduced K-means prior at the end of section 1
+(c) what is that special version of 1-D K-means?
+(d) Beginning of section 4.1 is hard to follow as you are referring to some experiments not shown in the paper.
+(e) Where is 8th cluster hiding in Figure 1b?
+(f) Any comparison to a classic compression technique would be beneficial.
+(g) You are referring to a sparsity at the end of page 8 without formally defining it. 
+(h) Can you label each subfigure in Figure 3 so I do not need to refer to the caption? Can you discuss this diagram in the main text, otherwise what is the point of dumping it in the appendix?
+(i) I do not understand Figure 4 without explanation. ",6,5.0,ICLR2018
+r1knUinef,3,Sy-tszZRZ,Sy-tszZRZ,Review,"This is quite an interesting paper. Thank you. Here are a few comments:
+
+I think this style of writing theoretical papers is pretty good, where the main text aims of preserving a coherent story while the technicalities of the proofs are sent to the appendix. 
+However I would have appreciated a little bit more details about the proofs in the main text (maybe more details about the construct that is involved). I can appreciate though that this a fine line to walk. Also in the appendix, please restate the lemma that is being proven. Otherwise one will have to scroll up and down all the time to understand the proof. 
+
+I think the paper could also discuss a bit more in detail the results provided. For example a discussion of how practical is the algorithm proposed for exact counting of linear regions would be nice. Though regardless, I think the findings speak for themselves and this seems an important step forward in understanding neural nets. 
+
+****************
+I had reduced my score based on the observation made by Reviewer 1 regarding the talk Montufar at SampTA. Could the authors prioritize clarification to that point ! 
+ - Thanks for the clarification and adding this citation. ",6,3.0,ICLR2018
+EdUrRBJMYFl,1,8W7LTo_zxdE,8W7LTo_zxdE,Interesting idea that needs polishing in terms of presentation and empirical evaluation ,"Variational deterministic uncertainty quantification
+
+Summary:
+The paper proposes a method for out-of-distribution detection by combining deep kernel learning and Gaussian processes. Using neural networks as a kernel for the GP as well as inducing point approximation alleviates the scalability issues of GP. The idea itself has merits, however, the presentation and experiments are not convincing.
+
+Strengths: The idea of using deep kernels within GP is a good solution that allows benefiting from both the expressiveness of the kernels and uncertainty estimates for GP. Additionally, using the uncertainty estimates for causal inference is a nice application. 
+
+Weaknesses: Although the approach is interesting it needs to be further developed and evaluated in multiple setups. I find it limiting that it relies on the residual connection, making it unsuitable for other NN architectures, which means it will apply to only a limited number of tasks.
+
+
+The presentation of the method should be better structured. I appreciate the background on deep kernels and how it helps to overcome the limits of GP, however, there is a lack of presentation of the method itself. A description, algorithmic listing or even an equation for the uncertainty score proposed is missing in the current version of the text.
+
+In the introduction vUQD is presented as favorable wrt UQD due to its rigorous probabilistic interpretation, however, this was never further analyzed in the text. Also, seems that the method is concerned only with the epistemic uncertainty in the data? In general, the whole presentation of related work and positioning of this paper in the uncertainty literature is not clear. What source of uncertainty does the method address? There is much to be elaborated on this topic and I believe the discussion on this will significantly improve the paper.
+
+The discussion on spectral-normalization and bi-Lipschitz in 3.1
+Please clarify it or explain it better, in the current writing it is contradicting the proposed method:
+“A complete, rigorous theory of why the spectral normalization as used in this and previous work is a useful regularization scheme is not remains an open question”
+
+Experiments:
+
+Toy examples:
+Figure 1 - on regression, I do not find this example motivating, first, why choosing noiseless data? Second, why is the vUQD increasing in reasons where there is data (such as the peaks?) Why does it compare only to deep ensembles?
+Figure 2 - Why choosing a toy example where a linear classifier works in the original space? 
+
+What is the sensitivity to the number of inducing points for the GP? An ablation study at least for the toy data sets can help. 
+
+Why were standard datasets such as MNIST and fashion MNIST not included?
+The empirical evaluation should be extended with more baselines and datasets.
+
+
+Minor:
+The manuscript needs proofreading, language errors increase increasingly towards the conclusion. 
+
+------------- Update after reading authors response -------------
+
+I thank the authors for their detailed responses, they have answered most of my concerns and I raise my score to 5. I am still not convinced about the method covering both the aleatoric and epistemic uncertainties, without any theoretical or intuitive justification, and without any discussion/clarification on that part. If indeed this is the case, then additional experiments should be included, for example for a regression task, the standard UCI datasets [1].
+
+[1] Hernandez-Lobato, J M and Adams, R P. Probabilistic ´
+backpropagation for scalable learning of bayesian neural networks. In ICML-15, 2015.",5,4.0,ICLR2021
+B1gdgfOFhX,2,HkxWrsC5FQ,HkxWrsC5FQ,A generative model for images,"After the rebuttal: I appreciate the authors' effort to revise the paper. The revision made clear that the data produced by the proposed generative model is not linearly separable in general while the theory (Theorem 2) still holds.
+
+I am keeping my original evaluation as there still seems to be a lack of stronger experimental evidence. The fact that the classification algorithm motivated by the generative model can do as well as a similar-sized ConvNet does not quite support that the generative model itself is good -- getting a good classifier is still an easier task than getting a good generative model. 
+
+=====================
+
+This paper proposes a new generative model for natural images. Based on the architecture of the generative model, a “layer-wise clustering” algorithm for image classification is proposed and theoretically shown to converge to an optimal classifier. Experimentally, the algorithm is shown to have similar performances as a baseline CNN on CIFAR-10.
+
+The main novelty of this paper is the proposed hierarchical generative model and the associated algorithm. One interesting feature is that the network obtained by this algorithm is entirely linear except for the ReLU-pool part. However, the ReLU-pool does not serve as a typical nonlinearity / pooling I believe; rather it sounds to me like a specially tailored step for the theoretical results, which under the “patch orthonormality” assumption is guaranteed to recover the previous layer. Therefore, it surprises me a little bit that the algorithm actually works reasonably well on CIFAR-10. However, as the baseline it compares with is still below ""typical"", I do want to see if this algorithm can be scaled up to match the performance of more complicated (at least pre-ResNet) models such as VGG.
+
+The theoretical result looks appealing, but I feel like the magic more or less comes from the strong assumptions. In particular, in expectation the output image is just a *linear* operator on the initial (m_0 x m_0 x C_0) one-hot semantic variable. Also, the patch orthonormality assumption implies that intermediate semantics can be perfectly recovered by the (clustering + conv with centroids + ReLU-pool) step, as we are just recovering a partition of a group of orthonormal vectors.",6,3.0,ICLR2019
+Q6pPx7-CYF,3,KWToR-Phbrz,KWToR-Phbrz,A method that counterfactually explains predictions of classifiers by exploiting perturbations on data samples.,"
+
+
+In this paper, the authors present a method based on counterfactuals that learns a perturbation using constraints to ensure diversity in explanations.
+
+The authors argue that explanations produced by their method are more “actionable, diverse, valuable and proximal than the previous literature”. However,
+it is unclear how they quantitatively measure these attributes, given that FID scores only captures the similarity of generated images to real ones.
+
+I would like to understand the motivation on using the perceptual reconstruction loss. The authors should clarify the usage of this loss in their method and
+highlight its importance on their explanatory method. The author briefly mentioned the gains in terms of image quality, when compared with GANs in PE.
+However, I would like to see a more deeper discussion. 
+
+Since interpretability is closely related to users/humans, it is difficult to assess the quality of the generated explanations without human evaluations.
+An initial setup could be the one used in PE.
+
+Overall, assuming the above limitations, the experiments help to understand the contributions of the article.
+
+Typos:
+ 
+- Sec. 3.3: “Since these mask are …” -> “Since these masks are…” 
+- Sec. 4.2: “In Figure 2b, …” -> “In Figure 2, …”",7,3.0,ICLR2021
+r1z2QSOlz,2,HytSvlWRZ,HytSvlWRZ,"The authors propose a DNN, called subspace network, for nonlinear multi-task censored regression problem. The writing needs more elaboration. The experiments are unconvincing.","The authors propose a DNN, called subspace network, for nonlinear multi-task censored regression problem. The topic is important. Experiments on real data show improvements compared to several traditional approaches.
+
+My major concerns are as follows.
+
+1. The paper is not self-contained. The authors claim that they establish both asymptotic and non-asymptotic convergence properties for Algorithm 1. However, for some key steps in the proof, they refer to other references. If this is due to space limitation in the main text, they may want to provide a complete proof in the appendix.
+
+2. The experiments are unconvincing. They compare the proposed SN with other traditional approaches on a very small data  set with 670 samples and 138 features. A major merit of DNN is that it can automatically extract useful features. However, in this experiment, the features are handcrafted before they are fed into the models. Thus, I would like to see a comparison between SN with vanilla DNN. ",5,3.0,ICLR2018
+HJx8NdWaKB,1,BJxH22EKPS,BJxH22EKPS,Official Blind Review #1,"Summary:
+
+This paper tries to understand the characteristics of the architectures found by common NAS methods in the cell-search space. Specifically it characterizes the cell-search space used by DARTS, SNAS, AmeobaNet and finds that a most of these search methods find cells which are wide and shallow in depth (they give a specific definition of width and depth for characterizing cells). In fact these cells are usually the widest and shallowest architectures in their search space. The author empirically find that because these kinds of topologies converge faster during training and inevitably every NAS algorithm during search don't train upto convergence but only up to a bit and make decisions based on partially converged statistics there is a bias in selection towards these topologies. They also provide theoretical intuition to back-up these empirical findings. 
+
+Secondly they analyze the generalization performance of such wide and shallow cell structures accidentally emphasized by search procedures. They take the common cell structures found by common NAS algorithms (NASNet AmoebaNet, ENAS, DARTS, SNAS) and make them the widest and shallowest possible in the search space (following the SNAS cell connection pattern) while keeping number of parameters as constant as possible. They find that on cifar10 the test error of the adapted architectures usually increase a bit while on cifar100 the adapted architectures decrease a bit. 
+
+Comments:
+
+- Overall the paper is interesting and well-written. Definitely liked the fact that wide and shallow networks are being accidentally biased towards during search. Liked the empirical analysis and theoretical insights backing it up.
+
+- The generalization experiments suggest to me that on bigger datasets wider and shallower networks might be better for generalization actually. Can we take the cell architectures found by various algorithms and 'scale-up' to ImageNet by doing the usual trick of replicating more of the cells together and training? At least going by Table 1 I find myself not agreeing with the statement ""The results above have shown that architectures with the common connection pattern may not generalize better despite of a faster convergence."" On cifar100 wider and shallower is better. Perhaps on ImageNet they will be even better? So NAS algorithms' strategy of training partially may be exactly the right thing to do? Any thoughts?
+
+- Any idea about if this pattern extends to RNN space as well or only limited to CNNs?
+
+- Overall my main gripe is that while it is interesting findings but I am not sure I understood the main takeaway or significance of these results especially the generalization ones and how it informs search algorithm design.",3,,ICLR2020
+LdvS6MrHsNz,1,6puCSjH3hwA,6puCSjH3hwA,Review on Paper751 ,"Summary
+
+This paper proposes a method to disentangle content and motion from videos for high-resolution video synthesis. The proposed method consists of a motion generator, pre-trained generator, image discriminator, and video discriminator. The motion generator predicts the latent motion trajectory z, which is residually updated over time. Then the image generator produces each individual frame from the motion trajectory. For training, five types of loss functions are combined. In experiments, video generation by the proposed method is performed on UCF-101, FaceForensics, and Sky time-Lapse datasets. Also, cross-domain video generation and more ablation studies were conducted to show the effectiveness of the proposed method.
+
+Overall,  about this paper, I am leaning on the positive side. I summarize both the strength and weakness of this paper that I felt.
+
+Strength
+
+The main benefits of this paper are covering high-resolution video synthesis and disentangling motion and contents. The proposed method was tested on the various datasets and showed experiments about cross-domain video synthesis. Also, ablation studies were performed to check the effects of each loss function.
+
+
+Weakness
+
+
+As mentioned in the Introduction, the second desired property for generated video is temporal coherency.  The motion generator may be helpful to find a motion trajectory that makes a temporally consistent video. However, it is unclear whether the meaning of ""temporarily constant video"" means that the content of the video is consistent over frames in video or there is no flickering effect in the video. Also, there seems to be a lack of experiments to directly verify the temporal coherency.
+
+About the paragraph “Motion Disentanglement” in section 3.1, how to decide variable “m” for PCA? Also, what is the reason why using motion residual is helpful to motion disentanglement?
+
+
+
+For evaluation, is there a reason for using different evaluation metrics for each dataset? Also, for each dataset, methods for comparison are different from each other. Is there a special reason to use different methods for comparisons? For, UCF101, IS and FVD were used while FVD and ACD were used for the FaceForensis dataset. Also, FVD, PSNR, and SSIM were used for Sky Time-lapse dataset. It is better to conduct a comparative experiment with the same metric and the same compared methods for all datasets.
+
+
+There are qualitative results in section4.2 about cross-domain video generation. Are there results that have been verified quantitatively?
+",6,3.0,ICLR2021
+rJxjxNSatH,2,S1lslCEYPB,S1lslCEYPB,Official Blind Review #3,"The paper presents to use \eta-trick for log(.) in Donsker-Varadhan representation of the KL divergence. The paper discusses its dual form in RKHS and claims it is better than the dual form of f-divergence (while I don't think it's a correct claim). Nevertheless, in experiments, we see that it outperforms the original neural estimation of mutual information.
+
+[Overall]
+Pros:
+1. The idea of avoiding the log-sum-exp computation in the DV representation of KL is good. One of the main reasons is to get rid of biased estimation. This idea may not be too novel, but definitely is useful.
+
+Cons:
+1. I don't agree with some claims in the paper. Nevertheless, these claims are some of the main stories supporting the paper.
+2. The presentation of the paper should be improved. Including the presentation flow between sections, and also misleading part in experiments.
+3. There are TOO MANY typos in equations. 
+
+[Cons In Details]
+<The role of discussing the dual formulation.>
+The paper spends huge paragraphs discussing the role of the dual formulation. And it also introduces \eta-DV-Fisher and \eta-DV-Sobolev, which can be seen as extensions of the proposed method.
+Nevertheless, the author doesn't present an evaluation using the dual form. It's a pity that this part is missing. Having Section 3 makes the paper contain several sporadic arguments unrelating to the research questions. Re-organize the paragraphs/ presentation is suggested.
+Another example is introducing perturbed loss function into the proposed loss function to make it strictly convex. This paragraph is misleading and can be moved entirely to the Appendix.
+
+<Claim on the Proposed Method is Better than f-divergence>
+The author emphasizes (in multiple places) that the f-divergence biases the estimate. And it exemplifies from eq. (15) to eq. (16). Nevertheless, in eq. (16), when r* is optimum, there should be no second term. The author's claim is based on the comparisons between eq. (13) and eq. (15) when assuming only the MMD term reaches zero. The statement may not be fair.
+<Section 4>
+Similar to Section 3, this section is cluttered. I don't get the reason why the author specifically come up with a new section comparing only to one mutual information estimation approach (a-MINE). Another irrelevant part of the research question is point 4 under page 7. Why discussing the extension of the a-MINE to conditional MI estimation?
+Some Typos: In equation (21) and (22), \eta is missing. In Algorithm 1, there are too many typos such as missing \theta under f, entirely wrong equation for the Output.
+
+<MI Estimation Experiment>
+Can the author discuss the standard deviation for various MI estimation approaches? The large standard deviation for MINE seems unusual.
+
+<Self-Supervised Experiment>
+The author mentioned that the network considering only the first two convolutional layers, followed by a linear classifier, leads to the best performance. Is there no other layer in between? Also, in Figure 4 (a) and (c), is the purple-color layer means the final classification layer? It is a bit confusing.
+
+<Appendix>
+I understand most of the people would not read the Appendix, but I do. Missing brackets, grammatic errors,  missing integrals, wrong notations, missing punctuation marks, ill-structured presentations, etc., are the problems in the Appendix. I would greatly appreciate the author also spend some time in the Appendix.
+
+[Summary]
+I would vote for a score with 4 or 5 to this paper.
+Regarding there're only 3/6, I'm proposing a score of 6 now. But I look forward to the authors' response and then addressing the problems that I identified. I feel the paper should be a strong paper after a good amount of revision.",3,,ICLR2020
+HklhT8o3KH,1,S1eSoeSYwr,S1eSoeSYwr,Official Blind Review #1,"This paper proposes a novel approach to estimate the confidence of predictions in a regression setting. The approach starts from the standard modelling assuming iid samples from a Gaussian distribution with unknown mean and variances and places evidential priors (relying on the Dempster-Shafer Theory of Evidence [1] /subjective logic [2]) on those quantities to model uncertainty in a deterministic fashion, i.e. without relying on sampling as most previous approaches. This opens the door to online applications with fully integrated uncertainty estimates. 
+This is a very relevant topic in deep learning, as deep learning methods are increasingly deployed in safety-critical domains, and I think that this works deserves its place at ICLR.
+
+Pros:
+1.	Novel approach to regression (a similar work has been published at NeurIPS last year for classification [3]), but the extension of the work to regression is important.
+2.	The experimental results show consistent improvement in performance over a wide base of benchmarks, scales to large vision problems and behaves robustly against adversarial examples.
+3.	The presentation of the paper is overall nice, and the Figures are very useful to the general comprehension of the article.
+Cons:
+1.	The theory of evidence, which is not widely known in the ML community, is not clearly introduced. 
+I think that the authors should consider adding a section similar to Section 3 of Sensoy et al. [3] should be considered. Currently, the only step explaining the evidential approach that I found was in section 3.1, in a very small paragraph (between “the mean of […] to \lambda + 2\alpha.”). I believe that the article would greatly benefit from a more thorough introduction of concepts linked to the theory of evidence.
+2.	The authors briefly mention that KL is not well defined between some NIG distributions (p.5) and propose a custom evidence regularizer, but there’s very little insight given on how this connects to/departs from the ELBO approach. 
+
+Other comments/questions:
+1.	(p.1)  I’m not sure to fully understand what’s meant by higher-order/lower-order distributions, could you clarify?
+2.	(p.3) In section 3.1, the term in the total evidence \phi_j is not defined.
+3.	(p.3) Could you comment on the implications of assuming that the estimated distribution can be factorized? 
+4.	(p.4) Could you comment on the difference that there is between NLL_ML and NLL_SOS from a modelling perspective?
+5.	(p.4) The ELBO loss (6) is unclearly defined, and not connected to the direct context. I would suggest moving this to the section 3.3, where the prior p(\theta) used in eq. (6) is actually defined.
+6.	(p.4) In equation (6), p_m(y|\theta) isn’t defined, and q(\theta|y) is already parameterized on y if I understand that q(\theta)=p(t\heta|y1,…,yN). Making the conditioning explicit in equation (6) might make the connection to the ELBO clearer. 
+7.	(p.7) I’m not sure to understand how the calibration of the predictive uncertainty can be tested by the ROC curves if both the uncertainty and estimates error are normalized. Could you also define more clearly what you mean by an “error at a given pixel”? 
+8.	Spelling & typos:
+-	(p.4) There are several typos in equation (8), where tau should be replaced with 1/\sigma^2. 
+-	(p.8) In the last sentence, there is “ntwork” instead of network.
+-	(p.9) There is a typo in the name of Jøsang in the references. 
+-	(p.10) In equation (13), due to the change of variable, there should be a 
+-(1/\tau^2) added;  
+-	(p.10) In equation (14), the \exp(-\lambda*\pi*(…)) should be replaced with \exp(-\lambda*\tau*(…)). 
+
+[1] Bahador Khaleghi, Alaa Khamis, Fakhreddine O Karray, and Saiedeh N Razavi. Multisensor data fusion: A review of the state-of-the-art. Information fusion, 14(1):28–44, 2013. 
+[2] Audun Jøsang. Subjective Logic: A formalism for reasoning under uncertainty. Springer Publishing Company, Incorporated, 2018. 
+[3] Sensoy, Murat, Lance Kaplan, and Melih Kandemir. ""Evidential deep learning to quantify classification uncertainty."" Advances in Neural Information Processing Systems. 2018.
+",6,,ICLR2020
+HyeGN5C85B,2,ByxODxHYwB,ByxODxHYwB,Official Blind Review #3,"The paper proposes a multi-source and multi-view transfer learning for neural topic modelling with the pre-trained topic and word embedding. The method is based on NEURAL AUTOREGRESSIVE TOPIC MODELs --- DocNADE (Larochelle&Lauly,2012). DocNADE learns topics using language modelling framework. DocNADEe (Gupta et al., 2019) extended DocNADE by incorporating word embeddings, the approach the authors described as a single source extension of the existing method.
+
+In this paper, the proposed method adds a regularizer term to the DocNADE loss function to minimize the overall loss whereas keeping the existing single-source extension. The authors claimed that incorporating the regularizer will facilitate learning the (latent) topic features in the trainable parameters simultaneously and inherit relevant topical features from each of the source domains and generate meaningful representations for the target domain. The analysis and evaluation were presented to show the effectiveness of the proposed method. However, the results are not significantly improved than the based line model DocNADE. 
+
+Overall, the paper is written well. However, it is not clear to me that the improved results are resulted due to multi-source multi-view transfer learning or for the better leaning of the single-source model due to the incorporation of the regularizer. 
+
+
+",3,,ICLR2020
+EdiKydx5iVM,1,1yXhko8GZEE,1yXhko8GZEE,"Good paper with clear motivation, nice approach, and good results","Update:  
+Thank the authors for the detailed feedback. I decide to keep the score.
+---
+---
+The paper shows that the failure mode of spectral normalization (SN) is often accompanied by large condition numbers in the discriminator layers. Motivated from this observation, the paper proposes to control the condition numbers of the discriminator layers, by adding preconditioning to the weights. The results show that the proposed approach makes the training more stable and achieves better sample quality on several datasets.
+
+Overall, I enjoy reading this paper. The motivation is clearly explained, and the approach is simple yet effective. I would recommend an accept. I do find that the writing needs to be improved (e.g. there are many typos in both the main paper and the appendix), and more experiments can be explored. However, I think these are relatively minor issues that do not diminish the overall quality of the paper.
+
+The minor issues and suggestions are listed below.
+
+Missing details:
+* Section 3.4 mentioned that ""the best preconditioner varies for different datasets"". It is fine, but you should show the FPC results with all different degrees, so that it is clearer how sensitive FPC is to the degree. I only see the results with degree 3 and 7 in the main text and the appendix.
+* APC needs to use the singular values of the weights. How do you compute the singular values, by exact computation or estimation? If you do the exact computation, then I have a question about the scalability of the approach. For convolutional layers (which are the dominant components in the network architectures you experimented on), the computation is cheap because the kernels are small. But it might computationally expensive for fully connected layers.
+* How do you compute the actual true spectral norm in Appendix E.1? By exact computation or estimation?
+
+Suggested experiments/discussions/related work:
+* From the appendix, I understand that for convolution layers you conduct the preconditioning on the reshaped kernels. However, the condition numbers/singular values of this reshaped kernel are NOT the same as those of the convolution layer (see the discussions in [1]). I wonder how your theorems and results would change considering this difference (e.g. how the theorems would change if you are controlling the condition numbers of the reshaped kernels instead of the layers, and how the results would be if you strictly controlling the condition number of convolutional layers)?
+* The spectral normalization paper [2] shows that spectral normalization is robust across different learning rates and betas in the Adam optimizer. You only tried one learning rate and one beta for each experiment. How sensitive are FPC and APC to the learning rates and betas?
+* I realize that this paper is very relevant to [1] (which is online recently so I understand that you are not required to know it). That paper also found out the instability issue in different variants of spectral normalization, and proposed an improved version (which are different from yours but could be relevant). Their results on CIFAR-10 and STL-10 seem to be better than FPC and APC you proposed in Table 4. You might want to add the discussions and/or experimental comparisons to it.
+* Another missing related work is [3], which discussed the importance of condition numbers in the generators of GANs. They proposed Jacobian Clamping for controlling the condition numbers by regularization. Although they focused on generators, nothing stops it from applying Jacobian Clamping on discriminators. You might want to add the discussions and/or experimental comparisons to this approach.
+
+Writing:
+* It is weird to discuss the scaling trick twice in Section 3.1 and in the ""Choice of desirable range"" paragraph of Section 3.3. 
+* Before the ""Choice of target function"" paragraph of Section 3.3, the scaling trick is already discussed, and the previous paragraph ends with [\gamma_L,\gamma_U]=[0,1.1]. It is weird to go back to [\gamma_L,\gamma_U]=[\lambda_1,\lambda_m] again in this paragraph. It is better to stick to the scaled version of singular values in this paragraph.
+* It is better the discuss briefly how SVD works in the main text as you are comparing with it.
+
+Typos:
+* 6th line in ""Choice of desirable range"" paragraph in Section 3.3: \sigma_{min}(A) -> \sigma_{min}(A)/\sigma_{max}(A)
+* The end of ""Search space of preconditioning polynomial"" paragraph in Section 3.3: wrong latex code for the section reference?
+* Last line before Section 4: compputation -> computation
+* Paragraph ""Failure mode: large condition numbers."" in Section 4: missing a period at the end
+* Header of table 1: there should be a bar on top of each W_l
+* Caption of table 1: there should be a bar on top of the W_l
+* Header of table 2: there should be a bar on top of the W_l
+* The last two paragraphs on page 18 & the first paragraph on page 19 (appendix): Table -> Algorithm
+* The first paragraph in Appendix F.1: deg-9 -> deg-7
+* The last paragraph on page 28 (appendix): mojority -> majority
+
+[1] Lin, Zinan, Vyas Sekar, and Giulia Fanti. ""Why Spectral Normalization Stabilizes GANs: Analysis and Improvements."" arXiv e-prints (2020).
+
+[2] Miyato, Takeru, et al. ""Spectral normalization for generative adversarial networks."" arXiv preprint arXiv:1802.05957 (2018).
+
+[3] Odena, Augustus, et al. ""Is generator conditioning causally related to gan performance?."" arXiv preprint arXiv:1802.08768 (2018).",7,4.0,ICLR2021
+EauVzx4Nz65,4,ascdLuNQY4J,ascdLuNQY4J,An interesting paper,"The paper claims to perform neural operator search on a search space defined by a family of Kaleidoscope operations. The paper address the computation challenges in the search using  a supernet, and performs variable ablation studies to show that the searched K-Op can slightly outperforms existing convolution operators.
+
+Here I appreciate if the authors can clarify the following points:
+
+a) ""Each butterfly matrix of dimension n x n is itself a product of log n sparse matrices with a special fixed sparsity pattern, which encodes the recursive divide-and-conquer algorithms such as the FFT.""
+
+Here the author deliberately selects a family of operations that contains FFT and convolution. It is unclear how you constrain the search space of K-matrices; I get you searched K operations, but I did not find it's structure and how does that different from FFT. I'm also confused about the point of fig.2, what's the point you want to say about these feature map?
+
+b) Ablation studies.
+
+There are a lot of ablation studies show that the searched K-operations are better than convolution. However, none of them show they actually pushed SoTA results. Considering the fact that there are so many tricks (https://github.com/facebookresearch/LaMCTS/tree/master/LaNAS/LaNet) in boosting the performance of a CNN, it is more convincing to see the searched K operators can actually push the boundary. The ablation studies in the current experiments are not enough to convince me, especially they are focusing on relatively simple tasks, e.g. CIFAR-10, MNIST.
+
+c) Another thought, Tensorized Neural Network has also tried to replace of current operators, and their hyper-parameters can also formulate a search space; My main concern about this line of work is ""are we really making progress here""?
+
+Here we're building something based one prior knowledge; if we will end up someting similar to convolution, so what's the point of doing it? However, the paper lacks a strong evidence that they invented a new operators that actually work. This is my main concern of this paper.
+
+
+
+
+",5,5.0,ICLR2021
+BJeh3af0FH,2,r1eU1gHFvH,r1eU1gHFvH,Official Blind Review #1,"The authors studied the local codes in neural networks through a set of controlled experiments (by controlling the invariance in the input data, dimensions of the input and the hidden layers, etc.), and identified some common conditions under which local codes are more likely to emerge.
+
+The fact that local codes tend to emerge as a response to invariance is interesting but not surprising, especially given that convolution operations are designed to capture location invariance. It would be useful if the authors can clarify their contributions and compare against existing works in the literature.
+
+Experiments are conducted at a relatively small scale: On a synthetic dataset with binarized vectors and on MNIST, which a predefined rule for noise injection (Figure 1). The controlled experiments conducted in the paper are still informative, but the overall message would be much stronger if the empirical analysis can be extended to common benchmarks such as CIFAR and/or ImageNet.
+
+All of the experiments are based on very shallow networks (3-4 layers), and as the result, the study ignores batch normalization and skip connections which are common ingredients in state-of-the-art convolutional networks. It remains unclear whether the presence of those components would change the emergence behavior of local codes, and hence affect some of the conclusions in the paper.",3,,ICLR2020
+rJgIEz285r,3,BJl8ZlHFwr,BJl8ZlHFwr,Official Blind Review #2,"
+The main topic of this paper is generalized zero-shot learning. This paper modifies traditional VAE method with attribute matching prior to release the hidden features from original regularization. This paper also proposes a domain discriminator to enhance class-separability of learned features to avoid unseen classes to be covered by seen classes. Experiment results show their efficiency under relation-based setting.
+
+Pros:
+1.This paper proposes an important insight that in generalized ZSL, the unseen classes may be dominated by seen classes in the feature space.
+2.An easy but efficient domain discriminator method is proposed to separate different classes to avoid domination. 
+3.Even without large synthetic learning architecture, the proposed method gets comparable results.
+
+Comments:
+1.The proposed MCMVAE is No-longer a VAE but an AE with attribute matching loss. Except that a new theory of MCMVAE is proposed, it is not rigorous to relate MCMVAE to VAE.
+2.Add results using synthetic architecture to get a better result will make this method more reliable.
+3.Why discriminator is harmful for PSE method?
+",6,,ICLR2020
+rkxX9dh3YB,2,BJg9hTNKPH,BJg9hTNKPH,Official Blind Review #3,"The paper introduces a general framework for behavior regularized actor-critic methods, and empirically evaluates recent offline RL algorithms and different design choices. 
+
+Overall, the paper is well written and easy to follow.  I appreciate the authors for their careful empirical study. I am leaning to accept the paper because (1) the experimental design is rigorous and the results provide several insights into how to design a behavior regularized algorithm for offline RL.
+
+There are some comments for the experiments.
+1. Are the results significant (e.g. Figure 3 and 4)? Have you checked the error bars?
+2. Missing numbers: trained_alpha in dataset 0 of Hopper-v2 in Figure 1 and SAC in dataset 0 of Hopper-v2 in Figure 6? Are they negative so not reported in the figure or just missing? 
+3. Do you think the conclusion will change if you use training datasets of different size (e.g. much less than 1 million)? ",6,,ICLR2020
+LEP5quICJSi,4,dOcQK-f4byz,dOcQK-f4byz,Teaching Temporal Logics to Neural Networks,"This paper applies transformer models to learning linear-time temporal logic (LTL) and propositional logic. The results show that, when trained on large datasets of random formulas, transformer models can perform quite well on within-distribution held out test tasks, and when equipped with tree-positional encodings, they exhibit generalization to longer formulas.
+
+Strengths:  
+-The paper is very clearly written 
+-Novel domain for neural approaches  
+-The results on the provided datasets are strong  
+
+Weaknesses:  
+I think the main weaknesses fall into three categories: novelty of the approach, analysis and comparison to baselines, and use of purely synthetic test data.
+
+Novelty of the approach:  
+The approach is not novel, using a transformer model and a previously proposed tree-positional encoding scheme.
+Novelty in the approach is not required, as long as the paper also contains experimental analysis which provides insight into either the technique or the problem studied, and the evaluation is thorough. However, I find that there are weaknesses in both these aspects.
+
+Analysis and baselines:  
+The main paper reports results of a single model, and does not report results of any baselines, besides the length generalization results. If the paper claims that transformers specifically perform well on these problems, then comparing with alternate baselines (such as sequence or tree RNNs) is necessary. If the claim is instead that high capacity models in general can perform well on important LTL tasks, then I think that the use of only synthetic test data is a weakness (see below). In either case, I think further analysis of the conditions under which the model succeeds vs fails is warranted. 
+
+Synthetic data:  
+This work evaluates models only on randomly generated synthetic test data, which I view as a disadvantage. Although the paper demonstrates that training models on a large synthetic corpus provides good within-distribution test performance (as well as length generalization), it’s unclear how it would perform on natural data. I also don’t have a clear sense of how difficult the training and testing problems are relative to tasks relevant to people. Is it possible to collect a small non-synthetic test corpus of LTL formulas, to verify that the trained models can generalize to tasks relevant to people? The LTLPattern126 dataset is constructed from formulas from 55 LTL specification patterns identified from the literature. Can the models trained on these patterns generalize to other patterns from the literature? Similarly, could models trained on a subset of the 55 patterns generalize to the held-out patterns?
+
+Because of the lack of baselines and analysis, and the use of purely synthetic test data, it’s difficult to evaluate the results of the paper. For this reason, coupled with the fact that the approach is not novel, I recommend a weak reject. However, I would be willing to raise my score if concerns about baselines, analysis, or synthetic data were addressed. In particular, evaluation on non-synthetic data would be very valuable. 
+
+Additional suggestions:  
+As stated above, I think more detailed experimental analysis would be very helpful, and could greatly strengthen the paper.  
+-How do transformers compare to other models, such as tree or sequence RNNs?  
+-What qualitatively (or quantitatively) distinguishes those formulas which a transformer can solve from those it can’t? Could insights here lead to proposed changes in architecture or data generation?  
+-What qualitatively (or quantitatively) distinguishes those formulas for which a transformer achieves syntactic accuracy from those for which it achieves semantic accuracy but not syntactic accuracy? Again, could insights here lead to proposed changes in architecture or data generation?  
+-Could a model trained on LTLRandom35 generalize to LTLPattern126 and visa versa?  
+-How do models with the sequence positional encoder perform on the LTLpattern126 and LTLUnsolved254 datasets? 
+
+Minor comments:  
+-I’d definitely recommend moving figure 9 to the main text, as it’s the only comparison between the model and a baseline.  
+
+
+",5,3.0,ICLR2021
+jCBAeT1NWC8,3,CGQ6ENUMX6,CGQ6ENUMX6,Interesting research in evolving robot morphologies,"The paper presents a method to evolve morphologies of a robotic system without explicit reward and using an empowerment-like quantity. The thus obtained morphologies can get higher rewards in task settings when RL algorithms are applied.
+
+The story and presentation of the paper are clear. I also like the results and think it is interesting, maybe more for a conference like ALife than ICLR though.
+
+Strengths:
+- Interesting way to estimate the information criterion using a GNN
+- formulation of a morphology-empowerment
+- analysis and ablations justify the design choices and the method
+- good results (on self-given tasks)
+
+Weaknesses:
+- the mathematical formulation is sloppy in many places
+- the morphologies are compared on different sequences. It might be interesting to know how much variance the estimation of Eq 1 has when sampling a different batch of action-sequences
+
+I think when the problems with the formulation are fixed and the typos are removed the paper can be much stronger.
+
+Related work: As you quantity is very close to the Empowerment definition, I think the original work by the Polani group should be cited, e.g. ""Empowerment: A universal agent-centric measure of control"", Klyubin, Polani, Nehaniv
+An interesting combination could be to use task agnostic live-time adaptation, such as predictive information maximization (Information Driven Self-Organization of Complex Robotic Behaviors, PLoSOne) or related work could be combined.
+
+Problems in the mathematical formulation:
+Page 3, first equation (before (1)): What is the expectation taken over? 
+You write about q_theta being a classifier, but most of the paper reads like you have continuous actions. This should be clarified early on that you assume discrete actions/ action-primitives for finding the morphologies.
+Eq 1: j appears in the right but without any specification. 
+You write that you take expectations of joints, but what is p(j), so the probability of a joint. I think you want to add a sum over the joints $j$ or something. What is |J|?
+
+
+Details:
+- Sec 1: Second, Using (capital)
+- Sec 3.1: Notice that left untouched, the...
+- Sec 3.1: By assuming actions are uniform A_j?
+- Action Distr: The appendix should contain information about the action sequences you are using.
+- GNN Classifier: the the 
+- Sec 3.3: slowest step in for simulation....
+- Sec 4.4: ""meta"" action: use the same label as in Fig 5  (global action)
+- Table 3: ..TAME and a similar 
+- Sec 5: You write randomly sampled actions, but you have highly structured action-primitives 
+",7,4.0,ICLR2021
+ryx7EiaBKr,1,BkgzqRVFDr,BkgzqRVFDr,Official Blind Review #2,"Reinforcement Learning with Probabilistically Complete Exploration
+==========================================================
+
+This paper proposes an exploration technique for planning from simulator.
+Roughly speaking, the algorithm uses some initial budget to sample random states and generate some effective demonstration trajectory.
+Once this trajectory is found, it can be used to form an initialization for a policy gradient method.
+This leads to improved performance on some mujoco tasks.
+
+
+There are several things to like about this paper:
+- The problem of efficient exploration in large-scale RL is a big outstanding hole. Particularly finding methods that are compatible with state of the art policy gradient approaches.
+- The proposed algorithm is sensible, and seems to use a reasonable heuristic from planning to generate good ""kickstart"" for policy gradient methods.
+- The general quality of the writing and presentation is pretty good.
+- It's great to see code released.
+
+However, there are some places where this paper falls down:
+- Assumptions 1 (and particularly 2) are *not* part of the standard RL problem... but actually show that this is a proposal for planning given a simulator. This is a different problem setting and the distinction is really not clear from the first (several) pages. Further, although the assumptions are stated clearly, I think that this leads to some unfair comparisons (even if the sampled states are taken from the X-axis budget in plots).
+  - Note that this paper is far from the only one that is a bit sloppy on this distinction... and, of course, you can still use an RL *algorithm* to solve the planning *problem*... but it's not clear you can use a planning algorithm to solve the RL problem... and that's what this paper claims... but then Assumption 2 essentially just reduces the RL problem to the planning problem!
+- The claims of ""probabilistic completeness"" are not particularly insightful, in fact, the same is also true of Q-learning with epsilon-greedy dithering! The point of efficient exploration would be that you find this stuff quickly... and I'm not really convinced that this method always would. The quality of \hat{a}, \hat{b} seems like it should be very important... but I don't get much insight to that spelled out in the paper.
+- The computational evaluations are not particularly insightful, in that they seems to not give much insight into exactly what is happening. I also wonder whether they are really ""fair"" comparisons given the planning vs RL distinction.
+
+
+Overall, I think that there are interesting pieces to the paper, and the underlying algorithm is also interesting.
+For me, the confusion between the planning and RL setting is impossible to move past... particularly since the exploration challenges can be distinct in these domains.
+For this reason, I don't think the paper is ready for publication.
+",3,,ICLR2020
+SJxfjMU9YB,2,S1lNWertDr,S1lNWertDr,Official Blind Review #1,"The proposed work investigates the problem of learning hierarchy in RNNs. Authors note that different layers of the hierarchy are trained in ""sync"". The proposed paper suggests to decouple the different layers of hierarchy using auxiliary losses.  The form of auxiliary losses used in the paper are of the form of local losses, where there is a decoder, which is used to decode past inputs to each level from the hidden state that is sent up the hierarchy, therebyforcing this hidden state to contain all relevant information. 
+
+Clarity of the paper: The paper is clearly written.
+
+Method: The proposed method  ignores the gradients from higher to lower levelsin the backward pass,  (because of this, the authors can also save some memory). In order to compensate for the lost gradients, authors propose to use local losses, and we introduce an auxiliary loss term to force this hidden state to contain all information aboutthe last k inputs. The authors note that the hidden state from the lower level (to the higher level) should contain the summary of the past, and hence use a decoder network (which is simply parameterized) as a feedforward network which is used to decoder a ""past"" hidden state. 
+
+Related Work Section: The related work section is nicely written. The authors have covered mostly everything. These 3 papers may still be relevant. (a), (b), (c).   (b) could be relevant for mitigating the parameter update lock problem as mentioned by authors in the introduction of the paper. (c) is also relevant as authors in (c) also consider using auxiliary losses for learning long term dependencies. 
+(a) SkipRNN: https://arxiv.org/abs/1708.06834(b) Sparse Attentive Backtracking: http://papers.nips.cc/paper/7991-sparse-attentive-backtracking-temporal-credit-assignment-through-reminding
+(c)  Learning long term dependencies in RNNs using auxiliary losses https://arxiv.org/abs/1803.00144
+Experiment Section: In order to validate the proposed method, authors evaluate it on copying task, pixel MNIST classification, permutedpixel MNIST classification, and character-level language modeling. 
+a) Copying results show that the decoder network are essential to achieve decent results. This task though does not show the strength of the proposed method though as baseline also solves the problem completely. It might be interesting to actually scale the ""gap"" time in copying time step to something larger like T = 1000 or something.
+b) PIXEL MNIST classification: Authors use the pixel by pixel classification task to test the proposed method. Here, the proposed method performs comparable to the hierarchical RNN (but without using too much memory). 
+c) Character level modelling: Authors demonstrate the performance of the proposed method on language modelling task (PTB). These results are particularly not interesting, as the performance gain is very marginal. Also, may be using other language modelling datasets like Wikitest103 or Text8 might be more useful.  As for the results, even unregularized LSTM performs better than the baseline in this paper. (For reference, see https://arxiv.org/abs/1606.01305) 
+
+What authors can do to improve paper:
+- The problem considered in the proposed paper is very interesting to me. Though, the results are not (yet) convincing. It might be interesting to think about a task, where there are really long term dependencies like reading CIFAR10 digit pixel by pixel and then doing classification, where the authors can actually show the promise of the proposed method. 
+- It might also be interesting to know how are the original training cost objective is weighed against the auxiliary loss. Have authors tried any search over what kind of auxiliary loss performs well ? ",1,,ICLR2020
+Skl0FlWAtS,2,r1egIyBFPS,r1egIyBFPS,Official Blind Review #2,"The authors present a framework for symbolic superoptimization using methods from deep learning. A deep learning approach operating on the expression tree structures is proposed based on a combination of subtree embeddings, LSTM RNN structures, and an attention mechanism. 
+
+The approach avoids the exploitation of human-generated equivalence pairs thus avoiding human interaction and corresponding bias. Instead, the approach is trained using random generated data. It remains somewhat unclear how the corresponding random data generation influences general applicability w.r.t. other tasks, as the authors apply constraints on the generation process for complexity reasons. A corresponding discussion would be valuable here.
+
+In Secs. 3 & 4, the authors present their specific modeling and learning approach. However, they do not report on modeling or learning alternatives. It would be interesting for the audience to understand, how the authors reached these specific choices, and how (some of) these choice influence performance and learning stability. For example, in Sec. 4.1, an additional loss term is introduced to further support the learning of embeddings. However, it might interesting to see comparative results quantitatively investigating the effect of this additional loss term. Also, as far as I can see, no information on the choice of hyperparameters (e.g. LSTM dimensions) are provided or analyzed w.r.t. their effect on the performance of the proposed approach.",3,,ICLR2020
+Sye-0-Kt37,1,S1lPShAqFm,S1lPShAqFm,Mostly descriptive experimental analysis,"This paper presents an empirical analysis of the convergence of deep NN training (in particular in language models and speech).
+
+Studying the effect of various hyperparameters on the convergence is certainly of great interest. However, the issue with this paper is that its analyses are mostly *descriptive*, rather than conclusive or even suggestive. For example, in Figure 2, it is shown that the convergence slope of Adam is steeper than that of SGD, when the x-axis is the model size. Very naturally I would be interested in a hypothesis like “Adam converges quicker than SGD as we increase the model size”, but there is no discussion like that. Throughout the paper there are many experimental results, but results are presented one after another, without many conclusions or suggestions made for practice. I don’t have a good take-away after reading it.
+
+The writing of this paper also needs to be improved significantly. In particular, lots of statements are made casually without justification. For example,
+
+“If hidden dimension is wide enough to absorb all the information within the input data, increasing width obviously would not affect convergence” -- Not so obvious to me, any reference? 
+
+“Figure 4 shows a sketch of a model’s convergence curve ...” -- it’s not a fact but only a hypothesis. For example, what if for super large models the convergence gets slow and the curve gets back up again?
+
+In general, I think the paper is asking an interesting, important question, but more developments are needed from these initial experimental results.",3,4.0,ICLR2019
+rkSREOYgM,3,Hkc-TeZ0W,Hkc-TeZ0W,Elegant method with impressive results,"This paper proposes a device placement algorithm to place operations of tensorflow on devices. 
+
+Pros:
+
+1. It is a novel approach which trains the placement end to end.
+2. The experiments are solid to demonstrate this method works very well.
+3. The writing is easy to follow.
+4. This would be a very useful tool for the community if open sourced.
+
+Cons:
+
+1. It is not very clear in the paper whether the training happens for each model yielding separate agents, or a shared agent is trained and used for all kinds of models. The latter would be more exciting. The adjacency matrix varies size for different graphs, so I guess a separate agent is trained for each graph? However, if the agent is not shared, why not just use integer to represent each operation in the graph, since overfitting would be more desirable in this case.
+2. Averaging the embedding is hard to understand especially for the output sizes and number of outputs.
+3. It is not clear how the adjacency information is used.
+",8,5.0,ICLR2018
+Sye656FTFr,2,ryxdEkHtPS,ryxdEkHtPS,Official Blind Review #3,"
+[Summary]
+This paper empirically studies the behavior of deep policy gradient algorithms during the optimization. The conclusion is that, while these methods generally improve the policy, their behavior does not comply with the underlying theoretical framework. First, sample gradients obtained with a reasonable batch size have little correlation with each other and with the true gradient. Second, a larger batch size requires a smaller step-size. Third, the value baseline is far from true values and only marginally reduces variance, yet it considerably helps with optimization. Finally, the optimization landscape highly varies with the choice of objective function and the number of samples used to estimate it.
+
+[Decision]
+I vote for acceptance. To the best of my knowledge, the findings of this paper are new and not predictable by the current theory. These negative results have some merit as they call for theory that explains the behavior of these algorithms, or an algorithm whose behavior is predictable by the current theory. The paper is well-written, with a few small issues in presentation that should to be addressed in the final revision.
+
+[Comments]
+In Fig. 4 (b) it does not look like that the value error is high. It is said that ""the learned value function is off by about 50% w.r.t. the underlying true value function."" This sentence should be clarified or visualized.
+
+What is \pi in Eq (13) in A1? If it is the agent's current policy, how is it different than \pi_\theta? If \pi corresponds to the distribution of state-action pairs in the replay buffer, how can one obtain a policy \pi that has led to this distribution of states in order to construct the importance sampling ratio?
+
+In 2.2, the claim that a learned value baseline results in significant improvement in performance should be supported by results or reference to previous work.
+
+Figs. 6 and 7 compare the loss surface with different objectives and sample regimes. Do these factors (objective and sample size) affect the part of the parameter space that is visualized (by changing the origin and the update direction), or are they only used to evaluate the values on the z-axis for the same area in the parameter space? Observing a different landscape in a different part of the parameter space is not surprising.
+
+[Minor comments]
+- Is V_\theta_{t-1} in Eq (4) a function of state? If so, a (s_t) is missing before the plus sign.",6,,ICLR2020
+wr02ZBmizNL,1,Siwm2BaNiG,Siwm2BaNiG,"Extension of conditional VAE for uncertainty estimation, but need polishing","The paper proposes a conditional VAE like framework to learn the one-to-many mappings between input and output, leading to an application of uncertainty estimation. Technically, the novel part is to utilize a deterministic (delta) distribution for approximate posterior.
+
+Some flaws may need future attention:
+
+1. In the introduction paragraph staring with ""Let us recall that one key ingredient of the VAE framework"", the main idea is understandable: discrete latent code has definitely advantages in coping with multimodal distributions than a Gaussian distribution which is in nature single mode.  However, its explanation is confusing as following: 
+	In VAE, let's say we minimize KL divergence KL(P(c|x,y)|P(c|x)), where P(c|x,y) has two modes, P(c|x) has single mode. When minimization is successful, p(c|x) will spread out like figure 1(a), rather than 1(b). This is related to the difference between forward KL and reverse KL. However, it seems this paragraph suggests 1(b) as posterior collapse.
+
+2. Using discrete latent code is not new in VAE community. There are previous works (e.g. https://arxiv.org/pdf/1804.08069.pdf) noting that naively learned discrete code c for p(y|x,c) can not be interpreted alone, but need interpreted together with input x. Such statement argues this paper's novelty and contribution. 
+
+3. Some typos and minor flaws, such as unclear references of figure rows in figure 3.
+
+Overall, the reviewer thinks this paper need some revisions for it to be more shining.",5,4.0,ICLR2021
+Fk8xgggCKGE,2,_CrmWaJ2uvP,_CrmWaJ2uvP,overall this is a good submission,"This paper aims at proposing Dynamic Recurrent Network to understand the underlying system properties of RNNs. By first showing five basic linear transfer functions in dynamic systems theory, the paper formulates DYRNN units. To solve the increasing number of layers issue, they concatenate inputs and intermediate results before passing into an FC layer. It is interesting to see how adjusting
+\deta t is related to the model’s robustness. Though not fully explained, this paper provides a method to partially explain the RNN insights through FC layers learnt weights.  
+
+The paper is well written to convey the central ideas. The overall idea is interesting, and experiments are clear to demonstrate the proposed method. It will be better to test the method on some benchmark datasets so that it will be easy to compare with the state-of-the-art.
+
+A small advice: do you mean varying instead of variing in section 6?",6,4.0,ICLR2021
+DhE_7lTQwHb,3,kW_zpEmMLdP,kW_zpEmMLdP,A nice extension to Neural ODEs,"This work provides an extension to the neural ODEs framework to include discrete changes (i.e. switching) in continuous-time dynamics. The authors provide a few examples of such systems (bouncing balls, collisions of particles, discrete control systems) and derive formally the gradients with respect to the unknown switching time (which is a solution to the so-called event function), where a discontinuity (the switch) happens. The authors implement their method in the torchdiffeq library of Chen et al. 2018, and provide an extensive experimental evaluation in this manuscript.
+
+The paper is well written, with very few typos and amazing attention to detail (nice colors, thought-through figures, great content organization). Also, the math is well-explained. Overall, a very good read and a nice contribution. I recommend acceptance.
+
+Just a few points/questions for the authors:
+- Q1: How does your method scale as the dimension increases? I am thinking about for instance the experiment of section 4.1: if the dynamics is higher-dimensional (with still with 3 discontinuities in the vector field), is the method still able to infer the correct switching times?
+- Q2: Still about Figure 2.. how is it possible in panel (d) that the inferred discontinuity in the flow is not aligned with the discontinuity in the field? 
+- Q3: I am a bit confused about how you infer successive switching times t*1, t*2 etc.. do you have to specify the number of switching times at the beginning of the inference procedure? Do you take gradients with respect to all the switching times together? Could you please add an explanation in the rebuttal and in your revised version of the paper?
+- Clarifications: I would add a few lines explaining what the “adjoint state” is before formula 5.
+- typos: Dots at the end of formulas are sometimes missing. End of page 4, “SLDS dynamics..”",7,3.0,ICLR2021
+BrYJ5b4xwse,4,X5ivSy4AHx,X5ivSy4AHx,"The paper is technical, further explanation would be helpful for better understanding of the result","The paper proposes a variant SREDA-Boost of the variance reduction method SEDRA for solving nonconvex-strongly-concave min-max problem. The first contribution of the paper is to relax the conditions on the initialization  of SEDRA and moreover enable larger stepsizes ($\epsilon$-independent stepsizes). As SEDRA is already optimal, such modification does not improve the theoretical convergence rate, but it is beneficial from the practical perspective. The second contribution is to adapt the method to zero order oracle, achieving the state-of-the-art convergence rate. 
+
+The result are presented in a clear and technical manner. My major concern is the lack of high level explanation on why a larger stepsize can be applied. I understand that the paper introduces a novel way for the complexity analysis, however the proof is very long and not easy to check. Hence, it would be helpful to explain in words the key aspects that allows an $\epsilon$-independent stepsize (in a non-technical manner).  
+
+Another question I would like to ask is whether the Boost version is necessary to deduce the zeroth order variant. In other words, what would be the complexity of ZO-SREDA. I would expect that it shares the same complexity as  ZO-SREDA-BOOST, even though the stepsize are smaller. Please clarify this point.
+
+The experiments are rather limited but it is fine for a theoretical paper.  I would expect including more comparisons on competitive methods in Figure 1, as in the current version the only baseline is ZO-SREDA. ",6,3.0,ICLR2021
+sdI5n-45XLQ,4,4NNQ3l2hbN0,4NNQ3l2hbN0,The paper needs to justify the new metric.,"In this paper, the authors proposed Search Data Structure Learning (SDSL), which they claim to be a generalization of the standard Search Data Structure. They also present a new metric called Sequential Search Work Ratio (SSWR) to evaluate the quality and efficiency of the search. They introduced a new loss called F-beta Loss, showing their algorithm is better than two previous results, MIHash (Cakir et al. 2017) and HashNet (Cao et al. 2017).
+
+I appreciate the key message the paper is trying to convey: we need the formal definition or mutual agreement on the problem as well as the correct evaluation metrics to push forward a research area. However, I have several major concerns about the contribution of this paper. 
+
+1. I do not see a formal definition of SDSL. Definition 3.1 is just a definition of matching and non-matching relations. 
+2. What is new in the metric defined in Definition 3.4? The denominator is constant for all search methods. C(.,.,.) is the cost of re-ranking, and w0 is the cost of searching (filtering the candidate). 
+3. There is no theoretical or empirical comparison/evaluation of the proposed metric. The calculations in Appendix A are very standard calculations. It is unclear about the innovation of this metric from a theory perspective. In experiments, the authors directly use SSWR, and there is no justification on why this is the right metric.
+
+Before proposing the new loss, etc., I suggest the authors could go back and justify the metric's effectiveness. Otherwise, it is hard to conclude if this paper has made any progress.
+
+=========================================
+
+Thanks for the rebuttal and revision. My first concern has been addressed. However, I still found the proposal lack empirical or theoretical proof, so I am not convinced the contribution is principle enough. I decide to keep my original score.
+",4,4.0,ICLR2021
+r1esT_GTKr,2,HklliySFDS,HklliySFDS,Official Blind Review #2,"The goal of this work is to best understand the performance and benchmarking of continual learning algorithms when applied to sequential data processing problems like language or sequence data sets. The contributions of the paper are 3 fold - new benchmarks for CL with sequential data for RNN processing, new architecture introduced for more effective processing and a thorough empirical evaluation. 
+
+Introduction: 
+I think a little more insight into why the sequential data processing CL scenario is any different than the vision scenario would be quite helpful. Specifically, it would be quite impactful to tell us more about what the additional challenges with RNNs for CL vs feedforward for CL are in the intro. 
+
+The paper is written as if the benchmark is the main contribution and the architecture improvement is just a delta on top of this, but it gets confusing when the methods section starts off with just directly stating the new architecture. 
+
+The algorithm seems like a straightforward combination of recurrent progressive nets and gated autoencoders for CL. Can the authors provide more justification if that is the contribution or there is more to the insight than has been previously suggested in prior work?
+
+Figure 1 has a very uninformative caption. It also doesn’t show how modules feed into one another properly. 
+
+The motivation for why one needs GIM after one already has A-LSTM or A-LMN is not very clear?
+
+Overall the contribution does seem a bit incremental based on prior work and the description lacks enough detail to properly indicate why this is a very important contribution?
+
+Experiments:
+What does it mean to be application agnostic but restricted to particular datasets and losses? This doesn’t quite parse to me. 
+
+The description of the tasks is very informal and hard to follow. It’s not clear what exactly the tasks and datasets look like 
+
+“using morehidden units can bridge this gap” -> why not just do it? Its a benchmark after all. 
+
+Overall the task descriptions should be in a separate section where the setup is described in a lot of detail and motivated properly. 
+
+The results in the experiments section are very hard to parse. The captions need much more detail for eg Table 2. 
+
+Could we also possibly have more baselines from continual learning? For instance EWC (Kirkpatrick) or generative replay might be competitive baselines. 
+
+Overall I think that the GIM and A-LMN and A-LSTM methods are reasonable although somewhat incremental. But the proposed benchmarks are pretty unclear and the results are a bit hard to really interpret well. It would also be important to run comparisons with more baselines and to provide more ablation/analysis experiments to really see the benefit of GIM/A-LMN or A-LSTM. I also think that the task descriptions should be much earlier in the paper and desribed in much more rigorous detail. 
+",3,,ICLR2020
+LwKvsrK7R-A,4,5B8YAz6W3eX,5B8YAz6W3eX,Review for Paper88,"This paper presents the optimization method Apollo, a quasi-Newton method that relies on a parameter-wise version of the weak secant condition to allow for a diagonal approximation of the Hessian. Additionally, the issue of a potentially non-PSD approximation is addressed by replacing the approximation with a rectified absolute value. While the combination of techniques is interesting, my main hesitation comes from the limited discussion concerning other quasi-Newton methods for the same problem setting.
+
+To begin, a much more significant overview of the distinctions between this work and those of AdaQN and SdLBFGS is certainly warranted, as few details are provided to explain how they differ from the methods in this paper. For example, the comment made about AdaQN is that it ""shares a similar idea but specifically designed for RNNs."" First, I am unsure why the implication is that this somehow weakens the merit of AdaQN as compared to the Apollo method, especially since the authors themselves evaluate their Apollo method on RNNs for language modeling in Section 4.2. In fact, the authors do not even compare with AdaQN in the RNN experiments, choosing instead only to run against Adam and RAdam. This brings us to a key issue with the paper: why are there no comparisons to any other quasi-Newton methods, for any setting (RNN or otherwise)? Since AdaQN is designed for RNNs, it is perfectly suited as a method to compare with in the language modeling tasks, which exhibit the most notable claimed improvement for Apollo over the adaptive first-order methods.
+
+As for other related methods such as AdaHessian, I agree that there is a distinction between quasi-Newton methods and second-order Hessian-free methods in terms of the information that is accessed. However, just because second-order information is invoked (through Hessian vector products) does not mean by default that the method is ""significantly more costly"" than these quasi-Newton methods, as is claimed in the paper. Hessian vector product-based methods are desirable precisely because the computational cost is comparable to first-order methods, and here too additional comparison is needed.
+
+Overall, the previous works on quasi-Newton methods for stochastic non-convex optimization have not been sufficiently addressed or compared to, particularly given how those works may also handle the issue of preserving positive-definiteness of B_t.
+
+Small comments:
+.- ""we demonstrate that Apollo significantly outperforms SGD and variants of Adam""
+This is overstated, as the only notable improvement claimed by the paper is for language modeling (the others, particularly for image classification tasks, are modest improvements at best).
+
+.- ""Newton's method usually employs the following updates to solve (1)""
+It should be clarified that convexity is important when trying to use (plain) Newton's method to solve problems such as (1).
+
+.- ""unnecessarily"" -> ""not necessarily""
+
+.- Related work on Hessian-free methods that consider absolute value-based transformations of the Hessian:
+
+Dauphin, Yann N., Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. ""Identifying and attacking the saddle point problem in high-dimensional non-convex optimization."" In Advances in Neural Information Processing Systems, pp. 2933-2941. 2014.
+
+.- Related works, in addition to Gupta et al. (2018), in terms of memory-efficient adaptive first-order methods:
+
+Naman Agarwal, Brian Bullins, Xinyi Chen, Elad Hazan, Karan Singh, Cyril Zhang, and Yi Zhang. ""Efficient full-matrix adaptive regularization."" In International Conference on Machine Learning, pp. 102-110. 2019.
+
+Rohan Anil, Vineet Gupta, Tomer Koren, and Yoram Singer. ""Memory Efficient Adaptive Optimization."" In Advances in Neural Information Processing Systems, pp. 9749-9758. 2019.
+
+Xinyi Chen, Naman Agarwal, Elad Hazan, Cyril Zhang, and Yi Zhang. ""Extreme Tensoring for Low-Memory Preconditioning."" In International Conference on Learning Representations. 2020.",4,4.0,ICLR2021
+SkFemC-lz,1,Sy0GnUxCb,Sy0GnUxCb,This paper showing nice results lacks a serious scientific analysis and contains several issues,"In this paper, the authors produced quite cool videos showing the acquisition of highly complex skills, and they are happy about it. If you read the conclusion, this is the only message they put forward, and to me this is not a scientific message.
+
+A more classical summary is that the authors use PPO, a state-of-the-art deep RL method, in a context where two agents are trained to perform competitive games against each other. They reuse a very recent ""dense reward"" technique to bootstrap the agent skills, and then anneal it to zero so that the competitive rewards obtained from defeating the opponent takes the lead. They study the effect of this annealing process (considered as a curriculum) and of various strategies for sampling the opponents. The main outcome is the acquisition of a large variety of useful skills, just observed from videos of the competitions.
+
+The main issue with this paper is the lack of scientific analysis of the results, together with many local issues in the presentation of these results.
+Below, I talk directly to the authors.
+
+---------------------------------
+
+The related work subsection is just a list of works, it should explain how the proposed work position itself with respect to these works.
+
+
+In Section 5.2, you are just describing ""cool"" behaviors observed from your videos.
+Science is about producing quantitative results, analyzing them and discussing them.
+I would be glad to read more science about these cool behaviors. Can you define a repertoire of such behaviors?
+Determine how often they are discovered? Study how the are represented in the networks?
+Anything beyond ""look, that's great!"" would make the paper better...
+
+By the end of Section 5.2, you allude to transfer learning phenomena.
+It would be nice to study these transfer effects in your results with a quantitative methodology.
+
+Section 5.3 is more scientific, but it has serious issues.
+
+In all subfigures in Figure 3, the performance of opponents should be symmetric around 50%. This is not the case for subfigures (b) and (c-1). Why?
+Do they correspond to non-zero sum game? The x-label is ""version"". Don't you mean ""number of epochs"", or something like this? Why do the last 2 images
+share the same caption?
+
+I had a hard time understanding the message from Table 1. It really needs a line before the last row and a more explicative caption.
+
+Still in 5.3, ""These results echo""...: can you characterize this echo? What is the relationship to this other work?
+
+Again, ""These results shed further light"": further with respect to what? Can you be more explicit about what we learn?
+
+Also, I find that annealing a kind of reward with respect to another is a weak form of curriculum learning. This should be further discussed.
+
+In Section 5.4, the idea of using many opponents from many stages of learning in not new.
+If I'm correct, the same was done in evolutionary method to escape the ""arms race"" dead-end in prey-predator races quite a while ago  (see e.g. ""Coevolving predator and prey robots: Do “arms races” arise in artificial evolution?"" Nolfi and Floreano, 1998)
+
+Section 5.5.1 would deserve a more quantitative presentation of the effect of randomization.
+Actually, in Fig5: the axes are not labelled. I don't believe it shows a win-rate. So probably the caption (or the image) is wrong.
+
+In Section 5.5.2, you ""suspect this is because..."".
+The role of a scientific paper is to clearly establish results and explanation from solid quantitative analysis. 
+
+-------------------------------------------
+More local comments:
+
+Abstract:
+
+""Normally, the complexity of the trained agent is closely related to the complexity of the environment."" Here you could cite Herbert Simon (1962).
+
+""In this paper, we point out that a competitive multi-agent environment trained with self-play can produce behaviors that are far more complex than the environment itself.""
+Well, for an agent, the other agent(s) are part of its environment, aren't they? So I don't like this perspective that the environment itself is ""simple"".
+
+Intro:
+
+""RL is exciting because good RL exists."" I don't believe this is a strong argument. There are many good things that exist which are not exciting.
+
+""In general, training an agent to perform a highly complex task requires a highly complex environment, and these can be difficult to create."" Well, the standard perspective is the other way round: in general, you face a complex problem, then you need to design a complex agent to solve it, and this is difficult. 
+
+""This happens because no matter how weak or strong an agent is, an environment populated with other agents of comparable strength provides the right challenge to the agent, facilitating maximally rapid learning and avoiding getting stuck."" This is not always true. The literature is full of examples where two-players competition end-up with oscillations between to solutions rather than ever-increasing skill performance. See the prey-predator literature pointed above.
+
+""in the domain of continuous control, where balance, dexterity, and manipulation are the key skills."" In robotics, dexterity, and manipulation usually refer to using the robot's hand(s), a capability which is not shown here.
+
+In preliminaries, notation, what you describe corresponds to the framework of Dec-POMDPs, you should position yourself with respect to this framework (see e.g. Memory-Bounded Dynamic Programming for DEC-POMDPs. S Seuken, S Zilberstein)
+
+In PPO description : Let l_t(\theta) ... denote the likelihood ratio: of what?
+
+p5:
+would train on the dense reward for about 10-15% of the trainig epochs. So how much is \alpha_t? How did you tune it? Was it hard?
+
+p6:
+
+you give to the agent the mass: does the mass change over time???
+
+In observations: Are both agents given different observations? Could you specify which is given what?
+
+In Algorithms parameters: why do you have to anneal longer for kick-and-defend? What is the underlying phenomenon?
+
+In Section 5, the text mentions Fig5 before Fig4.
+
+-------------------------------------------------
+Typos:
+
+p4:
+research(Andrychowicz => missing space
+straight forward => straightforward
+
+p5:
+agent like humanoid(s)
+from exi(s)ting work
+
+p6:
+eq. 1 => Eq. (1) (you should use \eqref{})
+In section 4.1 => In Section 4.1 (same p7 for Section 4.2)
+
+""One question that arises is the extent to which the outcome of learning is affected by this exploration reward and to explore the benefit of this exploration reward. As already argued, we found the exploration reward to be crucial for learning as otherwise the agents are unable to explore the sparse competition reward."" => One question that arises is the extent to which the outcome of learning is affected by this exploration reward and to explore its benefit. As already argued, we found it to be crucial for learning as otherwise the agents are unable to explore the sparse competition reward.
+
+p8:
+in a local minima => minimum
+
+p9:
+in references, you have Jakob Foerster and Jakob N Foerster => try to be more consistent.
+
+p10, In Laetitia Matignon et al.  ... markov => Markov
+
+p11, I would rename C_{alive} as C_{standing}",3,3.0,ICLR2018
+HkxnE68RYH,3,rygG4AVFvH,rygG4AVFvH,Official Blind Review #3,"This paper proposes an optimizing compiler  for DNN's based on adaptive sampling and reinforcement learning, to drive the search of optimal code in order to reduce compilation time as well as potentially improve the efficiency of the code produced. In particular, the paper proposes to use PPO to optimize a code optimization ""search"" policy, and then use K-mean clustering over a set of different proposed compilation proposals, from which to perform adaptive sampling to reduce compilation time while still keeping a high diversity of the proposed solution pools during exploration. At the same time the authors claim that using RL will learn a better search strategy compared to random search - such as simulated annealing which is used by competing methods - thus producing faster and better solutions.
+The paper show results of up to 4x speedup in compilation time (autotuning time) while obtaining a slightly better or similar efficiency of the generated code (in term of execution time). This is a well written extensive research with good results. The authors mention it is (will be) integrated in the open source code of TVM. However I could find no mention in the paper of whether the code will be released with this publication, and I would like to solicit the authors to clarify their code release strategy and timing.
+My other question pertains to whether or not  compilation time is an key metric to target. It is important to some extent, but I would say that aside from exponential / super-polynomial behaviour of auto-tuning algorithms, a multiple hours / days process to create the best optimized code for a certain network / hardware platform might not be such a big hurdle for a community already used to multiple days / weeks / months to train the same models. I believe that focusing on the efficiency of the optimized code produced would probably be a better metric of success.",3,,ICLR2020
+H1gaakiptr,1,SJgs8TVtvr,SJgs8TVtvr,Official Blind Review #1,"The proposed method of mixture-of-experts variational autoencoders
+is valuable and insightful.
+On the other hand the work could be improved and clarified at some points:
+
+- in the abstract it is claimed that the method works for high-dimensional data.However, it should be better explained why this is the case. The method is largely based on density estimation with a mixture of Gaussians which is known to have limitations in higher dimensions (see e.g. classical textbooks like Bishop 1995)
+
+- the similarity matrix and the similarity values should be carefully defined. Is there also an underlying similarity function assumed?
+
+- a main shortcoming is that there is no discussion or experimental comparison with methods like spectral clustering and kernel spectral clustering. Given that the paper and the proposed method relates to similarity-based representations it would be important to know how it compares to such methods. Though e.g. in Table 1 the authors compare with about 10 other methods it would be more relevant that among some of these would have been spectral clustering and kernel spectral clustering, because of the similarity-based representations.
+
+- in section 4.1 the MNIST data are taken with k=10. Though it is nicely explained and illustrated on this data set, it is possibly somewhat misleading as an example. The reason is that this is a classification problem with 10 classes, therefore the choice k=10 is obvious. It would be more important to consider benchmark problems for clustering, instead of classification, for which the choice of k is also an important model selection issue and for which k is unknown (how should k be selected then?).
+
+- is each cluster always be assumed to be a Gaussian (which seems to be a strong assumption in general, and possibly not always realistic)? Could other components be used in the mixture?",3,,ICLR2020
+ryetm_kR3Q,2,ryfcCo0ctQ,ryfcCo0ctQ,Incremental theoretical result on Q-learning/actor-critic algorithms; Experiments are quite small-scale,"In this paper the authors studied reinforcement learning algorithms with nonlinear function approximation. By formulating the problems of value function estimation and policy learning as a bilevel optimization problems, the authors proposed 
+Q-learning and actor-critic algorithms that also contains convergence properties, even when nonlinear function approximations are used.  Similar to the stochastic approximation approach adopted by many previous work such as Borkar (https://core.ac.uk/download/pdf/148488247.pdf), they analyze the convergence properties by drawing connections to stability of a two-timescale ODE. Furthermore they also evaluated the effectiveness of the modified Q-learning/actor-critic algorithms on two toy examples.
+
+In general I find this paper interesting in terms of addressing a long-standing open question of convergence analysis of actor-critic/Q-learning algorithms, when general nonlinear function approximations are used. Through reformulating the problem of value estimation and policy improvement as a bilevel optimization problem, they proposed modifications of Q-learning and actor-critic algorithms, and under certain assumptions they showed that these algorithms converge, which is a non-trivial contribution. 
+
+While I appreciate the effort of extending existing analysis of these RL algorithms to general nonlinear function approximation,  I find the result of this paper rather incremental. While convergence results are provided, I am not sure how practical are the assumptions listed in the paper. Correct me if i am wrong, it seems that the assumptions are stated for the sake of proving the theoretical results without much practical justifications (especially Assumption 4.3).  Furthermore how can one ensure that these assumptions hold (for example Assumption 4.3 (i) and (ii), especially on the existence of locally stable equilibrium point) ? Unfortunately I haven't had a chance to go over all the proof details, it seems to me the analysis is built upon two-time scale stochastic approximation theory, which is a standard tool in convergence analysis of actor-critic. Since the contribution of this paper is mostly theoretical, can the authors highlight the novel contribution (such as proof techniques used here that are different than that in standard actor-critic analysis from e.g. https://www.semanticscholar.org/paper/Natural-actor-critic-algorithms-Bhatnagar-Sutton/6a40ffc156aea0c9abbd92294d6b729d2e5d5797)  in the main paper?
+
+My other concern is on the scale of the experiments. While this paper focused on nonlinear function approximation, the examples chosen to evaluate these algorithms are rather small-scale. For example the domains to test Q-learning are standard in RL, and they were previously used to test algorithms with linear function approximation. Can the author compare their results with other existing baselines?",6,4.0,ICLR2019
+es7eCmturDX,2,dyjPVUc2KB,dyjPVUc2KB,Review for Spectral RL,"## Summary
+
+This paper details the problems that might arise in value-based reinforcement learning methods in domains where reward progressivity is present. To show that current methods do not handle reward progressivity, the authors introduce two domains, _Exponential Pong_ and _Ponglantis_.
+
+After showing that current methods do not work well in these domains, the paper then goes on to propose a solution, spectral decomposition of rewards. The paper shows that returns learned separately on decomposed rewards can be composed to get the original return. The paper then presents spectral Q-learning and spectral DQN, with experiments on the domains presented earlier as well as experiments on 6 Atari games.
+
+On the two domains where progressive rewards were shown to be problematic, Spectral DQN is shown to work better than current approaches. On Atari games, spectral does as well as current approaches, doing better in 3 out of 6 domains.
+
+## Positives
++ The setting of progressive rewards is interesting.
++ The domains proposed for testing these rewards are clear.
++ The spectral reward decomposition is a simple idea and is explained well.
++ The effectiveness of spectral DQN on the two domains introduced in the paper is clear.
++ Experimental details are clear and additional steps taken to stabilize learning are included.
+
+## Negatives/ Questions
+- One question that does not seem to be satisfactorily addressed is whether there is a commonly used benchmark domain or one which was not specifically engineered for progressive rewards where value based deep reinforcement learning would fail or not perform well unless it was using Spectral DQN.
+- The spectral DQN objective (eqn. 8) includes target compression. It is unclear how much benefit the spectral decomposition is having in the presence of target compression. While it is fine to require target compression for the full benefit of spectral DQN, it would be good to see an ablation of how well spectral DQN does without target compression.
+- Another useful ablation could be to remove the monte carlo mixing and show how unstable the updates get.
+- How does the van Seijen et al. paper on logarithmic mappings (Using a Logarithmic Mapping to Enable Lower Discount Factors in Reinforcement Learning) compare to the related work? It is motivated by a different problem, but since a possible logarithmic mapping might mitigate the problems that come up with progressive rewards their method might be a possible solution to be compared against.
+
+## Other Comments
+* Figure 2 has a typo (Fiilled instead of filled).
+
+## Summary
+Overall, I find the idea in this paper clear, simple, and effective. There are some additional questions and comments that if addressed would make the paper a more well-rounded submission.
+
+",7,3.0,ICLR2021
+D8RuTeBT1H_,4,99M-4QlinPr,99M-4QlinPr,Interesting paper,"
+1. Summary:
+
+This paper addressed an interesting topic on competitive self-play reinforcement learning on two-player zero-sum games. Typically, many self-play methods are only average-case optimal and self-play with latest agent fails to converge. I believe this is a big issue for many self-play because averaging past policies for each player during self-play iterations is very difficult especially for large-scale games.
+The author tried to solve this problem by perturbation-based subgradient method because of its three advantages:
+(1) it only requires to know the subgradients rather than the game dynamics;
+(2) it's guaranteed to converge in its last iterate rather than the average iterate for convex-concave functions;
+(3) it's very simple to select the opponent in the self-play training.
+The authors evaluate this method on some small games and the experimental settings/results are well.
+Overall, I like this paper. However, there are some issues or weakness in the current version, I tend to vote minor reject. 
+Note, if the author can address the key issues well, i can increase my score.
+
+2. Some Concerns/weakness:
+
+(1) For convex-concave functions, the perturbation-based subgradient method is guaranteed to converge in its last iterate. Does its convergence hold for deep learning settings?
+
+(2) I'd like to see the performance on large-scale settings.
+I think the strength of this method is to solve large-scale game, because it's expensive to average policies for the past iterations for many self-play methods. The author only evaluates the method on small settings, it's not very convincing because when solving small games, self-play with average policies is not expensive.
+
+(3) The author missed some important related papers, such as neural counterfactual regret minimization [1, 2] and exploitability descent [3]. These methods also address the problem of average-case optimal. It will benefit readers if you can talk about these methods.
+
+(4) the baseline is so weak, it will be better to compare against counterfactual regret minimization method and Johannes Heinrich's deep fictitious self-play method.
+
+3. Questions:
+
+(1) For large-scale games, does it need large population size $n$?
+
+(2) Does it time-consuming to solve perturbed $v^{k}_i$ and $u^{k}_i$ for step 6 in Algorithm 1?
+
+(3) Why not use exploitability to evaluate different methods? It's a standard metric in imperfect information games. It will be good if you can evaluate the method on one of poker game, such as leduc.
+
+(4) ""Four colors correspond to the 4 agents in the population with 4 initial points"" in Figure 1, do you mean there are the population size is 4 and there are 4 different initial policies?
+
+(5) what's exact gradient in ""On the other hand, our method enjoys approximate last-iterate convergence with both exact and policy gradients.""
+
+(6) page 4, Algorithm 1, $N^0$ iterations, $N^0$ inner updates.  $N^0$ is a typo?
+
+
+Some key references:
+
+[1] Brown, N., Lerer, A., Gross, S. and Sandholm, T., 2019, May. Deep counterfactual regret minimization. In International conference on machine learning (pp. 793-802).
+
+[2] Li, H., Hu, K., Zhang, S., Qi, Y. and Song, L., 2018. Double neural counterfactual regret minimization. ICLR, 2020. https://openreview.net/pdf?id=ByedzkrKvH
+
+[3] Lockhart, E., Lanctot, M., Pérolat, J., Lespiau, J.B., Morrill, D., Timbers, F. and Tuyls, K., 2019. Computing approximate equilibria in sequential adversarial games by exploitability descent. arXiv preprint arXiv:1903.05614.
+
+
+",5,4.0,ICLR2021
+yNb-ihZOfJj,4,5g5x0eVdRg,5g5x0eVdRg,MI based unsupervised representation learning,"Authors introduce a method which is tailored to unsupervised representation learning via minimizing mutual information. Authors support their method with an experimental study. The proposed method has some merits like finding a representation that has higher mutual information compared to baselines.  
+I have several concerns about the paper:
+a)	Authors state that some greedy optimization may end up lower MI optima. Since deep neural networks are not convex, is this a surprising point or is this expected one?
+b)	I think empirical comparison needs significant improvement. Authors mention the proposed method was outperforming the state of the art “at the time of writing”. However, currently I believe this is not the case. For example, GATCluster achieves 28.1% on Cifar100-20. 
+c)	I see the proposed method as shown in Fig 1, is novel. As far as I understand it is adding some extra blocks to already existing method and calculates mutual information between heads. I would like to point out that hierarchical ordering is an interesting idea however the impact of hierarchical ordering is not very clear. Why does architecture need extra heads i.e. h_4 to h_8.  Would you please emphasize the novelty of the idea bit more?
+Although I have some concerns about the paper, I would like to be extremely clear that I am open to change my view if more explanation and/or evidence supplied.
+",4,4.0,ICLR2021
+FJdzhbR4yOE,5,_ptUyYP19mP,_ptUyYP19mP,"Simple approach, more extensive experiments on other domains will be more interesting and convincing","Summary: This paper focuses on exploration with intrinsic rewards and proposes to use the difference of inverse visitation count as the intrinsic rewards. The method outperforms some existing baselines with intrinsic rewards in the procedurally-generated tasks (MiniGrid and NetHack). 
+
+Clarity: This paper introduces the method and discusses its advantage over the previous work clearly, but some small points are vague and need more clarification (please see the section 'Cons' for the details.)
+
+Originality: As far as I know, the proposed formulation of intrinsic reward is novel, though it is closely related to prior works. For example, in ""NEVER GIVE UP: LEARNING DIRECTED EXPLORATION STRATEGIES"", the intrinsic reward is also a combination of visitation count in the current episode and lifelong prediction error. 
+
+Significance: The proposed idea itself is interesting, and it has shown SoTA results in some challenging tasks. This work will be more significant if there are positive results on other hard exploration tasks (not only procedurally generated tasks), e.g. Montezuma's Revenge.
+
+Pros:
+* The proposed method is simple and effective in some challenging procedurally-generated tasks. 
+* This paper discusses the weakness of the previous work and explains why the proposed method helps address these issues.
+* This paper visualizes the change of visitation count during training and thus explicitly demonstrates the behavior of the learning policy. 
+* This paper conducts an ablative study to investigate the role of different components in the proposed intrinsic reward. 
+
+Cons:
+* In section 3, ""inverse visitation counts as prediction difference"". Could you please clarify why the prediction error in RND method can approximate the inverse visitation count? This unclear point seemingly weakens the ground of the proposed method. 
+* In Figure 3, are some baselines methods missing in some tasks. For example, on ""Medium: KeyCorridorS5R3"", are there curves for the baselines?
+* In section 4, the authors discuss the weakness of the count-based exploration method: detachment. Go-explore tries to solve the detachment problem and shows super good performance on Atari hard-exploration tasks. It will be quite interesting if the proposed method could also significantly outperform the count-based exploration of the Atari games. Currently, the experiments are all conducted on procedurally-generated tasks. Does the proposed method could also work well on other types of domains?
+* In section 1, it is mentioned that the curiosity-driven intrinsic reward suffers from the noisy TV problem. The proposed method will also suffer from this problem or not? In other words, is there any failure case that the proposed method will not work?
+",5,4.0,ICLR2021
+Q9s2S1Cdnoh,1,jpDaS6jQvcr,jpDaS6jQvcr,Weak contribution,"This paper presents a Robust Collaborative Autoencoder (RCA) for unsupervised anomaly detection. The authors focused on the overparameterization of existing NN-based unsupervised anomaly detection methods, and the proposed method aims to overcome the overparameterization problem. The main contibutinos of the proposed method are that (1) it uses two autoencoders, each of which is trained using only selected data points and (2) monte carlo (MC) dropout was used for inference.
+
+Although this paper has an interesting idea, i have doubt about the contributions. My comments are as below.
+
+1) First of all, to me it was very difficult to read this paper. The notations are very confusing.
+
+2) In Introduction section, it is confusing what the main focus of this paper is. They mentioned like ""unlike previous studies, our goal is tho learn the weights in an unsupervised learning fashion"". But because it seems the topic of this paper belongs to ""unsupervised anomaly detection"" (the labels indicating whether anomaly or not are assumed available in the training data), the point that your method is in an unsupervised learning fashion is pretty obvious.  You don't need to discuss about ""supervised approachs"" throughout the paper, but please clearly mention that at the beginning of the introduction section, and only discuss your method and other ""unsupervised"" anomaly detection methods.
+
+3) If the overparameterization is the problem when we build a NN for unsupervised anomaly segmentation (e.g. autoencoder-AE), we can simply think about various well-known NN regulaization techniques for the AE as remedy. I also think the two parts of the proposed method (corresponding to the contrbutions (1) and (2) )work as regularization for the AE. I'm curious if there's any reason to prefer the proposed method to other regularization techniques?
+
+4) The proposed RCA method involves an ensemble prediction by using MC-dropout (described in section 3.2). The authors metioned this is one of their research contribution, but the use of MC-dropout is quite general in neural network research. Also, while the proposed method definitely benifits from the use of MC-dropout, other unsupervised anomaly detections based on neural networks (e.g. AE, VAE,...) can also improve by employing MC-dropout. The ablation study in Table 1 showed that RCA significantly outperforms RCA-E (RCA without ensembling).  The authors can implement MC-dropout-based ensemble versions of AE, VAE, and other nn-based methods like Deep SVDD, and check whether the proposed only benefits from MC-dropout or RCA just outperforms others regardless of MC-dropout.
+",3,3.0,ICLR2021
+aGqrDNXRRGf,2,n1HD8M6WGn,n1HD8M6WGn,Comments on the paper,"The paper proposes a *fine-grained layer attention* (FGLA) to analyze Encoder Fusion method. 
+
+While the introduction of FGLA allows for the analysis of the contribution of individual encoder layers, it changed the model’s architecture. Thus, when a model with FGLA is trained, it’s optimized differently such that the use of each separated layer in the encoder is modeled explicitly through attention weight. As a result, the contribution of each layer found in Seq2seq with FGLA might not be the same as in the standard Seq2seq. Therefore, I think the conclusion might not be appropriate for the standard Seq2Seq model. In other words, the source representation bottleneck hypothesis might not hold true for standard Seq2seq.
+
+Compared to the previous fusion method that uses a scalar per layer as attention weight (i.e., *Transparent Attention*) in the work of Bapna et. al., 2018, FGLA generalizes it by making a vector per layer. This generalization is straightforward and simple. The later analysis in the paper is carried out on FGLA. However, I find that the same analysis can be performed with Transparent Attention. The introduction of FGLA seems not well-motivated in this regard. In terms of interpretability, Transparent Attention is more interpretable since there is one scalar per layer, whereas FGLA has to report average weight per layer (Figure 1).
+
+For experimental results, FGLA is not compared with other fusion methods. Thus it’s unclear how the proposed method positions itself in the previous work on Encoder Fusion.
+
+A previous work of ([Nguyen and Chiang, 2018](https://www.aclweb.org/anthology/N18-1031/)) proposed a similar way of fusing source embeddings directly to the output of the decoder. Their work can be seen as Surface fusion proposed in this paper.
+
+While the motivation for this kind of fusion might be obvious for machine translation, it’s less so for summarization and grammatical error correction. I think it’s worth spending some text/examples on why this is needed for those tasks.
+
+
+**References**
+Improving Lexical Choice in Neural Machine Translation. Nguyen and Chiang, NAACL 2018
+",5,4.0,ICLR2021
+rJxxBuwItB,1,HJlxIJBFDr,HJlxIJBFDr,Official Blind Review #1,"Summary: 
+
+The paper proposed a policy gradient method called SRVR-PG, which based on stochastic recursive gradient estimator. It shows that the complexity is better than that of SVRPG. Some experiments on standard environments are provided to show the efficiency of the algorithm over GPOMDP and SVRPG.   
+
+Comments:
+
+1) I think you may ignore the highly-related work: Yang and Zhang, ""Policy Optimization with Stochastic Mirror Descent"", June 2019. Since both papers are highly-related, I would suggest the author(s) have some discussions to differentiate two papers. 
+
+2) Could you please provide the reasons why you are only choosing two methods (GPOMDP and SVRPG) to compare? There are many policy gradient algorithms such as A3C, A2C, ... which have not mentioned or discussed in this paper. Are they completely different here? Is there no way to compare the performance among them with SRVR-PG? 
+
+3) I am not sure if you are using the right word ""novel"" to describe your method. Basically, you adopt the existing estimator based on SARAH/SPIDER in optimization algorithms into RL problems. I do not think the word ""novel"" is proper here since it is not something total new. Notice that, the complexity result achieved in this paper is also matched the one for SARAH/SPIDER/SpiderBoost in nonconvex optimization. 
+
+4) There are some discussion on choosing a batch-size in the experimental part. However, I do not see the discussion on the learning rate. The choice of parameters for GPOMDP and SVRPG may also need to be discussed. 
+
+5) In Papini et al., 2018, for the experiment part, they use a snapshot policy to sample in the inner loop, and use early stopping inner loop. Moreover, they also check variance to recover the backup policy when it is blowup. Do you apply any trick to your experiments? I wonder if your numerical experiments are totally followed on your theory. 
+
+Minor comments: 
+- Redundancy "")"" in \eta*S*m in Theorem 4
+",6,,ICLR2020
+HyeO3O55nQ,2,SyfXKoRqFQ,SyfXKoRqFQ,An adaptive batch normalization approach with limited technical novelty. ,"The paper introduces an adaptive importance sampling strategy, as opposed to uniform sampling, for batch normalization. The key idea is to assign higher importance to those correctly classified training samples with relatively smaller soft-max prediction variance, hopefully to push the deep nets to learn faster from uncertain samples near the decision boundary. Experimental results on several benchmark datasets (MNIST, CIFAR-10) and commonly used deep nets (LeNet, ResNet) are reported to show the power of boundary batch selection in improving the overall training efficiency.
+
+The paper is clearly presented and the numerical results are mostly easy to access. My main concern is about the novelty of technical contribution which is mainly composed by two: 1) a prediction variance based importance sampling strategy for batch selection and 2) an empirical study the show the merits of approach. Concerning the first contribution, the idea of defining boundary samples according to prediction variance looks fairly common, if not superficial, in modern machine learning. The way of defining the sampling probability (see Eq. 4 & 5) follows largely the rank-based method (Loshchilov and Hutter 2016) with slight modifications. The numerical study shows some promise of the proposal on several relatively easy data sets. However, as a practical paper, the numerical results could be much more supportive if more challenging data sets (e.g., ImageNet) are included for evaluation. 
+
+Pros: 
+
+-The method is well motivated and clearly presented. 
+- The paper is easy to follow. 
+
+
+Cons:
+
+-  The overall contribution is incremental with limited novelty.  
+- As a practical paper, the numerical study falls short in evaluation on large-scale data. 
+",5,3.0,ICLR2019
+4yFfqozK15s,4,17VnwXYZyhH,17VnwXYZyhH,Review,"This paper proposes probing BERT representations by projecting them into a Poincare subspace. The proposed approach is used to probe ELMO and BERT for both syntax and sentiment in comparison with the conventional Euclidean probes.
+
+I am ambivalent about this paper. On the positive side, I think that it is a quite solid work, with extensive experimentation, additional supporting results in the appendix, and an accompanying code that can be used to reproduce results and obtain additional visualizations. The paper is also well written and the authors are rigorous when discussing their results rather than trying to oversell. 
+
+On the negative side, I have some reservations about the relevance of this study. What do we learn from it? It is true that the Poincare probes obtain generally higher scores than the Euclidean probes, but it doesn't look like they lead to any new insight about how BERT works. If the message here is that Poincare probes are more appropriate than their Euclidean counterparts, I would have liked to see instances were Euclidean probes lead to erroneous or at least different conclusions when compared to Poincare probes. In the absence of that, we can expect that practitioners will stick with Euclidean probes given that they are simply easier to use.
+
+Moreover, I am not sure if the comparison with Euclidean probes is entirely fair. If my understanding is correct, the Poincare probes learn two linear transformations (P and Q), whereas Euclidean probes learn a single one. Unless I am missing something, it could be that the Poincare probes obtain higher scores simply because the transformation they are learning is more expressive, and not because of the underlying geometric space. In order to test this hypothesis, I think that the authors should try learning two linear transformations for Euclidean probes, with a non-linearity like ReLU in between.
+
+Finally, I feel that some of the analyses did not follow a systematic methodology and some of the interpretations seem subjective and possibly questionable. For instance, I don't see any clear difference between the Poincare and Euclidean probes when it comes to the sentence length (Figure 2C), except for very short sentences. For sentence length > 12 the curves look very similar to me, except that the absolute values for Poincare are higher. Similarly, the visualizations, although interesting, provide a rather anecdotal evidence, in particular since the examples seem to be cherry-picked.",6,3.0,ICLR2021
+SkgbODq_3X,1,HJfSEnRqKQ,HJfSEnRqKQ,"An interesting setting combining active learning and learning with partial labesl. Nice experimental contribution, lack of conceptual insights. ","The paper considers a multiclass classification problem in which labels are grouped in a given number M of subsets c_j, which contain all individual labels as singletons. Training takes place through an active learning setting in which all training examples x_i are initially provided without their ground truth labels y_i. The learner issues queries of the form (x_i,c_j) where c_j is one of the given subsets of labels. The annotator only replies yes/no according to whether the true label y_i of x_i belongs to c_j or not. Hence, for each training example the learner maintains a ""version space"" containing all labels that are consistent with the answers received so far for that example. The active learning process consists of the following steps: (1) use the current learning model to score queries (x_i,c_j); (2) query the best (x_i,c_j); (3) update the model.
+In their experiments, the authors use a mini-batched version, where queries are issued and re-ranked several times before updating the model. Assuming the learner generates predictive models which map examples to probability distributions over the class labels, several uncertainty measures can be used to score queries: expected info gain, expected remaining classes, expected decrease in remaining classes. Experiments are run using the Res-18 neural network architecture over CIFAR10, CIFAR100, and Tiny ImageNet, with training sets of 50k, 50k, and 100k examples. The subsets c_j are computed using the Wordnet hierarchy on the label names resulting in 27, 261, and 304 subsets for the three datasets. The experiments show the advantage of performing adaptive queries as opposed to several baselines: random example selection with binary search over labels, active learning over the examples with binary search over the labels, and others. 
+
+This paper develops a natural learning strategy combining two known approaches: active learning and learning with partial labels. The main idea is to exploit adaptation in both choosing examples and queries. The experimental approach is sound and the results are informative. In general, a good experimental paper with a somewhat incremental conceptual contribution.
+
+In (2) there is t+1 on the left-hand side and t on the right-hand side, as if it were an update. Is it a typo?
+
+In 3.1, how is the standard multiclass classifier making use of the partially labeled examples during training?
+
+How are the number of questions required to exactly label all training examples computed? Why does this number vary across the different methods?
+
+What specific partial feedback strategies are used by AQ for labeling examples?
+
+EDC seems to consistently outperform ERC for small annotation budgets. Any intuition why this happens?",7,4.0,ICLR2019
+rJhtcsdxf,1,Hyp3i2xRb,Hyp3i2xRb,"confusing analysis, little novelty","Here are my main critics of the papers:
+
+1. Equation (1), (2), (3) are those expectations w.r.t. the data distribution (otherwise I can't think of any other stochasticity)? If so your phrase ""is zero given a sequence of inputs X1, ...,T"" is misleading. 
+2. Lack of motivation for IE or UIE. Where is your background material? I do not understand why we would like to assume (1), (2), (3). Why the same intuition of UIE can be applied to RNNs? 
+3. The paper proposed the new architecture RIN, but it is not much different than a simple RNN with identity initialization. Not much novelty.
+4. The experimental results are not convincing. It's not compared against any previous published results. E.g. the addition tasks and sMNIST tasks are not as good as those reported in [1]. Also it only has been tested on very simple datasets.
+
+
+[1] Path-Normalized Optimization of Recurrent Neural Networks with ReLU Activations. Behnam Neyshabur, Yuhuai Wu, Ruslan Salakhutdinov, Nathan Srebro.",2,4.0,ICLR2018
+wA3w_pMvSTs,4,H92-E4kFwbR,H92-E4kFwbR,[Official Review] ,"#### Summary ####
+This paper tackles the problem of adversarial training for the image classification task. It proposed a novel adversarial training method called composite adversarial training (CAT) against combined attacks constructed by multiple perturbations. First, CAT is based on the composite adversarial attacks, in which the attackers explore different sources of perturbations. Second, CAT leverages the composite adversarial attacks as the inner loop for optimization during the training. The experimental evaluations have been focused on comparing the proposed CAT with existing robust training methods including adversarial training with PGD attacks, AVG, MAX (Tramer and Boneh, 2019), and MSD (Maini et al. 2020) on MNIST and CIFAR-10 classification benchmarks.
+
+#### Comments ####
+This paper studies an important problem in adversarial machine learning. The paper is well-motivated with novel technical contributions (Section 3.1) supported by reasonably designed experiments. However, reviewer feels the submission in the current form is a borderline case mainly due to mixed or inconclusive experimental results.
+
+W1: The clean accuracy of CAT  (Table 1 - 4, first row, last column) is significantly worse than methods such as AVG & MAX and MSD, especially on CIFAR-10 where the accuracy drops 20+% (I assume the state-of-the-art model has 90+% accuracy for the 10-way classification on CIFAR-10). This seems to be a major weakness of the proposed method. Reviewer understands the tradeoff between clean accuracy and accuracy under attack, but not sure how much value it is given the proposed defense method sacrifices too much on the clean accuracy. What makes it worse, this is just the performance drop of 10-way classification on CIFAR dataset. Reviewer is worried if this gap is even more significant on CIFAR-100 or ImageNet (w/ 1000 classes). It would be good to have some ablation studies.
+
+W2: Besides the drop on clean accuracy, reviewer fails to see a clear winner between MSD and CAT (see the last two columns in Table 1 and Table 2). CAT seems to be more robust to composite attacks but not as robust as MSD on other attacks. Such comparisons are missing in Section 4.2 (pixel perturbation and spatial transformations). It would be good to comment on this.
+
+W3: It would be good to report the computational cost (e.g., number of iterations in optimization, training time) of the proposed composite training method and explain how it is compared to the existing methods.
+
+Minor1 (applied for all the tables): it would be good to mention each row is a different attack method and each column is a different defense (robust training) method. It is not crystal clear at the first glance. 
+",5,4.0,ICLR2021
+BypRYU5xM,3,BJcAWaeCW,BJcAWaeCW,This paper presents a highly engineered approach for learning topological features of an input graph with GANs.  It is not clear why the approach works and under which conditions it could fail.,"The proposed approach, GTI, has many free parameters: number of layers L, number of communities in each layer, number of non-overlapping subgraphs M, number of nodes in each subgraph k, etc.  No analysis is reported on how these affect the performance of GTI.
+
+GTI uses the Louvain hierarchical community detection method to identify the hierarchy in the graph and METIS to partition the communities.  How important are these two methods to the success of GTI?
+
+Why is it reasonable to restore a k-by-k adjacency matrix from the standard uniform distribution (as stated in Section 2.1)?
+
+Why is the stride for the convolutional/deconvoluational layers set to 2 (as stated in Section 2.1)?
+
+Equation 1 has a symbol E in it.  E is defined (in Section 2.2) to be ""all the inter-subgraph (community) edges identified by the Louvain method for each hierarchy.""  However, E can be intra-community because communities are partitioned by METIS.  More discussion is needed about the role of edges in E. 
+
+Equation 3 sparsifies (i.e. prunes the edges) of a graph -- namely $re_{G}$.  However, it is not clear how one selects a $re^{i}{G}$ from among the various i values.  The symbol i is an index into $CV_{i}$, the cut-value of the i-th largest unique weight-value.
+
+Was the edge-importance reported in Section 2.3 checked against various measures of edge importance such as edge betweenness?
+
+Table 1 needs more discussion in terms of retained edge percentage for ordered stages.  Should one expect a certain trend in these sequences?
+
+Almost all of the experiments are qualitative and can be easily made quantitive by comparing PageRank or degree of nodes.
+
+The discussion on graph sampling does not include how much of the graph was sampled.  Thus, the comparisons in Tables 2 and 3 are not fair.
+
+The most realistic graph generator is the BTER model.  See http://www.sandia.gov/~tgkolda/bter_supplement/ and http://www.sandia.gov/~tgkolda/feastpack/doc_bter_match.html.
+
+A minor point: The acronym GTI is never defined.",4,5.0,ICLR2018
+DzD5Lil_BJT,3,tL89RnzIiCd,tL89RnzIiCd,Interesting results but some questions arise,"This paper considers a continuous version of the classical Hopfield network (HN) model.In contrast to well studied discrete models where the patterns (vectors) that are 
+stored are discrete, this paper studied continuous vectors and a new continuous energy function.
+Convergence results to a fixed point are proven for the new rule, and it is shown that for the case of random patterns, the Hopfield network can memorize exponentially many patterns (with high probability).  Finally several implementations are given showing how incorporating the new Hopfield net in classification tasks can improve classification accuracy in regimes where 
+data is scarce and where neural networks do not fare well. 
+
+The paper is rather long and I did not verify all results. The description appears sound.The proofs appear non-trivial and rather technical. While the results here are nontrivial I was left me wondering about the 
+added value of this new model. One of the biggest advantages of HN was its simplicity and elegance. More recent results of Hopfield and others with higher degree energy functions managed to maintain this clarity and brevity. The new model however is significantly more involved. It was not clear to me what is gained by this greater complexity and whether the gains 
+justify the larger complexity. In actual implementations very limited precision is often necessary.How does this discretization influence the continuous model? How robust is it to rounding errors? Don't we get ""old"" discrete models in disguise? 
+
+The (impressive) empirical results raise similar questions. Can't we use old discrete HN instead of the new model and achieve similar results? It would be perhaps more informative to compare different HN to the new model presented in this paper. It seems a bit strange that previous uses of HN (discrete ) did not achieve such an improvement in previous studies. It would be beneficial to add more on related work in this area. 
+
+ The authors might consider breaking their long paper to two different sections, one presenting the theoretical advantages of their new model and the other focusing on practical benefits. 
+
+Finally, the nature of convergence to a fixed point wasn't clear to me. It seems likely that if patterns are not random convergence can take a long time as is the case for discrete HN. 
+Some recent work about the complexity of finding fixed points of continuous functions may be relevant here:A converse to Banach's fixed point theorem and its CLS-completeness.
+More specific comments:
+1) The paper starts with a rather lengthy discussion of previous work. 
+I would recommend outlining the contributions of this paper earlier on. 
+2) ""converge in one update step with exponentially low error and have storage capacity proportional to..."" It was not clear to me that random patterns are considered here. 
+3) ""proven for c= 1.37andc= 3.15 in Theorem 3"" for what c exactly is the result proven? 
+4) ""Furthermore, with a single update, the fixed point recovered with high probability""I presume this is true for random patterns? 
+5) Is beta>0?",7,3.0,ICLR2021
+S1qQ8ZFlf,1,B1nLkl-0Z,B1nLkl-0Z,"seems to be a good paper, however, I do not even see an exact algorithm formulation","I think I should understand the gist of the paper, which is very interesting, where the action of \tilde Q(s,a) is drawn from a distribution. The author also explains in detail the relation with PGQ/Soft Q learning, and the recent paper ""expected policy gradient"" by Ciosek & Whiteson. All these seems very sound and interesting.
+
+Weakness:
+1. The major weakness is that throughout the paper, I do not see an algorithm formulation of the Smoothie algorithm, which is the major algorithmic contribution of the paper (I think the major contribution of the paper is on the algorithmic side instead of theoretical). Such representation style is highly discouraging and brings about un-necessary readability difficulties. 
+
+2. Sec. 3.3 and 3.4 is a little bit abbreviated from the major focus of the paper, and I guess they are not very important and novel (just educational guess, because I can only guess what the whole algorithm Smoothie is). So I suggest moving them to the Appendix and make the major focus more narrowed down.",6,4.0,ICLR2018
+ooxIeD-GI5o,3,HNA0kUAFdbv,HNA0kUAFdbv,A decent work on slide layout representation learning but evaluation can be improved,"This paper applies state-of-the-art transformer-based neural networks to layout representation learning of slides. The most notable contribution of this paper is the construction of large-scale parsed slide layout dataset. This paper proposes to pre-train the network on this large-scale dataset without masked reconstruction strategy and verifies it with several subtasks including element role labeling, image captioning, auto-completion and layout retrieval, with a comparison to a decision-tree based method as baseline. 
+
++Most of previous layout learning works only show experimental results on small labeled datasets (a few thousands), partially due to the scarcity nature of layout data. This paper looks at slide layout data and constructs a large-scale (>1m) dataset with parsed element properties. 
++The chosen network design and training strategies all make sense. 
+
+-It is pity that this paper didn’t disclose sufficient details of how the large-scale dataset was constructed and of data statistics, e.g. how many elements in each slide, templates, completeness of properties, etc. How are the properties parsed, fully automatic? Is the role labeling dataset part of the pretraining dataset?
+-Pretraining. The proposed evaluation tasks all seem to be sub-tasks of pre-training and it doesn’t look falling into the classic scheme of unsupervised pretraining + supervised fine-tuning. Dataset differentiation is another issue. For example, in the role labeling experiment, is this targeted dataset a subset of the large-scale one? And is the only difference that the training loss? 
+-Evaluation. Evaluating graphic layout can be a hard problem and this paper tried to propose several small tasks as probes into the learned network. However, it will be more convincing to have a systematic design of experiments. First of all, in addition to type properties, how about geometric property and color property prediction? Second, any experiments would benefit from both quantitative and qualitative results. Especially for layout design, visualization is very important. 
+-Layout retrieval is an interesting experiment, but manual scoring seems to be arbitrary. 
+-Baselines. Neural design network by Lee et al. in ECCV2020 and LayoutGAN by Li et al. in ICLR 2019 seem to be good baseline network architectures to compare, although they are trained in different ways. 
+",5,4.0,ICLR2021
+BygZMqts3Q,2,BygANjA5FX,BygANjA5FX,Issues of clarity and comparison,"This work proposes an ensemble method for convolutional neural networks wherein each convolutional layer is replicated m times and the resulting activations are averaged layerwise.
+
+There are a few issues that undermine the conclusion that this simple method is an improvement over full-model ensembles:
+	1. Equation (1) is unclear on the definition of C_layer, a critical detail. In the context, C_layer could be weights, activations before the nonlinearity/pooling/batch-norm, or activations after the nonlinearity/pooling/batch-norm. Averaging only makes sense after some form of non-linearity, otherwise the “ensemble” is merely a linear operation, so hopefully it’s the latter.
+	2. The headings in the results tables could be clarified. To be sure that I am understanding them correctly, I’ll propose a new notation here. Please note in the comments if I’ve misunderstood! Since “m” is used to represent the number of convolutional layer replications, let’s use “k” to represent the number of full model replications. So, instead of “CNL” and “IEA (ours)” in Table 1 and “Ensemble of models using CNL” and “Ensemble of models using IEA (ours)” in Table 2, I would recommend a single table with these headings: “(m=1, k=1)”,  “(m=3, k=1)”,  “(m=1, k=3)”,  and “(m=3, k=3)”, corresponding to the columns in Tables 1 and 2 in order. Likewise for Tables 3-6.
+	3. Under this interpretation of the tables---again, correct me if I’m wrong---the proper comparison would be “IEA (ours)” versus “Ensemble of models using CNL”, or  “(m=3, k=1)” versus “(m=1, k=3)” in my notation. This pair share a similar amount of computation and a similar number of parameters. (The k=3 model would be slightly larger on account of any fully-connected layers.) In this case, the “outer ensemble” wins handily in 4 of 5 cases for CIFAR-10.
+	4. The CNL results, or “(k=1,m=1)”, seem to not be state-of-the-art, adding more uncertainty to the evaluation. See, for instance, https://www.github.com/kuangliu/pytorch-cifar. Apologies that this isn’t a published table. A quick scan of the DenseNets paper and another didn’t yield a matching set of models. In any case, the lack of data augmentation may account for this disparity, but can easily be remedied.
+
+Given the above issues of clarity and that this simple method seems to not make a favorable comparison to the comparable ensemble baseline (significance), I can’t recommend acceptance at this time. 
+
+Other notes:
+	* The wrong LaTeX citation function is used, yielding the “author (year)” form (produced by \citet), instead of “(author, year)” (produced by \citep), which seems to be intended. It’s possible that \cite defaults to \citet.
+	* The acronyms CNL and FCL hurt the readability a bit. Since there is ample space available, spelling out “convolutional layer” and “fully-connected layer” would be preferred.
+	* Other additions to the evaluation could or should include: a plot of test error vs. number of parameters/FLOPS/inference time; additional challenging datasets including CIFAR-100, SVHN, and ImageNet; and consideration of other ways to use additional parameters or computation, such as increased depth or width (perhaps the various depths of ResNet would be useful here).
+",2,4.0,ICLR2019
+awlOJvlPzF,3,fAbkE6ant2,fAbkE6ant2,Review,"# Summary
+
+The paper analyzes the pitfalls of locally supervised learning from the point of view of information propagation and proposes a new auxiliary loss that can facilitate locally supervised learning. The proposed loss, ""infopro loss"", is then relaxed to a tractable upper bound, which is then used instead. To implement the loss, mutual information is approximated with a decoder, as well as a classifier. The authors further introduce now contrastive learning fits in the framework as a lower bound maximization process regarding mutual information. The experimental results on standard datasets demonstrate the efficacy of the proposed method.
+
+I have enjoyed reading the paper quite a lot, and therefore recommend accepting the paper. Still, I have some reservations that I would love the authors to clarify via the rebuttal.
+
+# Strength
+
+The paper is well written. There are some issues (which I detail in the weaknesses section), but most are very clear and easy to follow. There are some parts that are repetitive, but it does allow readers to scheme through without missing the important point.
+
+The part that I like most about the paper is how proposition 1 and appendix D is presented. They are theoretically well-motivated and gladly seems to work, despite the relaxation. I personally think appendix D deserves more attention in the main text, but this is my personal preference.
+
+The results clearly show that the method improves over simple greedy locally supervised learning, as well as other attempts at this problem. 
+
+# Weaknesses
+
+## Section 3.1, regarding equation (1)
+
+I am not 100% sure on this, but shouldn't the third last sentence of Section 3.1 read ""...under the goal of retaining as much information of the input as possible""? I am not sure this is actually a constraint, and the I(h,x) term is not explicitly task-relevant information.
+
+## Section 3.3, estimating I(h,y)
+
+It is not clear to me how I(h,y) is finally approximated. Shouldn't p(y|h) disappear after approximation? In the provided approximation it still exists. If p(y|h) is somehow directly used, isn't q not needed at all?
+
+## Section 3.3 final equation
+
+In my opinion, even when it is not referred to in text, equations should have numbers so that future readers can refer to it.
+
+## Regarding Asy-InfoPro
+
+It is somewhat unclear whether the dynamic caching was used at the end. Are the experiments in Table 2 with dynamic caching? From my current understanding, it does not seem to be the case, which leads to my second issue.
+
+This is assuming that the results are without dynamic caching as this seems most logical. The explanation in the ""Asynchronous and parallel training"" paragraph was not obvious to me during the first read. The second sentence could be rephrased and split so that it becomes clear that the distinction between the two modes is that transient feature maps are seen/unseen and that this has a regularizing effect.  This then brings up the important question, whether the dynamic caching version suffers from the same fate---it should not. Having this experimental verification would greatly strengthen the observations in this paper.
+
+## Softmax vs Contrastive
+
+I am curious as to why contrastive works better. Could it be a coincidence of better hyperparameter tuning? Because the softmax version is a direct estimate on mutual information, whereas the contrastive one optimizes the lower bound.  It would be nice if this was discussed in more detail (I do understand that there is already very little space though!)
+
+## Computational overhead
+
+Is the computational overhead including the memory transfer cost? Are the results the actual physical measures observed via monitoring the resource usage? Or are they theoretical? I might have missed it, but this is not very clear to me.
+
+## Typos and grammar errors
+
+There are quite a few grammatical mistakes throughout the paper. For example, at the beginning of Section 2, it is more natural to write ""we start by considering"" than ""we start with considering"". In the italic question in the second paragraph of Section 2, ""..., even the former..."" should be ""..., even though the former..."". While I generally did not find these errors to be critical, I would suggest a thorough proofread.
+
+I also found a typo in Appendix B. The last sentence should refer to equation (10) not (9).
+
+## Early stopping, and the choice of the number of epochs
+
+The training process in this paper does not utilize early stopping. While this is somewhat mitigated by the fact that multiple runs are performed, this is in fact another source of overfitting to the dataset and is strictly speaking tuning hyperparameters on the test set. This is a practice that should be avoided.",7,3.0,ICLR2021
+H1x1dumo3Q,2,BJgnmhA5KQ,BJgnmhA5KQ,Review for Diverse MT with a Single Multinomial Latent Variable,"This paper studies the diverse text generation problem, specifically on machine translation problem. The authors use a simple method, which just using a single multinomial latent variable compared with previous approaches that using multi latent variables. They named the approach: Hard-MoE. They use parallel greedy decoding to generate the diverse translations and the experiments on three WMT datasets show the approach make a trade-off between diversity and quality.
+In general, I think generating the diverse translations for machine translation problem may not so important and piratically in actual scenarios. In fact, how to generate fluent and correct translations is more important. 
+
+For the details, there are some problems. 1) The only modification for this work is to make the soft probability of p(z|x) to be 1/K. The others are several experimental studies. To be an formal ICLR paper, this may not be interesting enough to draw my attention. 2) In case of the results, though the authors claimed they achieved better trade-off between diversity and quality, in my opinion, the beam original beam search is good enough from the results in Table 1. 3) In table 2, what means k=0 for the BLEU score? 4) I want to indicate that the purpose of VAE approach related to this work is to increase the model performance w.r.t. the BLEU score instead of the diversity, same as the original MoE method. 5) There are some related works to this work, but their methods are also very effective in terms of the BLEU score, e.g., the author can check this one in EMNLP this year: “Sequence to Sequence Mixture Model for Diverse Machine Translation”. Authors may need a more discussion between those works and this work.
+",5,4.0,ICLR2019
+tYxwypDbtUc,2,D4A-v0kltaX,D4A-v0kltaX,Seems similar to previous works. More experiments maybe needed.,"Summary: the work proposes to use neural networks to learn a kernel C(x,p) for PDEs. It embeds the neural network into iterative solvers and trains it with the Adjoint method.
+
+The writing is clear in general but notation-heavy. For example, it could be better to define notations such as $A$ before using it. 
+
+Strong points: 
+- Clear, 
+- Well-motivated,
+- Theoretically solid, 
+- The example given on page 3 is quite helpful.
+
+Concerns:
+- The idea seems similar to the previous works.
+- Lack of comparison and benchmarking.
+
+I am not very certain about evaluating the novelty of this work. The idea to approximate the kernel with the neural networks has been proposed and studied in (https://arxiv.org/abs/2003.03485, https://arxiv.org/abs/2010.08895). The major difference seems to be that they directly learn solution operators, while in this work we embed it into iterative solvers. 
+It will be great if the authors can help me understand the contribution beyond the previous works.
+
+Another concern is about the experiments. The test equations (Poisson, Helmholtz, Wave Equation) presented in the paper seems fairly simple. I wonder how other methods such as PINN, or the numerical solver perform on these equations. It will be great to have some benchmarks and comparisons with existing works. It can help me better evaluate the performance of the method.
+
+Also, I found the scale for the neural network is very small. The authors claim larger networks are not needed, but it's better to have some justification or experiments.
+
+Questions:
+1. how does it differ from the existing work?
+2. how does it empirically compare with numerical solvers or other deep learning-based methods?
+3. is it possible to try large networks in the experiments?
+
+Recommendation:
+In general, I found this work interesting and concrete, but I am not certain about the novelty. Therefore, I would like to put this paper on margin.
+
+---
+_Updated review:_
+
+>The updated manuscript has some substantial improvements.
+>
+> I feel the biggest problem is that the authors didn't clearly state the problem settings. If I understand correctly, in their framework the equation is fixed but unknown. The training data are several points in the domain (with parameters input) and testing data are other points. So basically, we doing interpolations. But even the PDE is unknown, they do assume some structure of the PDE, I think.
+>
+> Other PDEs frameworks are either 1. solver-type: the equation is known and fixed, they directly solve for the solutions. 2. operator-type, the equations are unknown and changing. Train on inputs-outputs for several equations, and test on others. Their setting is quite different. I guess it's the reason their performance is much better in the updated comparison. On the other hand, it's also hard to evaluate their performance since there are no fair benchmarks.
+
+> In general, I feel this paper is novel and concrete, while it's not very complete and well-presented. I agree with other reviewers that this paper is not ready to publish.
+",5,4.0,ICLR2021
+JGdZm3Mw_Ba,3,Au1gNqq4brw,Au1gNqq4brw,An interesting attempt to improve theoretical understanding on gated RNNs.,"This paper attempts to add a contribution on understanding how gated recurrent neural networks like GRUs and LSTMs can learn the representation of n-grams. The authors expand the sigmoid function and the hyperbolic tangent function using Taylor series to obtain approximated closed-form mathematical expression of hidden representation when using the GRU or the LSTM as the update rules. The approximated hidden representation of the GRU and the LSTM update rules can be separated into two terms, (1) the current token-level input feature and (2) the sequence-level feature, which is a weighted sum of all previous tokens. As the hidden representation consists of two feature terms, one can take each feature (either token-level or sequence-level) separately for a downstream task, e.g., evaluate how good when sequence-level feature is used for predicting polarity score in sentiment analysis.
+ 
+The idea of improving theoretical understanding on how n-grams are modelled by gated recurrent activation functions is sound. However, I am not entirely satisfied with what has been investigated after obtaining the approximated closed-form expression of gated recurrent activation functions. The tasks that were used in the experiments are sentiment analysis and language modelling. In sentiment analysis, most of the plots were there to show how token-level features or sequence-level features align with the polarity score, and we can observe some sort of individual implication from each term. However, it is predictable that sequence-level feature should be meaningful. I don't see much of insights by showing that the polarity score from sequence-level features indeed align with this prediction. If we can to apply Taylor expansion to simple recurrent neural networks (RNNs), such that we can expand the hidden representation of a standard RNN into two terms: the current token-level input feature and sequence-level feature, how would the results look like and how can we relate them with what were reported in this paper? Is this paper particularly showing how gated RNNs are modelling n-grams or RNNs in general? A comparison would be nice to show how sequence features get improved in gated RNNs.
+
+It is interesting to see that the approximated versions of GRUs and LSTMs can perform on a par with the original models on language modelling tasks, however, these results don't necessarily improve our understanding on how gated RNNs are capable of learning good representations of n-grams. They confirm that sequence features are indeed helpful though.
+
+In Section 5.1, if there were multiple trials of experiments on the same task, why not report the average and the variance of the results instead of one set out of multiple results?
+
+In Section 5.2, Adpative softmax (Joulin et al., 2017) was used for Wikitext-130. -> Adpative softmax (Joulin et al., 2017) was used for
+Wikitext-103.",5,4.0,ICLR2021
+S1eEwK92Fr,2,HkePOCNtPH,HkePOCNtPH,Official Blind Review #1,"This paper describes a (DC)GAN architecture for modeling folk song melodies (Irish reels).
+The main idea of this work is to exploit the rigid form of this type of music --- bar structure and repetition --- to enable 2-dimensional convolutional modeling rather than the purely sequential modeling that is commonly used in recent (melodic) music generation architectures.
+The ideas in this paper seem sound, though they primarily consist of recombining known techniques for a specific application.
+The main weaknesses of this paper are in the evaluation (see below), and while I understand that evaluating generate models for creative applications is difficult and fraught territory, I don't think the efforts taken here are sufficiently convincing.
+
+
+
+Strengths:
+
+- The paper is clearly written, and the authors have taken great care to describe the unique structures of the data they are modeling.
+- The proposed architecture seems well motivated, and matches to the structure of the data.
+
+
+Weaknesses:
+
+There are three components to the evaluation, and each of them are problematic:
+
+- The first evaluation (Fig 2) compares the average Frechet distance between phrases generated by different models, and within the original dataset.  Some brief argument is given for why Frechet is a good choice here, but it still seems quite tenuous: what does this distance intuitively mean in terms of the data?  How should the scale of these distances be interpreted / what's a meaningfully large difference?  How concentrated are these average distances (ie, please show error bars, variance estimates, or some notion of spread)?
+
+- The second evaluation (Fig 3) uses t-SNE to embed the generated melodies into a 2D space to allow visual inspection of the differences between distributions produced by each model.  While this might be a reasonable qualitative gut-check, t-SNE is by no means an appropriate tool for quantitative evaluation.  The authors at least did multiple runs of t-SNE, but this hardly amounts to compelling evidence.  Moreover, combining all data sources into one sample prior to running t-SNE induces dependencies between the point-wise neighbor selection distributions, which seems undesirable if the eventual goal is to determine how similar each model's distribution is to the source data.  A better approach might be to create independent plots for each model's output (with the original data), but I'd generally advise against using t-SNE for this kind of analysis altogether.
+
+- The third evaluation (Fig 4) measures the amount of divergence from the key (D) in terms of note unigrams.  This evaluation is done qualitatively, and the histogram is difficult to read --- it may be easier to read if the octave content was collapsed out to produce pitch classes rather than pitches.  If, however, the goal is to actually measure distance from the target key, one could do this quantitatively by comparing histograms to a probe tone profile (or otherwise constructed unigram note model) to more clearly characterise the behaviors of the various models in question.
+
+
+At a higher level, there is no error analysis provided for the model, nor any ablation study to measure the impact of the various design choices taken here (eg dilation patterns in Figure 1). 
+The authors seem to argue that these choices are the main contribution of this work, so they should be explicitly evaluated in a controlled setting.
+",3,,ICLR2020
+H12VRW9gM,1,BkfEzz-0-,BkfEzz-0-,Good work. Presentation needs further polishing.,"In this paper, the authors present a novel way to look at a neural network such that each neuron (node) in the network is an agent working to optimize its reward. The paper shows that by appropriately defining the neuron level reward function, the model can learn a better policy in different tasks. For example, if a classification task is formulated as reinforcement learning where the ultimate reward depends on the batch likelihood, the presented formulation (called Adaptive DropConnect in this context) does better on standard datasets when compared with a strong baseline.
+
+The idea proposed in the paper is quite interesting, but the presentation is severely lacking. In a work that relies heavily on precise mathematical formulation, there are several instances when the details are not addressed leading to ample confusion making it hard to fully comprehend how the idea works. For example, in section 5.1, notations are presented and defined much later or not at all (g_{jit} and d_{it}). Many equations were unclear to me for similar reasons to the point I decided to only skim those parts. Even the definition of external vs. internal environment (section 4) was unclear which is used a few times later. Like, what does it mean when we say, “environment that the multi-agent system itself touches”?
+
+Overall, I think the idea presented in the paper has merit, but without a thorough rewriting of the mathematical sections, it is difficult to fully comprehend its potential and applications.",7,3.0,ICLR2018
+Hyg0dv2n2Q,3,HJxwDiActX,HJxwDiActX,Review,"Revision:
+
+The addition of new datasets and the qualitative demonstration of latent space interpolations and algebra are quite convincing. Interpolations from raster-based generative models such as the original VAE tend to be blurry and not semantic. The interpolations in this paper do a good job of demonstrating the usefulness of structure.
+
+The classification metric is reasonable, but there is no comparison with SPIRAL, and only a comparison with ablated versions of the StrokeNet agent. I see no reason why the comparison with SPIRAL was removed for this metric.
+
+Figure 11 does a good job of showing the usefulness of gradients over reinforcement learning, but should have a better x range so that one of the curves doesn't just become a vertical line, which is bad for stylistic reasons.
+
+The writing has improved, but still has stylistic and grammatical issues. A few examples, ""there’re"", ""the network could be more aware of what it’s exactly doing"", ""discriminator loss given its popularity and mightiness to achieve adversarial learning"". A full enumeration would be out of scope of this review. I encourage the authors to iterate more on the writing, and get the paper proofread by more people.
+
+In summary, the paper's quality has significantly improved, but some presentation issues keep it from being a great paper. The idea presented in the paper is however interesting and timely and deserves to be shared with the wider generative models community, which makes me lean towards an accept.
+
+Original Review:
+
+This paper deals with the problem of strokes-based image generation (in contrast to raster-based). The authors define strokes as a list of coordinates and pressure values along with the color and brush radius of a stroke. Then the authors investigate whether an agent can learn to produce the stroke corresponding to a given target image. The authors show that they were able to do so for the MNIST and OMNIGLOT datasets. This is done by first training an encoder-decoder pair of neural networks where the latent variable is the stroke, and the encoder and decoder have specific structure which takes advantage of the known stroke structure of the latent variable.
+
+The paper contains no quantitative evaluation, either with existing methods or with any baselines. No ablations are conducted to understand which techniques provide value and which don't. The paper does present some qualitative examples of rendered strokes but it's not clear whether these are from the training set or an unseen test set. It's not clear whether the model is generalizing or not.
+
+The writing is also very unclear. I had to fill in the blanks a lot. It isn't clear what the objective of the paper is. Why are we generating strokes? What use is the software for rendering images from strokes? Is it differentiable? Apparently not. The authors talk about differentiable rendering engines, but ultimately we learn that a learnt neural network decoder is the differentiable renderer.
+
+To improve this paper and make it acceptable, I recommend the following:
+
+1. Improve the presentation so that it's very clear what's being contributed. Instead of writing the chronological story of what you did, instead you should explain the problem, explain why current solutions are lacking, and then present your own solutions, and then quantify the improvements from your solution.
+
+2. Avoid casual language such as ""Reason may be"", ""The agent is just a plain"", ""since neural nets are famouse for their ability to approximate all sorts of functions"".
+
+3. Show that strokes-based generation enables capabilities that raster-based generation doesn't. For instance, you could show that the agent is able to systematically generalize to very different types of images. I'd also recommend presenting results on datasets more complex than MNIST and OMNIGLOT.",7,4.0,ICLR2019
+rJ2zSBlGJgc,2,#NAME?,#NAME?,RETHINKING CONVOLUTION: TOWARDS AN OPTIMAL EFFICIENCY,"## Summary
+The paper presents a new convolution structure which tries to achieve a better balance between the efficiency and accuracy. The proposed approach is well motivated and theoretically proved. Reasonable experiments have been provided to validate the proposed algorithm. 
+
+## Pros
+1. The paper is well presented and the motivation of the paper is clear.
+2. The proposed convolution structure has theoretical small FLOPs and well justified based on the proof. 
+3. Reasonable experiments have been reported to valiate the the performance gain over the baselines.
+
+## Cons
+1. Besides from the FLOPs, is it possible to provide the computational cost for the proposed algorithm, e.g., including the inference speed in the experiments like Table 3. From the engineering implementation, the proposed structure may not be hardware-friendly. 
+2. For the experiments on Full Imagenet, what about the experiments for the comparison with the resnet baseline. Also, I would suggest to include the comparison with the baseline with depthwise convolution. 
+
+## Reasons for the rating
+The exploration of the structure of the convolution is challenging but important to the community. The discussion of the convolution based on balance of the efficiency and effectiveness is meaningful. Although the experiments do not cover all of my concerns, I would rate it as marginally above the acceptance threshold.
+
+## Suggestions
+Please provde the the comparison of inference speed for the proposed structure. Also, it would be better to report the baseline with depthwise convolution. ",6,3.0,ICLR2021
+BJeOO--12Q,1,HyxpNnRcFX,HyxpNnRcFX,"Gradient-base few-shot learning. Extends MAML to a mixture distribution, to allow for internal task clustering. Falls short of recent state-of-art results, while being even a lot slower than MAML","Summary:
+
+This work tackles few-shot (or meta) learning, providing an extension of the gradient-based MAML method to using a mixture over global hyperparameters. Each task stochastically picks a mixture component, giving rise to task clustering. Stochastic EM is used for end-to-end learning, an algorithm that is L times more expensive than MAML, where L is the number of mixture components. There is also a nonparametric version, based on Dirichlet process mixtures, but a large number of approximations render this somewhat heuristic.
+
+Comparative results are presented on miniImageNet (5-way, 1-shot). These results are not near the state-of-the art anymore, and some of the state-of-art methods are simpler and faster than even MAML. If expensive gradient-based meta-learning methods are to be consider in the future, the authors have to provide compelling arguments why the additional computations pay off.
+
+- Quality: Paper is technically complex, but based on simple ideas. In the case of
+   infinite mixtures, it is not clear what is done in the end in the experiments.
+   Experimental results are rather poor, given state-of-the-art.
+- Clarity: The paper is not hard to understand. What is done, is done cleanly.
+- Originality: The idea of putting a mixture model on the global parameters is not
+   surprising. Important questions, such as how to make this faster, are not
+   addressed.
+- Significance: The only comparative results on miniImageNet are worse than the
+   state-of-the-art by quite a margin (admittedly, the field moves fast here, but it
+   is also likely these benchmarks are not all that hard). This is even though better
+   performing methods, like Versa, are much cheaper to run
+
+While the idea of task clustering is potentially useful, and may be important in practical use cases, I feel the proposed method is simply just too expensive to run in order to justify mild gains. The experiments do not show benefits of the idea.
+
+State of the art results on miniImageNet 5-way, 1-shot, the only experiments here which compare to others, show accuracies better than 53:
+- Versa: https://arxiv.org/abs/1805.09921.
+   Importantly, this method uses a simpler model (logistic regression head models)
+   and is quite a bit faster than MAML, so much faster than what is proposed here
+- BMAML: https://arxiv.org/abs/1806.03836.
+   This is also quite complex and expensive, compared to Versa, but provides good
+   results.
+
+Other points:
+- You use a set of size N+M per task update. In your 5-way, 1-shot experiments,
+   what is N and M? I'd guess N=5 (1 shot per class), but what is M? If N+M > 5,
+   then I wonder why results are branded as 5-way, 1-shot, which to mean means
+   that each update can use exactly 5 labeled points.
+   Please just be exact in the main paper about what you do, and what main
+   competitors do, in particular about the number of points to use in each task
+   update.
+- Nonparametric extension via Dirichlet process mixture. This is quite elaborate, and
+   uses further approximations (ICM, instead of Gibbs sampling).
+   Can be seen as a heuristic to evolve the number of components.
+   What is given in Algorithm 2, is not compatible with Section 4. How do you merge
+   your Section 4 algorithm with stochastic EM? In Algorithm 2, how do you avoid
+   that there is always one more (L -> L+1) components? Some threshold must be
+   applied somewhere.
+   An alternative would be to use split&merge heuristics for EM.
+- Results reported in Section 5 are potentially interesting, but entirely lack a
+   reference point. The first is artificial, and surely does not need an algorithm of this
+   complexity. The setup in Section 5.2 is potentially interesting, but needs more
+   work, in particular a proper comparison to related work.
+   This type of effort is needed to motivate an extension of MAML which makes
+   everything quite a bit more expensive, and lacks behind the state-of-art, which
+   uses amortized inference networks (Versa, neural processes) rather than
+   gradient-based.
+",4,4.0,ICLR2019
+VAg5rkJ55Fj,3,TVjLza1t4hI,TVjLza1t4hI,"An extensive and rigorous validation of the power of autoencoders for EEG based classification, but with little novelty","Summary
+
+The authors are concerned with the classification of EEG signals, in order to predict age, gender, depression and Axis 1 disorder diagonosis from EEG signals. After standard preprocessing and optionnal averaging to obtain evoked responses, the authors feed the samples into a $\beta$-VAE , and then either use a standard classification algorithm or the SCAN method to predict the labels.
+The authors report better results than the usual methods based on the late positive potential. They also show that their method can be trained with non-averaged EEG data and still yield good results when tested on ERP , and conversely. Finally, the authors inspect the learned representations.
+
+Major comments
+
+- The paper is very well written, easy and pleasant to read, and well structured.
+- The automation of EEG pipelines, like this paper does, is extremely important.
+- The SCAN + $\beta$-VAE or SCAN+ VAE method does not seem to perform much better than LR +LPP. Even though SCAN allows for interpretable components, it is arguably much less interpretable than LPP.
+- The article validates carefully  a machine learning pipeline on a specific task and dataset, but there is little contribution in terms of machine learning, so I'm wondering whether ICLR is a good fit for this paper, rather than a more neuroscience-oriented conference.
+
+Minor comments
+- The authors propose an original pipeline, yet the dataset and the code to reproduce the results are not provided, which hinders reproducibility and the potential impact of this work. 
+- In my understanding, the LPP seems to only use one feature for classification: the average amplitude difference between waveforms. Could the authors also consider methods using more hand-crafted features? 
+- The EEG signals are only acquired with 3 sensors, it would be interesting to add a word about how the method scales to datasets with more sensors. 
+-  It would be interesting to add some ROC curves for the logistic regressions, which would complement nicely the summary statistic used by the authors.
+
+Misc. 
+
+- The software used to perform the study should be acknowledged.
+- The ERP are normalized to [0, 1] before going in the VAE. The authors could be more accurate: is each channel normalized individually?
+- The references that state that EEG contains important biomarkers of clinical disorders, and that averaging trials yields ERP could also point to more historic papers.
+",7,3.0,ICLR2021
+S1T4ik9ef,3,SyVVXngRW,SyVVXngRW,"This paper addresses a multi-task feature learning setting, but the objective lacks clarity and the experiments are not convincing.","This paper addresses multi-task feature learning, i.e. learning representations that are common across multiple related supervised learning tasks. The paper is not clearly written, so I outline my interpretation on what is the main idea of the manuscript. 
+
+The authors rely on two prior works in multi-task learning  that explore parameter sharing (Lee et al, 2016) and subspace learning (Kumar & Daume III 2012) for multi-task learning. 
+1) The work of Lee et al 2016 is based on the idea of transferring information through weight vectors, where each task parameter can be represented as a sparse combination of other related task parameters. The interpretation is that negative transfer is avoided because only subset of relevant tasks is considered for transfer. The drawback is the scalability of this approach. 
+2) The second prior work is Kumar & Daume III 2012 (and also an early work of Argyrio et al 2008) that is based on learning a common feature representation. Specifically, the main assumption is that tasks parameters lie in a low-dimensional subspace, and parameters of related tasks can be represented as linear combinations of a small number of common/shared latent basis vectors in such subspace.  Subspace learning could help to scale up to many tasks.
+
+The authors try to combine together the ideas/principles in these previous works and propose a sparse auto encoder model for multi-task feature learning with (6) (and (7)) as the main learning objectives for training an autoencoder. 
+
+- I couldn’t fully understand the objective in (6) and how exactly it is related to the previous works, i.e. how the relatedness and easyness/hardness of tasks is measured; where does f enter in the autoencoder network structure?
+- The empirical evaluations are not convincing. In the real experiments with image data, only decaf features were used as input to the autoencoder model. Why not using raw input image? Moreover all input features where projected to a lower dimensional space using PCA before inputing to the autoencoder. Why? In fact, linear PCA can be viewed as an autoencoder model with linear encoder and decoder (so that the squared error reconstruction loss between a given sample and the sample reconstructed by the autoencoder is minimal (Bishop, 2006)). Then doing PCA before training an autoencoder is not motivated.  
+
+-Writing can be improved. The introduction primarily criticizes  the approach of Lee et al, 2016 called Assymetric Multi-task Learning. It would be nicer if the introduction sets the background and covers different approaches/aspects/conditions of negative transfer in transfer learning/multi-task learning setting. The main learning objective (6) should be better explained. 
+
+-Conceptual picture is a bit lacking. Striped hyena is used as an example of unreliable noisy data (source of negative transfer) when learning the attribute classifier ""stripes"". One might argue that visually, striped hyena is as informative as white tigers. Perhaps one could use a different (less striped) animal, e.g. raccoon. 
+",5,4.0,ICLR2018
+QCk_fNmpO92,3,VYfotZsQV5S,VYfotZsQV5S,A good paper,"This paper propose a doubly stochastic MM method based on Monte Carlo approximation of these stochastic surrogates for solving nonconvex and nonsmooth optimization problems. The proposed method iteratively selects a batch of functions
+at random at each iteration and minimize the accumulated surrogate functions (which are expressed as an expectation). They establish asymptotic and non-asymptotic convergence of the proposed algorithm. They apply their method for inference of logistic regression model and for variational inference of Bayesian CNN on the real-word data sets.
+
+Weak Points.
+W1. The authors do not discuss the connections with state-of-the-art second-order optimization algorithms such as K-FAC.
+W2. The proposed algorithm still falls into the framework of MM algorithm and a simple convex quadratic surrogate function is considered. The convergence rate of the algorithm is expected.
+
+Strong Points.
+S1. The proposed method can be viewed as a combination of MM and stochastic gradient method with variance reduction, which explains its good performance. 
+S2. The paper contains sufficient details of the choice of the surrogate function and all the compared methods in the experiments.
+S3. The authors establish asymptotic and non-asymptotic convergence of the proposed algorithm. I found the technical quality is very high.
+S4. Extensive experiments on binary logistic regression with missing values and Bayesian CNN have been conducted.
+
+
+
+",7,3.0,ICLR2021
+S18iNb5lG,3,ByhthReRb,ByhthReRb,"Some good ideas, but lack of detailed explanation impacts understanding","Properly capturing named entities for goal oriented dialog is essential, for instance location, time and cuisine for restaurant reservation. Mots successful approaches have argued for separate mechanism for NE captures, that rely on various hacks and tricks. This paper attempt to propose a comprehensive approach offers intriguing new ideas, but is too preliminary, both in the descriptions and experiments. 
+
+The proposed methods and experiments are not understandable in the current way the paper is written: there is not a single equation, pseudo-code algorithm or pointer to real code to enable the reader to get a detailed understanding of the process. All we have a besides text is a small figure (figure 1). Then we have to trust the authors that on their modified dataset, the accuracies of the proposed method is around 100% while not using this method yields 0% accuracies?
+
+The initial description (section 2)  leaves way too many unanswered questions:
+- What embeddings are used for words detected as NE? Is it the same as the generated representation?
+- What is the exact mechanism of generating a representation for NE EECS545? (end of page 2)
+- Is it correct that the same representation stored in the NE table is used twice? (a) To retrieve the key (a vector) given the value (a string)  as the encoder input. (b) To find the value that best matches a key at the decoder stage?
+- Exact description of the column attention mechanism: some similarity between a key embedding and embeddings representing each column? Multiplicative? Additive?
+- How is the system supervised? Do we need to give the name of the column the Attention-Column-Query attention should focus on? Because of this unknown, I could not understand the experiment setup and data formatting!
+
+The list goes on...
+
+For such a complex architecture, the authors must try to analyze separate modules as much as possible. As neither the QA and the Babi tasks use the RNN dialog manager, while not start with something that only works at the sentence level
+
+The Q&A task could be used to describe a simpler system with only a decoder accessing the DB table. Complexity for solving the Babi tasks could be added later.
+",3,3.0,ICLR2018
+H1llHZtRYB,3,SJx37TEtDH,SJx37TEtDH,Official Blind Review #2,"This paper demonstrates empirically that the gradient noises of SGD with ResNet and Adam with Bert are different: one is well-concentrated, while the other one is heavy-tailed. The paper claims that this difference costs the failure of SGD on training Bert. Furthermore, the authors proposes gradient clipped SGD and its adaptive version ACClip. Experiments show that ACClip outperforms Adam on training Bert.
+
+In general, the paper is well-written and has addressed an important practical and theoretical problem of why SGD fails to train Bert and how to fix this problem. The theory appears to be solid. My only concern is how generalizable ACClip is. Experiments show that it outperforms Adam on training Bert. How about the other architectures where Adam is usually applied? Is ACClip competitive to Adam in those applications? What’s the performance of ACClip on DL applications where SGD + momentum works well, such as ResNet on the ImageNet dataset?
+
+What is exactly \delta f(x)? Is this the full batch gradient over all training examples? 
+
+Some typos: 
+1.	Page 1: thereby providing a explanation
+2.	Page 4: at most af factor of 2 and Adam
+",6,,ICLR2020
+QdaHHLQmoTJ,2,S9MPX7ejmv,S9MPX7ejmv,Interesting idea but many details are missing,"The paper proposes a robust multi-objective RL approach and a non-linear utility metric to enforce an accurate and evenly distributed representation of the Pareto frontier. Robustness is obtained by formulating the problem as a two-player zero-sum game. The goal of the main agent is thus to learn the policies on the Pareto frontier under attacks from the adversary. This is achieved by training a single network to generate approximate Pareto optimal policies for any provided preference. To train this network, they introduce a new metric for Pareto frontier evaluation based on hypervolume and entropy (to force evenly distributed solutions). The resulting algorithm has the classical structure of an actor-critic algorithm where the critic provides an estimate of the Q-function and the actor updates the policies of the protagonist and adversary through alternate optimization.
+
+Could you give more motivations why a robust approach is needed for MORL? The motivation now seems simply to be that the literature didn't take into account robustness. 
+
+I think the idea of the paper is quite interesting but it is not well written/explained. I think a few details are missing.
+
+For example, in section 4.3.1 it is not clear to me how the loss is constructed. You mentioned that the objective is to learn the Q-function of the protagonist but the target value $y$ is built using the mix policy and Q-function. Could you clarify the meaning of this loss?
+There are also two different definitions of $y$ ($s'$ vs $s$ and $\omega'$ vs $\omega$).
+
+What is the meaning of $\mathbb{E}^{\pi^{mix}}$? Does it mean expectation wrt to the stationary distribution induced by the policy?
+It is important for understanding equations 8 and 9. Why there is an approximation in the definition of the gradients? Shouldn't be $\nabla \mathbb{E}^{\pi}[]$?
+
+Concerning section 4.4, you compare with the metric proposed in (Xu et al. 2020) without explaining it. It is hard for the reader to understand why it is not a good metric without knowing the metric. In general, as you acknowledged, a good approximate Pareto frontier should be accurate, evenly distributed and have a covering similar to the one of the true Pareto frontier. These properties are not equally important. This is to say that I think the example in figure 3 has a major drawback. Frontier 2 and 3 are not Pareto frontier since are dominated by 3. Not sure that everyone will agree that 2 is better than 3. I suggest you change the example.
+
+To overcome this, you introduced an entropic measure based on a partitioning of the Pareto frontier. There is no mention of how to do that in practice. Could you explain it? Are you partitioning the space of preferences?
+A standard measure to enforce spread solutions is the crowding distance (eg in your reference Parisi et al. 2017 and many more), could you apply the same idea of interval partition on this?
+Could you explain why you introduced evenness as a multiplicative factor rather than an additive one in $I(P)$?
+
+
+Experiments. There is no information about the implementation and parameters/configurations used. It is thus very difficult to parse the results. For example, how are preferences selected for standard methods (ie all expect BRMORL)?
+Why is the comparison with (Yang et al. 2019) missing? I think this is a relevant algorithm for the setting.
+You should add an explicit reference to the papers introducing the methods in Table 3, ie RA, PFA, MOEA/D and META. 
+
+I think the paper contains an interesting idea but, given the mentioned concerns, I think this is a borderline paper. I'm looking forward to the authors' feedback.
+
+Minor issues
+-I think it's more precise to define \mathcal{A}^{mix} as the space of possible combinations of actions. The probability $\alpha$ should be associated with $\pi^{mix}$ rather than with the action space.
+-Reference missing to SUMO
+-You use both $\mathbb{E}_\pi$ and $\mathbb{E}^{\pi, \pi'}$ but these symbols are never explained
+-Figure 2:  ""OA can not parallel to OB""
+-You use often the term convergence in the description of the algorithm. How do you evaluate convergence?
+-In figure 4 what is $\alpha(\Omega)$? Is it related to $\alpha$ used in the definition of the mix policy?
+",5,4.0,ICLR2021
+ByxyYw-t37,2,B1x5KiCcFX,B1x5KiCcFX,"Interesting analysis of the generalization performance of GANs, lacks strong experimental evidence","The paper provides some bounds on the generalization performance of GANs for approximating distributions with discontinuous support. This work relies heavily on the results shown in [1] and [2] on the approximation power of Deep networks for non-smooth functions. The paper is globally well written and the proof seems sound. However, the experiments could be more convincing and the relevance of the result is questionable:
+
+- By choosing the function class F to the be L_1-lipschitz, the resulting error bound loses it’s dependence on the smoothness beta and becomes slightly worse than the classical methods (equation 7 with kappa = 2+2D). Is this an artifact of the proof? if that is the case, it would be good to have a tighter bound: [3] might be a good starting point. 
+- Neural networks used in practice are continuous usually, but it seems that all the analysis is all based on the fact that distributions with disjoint support require discontinuous networks. Can similar results be obtained in the more realistic case of continuous networks? Also what network architecture was used in the experiments?
+- Although the bound in eq (5) clearly shows a tradeoff for S_g it only says that S_f should be as small as possible. Of course, if S_f =0 there is no discriminative power, but it’s unclear to me how the expression for S_f in eq (6)  can be obtained from (5) and why it would keep the discriminative power (in what sense?). Again, this tradeoff was discussed in prior work [3], so it might be worth looking into that direction.
+- The discussion right after lemma 1 doesn't seem to be true: a distribution might have disjoint support and still have a density (i.e.: absolutely continuous with respect to the Lebesgue measure). It can even have a smooth density.
+- The experiment doesn’t use the same metric to compare GANs method with other methods, so it is unclear how these methods compare. Moreover, figure 6 seems to show that other methods are also able to get the support right (Kernel E). Based on what could we claim that one method is better than the other?
+
+Revision:
+Thank you for your response. 
+> In fact, our estimator in the theoretical and experimental analysis employs a continuous (ReLU) network. Though discontinuous networks are necessary for our setting (Lemma 2), we show that (continuous) ReLU networks can approximate the discontinuous network effectively (Lemma 3), hence the effectiveness of GANs is proved (Theorem 1).
+
+-That clarifies things, however I find that the discussion after lemma 2 rather missleading, if in the end the result ends up using continuous generator:
+""Because of the discontinuity, generative models with smooth functions, such as an
+adversarial generative model with kernel generators (Sinn & Rawat, 2018), cannot work well with
+disconnected supports.""  
+
+- It is still unclear to me how the optimal value of S_f is obtained from eq (5). The author points out the work by Zhang+ (2018), but this should be clarified in the current version of the paper: What result in Zhang+(2018) do you use to get this value?
+
+- I find the experiments  not very convincing. I understand that the point is not to show that GANs are better than  other methods but it is important to be make meaningful compairisons (use comparable scores) otherwise there is little scientific value in figure 5 especially.  
+
+- As reviewer 1 mentions, lemma 3 is supposed to be one of  the main theoretical contributions of the paper, however, the proof seems very similar to the one in ([2], appendix B.1). Although the authors mention lemma 1 of [2] in the proof of lemma 3, it seems like the whole section in ([2] appendix B.1) is dedicated to show the very same result.
+
+For all these reasons I still wouldn't recommend accepting this paper. 
+
+
+
+
+
+
+[1]: Yarotsky. Error bounds for approximation with deep relu networks.
+[2]: Massaki Imaizumi, Kenji Fukumizu. Deep neural networks learn non-smooth functions effectively.
+[3]: P. Zhang, Q. Liu, D. Zhou, T. Xu, and X. He. On the Discrimination-Generalization Tradeoff in GANs. 
+",5,4.0,ICLR2019
+HkgaxjChYH,2,r1glygHtDB,r1glygHtDB,Official Blind Review #1,"This paper proposes a method for semantic segmentation using ""lazy"" segmentation labels. Lazy labels are defined as coarse labels of the segmented objects. The proposed method is a UNET trained in a multitask fashion whit 3 tasks: object detection, object separation, and object segmentation. The method is trained on 2 datasets: air bubbles, and ice crystals. The proposed method performs better than the same method using only the weakly supervised labels and the one that only uses the sparse labels.
+
+The novelty of the method is very limited. It is a multitask UNET. The method is compared with one method using pseudo labels. However, this method is not SOTA. Authors should compare with current methods such as:
+ - Where are the Masks: Instance Segmentation with Image-level Supervision
+ - Instance Segmentation with Point Supervision
+ - Object counting and instance segmentation with image-level supervision
+ - Weakly supervised instance segmentation using class peak response
+ - Soft proposal networks for weakly supervised object localization
+ - Learning Pixel-level Semantic Affinity with Image-level Supervision for Weakly Supervised Semantic Segmentation
+These methods can use much less supervision (point-level, count-level or image-level) and may work even better.
+
+The method should be compared on standard and challenging datasets like Cityscapes, PASCAL VOC 2012, COCO, KITTI...
+
+",1,,ICLR2020
+HJxtbjQUcS,3,Byg9A24tvB,Byg9A24tvB,Official Blind Review #2,"This paper first shows some potential issues of softmax loss (i.e., cross-entropy loss with softmax function) and then propose the Max-Mahalanobis center (MMC) loss to encourge the intra-class compactness for better adversarial robustness.
+
+The MMC loss is essentially minimizing the distance between the feature and the pre-fixed class center. Different from center loss, these centers are determined by minimizing the maximum inner product between any two class centers. Since the norm of these class centers are normalized to a constant. It is equivalent to angles. This acutally reminds me of a number of works in angular margin-based softmax loss. Just to name a few:
+
+[1] Large-Margin Softmax Loss for Convolutional Neural Networks, ICML 2016
+[2] SphereFace: Deep Hypersphere Embedding for Face Recognition, CVPR 2017
+[3] Soft-margin softmax for deep classification, ICNIP 2017
+[4] CosFace: Large Margin Cosine Loss for Deep Face Recognition, CVPR 2018
+[5] ArcFace: Additive Angular Margin Loss for Deep Face Recognition, CVPR 2019
+
+I think these works are closely related to what the authors aim to do, and therefore they should be discussed methodologically and compared empirically.
+
+Besides that, I think it is also worth conducting an ablation study for how to determine these class centers. This paper considers to minimize the maximum inner product. There are a few papers listed below that explicitly discusses how to make the class centers uniformly spaced. The authors may consider to compare these methods for determining the class centers. 
+
+[1] Learning towards Minimum Hyperspherical Energy, NeurIPS 2018 
+[2] UniformFace: Learning Deep Equidistributed Representation for Face Recognition, CVPR 2019
+
+For the experiments, the MMC loss indeed shows some advantages over the softmax loss. I am basically convinced by the experiments, although it can further strengthen the paper if the authors can conduct some evaluations on large-scale datasets like ImageNet.
+
+I appreciate the authors provide many theoretical justifications, which is inspiring. Intuitively speaking, I can understand that shrinking the feature space (i.e., make feature distribution more compact) can improve the adversarial robustness. As a result, I think this paper is naturally motivated and is also theoretically sound. The experiments can be further improved.",6,,ICLR2020
+Hkx6OwB0YB,2,S1efxTVYDr,S1efxTVYDr,Official Blind Review #3,"This paper proposes to add a prior/objective to the standard MLE objective for training text generation models. The prior penalizes incorrect generations/predictions when they are close to the reference; thus, in contrast with standard MLE alone, the training objective does not equally penalize all incorrect predictions. For the experiments, the authors use cosine similarity between fastText embeddings to determine the similarity of a predicted word and the target word. The method is tested on a comprehensive set of text generation tasks: machine translation, unsupervised machine translation, summarization, storytelling, and image captioning. In all cases, simply adding the proposed prior improves over a state-of-the-art model. The results are remarkable, as the proposed prior is useful despite the variety of architectures, tasks (including multi-modal ones), and models with/without pre-training.
+
+In general, it is promising to pursue work in altering the standard MLE objective; changes to learning objective seem orthogonal to the modeling gains made in many papers (as evidenced by the gains the authors show across diverse models). This paper opens up several new directions, i.e., how can we impose even more effective priors? The authors show that it's effective to use a relatively simple fastText-based prior, but it's possible to consider other priors based on large-scale pre-trained language models or learned models. In this vein, a concurrent paper ""Neural Text Generation with Unlikelihood Training"" has also shown it effective to alter the standard MLE objective. I think it would be nice to discuss this paper and related works. Overall, I think the approach is quite general and elegant.
+
+My main criticism is that the writing was unfocused or unclear at times. The intro discusses a variety of problems in generation, before explaining that the authors only intend to tackle one (""negative diversity ignorance""). It would have been more helpful to read more text in the intro that motivated the problem of negative diversity ignorance and the proposed solution. The second paragraph in the Discussion in Section 4 is rather ambiguous and hand-wavy. It would be nice to see the authors' intuition described more rigorously (i.e., explicitly describing in math how the cosine similarity score is used in the Gaussian prior, or describing in math how the central limit theorem is used). Some of the existing mathematical explanation in section 4 could be made simpler or more clear (the description of f(y) seems to be a distraction since it doesn't end up in the final loss).
+
+I would have also appreciated more analysis. After reading the paper, I have the following questions (which the authors may be able to address in the rebuttal):
+* Do off-the-shelf fastText embeddings work well? How important is it to train fastText embeddings on the data itself? If off-the-shelf embeddings worked well, that could make the method easier to use for others in practice.
+* How does the gain in performance with D2GPo vary based on the number of training examples? Priors are generally more helpful in low-data regimes. If that is the case here as well, you might get even more compelling results on low-data tasks (all tasks attempted here are somewhat large-scale, as I understand)
+* Qualitatively, do you notice any difference in the generations? How does the model make mistakes (are these more ""semantic"" somehow, i.e. swapping a different synonym in). Perhaps the gaussian prior has some failure modes, i.e., where it increases the probability of very incorrect/opposite words because they have a similar fastText representation. These kinds of intuitions would be useful to know
+
+I also have one technical question:
+* When you compare against MASS (Song et al. 2019), do you use the same code and/or pre-trained weights from MASS, or do you pre-train from scratch using the procedure from MASS? (The wording in the text is somewhat ambiguous.) I'm just wondering how comparable the results are vs. MASS, or if it would be useful to know how your version of the pre-trained model does.
+
+
+Despite my above questions/concerns, I think the proposed method or its predecessors could provide improvements across a variety of text generation tasks, so I overall highly recommend this paper for acceptance.
+",8,,ICLR2020
+HyeDT0djn7,3,SJxFN3RcFX,SJxFN3RcFX,"An interesting idea plagued by flaws in presentation, inconsistent notation, and lack of critical experiments","The authors propose an approximate MCMC method for sampling a posterior distribution of weights in a Bayesian neural network.  They claim that existing MCMC methods are limited by poor scaling with dimensionality of the weights, and they propose a method inspired by HMC on finite-dimensional approximations of measures on an infinite-dimensional Hilbert space (Beskos et al, 2011).  In short, the idea is to use a low dimensional approximation to the parameters (i.e. weights) of the neural network, representing them instead as a weighted combination of basis functions in neural network parameter space.  Then the authors propose to use HMC on this lower dimensional representation.  While the idea is intriguing, there are a number of flaws in the presentation, notational inconsistencies, and missing experiments that prohibit acceptance in the current form.
+
+The authors define a functional, f: \theta -> [0, 1], that maps neural network parameters \theta to the unit interval.  They claim that this function defines a probability distribution on \theta, but this not warranted.  First, \theta is a continuous random variable and its probability density need not be bounded above by one; second, the authors have made no constraints on f actually being normalized.  
+
+The second flaw is that the authors equate a posterior on f given the data with a posterior on the parameters \theta themselves.  Cf. Eq 4 and paragraph above.  There is a big difference between a posterior on parameters and a posterior on distributions over parameters.   Moreover, Eq. 5 doesn't make sense: there is only one posterior f; there are no samples of the posterior. 
+
+The third problem appears in the start of Section 3, where the authors now call the posterior U(theta) instead of f.  They make a finite approximation of posterior U(\theta) = \sum_i \lambda_i u_i, which is inconsistent with Beskos et al.  I believe the authors intend to use a low dimensional approximation to \theta rather than its posterior U(\theta).  For example, if \theta = \sum_i \lambda_i u_i for fixed basis functions u_i, then you can approximate a posterior on \theta with a posterior on \lambda.
+
+The fourth, and most important problem, is that the basis functions u_i are never defined.  How are these chosen? Beskos et al use the eigenfunctions of the Gaussian base measure \pi_0, but no such measure exists here.  Moreover, this choice will have a substantial impact on the approximation quality. 
+
+There are more inconsistencies and notational problems throughout the paper.  Section 4.1 begins with a mean field approximation that seems out of place.  Section 3 clearly states that the posterior on theta is approximated with a posterior on lambda, and this cannot factorize over the dimensions of theta.  Finally, the authors again confuse the posterior on weights with a posterior on distributions of weights in Eq 11.   \tilde{U} is introduced as a function of lambda in Eq 14 and then called with f in line 4 of Alg. 1.  These two types are not interchangeable. 
+
+These inconsistencies cast doubt on the subsequent experiments.  Assuming the algorithm is correct, a fundamental experiment is still missing. 
+To justify this approach, the authors should show how the posterior approximation quality varies as a function of the size of the low dimensional approximation, D.
+
+I reiterate that the idea of approximating the posterior distribution over neural network weights with a posterior distribution over a lower dimensional representation of weights is interesting.  Unfortunately, the abundance of errors in presentation cloud the positive contributions of this paper.",3,3.0,ICLR2019
+S1l8cmXhYB,1,rJx9vaVtDS,rJx9vaVtDS,Official Blind Review #3,"The paper introduces Dose Response Generative Adversarial Network (DRGAN) that is aimed at generating entire dose-response curve from observational data with single dose treatments. This work is an extension of GANITE (Yoon et al., 2018) for the case of real-valued treatments (i.e., dosage). The proposed model consists of 3 blocks: (1) a generator, (2) a discriminator, and (3) an inference block. In this paper, GANITE’s generator and discriminator architectures are modified to be able to handle real-valued treatments.
+
+This paper should be rejected due to the following arguments:
+	- Since the paper is based on GANITE, I’m going to start my review by pointing out the problems/questions I have about GANITE. First, let me briefly summarize my understanding of this method: 
+		+ Given (x, t, yf), the generator G tries to estimate counterfactual outcomes (ycf). 
+		+ Given (x, (yf, ycf)), the discriminator D tries to figure out which outcome was factual.
+		+ Yoon et al. (2018) claims that optimizing a GAN with this G and D, the generator will become better and better at estimating accurate counterfactual outcomes.
+Looking at the objective function in Eq. (4), it is not clear why D should learn to distinguish factual from counterfactual outcome as opposed to learning the treatment selection bias as t.log(D(x ,y))+... would suggest. Yoon et al. (2018) assume the former while the latter seems more plausible. Of course, a by-product of learning the treatment selection bias is distinguishing yf from ycf, however, this doesn’t mean that G did a good job at generating an accurate estimation of ycf. 
+In summary, it seems that the adversarial training designed in GANITE and consequently DRGAN provide no advantage in terms of accurate estimation of counterfactuals, which in turn, nullifies the entire claimed contribution of these two works. I will, however, read the rebuttal carefully and am willing to modify the score if the authors address this major concern.
+
+References:
+	- Yoon, J., Jordon, J., & van der Schaar, M. (2018). GANITE: Estimation of individualized treatment effects using generative adversarial nets.
+
+
+********UPDATE after reading the rebuttal********
+Thank you for your response. 
+Still, there are multiple issues with this reasoning that you need to address.
+
+$X$ is way more expressive than the added binary input that you described in the example. While the binary input is sampled from a Uniform distribution (either ½ or ⅔), $Pr(X|t=1)$ and $Pr(X|t=1)$ distributions are often Gaussian (certainly not Uniform) and therefore, embed a lot more information about which instances should receive which treatment. Therefore, $X$ is way more informative about selection bias than your binary input is about the image being fake/real. Do not underestimate this!! 
+
+It is true that the discriminator reaches its maximum loss when the generator’s estimation of the counterfactual is spot on. That is, it simply predicts the administered treatment according to the treatment selection bias. You need to show, however, that this maxima is global. Otherwise, there could be infinite maximas with the same loss; e.g., why not simply $Y^{factual} = Y^{counterfactual}$ does not reach this maxima? 
+
+You also need to show that the added information to $X$ by concatenating it with the factual outcome plus the generator’s estimation of the counterfactual outcome do improve prediction of the administered treatment -- i.e., minimizing the discriminator loss.
+
+To summarize, I am not yet convinced that either GANITE or DRGAN do what they claim.
+
+",1,,ICLR2020
+Byg9XiEs3X,3,ryxDjjCqtQ,ryxDjjCqtQ,interesting problem,"I have read the discussion from the authors. my evaluation stays the same.
+--------
+this paper studies an interesting question of how to learn causal effects from observational data generated from reinforcement learning. they work with a very challenging setting where an unobserved confounder exists at each time step that affects actions, rewards and the confounder at next time step.
+
+the authors fit latent variables models to the observational data and perform experiments.
+
+the major concern is on the causal inference side, where it is not easy to claim anything causal in such a complicated system with unobserved confounders. causal inference with unobserved confounders cannot be simply solved by fitting a latent variable model. there exists negative examples even in the simplest setting that two distinct causal structure can lead to the same observational distribution. for example here, https://www.alexdamour.com/blog/public/2018/05/18/non-identification-in-latent-confounder-models/
+
+it could be helpful if the authors can lay out the identification assumptions for causal effects. before claiming anything causal and justifying experimental results.",4,3.0,ICLR2019
+ux8v9DFX05N,3,3Aoft6NWFej,3Aoft6NWFej,"Clear empirical gains and well written, but somewhat incremental contribution overall","Summary:
+
+The paper proposes a variant on the MLM training objective which uses PMI in order to determine which spans to mask. The idea is related to recently-proposed Whole Word Masking and Entity-based masking, but the authors argue the PMI-based approach is more principled. The method is straightforward--it involves computing PMIs for ngrams (in this case, up to length 5) over the training corpus, and then preferring to mask entire collocational phrases rather than single words during training. The intuition is that masking single words allows models to exploit simple collocations, thus optimizing their training objective without learning longer-range dependencies or higher level semantic features of the sentences, and this makes training less efficient than it could be. One contribution of the paper is a variant on the PMI metric that performs better for longer phrases by reducing the scores of phrases that happen to contain high-PMI subphrases, e.g. ""George Washington is"" should not have a high score despite the fact that ""George Washington"" does have a high score.
+
+The authors compare their method against vanilla BERT with random masking, as well as against recently proposed variants such as SpanBERT and AMBERT, and show consistent improvements in terms of final performance as well as better efficiency during training. By way of analysis, the authors also make an argument that token-level perplexity is not correlated with downstream performance. This is an interesting point to make, though they do not expound upon it in this paper. 
+
+Strengths:
+
+* The proposed method is simple and principled
+* The empirical results show consistent improvement on standard benchmark tasks
+* The proposed variation the PMI metric is a nice sub-contribution
+
+Weaknesses:
+
+* A somewhat marginal contribution, its not significantly different from the variants proposed previously (e.g., SpanBERT, entity masking)
+* The evaluation focuses purely on benchmark tasks which are known to have flaws (e.g., the current ""superhuman"" performance on these tasks already makes gains on them suspect). I'd have liked to some more analysis/discussion of the linguistic consequences of this new objective. See more specific comments below.
+
+Additional Comments/Questions:
+
+I am curious about the more general effect of this training objective on the models linguistic (and particularly syntactic) knowledge. E.g., can you say more about how often the model sees unigrams being masked and how the distribution of these unigrams differs from what would be seen if we did random masking? I ask because I could imagine that this objective has a noticeable effect on the masking of function words (e.g., preposition occurring more often in collocations, pronouns and determines maybe less often) and thus the model might get differing access to these words in isolation. Since function words carry a lot more signal about syntactic structure than do content words and phrases (of the type you are capturing in your PMI metric), I'm very curious if there are some tradeoffs (or, possibly, additional advantages) that comes with your method that are not reflected by the benchmark performance metrics. Squad and GLUE are going to favor knowledge of things like entities and events, and capture very little about more nuanced linguistic reasoning, so reporting performance on some more recently released challenge sets, or using some probing studies, or at least just giving some analysis of win/loss patterns, would be very informative for assessing the contribution of this paper to NLP more generally. ",6,4.0,ICLR2021
+Bkg5_kcxG,2,BJ0hF1Z0b,BJ0hF1Z0b," Nice extensions to FederatedAveraging, with strong experimental setup.","
+Summary of the paper
+-------------------------------
+
+The authors propose to add 4 elements to the 'FederatedAveraging' algorithm to provide a user-level differential privacy guarantee. The impact of those 4 elements on the model'a accuracy and privacy is then carefully analysed.
+
+Clarity, Significance and Correctness
+--------------------------------------------------
+
+Clarity: Excellent
+
+Significance: I'm not familiar with the literature of differential privacy, so I'll let more knowledgeable reviewers evaluate this point.
+
+Correctness: The paper is technically correct.
+
+Questions
+--------------
+
+1. Figure 1: Adding some noise to the updates could be view as some form of regularization, so I have trouble understand why the models with noise are less efficient than the baseline.
+2. Clipping is supposed to help with the exploding gradients problem. Do you have an idea why a low threshold hurts the performances? Is it because it reduces the amplitude of the updates (and thus simply slows down the training)?
+3. Is your method compatible with other optimizers, such as RMSprop or ADAM (which are commonly used to train RNNs)?
+
+Pros
+------
+
+1. Nice extensions to FederatedAveraging to provide privacy guarantee.
+2. Strong experimental setup that analyses in details the proposed extensions.
+3. Experiments performed on public datasets.
+
+Cons
+-------
+
+None
+
+Typos
+--------
+
+1. Section 2, paragraph 3 : ""is given in Figure 1"" -> ""is given in Algorithm 1""
+
+Note
+-------
+
+Since I'm not familiar with the differential privacy literature, I'm flexible with my evaluation based on what other reviewers with more expertise have to say.",7,2.0,ICLR2018
+wH5qmWlxrY,3,FUdBF49WRV1,FUdBF49WRV1,The algorithm is not clearly described and may not be scalable.,"This paper provides a theoretical framework that allows to directional convolutional kernels in any graph, e.g., generalize CNNs on an n-dimensional grid. In the framework, gradients of the eigenvectors of the graph Laplacian are used to define “directions” on the graph.
+
+The proposed method is well-motivated and seems to be theoretically justified (I did not fully understand the details and check the proofs). My main concerns are:
+
+1.	The theoretical development is difficult to follow. The proposed method is not clearly described. It is hard for readers to find in the paper what are the steps of the proposed algorithm and understand how it works. It would be better to describe the algorithm step by step.
+
+2.	I have some doubts about the practical value of the proposed method because it requires eigen-decomposition of the Laplacian matrices. Though the authors provide complexity analysis in the appendices, it would be more informative to provide a runtime comparison with SOTA methods such as the vanilla GCN. 
+
+================ Post Rebuttal =============================================================
+
+Thank the authors for the updates.
+
+In the latest version, the algorithm flow is clearly stated in Figure 1, and now I can understand how the algorithm works. The authors also reported additional results on running time in the latest version, which are informative. 
+
+Here is what I think after reading the paper again.
+
+1.	This paper proposes a novel idea. Defining directions on graphs is not a well-addressed problem in current GNN models, and using the gradients of the low-frequency eigenvectors of the Laplacian to define directions seems novel and interesting to me.  
+
+2.	The insight and analysis are not clear. Section 2.4 is still difficult to follow after the updates. More importantly, I am not sure about the correctness of the theorems and corollaries. 
+
+       The K-walk distance is supposed to reflect the difficulty of passing information between two nodes, and a larger distance means more difficulty. In the paper the K-walk distance is defined as the average number of times that a K-step random walk from one node to hit another (formal definition given in Page 18), which really puzzles me, because frequent visits indicate ease of message passing. Did the authors confuse hitting probabilities with hitting times? 
+
+",5,2.0,ICLR2021
+boK0P0c-ADa,3,WW8VEE7gjx,WW8VEE7gjx,Reformulation of unsupervised dimensionality reduction problem,"The paper considers the unsupervised dimension reduction problem. That is, given a set of points in R^n, find a low dimensional affine subspace that approximate the support of the distribution that generated the points. More specifically, the paper considers the empirical probability density function p_emp of a dataset which is the average of \delta^n(x-x_i), where x_i's are points of the dataset and \delta^n is the n-dimensional version of the Dirac function. Then the goal is to find a distribution q such that its density is supported in a k-dimensional affine space and it minimized a certain loss D(p_emp,q), where D is a measure of distance between two distributions.
+The paper then presents 4 examples of problems that can be formulated in this framework: 1) maximum mean discrepancy; 2) distance based on the higher moments; 3) Wasserstein distance; and 4) sufficient dimension reduction.
+Finally, the paper proposes an alternating optimization scheme to solve this optimization problem and presents experiments that compare the accuracy of the proposed method with other dimensionality reduction methods like PCA.
+
+I think the paper is not well-motivated, and it is not clear what are the novelties of the paper. Please explicitly state what are the contributions of the paper. The experiments are also very inconclusive. Table 1 reports the accuracy of KNN on 2 and 3 dimensions. First, I think it is better to report the reconstruction error of PCA and other methods instead of this. Moreover, it is better to test the projection on a bit higher dimensions as well. For example what happens for k=10?",4,4.0,ICLR2021
+8TRH1_BOYDy,4,27acGyyI1BY,27acGyyI1BY,Review of Neural ODE Processes,"This work presents a new method that combines neural ODEs, which uses neural networks to flexibly describe a non-linear dynamical system, and neural processes, which uses neural networks to parameterize a class of functions. By combining these two approaches, neural ODE processes can now adapt to incoming data-points as the latent representation parameterizes a distribution over ODEs due to the NP component of the model. The authors use different variants of the model (first order ODE, second order ODE, linear latent readout) to infer dynamics from data in the experiments section. I find this work to be an interesting and important extension to the existing literature on neural processes. 
+
+My primary qualms with the manuscript is that I found it difficult to glean some of the details about the model(s) and, in particular, the details of the inference procedure. I assume many of the same details in the original NP paper apply here, but it is not clear to what extent, and exactly how. Many of these important inference and model details seem to be missing from both the main text and the supplemental material. 
+
+In particular, when you first discuss the aggregator some key details are missing. You mention that the mapping from r to z could be a NN but you are not clear on when/if the encoder is a neural network in the actual experiments. Also, is it the case that data from multiple contexts are trained in parallel? It is important to specify all of the details for the encoder for each of the experimental sections. The decoder and the other pieces of the model are clear.
+
+Moreover, how exactly is this trained? SGD? Adam? Is it end to end?  I assume you are optimizing an ELBO and the inference methods are akin to a VAE (or the original NP paper), but it is not explicitly said anywhere. Stepping through or at least explaining the primary details of training the model and the training objective will be useful. 
+
+Finally, it is unclear how long this inference takes or what kind of computing resources are needed. Though there are some comparisons of training different versions of the model in the appendix, there is no sense of how long an 'epoch' is. Because there was no code that I could see with the submission, this is doubly difficult to glean. 
+
+I think the proofs of secion 3.2 could be moved to an appendix. Additionally, a lot of space is devoted to the discussion and the conclusion; I would rather see more clarity provided to the implementation of NDPs and their differences at every stage of the model across the experiments.
+
+I am excited about the work and it does seem to be a useful extension of existing methods, and I think there are details that need to be clarified in order for this to be publishable. 
+
+Minor details:
+Bottom of page three ""the the""
+
+
+
+%%%%% EDIT %%%%%%
+
+I am satisfied with the author's response and given the proposed changes will raise my score to a 7.
+
+",7,3.0,ICLR2021
+SJg5q-L5Yr,1,HJgLZR4KvH,HJgLZR4KvH,Official Blind Review #1,"This paper proposes a novel approach to learn a continuous set of skills (where a skill is associated with a latent vector and the skill policy network takes that vector as an extra input) by pure unsupervised exploration using as intrinsic reward a proxy for the mutual information  between next states and the skill (given the previous state). These skills can be used in a model-based planning (model-predictive control) with zero 'supervised' training data (for which the rewards are given), but using calls to the reward function to evaluate candidate sequences of skills and actions. The proposed approach is convincingly compared in several ways to both previous model-based approaches and model-free approaches.
+
+This is a very interesting approach, and although I can already think of ways to improve, it seems like an exciting step in the right direction to develop more autonomous and sample efficient learning systems. I suggest to rate this submission as 'accept'.
+
+Regarding the comparison to model-free RL: although it is true that no task-specific training is needed, a possibly large number of calls to the reward functions are needed during planning. It would be good to compare those numbers with the number of rewarded trajectories needed for the model-free approaches.
+
+My main concern with the proposed method is how it would scale when the state-space becomes substantially larger (than the 2 dimensions x and y used in the experiments). The reason I am concerned is that the proposed method uses brutal sampling to search for good trajectories in z-space and action-space. It looks like the curse of dimensionality will quickly make this approach unfeasible. Also, it would be nice to have the learning system discover the important dimensions in which to plan (the x and y in the experiments), rather than having to provide them by hand.
+
+A minor concern is the following: is it possible that the optimization could end up discovering a large number of highly predictable (and diverse) but useless skills?
+
+In the related work section, 1st paragraph, in the list of citations, it might be good to also include the work on maximizing mutual information between representation of the next state and representation of the skill (Thomas et al, arXiv:1802.09484).
+
+The definition of Delta (page 9) is strange: it is said that Delta should be minimized but Delta is defined as proportional to the rewards (which should be maximized). Maybe a sign is missing. Also, why not simply define the rewards as being normalized in the first place, so that the metric IS the accumulated reward rather than this unusual normalized version of it.",8,,ICLR2020
+LeFq8XDkxgW,1,UfJn-cstSF,UfJn-cstSF,Interesting idea but theoretical results and evaluation  are not satisfactory,"### Strengh
+
+- The idea makes sense to adapt the thresholding mechanism to an input distribution with various reconstruction error. It might bring a much better empirical performance compared to thresholds fixed globally and it seems to be adapted in a denoising setting.
+
+
+### Weakness
+
+- The motivation for this work is not sufficiently furnished. The authors claim that it is useful when there is a discrepancy between train/test distribution but do not provide reference to realistic situation where such a problem arise.
+- Also, this method cannot really be used for real sparse coding problems as the network needs to be trained with the ground truth which is often not known in practice.
+- The theoretical contribution is marginal as it is almost straightforwardly adapted from Chen et al. (2018) and Liu et al. (2019).
+- The theoretical results are not precise enough (`s small enough`) while this assumption might very well render all the results only applicable to toyish case. Moreover, these assumptions are stronger than the one in `Chen et al. (2018)` and `Liu et al. (2019)` for $x_s$ realizing the sup in the expression of $b^t$ (Eq.7).
+- Almost all the experiments use the same setting as previous work, failing to highlight the advantage of the proposed method.
+- The performance advantage over LISTA seems to be minor from Figure.2.
+
+
+## Extra remarks
+
+
+- The proposed goal is to adapt LISTA in a setting where the input training distribution is *different* from the testing one. However, in this setting, using an algorithm like LISTA does not make sense as the learned weights have no reason to be adapted to the new distribution if the distribution do not overlap at all. There might be some degree to which it is possible to adapt but this should be made more explicit and better discussed. In particular, what type of distribution shift are considered and would make sense -- sparsity is mentionned but it is unclear.
+
+- p.3: `normal training of LISTA leads to it.` I don't think there is any results showing that SGD over LISTA achieves such threshold in theory and I haven't seen any proper empirical validation. If it exists, a proper citation is needed. Else, the statement should be updated.
+
+- p.3: `According to some prior works, we also know that U(t)∈ W(A)`: in the three cited papers, there seems to be no results showing that the learned `U(t)` verifies this. The statement is once again too strong.
+
+- Eq.(8): Would it be interesting to evaluate the usage of $\rho^{(t)} = \mu(A)$?
+
+- p.4: `the main results are obtained under a mild assumption of the ground-truth sparse code` -> The assumption is not mild. For instance, it seems to never be verified in any of the experiments. The statement $\mu(A)s \ll 1$ seems not backed by any experimental evidence and I don't think this is true.
+
+- p.4: `the above assumption gives a more detailed description of the distribution for $x_s$` while it is true that it sets a distribution on the space $\mathcal X$, it is not more precise as in the assumptions by `Chen et al. (2018)`, I believe no distribution is mentionned. so overall it constrains the type of distribution while it is not required in `Chen et al. (2018)`
+
+
+## Minor comments, nitpicks and typos
+
+- citations in () could use `citealt` to remove the extra parenthesis.
+- p.1: Lasso can also be solved using CD algorithms, which are typically state of the art.
+- p.1: In Gregor&LeCun (2010), the thresholding is not modified compared to ISTA.
+- p.2: `with W(t)=I−U(t)A holds for any layer` -> `in the case where W(t)=I−U(t)A holds for any layer`.
+- Eq.(7):
+- `Liu et al. (2018)`: The proper citation is `Liu, J., Chen, X., Wang, Z. & Yin, W. ALISTA: Analytic Weights are as good as Learned weigths in LISTA. in International Conference on Learning Representation (ICLR) 1113–1117 (2019).`
+- p.4: `a truncated distribution` -> I assume the authors mean `truncated gaussian distribution`?
+",5,4.0,ICLR2021
+B1er_zja3m,2,H1gZV30qKQ,H1gZV30qKQ,"Overall interesting, but concerns about the key idea and the applicability of the method","The paper considers the problem of transfer in continuous-action deep RL. In particular, the authors consider the setting where the dynamics of the task change slightly, but the effect on the policy is significant. They suggest that values are better suited for transfer and suggest learning a model to obtain these values.
+
+Overall, there are interesting ideas here, but I am concerned about whether the proposed approach actually solves the problem the authors consider and its general applicability.
+
+The point about value functions being better suited for transfer than policies is indeed true for greedy policies: it is well-known that they are discontinuous, and small differences in value can result in large differences in policy. This point is hence relevant in continuous control, where deterministic policies are considered.
+
+But I am a bit confused as to why the proposed approach is better though. Eq. (4) still takes a max w.r.t. the estimated dynamics, etc. So even if the value function is continuous, by taking the max, we get a deterministic policy which has the same problem! That is probably why the performance is quite similar to DDPG. Considering a softer policy parameterization (a continuous softmax analogue) would be more in line with the authors’ motivation.
+
+The proposed method itself doesn’t seem generally practical unfortunately, as it is suggested to learn the *model* of the environment for with a high-dimensional state space and a continuous action space, and do value iteration. In other words, if Property 2 was easy to satisfy, we wouldn’t be struggling with model-based methods as much as we are! However, I do appreciate that the authors illustrate the model loss curves in their considered domains. This raises a question of when are dynamics “easy”.
+
+The theoretical justification is quite weak, since the bound in Proposition 2 is too loose to be meaningful (as the authors themselves acknowledge). One way to mitigate this would be to support it empirically, by considering a range of disturbances of the specified form, and showing the shape of the bound on a small domain. The same thing can be done for the parametric modifications considered in the experiments -- instead of considering a set of instances, consider the performance as a function of the range of disturbances to the same dynamics parameter.
+
+Minor comments:
+* The italicization of certain keywords in the intro is confusing, in particular precise, imprecise -- these aren’t well-defined terms, and don’t make sense to me in the mentioned context. The policy function isn’t more “precise” than the value.
+* I suggest including the statements of the propositions in the main text",4,4.0,ICLR2019
+S1leviI0YB,3,BJlqYlrtPB,BJlqYlrtPB,Official Blind Review #2,"Summary:
+The authors propose augmenting VAEs with an additional latent variable to allow them to detect out-of-distribution (OOD) data. They propose several measures based on this model to distinguish between inliers and outliers, and evaluate the model empirically, finding it successful.
+
+Unfortunately, the method in this paper is developed unclearly and incorrectly. Although their experiments are somewhat successful, the problems with the text and method are severe enough to justify rejection.
+
+Specifically, the authors' method proposes adding a term to the loss of the VAE that encourages the variational posterior (q) to distribute latent codes (z) for inliers and outliers differently. The equation which defines their new objective is unclear -- specifically, it is not clear whether the added KL term is computed for inliers and outliers both, or whether it is only computed for outliers. If it is the former, then the method does not make sense. If it is the latter, then the equation is incorrect or at the very least not clear in the extreme.
+
+Furthermore, the term is added without consideration of whether or not the method is still optimizing a sensible variational lower bound. The authors attempt to justify the objective by writing out a variational lower bound for a VAE with a mixture prior where inliers and outliers are generated from different mixture components. However, their equations are incorrect -- the equation that is called the log likelihood is not the log likelihood, and the ELBO is similarly wrong.
+
+Their empirical evaluation is reasonable, although the measures they propose to distinguish between inliers and outliers (i.e. the kl from the approximate posterior to the prior) is not thoroughly justified.
+",3,,ICLR2020
+BJl5tkoN2X,2,ryxnHhRqFm,ryxnHhRqFm,End-to-end task oriented system: An encoder-decoder approach with a shared external knowledge base,"This is, in general, a well-written paper with extensive experimentation. 
+
+The authors tried to describe their architecture both with equations as well as graphically. However, I would like to mention the following: 
+
+In Section 2.1 I am not sure all the symbols are clearly defined. For example, I could not locate the definitions of n, l etc. Even if they are easy to assume, I am fond of appropriate definitions. Also, I suspect that some symbols, like n, are not used consistently across the manuscript.
+
+I am also confused about the loss function. Which loss function is used when?
+
+I am missing one more figure. From Fig 2 it's not so straightforward to see how the encoder/decoder along with the shared KB work at the same time (i.e. not independently)
+
+In Section 2.3, it's not clear to me how the expected output word will be picked up from the local memory pointer. Same goes for the entity table.
+
+How can you guarantee that that position n+l+1 is a null token?
+
+What was the initial query vector and how did you initialise that? Did different initialisations had any effect on performance?
+
+If you can please provide an example of a memory position.
+
+Also, i would like to see a description of how the OOV tasks are handled.
+
+Finally, your method is a NN end-to-end one and I was wondering how do you compare not with other end-to-end approaches, but with a traditional approach, such as pydial?
+
+
+And some minor suggestions:
+
+Not all the abbreviations are defined. For example QRN, GMN, KVR. It would also be nice to have the references of the respective methods included in the Tables or their captions.
+
+Parts of Figs. 1&2 are pixelised. It would be nice to have everything vectorised.
+
+ I would prefer to see the training details (in fact, I would even be favorable of having more of those) in the main body of the manuscript, rather than in the appendix.
+
+There are some minor typos, such as ""our approach that utilizing the recurrent"" or ""in each datasets""",8,2.0,ICLR2019
+Vi3v___WWM,1,Utc4Yd1RD_s,Utc4Yd1RD_s,"Interesting results, novelty slightly limited – would benefit from further exploration","### Summary
+
+The authors propose to improve the robustness against adversarial attacks in deep networks via the use of a customized normalization strategy. Their central methodological suggestion is called gated batch normalization (GBN), and involves a (soft, but some other variants are briefly explored as well) gating mechanism that maps to distinct normalization branches. Their idea is evaluated on a set of benchmarks and robustness is measured threefold: in terms of $L_1$, $L_2$ and $L_\infty$ perturbations.
+
+### Strengths
+
+The authors formulate an intellectually compelling underlying hypothesis, namely that adversarial perturbations can be viewed as coming from distinct domains, special in character since they may be positioned very closely to the original domain of natural images of the unperturbed distribution. This is an interesting viewpoint (although I am not sure why we would want to call these entities domains, a ""mode"" or ""manifold"" would seem more appropriate), and this aspect is in my opinion the strong point of the paper!
+
+The experimental section evaluates GBN against competitor approaches, either directly purposed for adversarial robustness (e.g. MSD) and conceptually similar normalization techniques (MN, MBN). CGN achieves convincing performance, and the underlying idea of separating BN statistics seems to improve adversarial robustness in a substantial way.
+
+### Weaknesses
+
+While the methodology of several gated BN units is well motivated, the technical novelty is somewhat limited. In particular, the method proposed here is very similar to the work of Deecke et al. (ICLR, 2019), with its main differentiation being that the gating mechanism is computed in a round-robin fashion (i.e. parameters are fixed when updating the gates, and vice versa) and with ground-truth information passed to the normalization units.
+
+That being said, introducing supervision (with respect to the initially applied perturbation) seems to be crucial to be effective when defending against adversarial examples, and increases performance against a range of competing approaches. Because of this, I would argue that the limited novelty in terms of the normalization is offset somewhat by the idea to try this out in an adversarial setting – and getting it to work there.
+
+### Suggestions
+
+There are a few suggestions that I would have liked to see included in the manuscript to increase its strength:
+* further study around what constitutes a good gating mechanism, in particular (i) what is the performance for a Gumbel-based approach; (ii) since the gates are computed offline, is there some sort of non-local parameter-sharing that (e.g. via a regularizer) benefits performance? This would also help in increasing the novelty of this work respective to the various competitive normalization approaches introduced in earlier publications.
+* Additional analysis around Figure 1 (c) and (d). Why was this particular layer chosen? Does this trend hold in other layers as well? What about deep lower versus higher layers. There are a number of interesting questions to be explored around this theme, for example is this trend a function of the semanticity? For some background reading on the depth of networks and their semantic function, I recommend looking into Asano et al. (ICLR, 2020).
+
+### Minor suggestions
+
+Good job on Figure 1 (a) and (b), this helped tremendously with quickly understanding the proposed mechanism! However, Figure 2 could use some improvement (in fact, maybe this can be removed in favor of additional discussions as outlined above).
+
+There seems to be some issue with the citations, e.g. Kurakin et al. (ICLR, 2017) and Deecke et al. (ICLR, 2019) are cited with first & last names in the wrong order.",6,4.0,ICLR2021
+HkgJzPJP2X,2,SkxxIs0qY7,SkxxIs0qY7,Interesting and promising method for generative modelling of sequence data without policy gradient,"The paper proposes an interesting method, where the discriminator is replaced by a component that estimates the density that is the mixture of the data and the generator's distributions. In a sense, that component is only a device that allows estimating a Jensen-Shannon divergence for the generator to then be optimized against. Other GAN papers have replaced their discriminator by a similar device (e.g., WGANs, ..), but the present formulation seems novel. The numerical experiments presented on a synthetic Turing test and text generation from EMNLP's 2017 news dataset appear promising. 
+
+Overall, the mediator seems to allow to achieve lower Jensen-Shannon (JS) divergence values in the experiments (and is kind of designed for that). Although this may be an improvement with respect to existing methods for discrete sequential data, it may also be limited in that it may not easily extend to other types of divergences that have proved superior to JS in some continuous settings.
+
+The paper is rather clear, although there are lots of small grammatical errors as well as odd formulations which end up being distracting or confusing. The language should be proof-read carefully. 
+
+Pros:
+- Generative modeling of sequence data still in its infancy
+- Potentially lower variance than policy gradient approaches
+- Experiments are promising
+
+Cons:
+- Lots of grammatical errors and odd formulations
+
+Questions:
+- Equation 14: what does it mean to find the ""maximum entropy solution"" for the given optimization problem?
+- Figure 2: how do (b) and (c) relate to each other?
+
+Remarks, small typos and odd formulations:
+- ""for measuring M_\/phi"": what does measuring mean in this context?
+- What does small m refer to? Algorithm 1 says the total number of steps  but it is also used in the main text as an index for J and \pi (for mediator?)
+- Equation block 8: J_m has not been defined yet
+- ""the supports of distributions G and P""... -> G without subscript has now been defined in this context
+- ""if the training being perfect""
+- ""tend to get stuck in some sub-optimals""
+- the learned distribution ""collapseS""
+- ""since  the data distribution is, thus ...""
+- ""that measures a"" -> ""that estimates a ...""?
+- ""a predictive module"": a bit unclear - generative v. discriminative is more usual terminology
+- ""is well ensured""
+- ""with the cost of diversity"" -> ""at the cost of diversity""?
+- ""has theoretical guarantee""
+- in the references: ""ALIAS PARTH GOYAL"" (all caps)
+- ""let p denote the intermediate states"": I don't understand what this is. Where is ""p"" used? (proof of Theorem 3)
+- ""CoT theoretically guarantees the training effectiveness"": what does that mean?
+- Figure 3: ""epochs"" -> ""Epochs""
+- Algorithm 1: what does ""mixed balanced samples"" mean? Make this more precise
+- ""wide-ranged""
+- Equation 10 is too long and equation number is not properly formatted
+- Figures hard to read in black & white
+- Figure 2 doesn't use the same limits for the Y axis of the two NLL plots, making comparisons difficult. The two NLL plots are also not side-by-side",7,2.0,ICLR2019
+lNlT_ly4yr,1,U_mat0b9iv,U_mat0b9iv,"Good motivation and findings with seem-solid but actually weak theory. unclear paper writing, difficult to follow","The paper proposes an innovate method based on lottery ticket hypothesis to prune a BNN (parameters are only -1(0) and +1, it can be viewed as an extreme case of quantization) from a dense NN. It focuses on learning a mask to prune the NN instead of the traditional method (pruning on an already trained network). In addition, not only experiments but theortical proof are given and have a highly brief result.
+
+Pro:
+The way to find the mask iteratively is innovate and has a mathematical support.
+The result of MPT is amazing because the untrained network can be pruned to a BNN with comparable accrancy of some trained SOTA NN on CIFAR dataset.
+The experiments show it can be generized to deeper and wider network.
+It has better accurancy than other BNN methods but network parameters still high.
+Con:
+The main article spends little word to describe how to find the mask, and it is not a trivial way.
+The experiments of generization are only done on the very small NN (e.g. Conv2/4/6/8).
+
+Clarity: Very low. Pros: The authors try to express in a way that every step of logical connections in this paper can be clearly understood by readers. To reduce complexity, many parts are settled in the appendix. Cons: Many sentences in this paper are quite long and sometimes using nesting clauses, which makes the text to be obscure. Besides, since many parts are moved into the appendix, the whole structure of the main body is kind of empty and shallow. Some summative and conclusive paragraphs are the simple repetition of previous “claims” since the demonstrations are in the appendix. All of these make a lower clarity. 
+Finally, I find that this paper has narrower page margins, which means each line can contain more characters. Besides, the header “Under review as a conference paper at …” is missing. These modifications of the submit template may volatile the conference rule and should be considered a cheating behavior. So I give the “very low” score on clarity.
+
+Originality: Medium. Pros: They apply the Lottery Ticket Hypothesis to quantization/BNN. They give proof of their rationality. Cons: The success essentials of their algorithm “biprop” contains two-part: edge-popup and gradient estimator. Both of them are take-away from other works. I regard this work as a new application of LTH to binary neural networks.
+",4,4.0,ICLR2021
+HjtMqmtlS9P,4,M71R_ivbTQP,M71R_ivbTQP,Interesting idea and missing validations for the regions around a sample and for non-high confidence samples,"content:
+It is about pruning for explanation. The goal of the methods presented is, given a sample x, to extract a network, which is
+an unmodified subset of the original network,
+and has similar predictions to the original network in a region around x.
+
+The authors derive a gradient-based optimization procedure and do a heuristic threshold-cutoff postprocessing to remove layers and filter channels after the optimization.
+
+The authors evaluate faithfulness in the sample itself. 
+Faithfulness for the region is evaluated using the nearest neighbors from the dataset and in a high level feature map metric.
+
+strength: 
+paper concept is well explained. Clarity of the idea.
+The outcome are sample-dependent very small sub-networks.
+
+
+
+weaknesses:
+
+--The validation of the method.
+
+First of all, they do not validate the impact on the region of a sample x sufficiently, and that must be done because it is a central claim of the paper. The not satisfactorily attendance to that claim is the main reason to reject this paper at the moment. page3: ""(3) it is for data from a local region instead of the whole data distribution.""
+
+That is relevant even more so as the input space has usually some adversarial samples nearby with respect to a metric in the input space.
+
+
+Figure 4 gives a partial result - but for a very high level feature space notion of neighbors rather than with respect to a metric in the input space. 
+
+Using the last layer for defining the metric to obtain the nearest neighbors may result in rather ""semantic"" neighbors with similar high level structure, but very different low level structure, which does not conform to the idea of a \eps-lp-metric Ball around x in input space. It is not local with respect to the input space.
+
+In that sense the evaluation of local faithfulness is not complete.
+ 
+Furthermore using nearest neighbors from the dataset also does not guarantee that they are locally close to x (for regions with low data density this may not be guaranteed)..
+
+
+--one needs to perform evaluation with some kind of sampling of points within a ball around a sample, close to the local ReLU linearity zone and evaluate faithfulness for those sampled points -- adressing the notion of points being close wrt.~a metric in the input space.
+
+The reviewer would be satisfied, if that would be done for a few hundred test points if 1500 x2 networks costs too much time.
+
+--one needs to perform evaluation on what happens with predicted labels of adversarial samples close to x. At least to consider how likely do they switch back to the original, non-adversarial label when looking at the extracted subnet. 
+
+The reviewer would be satisfied, if that would be done for one adversarial per test point but over hundreds of different test points.
+Optionally to consider how likely do they switch to another wrong label (this relevant for targeted attacks only, thus optional). 
+
+-- any test to be done also for both nets
+
+
+Secondly,
+
+they do evaluate faithfulness in the sample point. The reviewer is not fully satisfied with the metric.
+
+Fig 3 decrease in probability for predicted class does not tell if it changes the predicted label. They circumvent this problem by using only high confidence samples, however this creates a biased or limited evaluation (namely how this method works on the most confident samples).
+
+it would be better to measure two things:
+
+-- do these evaluations for all samples on their predicted label (a few hundred ...), not only the very confident ones.
+
+--the change in difference to the highest scoring other class ( this gets negative if the predicted label switches ) when comparing original and extracted net. 
+
+For example by A quotient of  ""diff to highest scoring other class (extracted)"" / ""diff to highest scoring other class (original)"" - this is signed and gets negative if on the extracted net the highest scoring other class is the flipped predicted label
+ 
+--the probability that the predicted label switches when looking at the subnet 
+
+
+
+central suggestions for improvement:
+
+run experiments: 
+
+--one needs to perform evaluation with random sampling of points within a ball around a sample, close to the ReLU linearity zone and evaluate faithfulness for those sampled points -- adressing the notion of points being close wrt.~a metric in the input space.
+
+--one needs to perform evaluation on what happens with predicted labels of adversarial samples close to x. At least to consider how likely do they switch to the original, non-adversarial label. 
+
+
+Fig3 with:
+--the change in difference to the highest scoring other class ( this gets negative if the predicted label switches ) when comparing original and extracted net. 
+
+For example by A quotient of  ""diff to highest scoring other class (extracted)"" / ""diff to highest scoring other class (original)"" - this is signed and gets negative if on the extracted net the highest scoring other class is the flipped predicted label (as it works on the predicted class on the original net, the denominator is always positive)
+ 
+--the probability that the predicted label switches when switching at the subnet 
+
+-- do not perform the experiments only on high confidence samples but on all samples. This may constitute a bias towards ""easy"" samples otherwise.
+
+This further also allows to answer the question: are difficult samples with lower confidence less sparsely represented than high confidence samples ?
+
+
+technical problem:
+
+eq (7) seems to have a typo. The idea is understood, to consider to drop a layer and use the output of the layer before, if input and output size matches, which may make sense for NNs with residual connections.
+
+However if one looks at the case otherwise in eq.7, then it is F^l(x,W^{1:l}) without any filter masks S. In accordance to eq5 this is the mask-less original NN, which is likely a mistake.
+
+It seems that for both cases (incl. the first case) F^l(x,W^{1:l}) needs to be changed to include the masks S 
+F^l(x,(W^{l'}  \odot S^{l'}, \alpha)^{l'})_{l'=1:l}
+
+That is a fixable problem that does not affect the paper rating.
+
+Technical Questions to the authors:
+
+Section 3.2 algorithms:
+-- if the loss is above a certain threshold, it is understood that then you perform the finetuning on the S values. In that case, do you roll back then the filter removal using tau ? In that case, do you roll back then the filter removal by G ?
+-- how do you estimate \tau,  \lambda and \lambda_g  ?
+
+General questions to the authors: 
+
+The authors contribution in Fig5 and Fig6 seems to be the sparse network extraction, as the SMOE visualization and the filter activation maximization are known. It seems that these visualizations, while nice to show, are not really central to the questions raised by the authors. However, this should not be misunderstood as a wish from the reviewer to remove them. Rather to remark that the paper novelty is the left side in these figures.
+
+The reviewer takes the value of the method so far by: ""When applied to samples from a local region in data space, it is plausible that its inference process mainly relies on a small subset of layers/neurons/filters.""
+
+--Do the authors have a suggestion how the sparse network extraction can be used for any kind of analysis or insight beyond showing the sparse network itself ? 
+
+-- SMOE can also be applied on the original, large network. What is the difference (or value) between applying SMOE visualization on the original network to at first pruning and then applying SMOE on the pruned net?
+
+typos:
+--three quantitative analysis over --> three quantitative analyses over (Plural)
+
+--Fig2 is a table
+
+minor suggestions:
+
+--""we remove layer-$\ell$ if G^l < 0.5"" is simpler
+--remove the top subfig in figure 1, as it is anyway shown in Fig 5 and Fig6 in better resolution. I think you need the space for more interesting content.
+
+Post review: The reviewer thinks that the authors did a thorough job of addressing the reviews. The results are interesting in several aspects, for example Fig 5 and 8. That said, regarding the question ""Do the authors have a suggestion how the sparse network extraction can be used for any kind of analysis or insight beyond showing the sparse network itself?"" The reviewer has doubts that non-ML expert end users (e.g. M.D.s) could make use of a sparse network as explanation mode (as this would assume that they can make sense of what a neuron has learnt). The reviewer updated his rating upwards. 
+",6,4.0,ICLR2021
+5QX06tv9adT,3,o3iritJHLfO,o3iritJHLfO,"Potentially valuable contribution to parallel TTS, with some concerns. ","Summary:
+This paper presents BVAE-TTS, which applies hierarchical VAEs (using an approach motivated by NVAE and Ladder VAEs) to the problem of parallel TTS.  The main components of the system are a dot product-based attention mechanism that is used during training to produce phoneme duration targets for the parallel duration predictor (that is used during synthesis) and the hierarchical VAE that converts duration-replicated phoneme features into mel spectrogram frames (which are converted to waveform samples using a pre-trained WaveGlow vocoder).  The system is compared to Glow-TTS (a similar parallel system that uses flows instead of VAEs) and Tacotron 2 (a non-parallel autoregressive system) in terms of MOS naturalness, synthesis speed, and parameter efficiency. 
+
+
+Reasons for score:
+Overall, I think the system presented in this paper could be a valuable contribution to the field of end-to-end TTS; however, from a machine learning perspective, the contributions are incremental and quite specific to TTS.  In addition, I have some slight concerns about the clarity of the presentation that made it harder to understand the (fairly simple) approach and its motivation than I’d expect from an ICLR paper.  Finally, the quality of the speech produced by the system is only evaluated on a single dataset and uses only 50 synthesized examples in the subjective ratings.  For these reasons, I feel this paper would be a better fit for a speech conference or journal after addressing the evaluation and presentation issues, but I would still support acceptance if other reviewers push for it and my concerns are addressed. 
+
+
+High-level Comments:
+* The speed, parameter efficiency, and MOS results are quite promising.  However, when considering the Glow-TTS paper (which this seems like a direct followup to), the system improvements seem quite incremental (replace flows with HVAEs and replace the monotonic alignment search with soft attention plus argmax).  
+* Incremental system improvements are great if they result in significant improvements that are demonstrated through rigorous experiments, however, compared to Glow-TTS, the experiments are not nearly as comprehensive and convincing. Listening to a few of the audio examples provided in the supplemental materials, I don’t get the sense that the audio quality is significantly better than that of Glow-TTS as is suggested by the MOS numbers (BVAE-TTS sounds a bit muffled to my ears relative to Glow-TTS). 
+* Since this system uses the same deterministic duration prediction paradigm as Glow-TTS (and other parallel TTS systems), it suffers from the same duration averaging effects and inability to sample from the full distribution of prosodic realizations.  
+* The motivation would be made clearer if you were more specific early on about the potential advantage of VAE's relative to flows however you want to describe it (parameter efficiency, more flexible layer architectures, more powerful transformations per layer, etc.).  
+* I'd recommend providing similar motivation for using dot-product soft attention plus straight-through argmax instead of Glow-TTS's alignment search or other competing approaches.  Is it because it's a superior approach or just because it's different from existing approaches? 
+
+Detailed Comments:
+* Section 2:  I don’t believe Tacotron is actually the *first* end-to-end TTS system.  Maybe it was the first to gain widespread attention, but I know that char2wav (if you count that as e2e TTS) preceded it chronologically in terms of first arxiv submission date.
+* Section 2: The Related Work section is fairly redundant with information that is already presented in the introduction.  It might be worth combining the two sections.  This should free up space for additional experiments, explanations, or analysis. 
+* Section 4.1: The first paragraph here was quite confusing upon a first reading.  I had to read the second sentence (“Via the attention network…”) many times to understand what was being described.
+* Section 5.2: I’m curious how you arrived at a sample temperature of 0.333.  Was this empirically tuned for BVAE-TTS or in response to Glow-TTS’s findings? 
+* Section 5.2, “Inference Time”: It seems important to include details about the hardware platform used to gather the speed results. 
+* There are minor English style and grammar issues throughout the paper that make the paper slightly more difficult to read.  Please have the paper proofread to improve readability. 
+
+Update (Nov 24, 2020):
+After reading through the author responses and the updated version of the paper, I feel like a sufficient number of my concerns have been addressed to increase my score to 6.  Specifically, the motivation has been made clearer, the related work section is no longer redundant with the intro, and the authors gave an adequate explanation about the necessity of their attention-based alignment method.  ",6,5.0,ICLR2021
+ryecO-JoYS,1,rkeYL1SFvH,rkeYL1SFvH,Official Blind Review #1,"The paper presents WikiMatrix, an approach to automatically extract parallel sentences from the free text content of Wikipedia. The paper considers 1620 languages and the final dataset contains 135M parallel sentences. The language pairs are general and therefore the data does not require the use of English as a common language between two other languages.
+
+To evaluate the quality of the extracted pairs, a neural machine translation system has been trained on them and tested on the TED dataset, obtaining good results in terms of BLEU score.
+
+The article provides information on the system used to extract parallel sentences and opens up different directions for future investigations.
+
+The dataset seems, from the given details, useful. However, without access to the data and, more importantly, extensive testing of it, it is difficult to say how and how much it would help the advancement of the field. For the moment it seems to be good. However, I am not really sure that this paper could be of interest to a wide audience, except for those involved in machine translation.
+
+In general, the article describes everything at a high level, without going into the real details.
+An example of this is on page 6, section 4.2, where the article says that its purpose is to compare different mining parameters, but I do not see any real comparison. Some words are spent for the mining threshold, but there is no real comparison, while other possible parameters are not considered at all.
+
+For this reason, I would tend to give a low score, which does not mean that the dataset is not good. It means that the real content of the paper seems to me to be too little to be published at ICLR, since the paper only informs about the presence of this new dataset, saying that it contains a large number of sentences and seems to allow good translations based on the results of a preliminary test.
+
+Typos:
+- on page 9 ""Aragonse""
+- on page 9, end penultimate line, the word ""for"" is repeated.",6,,ICLR2020
+vI378HWJ0fQ,4,ECuvULjFQia,ECuvULjFQia,A novel approach to predict labels of dynamical systems,"This paper proposes a learning framework for predicting the labels of dynamic systems. Unlike existing model-based approaches and model-free approaches, the proposed model takes a middle ground and uses a knowledge distillation-based framework. It uses a teacher model to learn to interpret a trajectory of the dynamic system, and distills target activations for a student model to learn to predict the system label based only on the current observation.
+
+Experimental results on both synthetic and simulated datasets confirm the effectiveness of the proposed framework.  
+
+Pros:
+1. The paper studies an important problem. Predicting the behavior of a dynamic system has many applications. 
+
+2. The proposed model is interesting and may lead to a series of follow-up studies that leverage the strengths of both model-free and model-based methods using knowledge distillation techniques. 
+
+Cons:
+1. The baseline models are quite simple. There are stronger baselines as noted by the authors. While the proposal is a learning framework, it might still be worth customizing and comparing it with state-of-the-art models in specific problems. 
+
+2. The presentation of the paper can be improved. It would be good to add a running example to explain the various concepts and definitions used in the paper. 
+
+Additional comments:
+Typo: ""a teacher networks"" => ""a teacher network""; ""using only using"" => ""using only""
+
+**Update after author response:** I appreciate the authors' efforts to address my comments. The new version reads better. However, I am still not entirely convinced by the choice of the simple baselines. Since a positive rating is already given, I would keep it unchanged. ",6,3.0,ICLR2021
+Bkxb6KTpYH,2,B1xpI1BFDS,B1xpI1BFDS,Official Blind Review #1,"Summary
+========
+This paper tackles the more realistic variant of few-shot classification where a large portion of the available data is unlabeled, both at meta-training time in order to meta-learn a learner capable of fast adaptation, as well as meta-test time for adapting said learner to each individual held-out task. 
+
+Their approach is based on TapNet, a model which constructs a task-specific projection based on the support set of each episode, and then classifies each query according to its distance to each class prototype in the projected space. The projection is computed so that the class prototypes of the episode (averaged support examples) are well-aligned with a set of meta-learned references. Intuitively, those references learn to be far from each other, so that aligning the prototypes with them leads to a space where the episode’s classes are well separated, allowing for easier classification between them.
+
+They then extend this model to incorporate unlabeled examples by performing task projection as follows: 1) the projection is computed so as to align the initial prototypes (computed only using labeled examples) to the meta-learned references. 2) In that projected space, each unlabeled example is assigned a predicted label based on its proximity to the projected prototypes. 3) Then, back at the original space, those predicted labels of the unlabeled examples are used to refine the class prototypes (weighted average as in Ren et al). 4) The projected space so that the *refined* prototypes are best aligned with the meta-learned references. 5) Possibly repeat 2-4 (at meta-test time).
+
+Experimentally, they outperform recent approaches to semi-supervied few-shot classification on standard benchmarks, though not by far. Perhaps more interestingly, their performance degrades less than that of Ren et al as the distractor ratio increases, and they show that their method benefits from additional steps of task adaptation, whereas that of Ren et al reaches its performance limits after the first step of soft clustering.
+
+High-level comments
+==================
+A) Ablation: An interesting ablation would be: instead of going back and forth between the embedding space and the projected space, the task adaptation happens only in the initially-computed projection space (i.e. the one computed based on the labeled data only). This would amount to: computing the projection space, and then performing a few steps of soft clustering, similar to Ren et al. in that space. This would help determine how beneficial it is to re-compute that projection space according to the current ‘best guess’ of where the class prototypes lie at each iteration. The way I see it, it is an empirical question whether the initially computed projection space already sufficiently separates the classes or not. I assume this would also lead to a more computationally efficient solution?
+
+B) Handling distractors: In the case of distractors, they use an additional centroid (and a corresponding additional reference vector) for the purpose of ‘capturing’ the unlabeled examples that don’t belong to candidate classes. I find the initialization of this strange: this additional centroid is computed as the mean of all unlabeled examples, and the initial projection construction is influenced by a term that matches this centroid to a corresponding reference. This would mean, however, that even the unlabeled examples which do indeed belong to one of the candidate classes end up far from those classes in the projected space, in order to be close to the designated extra reference in that space. We know that this is not ideal, since we assume that some unlabeled examples do belong to the same classes as the labeled ones. Is there a way to quantify how severely this affects the quality of this initial projection? I would also be curious about the meta-learned location of the extra reference. Does it end up being roughly in the center of the references corresponding to the labeled classes?
+
+C) Inference-only baselines. Ren et al. experimented with inference-only baselines: meta-learning happens only using the labeled subset, and the proposed clustering approach only takes place at meta-test time. In this case this would amount to meta-training a standard TapNet and then performing the proposed refinement only in test episodes. This is interesting as it allows to understand the importance of learning an embedding end-to-end for being more appropriate for unlabeled example refinement. It is not obvious that this is required, so I would be curious to see these results too. (This differs from the reported TapNet baseline in that at meta-test time it would make use of the proposed semi-supervised refinement).
+
+Clarity / quality of presentation:
+============================
+D) A lot of emphasis is placed on the ability of the proposed method to control the degree of task conditioning. I would like to emphasize that this is not something that previous methods lack. Ren et al.’s approach could also perform multiple steps of clustering for example. Whether or not this is beneficial is an empirical question, but I wouldn’t say that the proposed method does something special to “control” how much adaptation is performed.
+
+E) It would have been useful to have a separate section or subsection that explains TapNet, since this is the model that the proposed method is built upon. Instead, the authors blend the explanation of TapNet in the same section as their method which makes it hard to understand the approach and to separate the contribution of this method from TapNet.
+
+G) The length of the paper is over the recommended pages, and I did feel that a few parts were too lengthy or repetitive (e.g. the last paragraph of the introduction described the model in detail. I felt that it would have been more appropriate to give the higher level intuition there and leave the detailed description for the next section). 
+",6,,ICLR2020
+daIPJiREk6T,1,IbFcpYnwCvd,IbFcpYnwCvd,"The paper presents a good idea towards using RL to learn sub-policies which can be composed to accomplish high-level tasks specified by Finite State Automata (FSA). However, the technical components of the paper need be clarified, and in some places fixed.","Summary of the work:
+The authors propose the Logical Options Framework (LOF) --- a framework for reasoning over high-level plans and learning low-level control policies. This framework uses Linear Temporal Logic (LTL) to specify properties (high-level tasks) in terms of propositions. The authors propose a framework in which a separate sub-task policy is learnt to accomplish each such sub-task proposition. These low-level control policies may be reused, without further training, to accomplish new high-level tasks by performing value-iteration in the proposed Hierarchical SMDP. Experimental results demonstrate the method’s effectiveness for several tasks and experimental domains.
+
+Quality and Clarity:
+The paper does a good job at intuitively describing the value that the proposed “LOF” framework provides (satisfaction, optimality, and composability). In general, the language used in the paper is clear and easy to read. 
+However, some of the technical aspects in the paper are difficult to follow concretely. There are various vague statements and definitions throughout the paper. Typos, inconsistencies, and missing explanations/discussion of seemingly important notions make some of the presented ideas imprecise. This raised several questions from me on the general applicability of the framework (as it is presented in this paper) beyond the presented experimental tasks, and on some of the specifics of Theorem 3.1. Given that the focus of this paper is on the development a new framework for RL, fixing these issues is crucial. I have included a more detailed list of feedback for the authors at the end of the review. 
+
+Originality and Significance:
+The proposed method defines “logical options” – sub-task policies whose goal is to trigger propositions that cause transitions in the automata-based description of the high-level task. Temporally extended tasks may then be accomplished by deploying the appropriate sub-task policy at the appropriate FSA state. This idea seems very similar to learning with Reward Machines. This similarity is acknowledged in the paper. The main conceptual difference between LOF and RMs, is that LOF proposes to learn policies that trigger transitions in an automaton, whereas learning with RMs instead learns separate policies for each state of the automaton. This difference means that in the LOF framework, previously learnt sub-task policies can be reused by composing them to complete new high-level tasks. This is not true for RMs. However, the policies learnt by LOF are not guaranteed to be optimal unless certain conditions are met. 
+
+This is a very good general idea, and as far as I am aware other works have not studied how one might compose previously learned sub-task policies to accomplish new tasks described by automata in this way. However, I am concerned that the significance of the work might be limited to the specific examples of the paper.
+
+For example, by associating a cost with each safety proposition, instead of forming the safety FSA associated with the safety property, it seems that the only safety properties that can be expressed by  LOF are those of the form: “avoid these states”. Because this is a significant limitation in comparison with all possible safety properties, an explicit discussion of the safety properties that can be represented by LOF would help the reader to understand exactly what problems LOF can solve. 
+
+Furthermore, it is unclear how the composability results change in the presence of new and/or changed safety properties. I elaborate further on this point at the end of the next section of the review.
+
+Questions surrounding theorem 3.1
+The main theoretical result of this paper is that under appropriate conditions, the hierarchical policy learned through the LOF framework will be equivalent to the optimal policy (optimal when planning is allowed with respect to low-level actions in the FSA-MDP product).
+
+The proof follows the logic that given the appropriate reward functions, the option policies will always reach the states corresponding to their sub-goals, and that the meta-policy will select a sequence of sub-goals that always reaches the FSA goal state.
+The proposed method assumes that if an option corresponding to a particular sub-goal is selected by the meta-policy, then no other sub-goal proposition will be triggered before that option is complete. My concern with this assumption is as follows: what happens in scenarios in which an option policy passes through a state associated with a different sub-goal before reaching its own sub-goal? Would it then be possible for the meta-policy to select an option corresponding to a particular sub-goal, but for a different sub-goal proposition to be triggered first, causing an unwanted transition in the FSA? 
+
+For example, suppose in the delivery domain we could complete a task by either of the following sequences of sub-goals: “a then b”, or “b then a then c”. If “a then b” is less costly than “b then a then c”, the optimal meta-policy should result in “a then b”. Furthermore, assume that “b” lies directly between “a” and the initial state of the agent. The optimal policy for sub-goal proposition “a” would be to move directly to “a”, which would cause it to first pass through “b”. This would cause the agent to be forced to follow “b then a then c” even if the meta-policy first chose proposition “a”. Meanwhile, the optimal policy allowing low-level planning in the FSA-MDP product space would be to move to “a” while actively avoiding “b”, and then to proceed to “b”.
+
+Also, it is unclear whether consideration of obstacles and safety propositions are included in the consideration of theorem 3.1. It is unclear what the reward functions corresponding to the “goal-conditioned policy” is. If R_S is included in the reward function used to train sub-task policies, then these policies will learn to avoid obstacles. When we compose these sub-task policies for a new task with potentially different safety properties, the policies we learned previously could be sub-optimal in the new scenario. 
+
+Experimental results:
+The experimental results demonstrate the effectiveness of the framework for the chosen experimental tasks. I like the videos provided in the supplementary materials; they are nice visualizations of how the sub-policies are strung together to complete the overall task.
+
+From the paper’s description, it is unclear how LOF-QL is implemented. How does this algorithm make use of the liveness FSA if it does not have access to the FSA’s transition function?
+
+Also, I am concerned that the comparison against RMs may not be fair. As stated in the paper, the RM-based method is only rewarded when the final RM state has been reached. Conversely, the reward for the LOF method is much denser. I do not see why the RM-based method could not also be rewarded for each “good” transition.
+
+Finally, the experiments include the “can” proposition which represents when one of the sub-goals has been canceled. It is unclear how the labeling function, which is defined to map specific MDP states to individual propositions, could return this proposition. If the proposition is returned randomly, the formalisms of the labeling function need to be altered to include these types of randomly occurring propositions.
+
+Pros
+
+> The general idea of the paper is good. I like the idea of pre-learning sub-task policies, and then of using automata-based task descriptions to find meta-policies that accomplish new tasks without additional training. This is the “composability” described in the paper.
+
+>The authors do a good job of intuitively describing the benefits that this type of method could provide over competing algorithms.
+
+>The experimental tasks (particularly the videos of the tasks being completed) provide a nice visualization/demonstration of the ideas of the paper.
+
+Cons
+
+>The technical aspects of the paper are hard to follow concretely.
+
+>Some of the theory seems vague and the paper has various typos and inconsistencies. This could lead to reader misinterpretations and/or author mistakes.
+
+>The method’s applicability seems to be somewhat limited to specific types of tasks, without explicit discussion of exactly what types of tasks can be solved.
+
+>The questions raised surrounding theorem 3.1 need to be clarified.
+
+Further detailed feedback:
+
+>How are the logical options learned? In particular, how are T_{o_p}(s’|s) and R_{o_p}(s) learned. These are the components of the options that are subsequently used to find the metapolicy. Are they estimated by averaging values over rollouts of the learned policy? What if the environment is highly stochastic and their values vary greatly across different rollouts?
+
+>The assignment for T_{o_p}(s’|s) on line 11 of algorithm 1 seems to assume there is a fixed k in which p will be reached. Would this always be the case if the environment is stochastic? This also seems to be a different definition than is given in equation 2.
+
+>In equation 1, R_{o}(s) is defined as the expected cumulative reward of options given that the option is initiated in state s at time t and ends after k time steps; shouldn’t this make R_{o}(s) a function of k as well? 
+
+>In the definition of T_{o}(s’|s), what is p(s’,k)? I assume this is the probability, under the option’s policy, of reaching state s’ after k timesteps from state s. If this is the case, p(s’,k) should be defined.
+
+>In line 9 of the Algorithm 1, T_{P}(s,p) = 1 is written. On line 11 of Algorithm 1, T_{P}(s) = p is written instead, but both have the same meaning: proposition p is returned by the labeling function from state s.
+
+>The liveness FSA’s transition function is defined and treated as a probability distribution. However, the automaton’s transitions appear to be deterministic in the presented examples. If there is a particular reason for it to be defined as a transition distribution, some discussion would be helpful for the reader. 
+
+>In the definition of a Hierarchical SMDP, the transition function is defined as a cartesian product of the environment transition distribution, the labeling function, and the FSA transition function. The precise meaning of this notation is unclear. A brief description of how a transition proceeds would be very helpful for the reader to understand the order in which arguments are passed to these three functions. Does the environment transition first, and the proposition of the new environment state cause the FSA to transition?
+
+>In section 3.1 and in appendix A.1, the author writes that “if the optimal option policy equals the goal-conditioned policy, the policy will always reach the sub-goal”. This will not be the case if the environment is stochastic and it is possible for the agent, under ANY policy, to slip into a state from which the sub-goal is not reachable.
+
+>Point 2 of the contributions outlined in section 1.1 states that the options can be composed to solve new tasks with only 10-50 retraining steps. This wording is at odds with the abstract, which states that LOF’s learned policies can be composed to satisfy unseen tasks with no additional learning.
+
+>In Section 3 “These may only appear in the liveness property and there may only be one instance of each subgoal.” The second half of this sentence is unclear. Does this mean that the inverse image of each sub-goal proposition through the labeling function is a singleton set containing only one state?
+
+>In section 3 “Safety propositions are propositions such as ‘the agent received a call’ that the agent has no control over or does not seek out.” This sentence is unclear. It would help if you could express what it means for the agent to “have control over” or to “seek out” a proposition in terms of the MDP states, actions, and proposition labeling function.
+",4,5.0,ICLR2021
+HJ6BKc6Qg,2,ryF7rTqgl,ryF7rTqgl,"Interesting problem, but technically and experimentally not solid enough","This paper proposes to use a linear classifier as the probe for the informativeness of the hidden activations from different neural network layers. The training of the linear classifier does not affect the training of the neural network. 
+
+The paper is well motivated for investigating how much useful information (or how good the representations are) for each layer. The observations in this paper agrees with existing insights, such as, 1) (Fig 5a) too many random layers are harmful. 2) (Fig 5b) training is helpful. 3) (Fig 7) lower layers converge faster than higher layer. 4) (Fig 8) too deep network is hard to train, and skip link can remedy this problem.
+
+However, this paper has following problems:
+
+1. It is not sufficiently justified why the linear classifier is a good probe. It is not crystal clear why good intermediate features need to show high linear classification accuracy. More theoretical analysis and/or intuition will be helpful.   
+2. This paper does not provide much insight on how to design better networks based on the observations. Designing a better network is also the best way to justify the usefulness of the analysis.
+
+Overall, this paper is tackling an interesting problem, but the technique (the linear classifier as the probe) is not novel and more importantly need to be better justified. Moreover, it is important to show how to design better neural networks using the observations in this paper.
+  
+",4,4.0,ICLR2017
+u5VDDAi_qVt,1,ZcKPWuhG6wy,ZcKPWuhG6wy,"Relevant topic, interesting ideas; proposal and methods can be polished","## Edit after authors' responses
+
+I have upgraded my score (from 4 to 5) based on the clarifications provided by the authors and the updated manuscript. Please see the details in my extended comments: https://openreview.net/forum?id=ZcKPWuhG6wy&noteId=V7Wy0Mpsz7Q
+
+## Summary of the paper
+
+This paper proposes two metrics, affinity and diversity, for assessing the value and contribution of data augmentation strategies (single transformation or combinations of them). Given a model trained on a clean data set (without data augmentation), the affinity of an augmentation strategy is defined as the difference between the accuracy of the model on augmented validation data, minus the accuracy on clean validation data. The diversity of an augmentation strategy is defined as the final training loss of a model trained with data augmented according to that strategy. The paper presents an empirical analysis of the affinity and diversity of a set of image augmentation strategies evaluated on a network architecture trained on CIFAR-10 and one trained on ImageNet. The main conclusion is that the contribution to a model's performance of an augmentation strategy is predicted by its joint affinity and diversity, but not separately.
+
+## Summary of merits and concerns
+
+### Merits
+
++ The paper's overarching motivation of quantifying the usefulness of data augmentation strategies is interesting and definitely important, given the renewed interest in data augmentation by the machine learning community.
++ The proposal of quantifying the data distribution shift and complexity (affinity and diversity, respectively) introduced by an augmentation strategy is reasonable, interesting and well motivated.
++ The introduction of the problem and motivation for the proposal (Introduction) is comprehensible and interesting and the review of related work is exhaustive and relevant. This part of the paper is very well written and I quite enjoyed reading it.
++ In the rest of the paper, the concepts and definitions introduced are easy to understand and the methods employed for empirically assessing the affinity and diversity are generally clear.
+
+### Concerns
+
+- I see some issues in the specific definition of affinity and diversity. Summarised (extended below), first, while the affinity can be easily computed for any augmentation strategy and pre-trained model, the diversity is computationally costly as it requires re-training the model;  second, in their current definition, the dependence on the specific model and data set makes the metrics hard to compare across models and data sets; third, affinity and diversity are defined in very different ways, one in terms of the accuracy of a pre-trained model, the other in terms of the training loss.
+- Although generally clear, the methodology employed falls short at demonstrating the contributions stated in the introduction and portraying a complete picture of how affinity and diversity can be used to assess the value of data augmentation strategies. We do gain some insights, but I have some concerns about the methodology and some important questions remain open.
+- One motivation for the introduction of metrics to quantify the merit and mechanisms of data augmentation strategies is that the reasons why cutout, SpecAugment and mixup work so well are not well understood. Another one is that (so-called) automatic data augmentation strategies, such as AutoAugment, are hugely computationally expensive. However, the paper does not really discuss how affinity and diversity explain the mechanisms of these augmentation strategies and how they can be used to discover new strategies more efficiently.
+- The presentation of the results, especially in the figures, can be improved, in my opinion, and hinders the clarity of the paper.
+
+## Evaluation and justification
+
+While acknowledging the merits and contributions of the paper, my concerns outweigh the positive aspects of the paper and hence my recommendation of rejection. I will discuss next in more depth these concerns in order to better justify my recommendation and with the intention to provide constructive feedback for potential subsequent work on the paper.
+
+### Definitions of affinity and diversity
+
+The definitions of affinity and diversity proposed in this paper are intuitive, easy to understand and reasonable. Furthermore, as stated above, I agree that quantifying the concepts and intuitions behind these metrics is an important contribution. In fact, the main result that the value of an augmentation strategy depends on both its affinity and diversity matches my (and the author's) intuitions and expectations. However, I have several concerns about the specific way these quantities are defined and computed in the paper.
+
+First of all, since one of the motivations for proposing such metrics is to more efficiently discover new augmentation strategies and assess the existing ones, I think that an important feature of the metrics should be the efficiency and ease of computation. However, computing the diversity of an augmentation strategy for a model requires training the model end-to-end. This is not the case of the affinity, which can be computed for any pre-trained model, and I think a useful definition of the diversity should achieve the same goal.
+
+Second, in the way affinity and diversity are defined in the paper, it is hard to compare augmentation strategies across models and data sets, even though they are identical, that is an image rotation is applied in the same way on CIFAR-10 and on ImageNet; on ResNet and on DenseNet. Thus, it would be desirable if comparisons across models and data sets would be possible. This mismatch is in fact reflected in the presentation of the results, hindering the clarity. For example, in Figures 3b and 3c, the colour codes represent very different things as the range varies in one case from -50 to +7 and in the other case from -70 to +0.6. Green dots in Figure 3b (CIFAR-10) represent augmentation strategies that improve the accuracy with respect to the baseline, while green dots in Figure 3c (ImageNet) correspond to strategies that hinder the performance. Furthermore, the range of affinity and diversity also differs greatly between the two plots. This has to do in part with a likely suboptimal way of presenting the results (more about this below), but also with the fact that the affinity and diversity in ways that are not comparable across models and data sets.
+
+One reason why the metrics are not comparable is that they are defined in absolute terms, without any normalisation that cancels the dependence on specific aspects of the model and data set, such as the loss, the number of classes (which determines the loss and accuracy), etc. Moreover, the authors chose to define the affinity and diversity in terms of the (top-1) accuracy and the (cross-entropy) loss, respectively. However, other researchers that may wish to use these metrics in the future might prefer to assess their models using an alternative metric, such as the top-5 accuracy, commonly used for ImageNet, or trained their models with a different loss. Such choices would also affect the interpretation of affinity and diversity and complicate even further the comparisons. 
+
+A suggestion would be to define the metrics in relative terms instead. Without claiming that these suggestions would be optimal, the affinity could be computed, for instance, as the accuracy on the augmented data divided by the baseline accuracy on the clean data (and optionally multiplied by 100 to turn it into a percentage): $\mathcal{T} = A(m, D') / A(m, D) * 100$. This would represent the fraction of the percentage of the accuracy obtained by testing on augmented images. These would reduce the dependence on the specific metric (accuracy) and on the characteristics of the model and data set. This has been used for instance in [[1]](#references) to also compared the contribution to performance of models trained with different data augmentation strategies.
+
+Third, affinity and diversity are defined in very different ways. Intuitively, the concepts that these quantities aim to represent are both related to the data distribution: affinity is related to the shift in the data distribution introduced by an augmentation strategy; diversity is related to the complexity in the distribution introduced by the augmentation. However, the former is defined in terms of the accuracy of a pre-trained model with respect to a baseline, the other in terms of the absolute training loss achieved by training with the augmentation. These are very different, unrelated quantities. Again, without claiming that the following suggestion is optimal, one idea would be to define the diversity as measure of spread (standard deviation, variance). Have the authors considered defining the diversity along the lines of the variance of the affinity or of a related quantity? This would quantify the diversity of an augmentation strategy and at the same time remove the need to train a model end-to-end, as discussed in the first point.
+
+The authors briefly discuss entropy as an alternative measure of diversity, despite the problem of computing it for continuously-varying transformations. This is also an interesting direction which could be worth exploring.
+
+### Results do not demonstrate all contributions
+
+The authors list four main contributions of their paper in the introduction. The first one is the introduction of affinity and diversity as ""interpretable, easy-to-compute metrics for parametrizing augmentation performance"". This is satisfied, although I have discussed above some concerns about the definitions. 
+
+The second contribution is that ""performance is dependent on both metrics"". This is indeed reflected by the results in Figures 3b and 3c. However, I should also note in this regard that despite the large number of augmentation strategies analysed, only two network architectures, each trained on one data set, are included in the experimental setup, and some differences between the two plots could already be discussed. The claim that the performance gain introduced by an augmentation strategy increases when both the affinity and the diversity increase is well supported by the results in Figure 3b (WRN on CIFAR-10), but this is not so clear in Figure 3c (ResNet on ImageNet). As a matter of fact, the relative test accuracy in Figure 3c seems to be higher (more yellow) as diversity decreases, rather than increases, and affinity increases. It would be desirable to obtain similar results on other architectures and data sets in order to gain more evidence to support this claim. 
+
+Related to this point, I would like to note that having two sets of results (WRN on CIFAR-10 and ResNet on ImageNet), presented in Figure 3, the authors select one of them (WRN on CIFAR-10, Figure 3b), the one, out of two, that best supports the claim, for Figure 1 on the first page of the paper. This clearly introduces a selection bias that distorts the actual data (i.e. why not showing in Figure 1 the results on ImageNet?). Moreover, the range of the axes differs between Figure 1 and Figure 3b, and there is even a third version in Figure 7, at the supplementary material. I would appreciate it if the authors can comment on/clarify this.
+
+The third contribution claims to ""connect augmentation to other familiar forms of regularization"". This point is addressed in Section 4.2, where the authors evaluate the performance of the models trained with data augmentation after turning off the augmentation partway through training. Although the phenomenon that performance sharply improves after turning off regularisers partway is interesting, this has been observed in previous works (reviewed by the authors) and the analysis in this paper, through the lens of affinity and diversity, does not provide new, significant insights, in my opinion. Further, taking into account the claim in the list of contributions, the analysis offers little insight about the connection of data augmentation with other forms of regularisation, beyond noting that the ""slingshot effect"" occurs also with data augmentation, which had been observed before.
+
+I would like to draw the attention to certain aspects of the methodology in this section that might distort the conclusions. For example, the authors conclude from Figure 5b that ""For some poor-performing augmentations, [switching off the augmentation] can actually bring the test accuracy above the baseline"". However, I would like to note that the authors report that in order to obtain these results, they tested multiple switch-off points and select the one that yields the best accuracy. This introduces a clear bias in Figure 5b towards best-case scenarios, which might give the impression that the switch-off lift is a general effect or, in the best case, the magnitude of the effect will be magnified. In order to gain a more accurate picture of the phenomenon, all the available data should be considered and ideally a statistical analysis should be carried out. Another source of bias in the visualisation of the results is that in Figure 5c, ""Where Switch-off Lift is negative, it is mapped to 0 on the color scale"".
+
+On the other hand, the switch-off lift effect could be explained in simpler terms, at least partially. For example, the authors write ""Bad augmentations can become helpful if switched off"". First of all, this could occur by pure chance and be reflected in the reported results as an artifact of the selection bias pointed out in the previous paragraph. Second, we should simply think that it is expected that turning off a bad augmentation should improve the accuracy. As a matter of fact, the augmentation strategies used to illustrate this effect are `FlipUD(100%)` (I will assume this means vertical flip) and `Rotate(fixed, 20deg,100%)`. In both cases, with the augmentation on, the model does not see the original images, so an improvement is expected if suddenly the model does see them. If the final accuracy is actually above the baseline should be analysed through a statistical analysis, rather than focusing on the best case, which might be due to pure chance. Moreover, it is also worth questioning the accuracy on clean images is actually a good baseline, since this model sees fewer different images.
+
+Finally, the authors claim that ""Switch-off Lift varies with Affinity and Diversity"", from the results in Figure 5c. However, from this figure we observe that mainly, it varies with affinity. This makes sense: very low affinity is indicative of unrealistic or at least odd augmentations, which are detrimental if performed during the whole training procedure. If turned off, the model has time to fine tune on the actual images. 
+
+The temporal dynamics of training neural networks and its relation to regularisation is an interesting, active topic of research, but due to its complexity it should be analysed very rigorously in order to minimise the risk of leading ourselves astray.
+
+The fourth contribution claims that affinity and diversity informs that ""performance is only improved when a transform increases the total number of unique training examples"". This aspect is addressed in Section 4.3. The authors ""seek to discriminate this increase in effective dataset size from other effects"". Again, this is an important and interesting topic, as well as hard to analyse, but in my opinion the methods fall short at justifying the conclusions. Here, the authors simply trained models with static augmentation and found that ""For almost all tested augmentations, using static augmentation yields lower test accuracy than the clean baseline"", which is not surprising because the augmented images differ from the original validation/test distribution on which the model is evaluated. However, this observation does not prove that the gain provided by data augmentation, when it does improve the performance, is due to the increase in the effective training set size. To be clear, this is likely to be the case, intuitively, but should be proven differently.
+
+### Cutout, SpecAugment, mixup are not discussed in terms of affinity and diversity
+
+Gaining insights about the mechanisms that make some data augmentation techniques (Cutout, SpecAugment, mixup) work better than others would be an interesting contribution. In the introduction, the authors mention this as a motivation for proposing the affinity and diversity metrics. However, although (some of) these augmentation strategies are included in the experiments, there is no specific discussion about them anywhere in the paper. Having mentioned this in the introduction as a motivation for the paper, I did miss a discussion that provided new insights about how these methods work and when.
+
+Similarly, the authors motivate the proposal of affinity and diversity as a way to better understand the mechanisms of data augmentation and discovering new techniques more efficiently. However, beyond presenting the empirical results in Section 4, the paper does not further discuss use cases of affinity and diversity to efficiently assess the value of new techniques, or analyse commonly used strategies in terms of these metrics. For example, a widely used data augmentation strategy in computer vision is the combination of horizontal flips and vertical and horizontal translations of about 10 % of the height and width. This has been found to provide large performance gains [[1, 2, 3]](#references), while additional transformations only marginally improved the accuracy. Knowing the effectiveness of this simple augmentation strategy, it would also be interesting to analyse its affinity and diversity.
+
+### Visualisation of results
+
+Although this aspect has not been decisive in my evaluation of the paper, I think there is room for improvement regarding the visual presentation of the results and hopefully the following feedback, from a careful read of the article, may help make the paper stronger.
+
+- Figure 3a: it would help to more clearly specify, perhaps directly in the plots, that the top row corresponds to CIFAR-10 and the bottom row to ImageNet.
+- Figures 3b and 3c: I have commented above on this figure specifically, about the possibility of changing the definitions of affinity and diversity that would improve the comparison across models and data sets and hence the interpretability of these figures. In any case, a confusing aspect of these figures is that the colour codes define different ranges of values in each plot, which have semantically very different interpretations. For instance, light green values on 3b correspond to augmentations that improve the accuracy with respect to the baseline, while the same colour on 3c correspond to augmentations that perform worse than the baseline. Given that there is a clear central point in the colour code, zero, where the semantic interpretation changes (positive vs. negative), I would strongly suggest to use a [perceptually uniform diverging colour palette](https://seaborn.pydata.org/tutorial/color_palettes.html#perceptually-uniform-divering-palettes).
+- Figure 4: I would suggest to colour-code the dots to reflect the probability of rotation. It would reduce the cognitive load to interpret the figure.
+- Figure 5a: the legends could be placed inside the axes to make space for larger figures
+- Figure 5b: indicate what the dash line represents
+
+## Questions
+
+The questions I list below are mainly intended to raise awareness about relatively minor aspects of the paper that remained unclear to me while reading it, with the aim of providing constructive feedback to potentially improve the manuscript. The aspects that have been more decisive in my decision have been already commented above.
+
+- ""Random crop was also applied on all ImageNet models."": Why is not random crop considered data augmentation in this paper?
+- How would the authors explain the strange variation of diversity with respect to the probability of rotation in Figure 4 centre? Is this a general behaviour in other augmentation strategies? Is the affinity as clear in other cases?
+- Figure 5c: how is it possible to get an improvement (Switch-Off Lift) of 50 %, while in Figure 5b all cases show smaller variation?
+- The specific version of the wide residual network reported in the paper is ""WRN 28-2"" However, 28-2 is not described in the original WRN paper, and is also not described in the github repository of AutoAugment. I suppose that it should instead read WRN 28-10. Assuming this is the case, I have an additional question: WRN 28-10 achieves, according to my own implementation, around 91.5 % accuracy on CIFAR-10 without any data augmentation, using the hyperparameters and regularisation of the original paper. If I interpret correctly Figure 5b, the baseline accuracy achieved by the authors is 89.7, which is significantly lower. I would appreciate it if the authors could clarify what I may have misunderstood.
+- In Section 4.3, what is the goal of comparing models trained with static vs dynamic data augmentation? How would models trained with static augmentation be better or have larger diversity?
+- ""transforms and hyperparameters from the AutoAugment search space [...] implicitly have high Affinity"": this is not what we see in the Figure 3b and Figure 7. Could the authors clarify this statement?
+
+## Minor comments and potential typos identified
+
+- ""some have proposed that augmentation strategies are effective because they increase the diversity of images seen by the model"": any reference where this claim is made?
+- Make sure that ""dataset""/""data set"" are spelled consistency throughout the article.
+- Would the authors venture any guess about the affinity and diversity of strategies in other data domains?
++ I appreciate that the authors report (Section 3) details about how the test results were computed, including the standard errors
+- ""static training"" (Section 3): Do the authors mean ""static augmentation""? Also, consider giving an example for better illustration.
+- It is slightly confusing that the augmentation function is denoted by $a$ in Definition 1, which is an unusual choice for a function. Would a capital letter be a better choice, since augmentations are generally stochastic functions?
+- ""KL divergence of the shifted data with respect to the original data"": consider specifying this mathematically too
+- What is the reason for the capitalisation of _Affinity_ and _Diversity_?
+- Typo: ""model-dependant"" (Section 3.1)
+- The gap between Figure 5's caption and the paragraph seems to have been manually reduced. If this is the case, it may be against the formatting guidelines and, especially, it hinders the readability.
+- Some data augmentation strategies are mentioned without previously introducing them. For example, `FlipUD` in Section 4.2
+
+## References
+
+[1] Hernández-García, Alex, and Peter König. ""Data augmentation instead of explicit regularization."" arXiv preprint arXiv:1806.03852, 2018.
+
+[2] Goodfellow, Ian, et al. ""Maxout networks."" International conference on machine learning. PMLR, 2013.
+
+[3] Springenberg, Jost Tobias, et al. ""Striving for simplicity: The all convolutional net."" arXiv preprint arXiv:1412.6806, 2014.",5,5.0,ICLR2021
+qSfQeDT7Mj9,3,6DOZ8XNNfGN,6DOZ8XNNfGN,Good contribution but experiments are not convincing ,"This paper proposes a new graph traversal method by introducing a compact adjacency matrix and applying stochastic methods to sample nodes.  The proposed graph traversal can be applied in conjunction with graph neural networks such as GCN, Deepwalk,..etc, and ideas proposed in the paper would interest and influence other researchers in the community. An advantage is that the proposed method shows improvements in speed and memory usage. Overall, The paper is well written with both theoretical and experimental evaluations. 
+
+Pros:
+a.) The proposed method well-developed and supported with theoretical analysis 
+
+b.) The proposed sampling method can be combined with other existing methods. The authors discuss the applicability of the proposed method with existing methods.
+
+Cons:
+a.) One limitation is that experiments are not very convincing since there are only a few baseline methods that are compared with the proposed method.  For instance, It would be better to show performances of methods like GraphSAGE, cluster GCN, and other methods (at least method in the table in Section 2) applied to all datasets. Showing different selections of baseline methods make it hard to understand the true performances of the proposed method. Can the authors provide more detailed comparisons with different baseline methods?  
+
+b.) The performances (e.g. semisupervised node-classification) shown are not as competitive as performances achieved by other recent methods (e.g. GCNII). Can the authors add more experimental results to the paper?
+
+I raise my rating based on the additional experimental results given.",7,2.0,ICLR2021
+S1IjqJvxf,2,SkPoRg10b,SkPoRg10b,"Interesting set of ideas and direction, but lack of quantitative analysis supporting the results.","
+This papers provides an interesting set of ideas related to theoretical understanding generalization properties of multilayer neural networks. It puts forward a qualitative analogy between some recently observed behaviours in deep learning and results stemming from previous quantitative statistical physics analysis of single and two-layer neural networks. The paper serves as a nice highlight into the not-so recent progress made in statistical physics for understanding of various models of neural networks. I agree with the authors that this line of work, that is not very well known in the current machine learning community, includes a number of ideas that should be able to shed light on some of the currently open theoretical questions. As such the paper would be a nice contribution to ICLR.
+
+On the negative side, the paper is only qualitative. The Very Simple Deep Learning model that it introduces is not even a model in the physics or statistics sense, since it cannot be fit on data, it does not specify any macroscopic details. I only saw something like that to be called a *model* in experimental biology papers ... The models that are reviewed in the appendix, i.e. the continuous and Ising perceptron and the committee machine are more relevant. However, the present paper only reviews existing results about them. And even in that there are flaws, because it is not always clear from what previous works are the results taken nor is it clear how exactly they were obtained (e.g.  Fig. 2 (a) is for Ising or continuous weights? How was it computed? Why in Fig. 3(a) the training and generalization error is the same while in Fig. 3(c) they are different? What exact formulas were evaluated to obtain these figures?). 
+
+Concerning the lack of mathematical rigour in the statistical physics literature on which the authors comment, they might want to relate to a very recent work https://arxiv.org/pdf/1708.03395.pdf work that sets all the past statistical physics results on optimal generalization in single layer neural networks on fully rigorous basis by proving that the corresponding formulas stemming from the replica method are indeed correct. 
+",6,5.0,ICLR2018
+fnOc03ACQbH,2,xYJpCgSZff,xYJpCgSZff,Interesting idea with comprehensive experiments,"This paper proposes a novel model integrating both causal inference and structure-aware counterfactual training to enhance the long-tail performances of information extraction. The causal mechanism considers a structured causal model that takes into account all possible cause-effect relations for the final predictions, including contexts, target representations, POS tags, NERs, etc. They also implement counterfactual training strategy by selecting the most important factors and wipe off the side effects to enhance the long-tail situations.
+
+The strengths of the paper includes:
+1. In general, this paper is well-written and easy to follow. The motivation and the structure are clear. 
+2. The ideas of both structured causal model and structure-aware counterfactual training are interesting. 
+3. Extensive experiments are conducted to demonstrate the effect of the whole model and each component. It is interesting to see how different generation of counterfactual examples using dependency structure affect the final performance.
+
+Some improvements could be made:
+1. If structure is considered, why not try to mask on some dependency relations? It will be interesting to see the difference between masking words and relations.
+2. What is the effect of using (5) instead of (4) in terms of the experimental result? Have you conduct such comparison experiment? And how sensitive is $\alpha$ to the final performance?
+3. It is also better to demonstrate some qualitative examples on which factors are most important for NER, RE and ED.",6,4.0,ICLR2021
+Tu1a3WkxQ14,3,#NAME?,#NAME?,Relevant and interesting focus on TM use in NMT; a number of issues to clarify,"This paper describes several improvements on using information from a Translation Memory (TM) in Neural Machine Translation (NMT). In the spirit of several prior work, the approach relies on 1) a retrieval step to obtain TM content that is related to the current source sentence to translate, 2) an encoder combining the source sentence with retrieved TM content, and 3) a decoder using the joint encoded information to produce a (target) translation. Experiments are conducted on benchmark French-English data, showing consistent improvement over classical baselines.
+
+Translation Memories are important Computer-Aided Translation (CAT) tools, likely the most widely used CAT tools by translation professionals and agencies. As such, it is important to study how they can be used to improve translation quality for example through inclusion in NMT. This study is therefore a welcome addition to the relatively limited work investigating this topic. I personally wish there was more work on integrating existing translation resources in MT. However, the experimental setup does not really correspond to a typical TM use. It essentially leverages the idea of reusing close matches to a source sentence in the training data. The idea is interesting, but only loosely related to TM. On the other hand, the experiments in Section 5.4 are much closer to a TM, it is unfortunate that these experiments are quite limited.
+
+Despite the general relevant and interesting focus of this work, there are a number of issues discussed below, related mainly to modeling and to the experimental evaluation.
+
+MODELLING:
+Each of the three components on the method (retrieval, encoding, and decoding) introduces some novelty. There are also a number of issues to resolve.
+
+1) The retrieval (sec. 4.1) uses an n-gram matching technique, which is contrasted with the usual computation of sentence similarities (edit distance, fuzzy match or idf-based score). I basically don’t buy the advantages put forward in the paper:
+- The cost of retrieval in a TM is dominated by the requirement to go over the entire memory for each source sentence, not by the computation of the score. The n-gram matching would still incur that cost, unless some smart way to retrieve matching sentences (such as an inverted n-gram index) is implemented. Unfortunately there is no detail at all in the paper about how the ngram matching is performed.
+- The fact that one can retrieve N matched sentence pairs for each (x,y) pair in the training data is no different from retrieving the top-N sentence pairs using any of the usual similarity metrics.
+
+Additionally, when retrieving N>1 sentence pairs from the TM, it is not entirely clear how the N pairs are used. One interpretation of the second paragraph of 5.1 is that this actually yields N different training instances at training time, while one match is randomly picked at prediction time. This should be clarified. In addition, this would introduce additional randomness at prediction time, producing possibly different target translations. It would be good to assess the impact of this choice on the performance, and compare against the obvious choice of picking the closest match.
+
+2) The encoding is straightforward but clever. It is not entirely clear how the encoder keeps track of the split of the context into I+M+N (one assumes here that N is no longer the number of matched pairs, but the length of the TM-target context) — is it through propagating the separation marker, hardcoded in the encoder, through some other way?
+
+3) The decoding is done in four different manners, offering various ways to integrate the TM-target information into the prediction. However, the description of « TM-pointer Decoder » in Section 4.3 seems faulty: Eq. 11 shows how to get S_x^l through the self-attention mechanism and Eq. 10 illustrates the concatenation mechanism in the TM-concat decoder, they can’t help get g_t and the attention distribution vector a.
+
+EVALUATION:
+
+Strengths:
++ Shows consistent non trivial gains
++ Uses a large corpus, with several domains
++ Interesting « domain adaptation » mode shows good results [Sec. 5.4]
+
+Weaknesses:
+- Limited to French-English (close languages with lots of cognates, lots of resources, high performance) — it would be interesting to show how this works on radically different languages, especially in a lower resource setting where an existing TM may greatly help.
+- Limited comparison to a couple baselines. None of the methods cited in the related work is tested against.
+- It is unclear what significance test was used, if any, to back the claim of « significantly outperforms » (e.g. end Section 5.4). 
+
+Small typos and clarifications:
+« by reuse existing » (Sec 1)
+« nmt » (Sec 2) -> NMT
+«  Formally, We » (Sec 4.2)
+What is the sentence after Eq. 15 (« The p_copy… ») trying to tell us. This is not clear.
+« we set the N to 10 » (Sec 5.1) 
+« set the size … is » (Sec 5.2)
+« we valid the translation performance » (Sec 5.4)
+",4,4.0,ICLR2021
+SJeqwyyuhX,1,B1e0KsRcYQ,B1e0KsRcYQ,The proposed representation method is tested on only one task and considers only one evaluation metric,"The paper proposes a second order method to represent images. More exactly, multiple (low-dimensional) projections of Kronecker products of low-dimensional representations are used to represent a limited set of dimensions of second-order representations. It is an extension of HPBP (Kim et al., ICLR 2017) but with codebook assigment. 
+
+The main advantage of the method is that, if the number of projection dimensions is small enough, the number of learned parameters is small and the learning process is fast. The method can be easily used as last layers of a neural network. Although the derivations of the method are straightforward, I think the paper is of interest for the computer vision community. 
+
+Nonetheless, I think that the experimental evaluation is weak. Indeed, the article only considers the specific problem of transfer learning and considers only one evaluation metric (recall@k). However, recent papers that evaluate their method for that task also use the Normalized Mutual Information (NMI) (e.g. [A,B]) or the F1-score [B] as evaluation metrics. 
+The paper does not compare the same task and datasets as (Kim et al., ICLR 2017) either.
+It is then difficult to evaluate whether the proposed representation is useful only for the considered task. Other tasks and evaluation metrics should be considered.
+Moreover, only the case where D=32 and R=8 are evaluated. It would be useful to observe the behavior of the approaches for different values of R. 
+In Section 3.2, it is mentioned that feature maps become rapidly intractable if the dimension of z is above 10. Other Factorizations are then proposed. How do these factorizations affect the second order nature of the representation of z? Is the proposed projection in Eq. (10) still a good approximation of the second order information induced by the features x?
+
+
+The paper says that the method is efficient but does not mention training times. How does the method compare in terms of clockwork times compared to other approaches (on machines with similar architecture)?
+
+In conclusion, the experimental evaluation of the method is currently too weak.
+
+
+[A] Hyun Oh Song, Stefanie Jegelka, Vivek Rathod, Kevin Murphy: Deep Metric Learning via Facility Location. CVPR 2017
+[B] Wang et al., Deep Metric Learning with Angular Loss, ICCV 2017
+
+after rebuttal:
+The authors still did not address my concern about testing on only one task with only one evaluation metric.",5,2.0,ICLR2019
+Bkg6NF0Wqr,3,rkgQL6VFwr,rkgQL6VFwr,Official Blind Review #3,"1. The paper aims to train a model to move objects in an image using language. For instance, an image with a red cube and blue ball needs to be turned into an image  with a red cube and red ball if asked to ""replace the red cube with a blue ball"". The task itself is interesting as it aims to modify system behavior through language. 
+
+The approach the authors  take is to encode an image with a CNN, encode the sentence with an RNN and use both representations to reconstruct  the frame (via a relational network and decoder) that solves this task. This process as described was already done in (Santoro 2017). The idea of using a CNN feature map and LSTM embedding to solve spatial reasoning tasks is not new. 
+
+The main contribution is to add a discriminative loss to turn the problem into a ""is this solution correct or not."" This is interesting but does not perform much better than the baseline of not using the GAN loss (as suggested by the results in Table 2). This suggests that the GAN term is not adding as much value as the authors claim.
+
+2. Reject
+- Reason 1: The results in Table 2 show that the GAN does slightly better (0.0134 vs 0.0144) in RMSE against the non-GAN version. This improvement does not seem statistically significant enough to warrant the added GAN complexity.
+- Reason 2: Other baselines need to be considered, AE, VAE or other variations.
+- Reason 3: No ablations on the impact of the parameters to eq 1.
+
+3. To improve the paper I suggest adding other baselines such as VAE, AE. In addition, consider using more negative samples instead of the single negative image.",1,,ICLR2020
+rke4auot2Q,1,S1EERs09YQ,S1EERs09YQ,"This paper proposes an interpretation to the activation values of hidden layer units of convolutional neural networks trained on language tasks, aligning those units with natural language concepts. The work is novel and interesting to the NLP community.","The paper is well written and structured, presenting the problem clearly and accurately. It contains considerable relevant references and enough background knowledge. It nicely motivates the proposed approach, locates the contributions in the state-of-the-art and reviews related work. It is also very honest in terms of how it differs on the technical level from existing approaches. 
+The paper presents interesting and novel findings to further state-of-the-art’s understanding on how language concepts are represented in the intermediate layers of deep convolutional neural networks, showing that channels in convolutional representations are selectively sensitive to specific natural language concepts. It also nicely discusses how concepts granularity evolves with layers’ deepness in the case of natural language tasks.
+What I am missing, however, is an empirical study of concepts coverage over multiple layers, studying the multiple occurrences of single concepts at different layers, and a deeper dive on the rather noisy elements of natural language and the layers’ activation dynamics towards such elements.
+Overall, however, the ideas presented in the paper are interesting and original, and the experimental section is convincing. My recommendation is to accept this submission.
+",6,3.0,ICLR2019
+0D35vqc5NS_,4,P3WG6p6Jnb,P3WG6p6Jnb,"The paper proposes an interesting idea, but it needs revision.","Disclaimer: this paper was assigned to me as an emergency review paper, so I might be missing something important.
+
+### Evaluation
+I recommend rejection. I think the paper presents an interesting idea to regularize policy updates by the variance of a return. Besides, it proves that the return-variance-regularized policy update leads to a monotonic policy improvement, which I think is novel. That being said, the paper seems to have several issues that decreased my evaluation. First, the clarity of the paper is really low: many typos, ambiguous notations, and confusing presentation of a new algorithm. Second, I could not understand some parts of the paper, especially a proof of the monotonic policy improvement theorem. Third, it is unclear why the proposed regularizer is suitable for offline RL.
+
+### Paper Summary
+Offline RL is (said to be) a key to enable RL applications to real world problems. The paper proposes a new algorithm, which regularizes policy updates by the variance of return, for offline RL. The paper proves a monotonic policy improvement when the return-variance-regularized policy update is used. Experiments show moderate performance gain by the regularization.
+
+### Strong Points
+1. Interesting idea to regularize policy updates by the variance of a return
+2. Moderate performance gain, especially when offline dataset is obtained by a suboptimal policy.
+
+### Weak Points
+1. The paper is not well-written. It contains typos, unclear sentences, and ambiguous notations.
+2. Some parts of the paper, like the proof of the monotonic policy improvement theorem, seem to contain mistakes. (I might be wrong, though.)
+3. The performance gain seems to be moderate, despite a high complexity of the computation of the proposed regularizer.
+
+### Comments to the Authors
+
+If I am missing something, please feel free to let me know. As noted above, I could not spare much time to review the paper.
+
+1. The paper is not well-written. Please revise it again. (I don't point out all ambiguous notations and typos, but later I point out some serious ones.) Besides, please revise the references section. It refers to arxiv versions of papers that were accepted to conferences.
+2. In page 3, $\omega_{\pi / \mathcal{D}}$ suddenly appears. What does it mean? Maybe $\omega_{\pi / \mu}$?
+3. What does $s \sim \mathcal{D}$ mean? Does it mean $s \sim d_\mu$? Since the dataset $\mathcal{D}$ contains states visited by $\mu$, simply drawing states from $\mathcal{D}$ will be different from $d_\mu$.
+4. What is $d_{\mathcal{D}}$ in, for example, Equation 4?
+5. The idea to use the Fenchel duality to avoid double-sampling problem seems to be not new (cf SBEED paper). While the paper mentions AlgaeDICE as an algorithm using a similar technique, it does not mention SBEED about the use of Fenchel duality. Why?
+6. In the beginning of Section 3.4, the min-max problem is being solved by repeating the inner minimization and outer maximization. As far as I remember (I might be wrong, though!), this way of solving a min-max problem might not find the exact solution. Isn't it a problem?
+7. In Equation 6, it seems that $Q^\pi (s, a)$ is rewritten as $E_{(s, a) \sim d_\mathcal{D}} [ \omega (s, a) r (s, a)]$. According to the notation of the paper, isn't $E_{(s, a) \sim d_\mathcal{D}} [ \omega (s, a) r (s, a)]$ be $(1-\gamma) E_{s_0 \sim \beta, a_t \sim \pi} [\sum_{t=0}^\infty \gamma^t r(s_t, a_t)] \neq Q^\pi(s, a)$?
+8. Is Lemma 1 different from Lemma 1 of [Bisi et al, 2019](https://arxiv.org/pdf/1912.03193.pdf)? Also what is its meaning? Why is is useful to understand the variance regularizer?
+9. How is the beginning of Section 4.2 related to Theorem 1? Theorem 1 seems to be derived based on a different inequality in its proof.
+10. As far as I remember, the dual form of the total variation is $\sup_{f \in C_`1} E_{A \sim P} f(A) - E_{B \sim Q} f(B)$, where $C_1$ is the space of all continuous functions bounded by $1$. Therefore, we don't need $\phi$ and can explicitly state the space of $f$ in Equation 12. Am I wrong or missing something? 
+11. Why is there no sup over f in Theorem 1?
+12. As for Equation 59, how do you get the first line? In addition, do you need $E_{s \sim d_\pi}$? In the second line, the first $E_{s \sim d_{\pi'}, a \sim \pi}$ is sampling an action from $\pi$. Isn't it $\pi'$? In the fourth line, $d_{\pi'}$ changed to $d_\pi$. How is it possible?
+13. I don't fully understand what Section 4.3 does. In addition, what are random variables in Theorem 2?
+14. I don't understand what the final objective to be optimized is. In Algorithm 1, $J(\theta, \phi, \psi, \nu)$ appears. Is it the same as Equation 6?",4,3.0,ICLR2021
+mMhsFT_vAAw,2,wZ4yWvQ_g2y,wZ4yWvQ_g2y,Interesting work on NAS for BERT,"Summary:
+This paper proposes to search architectures of BERT model under various memory and latency contraints. The search algorithm is conducted by pretraining a big supernet that contains the all the sub-network structures, where the optimal models for different requirements are selected from it. Once an architecture is found, it is re-trained through pretraining-finetuning or two-stage distillation for each specific task. Several approaches (block-wise training and search, progressive shrinking, performance approximation) are proposed to improve the search efficiency. Experiments on GLUE benchmark shows the models found by proposed methods can achieve better accuracy than some of the previous compressed BERT models. The paper (together with the appendix) is clearly presented, and the idea is new and interesting to me. The experiments are detailed and comprehensive.
+
+Pros:
+The paper is well presented. The architecture of the superent and the candidate operations are carefully designed and selected. It seems that the SpeConv operation is particularly effective when the model size is small. The search algorithm including the block-wise training, progressive shrinking can remove less-optimal structures quickly and significantly reduce the search space. The performance of NAS-BERT models are generally better than those of the compressed BERT models with similar model size, although the comparisons may not be completely fair.
+
+Concerns:
+1. The organization of the paper can be further improved. The paper may not be easy to follow if the appendix is skipped, especially for the readers who are not familiar with NAS or related work. Many of the important information can only be found in appendix.
+2. The novelty of the paper is unclear to me. Although this work may be new on search BERT-like language model, it seems many of the ideas such as block-wise search and distillation are borrowed from existing work. Please the author clarify the main novelties and technical contribution of this work, especially to the field of neural architecture search or more broadly, AutoML . Moreover, some of the proposed techniques such as progressive shrinking are merely empirical practices and are lack of theory or insight showing how accurate the approximation would be.
+3. It is usually more illustrative (and also space saving) to plot accuracy versus latency/#parameters of different models in the same figure. Some of the well noted models such as MobileBERT and TinyBERT are not included in comparison. For DynaBERT, there are multiple configurations but only one is included. AdaBERT, which adopts NAS for each specific task, should also be included if possible. Again, since there are of many models with different size and latency, it may be better to have a plot for clear comparison. 
+4. HAT (Wang et al. HAT: Hardware-Aware Transformers for Efficient Natural Language Processing. ACL 2020.) is not mentioned in the paper, which share similarities (training supernet) and differences (search algorithm) with this work from technical point of view. It will be better if the author can explain and compare the proposed search algorithm to evolutionary search.
+",7,3.0,ICLR2021
+2CSPQtGSJsB,3,UOOmHiXetC,UOOmHiXetC,The paper presents a simple extension to MCTS search by choosing multiple actions in each call to 'expansion' phase. The main concern with the paper is the number of simulations for MCTS.,"**Summary**
+This paper presents a new planning algorithm, called Shoot Tree Search, to control the trade-off between depth and breath of the search. STS modifies the expansion phase of tree search by choosing multiple actions (e.g. $\gt$ 1) instead of one level expansion. The presented idea is simple and straightforward and seems to provide improvement over existing tree-based planning algorithms. The presented detailed ablation studies provides insights about the choices made in the paper. 
+
+**Reasons for score**
+Overall, I liked the paper and the simplicity of the idea. However, my major concern is the comparison with MCTS. I am not convinced that STS would outperform vanilla MCTS when the number of simulations is in order of thousands (e.g. the number of simulations in AlphaGo paper is around 1600). 
+
+**Strengths**
++ The idea is simple and seems to outperform vanilla MCTS implementation in the environments with large action space.
+
+**Weaknesses**
++ The comparison with the related work is not thorough which makes it hard to come into a decisive conclusion about the performance of the proposed method.
++  There are some missing related work, e.g. using policy network for multiple rounds of simulations.
+
+**Questions**
++ What would the benefits if we have a policy network to perform the rollouts (e.g. a similar method to [1])?
++ In general, the benefit of MCTS algorithm (like AlphaGo which performs around 1600 simulations) presents itself when the number of simulations are large. Can you compare running MCTS with more number of simulations (e.g. large C) and STS?
++ Can you please provide some insights on why in 'Corner' STS underperform compared to random shooting?
+
+[1] https://cs.brown.edu/people/gdk/pubs/analysis_mcts.pdf",6,2.0,ICLR2021
+HkSWIWqez,2,rkQu4Wb0Z,rkQu4Wb0Z,Paper proposed new regulariziation schemes,"1. Summary
+The authors of the paper compare the learning of representations in DNNs with Shannons channel coding theory, which deals with reliably sending information through channels. In channel coding theory the statistical properties of the coding of the information can be designed to fit the task at hand. With DNNs the representations cannot be designed in the same way. But the representations, learned by DNNs, can be affected indirectly by applying regularization. Regularizers can be designed to affect statistical properties of the representations, such as sparsity, variance, or covariance. The paper extends the regularizers to perform per-class regularization. This makes sense, because, for example, forcing the variance of a representation to go towards zero is undesirable as it would state that the unit always has the same output no matter the input. On the other hand having zero variance for a class is desirable as it means that the unit has a consistent activation for all samples of the same class. The paper compares different regularization techniques regarding their error performance. They find that applying representation regularization outperforms classical approaches such as L1 and L2 weight regularization. They also find, that performing representation regularization on the last layer achieves the best performance. Class-wise methods generally outperform methods that apply regularization on all classes.
+
+2. Remarks
+Shannons channel coding theory was used by the authors to derive regularizers, that manipulate certain statistical properties of representations learned by DNNs. In the reviewers opinion, there is no theoretical connection between DNNs and channel theory. For one, DNNs are no channels in the sense that they transmit information. DNNs are rather pipes that transform information from one domain to another, where representations are learned as an intermediate model as the information is being transformed. Noise introduced in the process is not due to a faulty channel but due to the quality of the learned representations themselves. The paper falls short in explaining how DNNs and Shannons channel coding theory fit together theoretically and how they used it to derive the proposed regularizers. Despite the theoretical gap between the two was not properly bridged by the authors, channel coding theory is still a good metaphor for what they were trying to achieve.
+The authors recognize that there is similar research being done independently by Belharbi et al. (2017). The similarities and differences between the proposed work and Belharbi et al. should be discussed in more detail.
+The authors conclude that it is unclear which statistical properties of representations are generally helpful when being strengthened. It would be nice if they had derived at least a set of rules of thumb. Especially because none of the regularizers described in the paper only target one specific statistical property but multiple. One good example that was provided, is that L1-rep consistently failed to train on CIFAR-100, because too much sparsity can hurt performance, when having many different classes (100 in this case). These kinds of conclusions will make it easier to transfer the presented theory into practice.
+
+3. Conclusion
+The comparison between DNNs and Shannons channel coding theory stands on shaky ground. The proposed regularizes are rather simple, but perform well in the experiments. The effect of each regularizer on the statistical properties of the representation and the relations to previous work (especially Belharbi et al. (2017)) should be discussed in more detail. ",5,3.0,ICLR2018
+jSUZUzWLka,1,gSJTgko59MC,gSJTgko59MC,Interesting theory on finite width DNNs but not clearly explained,"It has been show that practical performance of finite width DNNs deviates from the infinite limiting cases of both the NTK and GP. The reason behind this is still not well understood. This papers develops theory trying to explain this by showing the finite width correction (FWC) for DNNs. The results look very interesting. However, as an educational guest on this topic, I find the paper written in a way that causes a lot of confusions, making me not fully understand the results.
+
+First of all, the paper is talking about the FWC. Although it has been used in previous work and I can get some sense of the meaning, I think a formal definition is needed in order to make the paper more readable.
+
+Regarding the technical details:
+
+1. What is Df in eq.4? And why can it be eliminated? And it is not clear to me about the sentence below eq.4: why the rhs of eq.4 equals the kernel of the NNGP only in highly over-parameterized DNNs?
+2. I am not sure how eq.5 is obtained?
+3. I think the description below eq.5 can be made more formal because this gives the definition of NNSP. Based on the current description, I still cannot get why NNSP is special? Is it still a GP and why?
+4. Second line below section 3.1, it says e^{-L[f]/2\sigma^2} is independent of the DNNs, but why? I think f corresponds to the DNN?
+5. In Section 3, the paper derives complicated formulations for the posterior mean and variance. I am not sure how these results are useful? It seems to say that these results describe how the finite width DNNs are deviated from the NTK and NNGP, but by looking at these formulas, it is hard for me to figure out why NNSP is better than NTK and NNGP?
+6. Paragraph above Figure 1: it says ""the above cubic term would lose its explicit dependence on n"", but I think there is still a quadratic term, which still depends on n. How can it say that the FWC is negligible in the large n regime?
+
+Overall, I found this paper addressing an important and interesting problem, but the current presentation cannot convince me a pass.",4,2.0,ICLR2021
+r1ef7Qn-jr,3,SyevYxHtDB,SyevYxHtDB,Official Blind Review #4,"The paper proposes a new method for defending against stealing attacks.
+
+Positives:
+1) The paper was very readable and clear.
+2) The proposed method is straightforward and well motivated.
+3) The authors included a good amount of experimental results. 
+
+
+Concerns: 
+1) You note that the random perturbation to the outputs performs poorly compared to your method, but this performance gap seems to decrease as the dataset becomes more difficult (i.e. CIFAR100). I’m concerned that this may indicate that the attackers are generally weak and this threat model may not be very serious. Overall, I’m skeptical of this threat model - the attackers require a very large number of queries, and don’t achieve great results on difficult datasets. Including results on a dataset like ImageNet would be nice. 
+2) How long does this optimization procedure take? It seems possibly unreasonable for the victim to implement this defense if it significantly lengthens the time to return outputs of queries. 
+3) Although this is a defense paper, it would be nice if the attacks were explained a bit more. Specifically, how are these attacks tested? You use the validation set, but does the attacker have knowledge about the class-label space of the victim? If the attacker trained with some synthetic data/other dataset, do you then freeze the feature extractor and train a linear layer to validate on the victim’s test set? It seems like this is discussed in the context of the victim in the “Attack Models” subsection, but it’s unclear what’s happening with the attacker. 
+4) It would be nice to see an angular histogram plot for a model where the perturbed labels were not crafted with knowledge of this model’s parameters - i.e. transfer the proposed defense to a blackbox attacker and produce this same plot. This would motivate the defense more. 
+
+
+",3,,ICLR2020
+v5U9qJMdjT,4,oXQxan1BWgU,oXQxan1BWgU,"While appropriately taking into account task similarity & structure in meta-learning and related fields is an important open problem, the scope of the paper is limited to a specific combination of existing algorithms on synthetic datasets. The submission also needs some work to make its methodology clearer.","The submission proposes a meta-learning algorithm attuned to the hierarchical structure of a dataset of tasks. Hierarchy is enforced in a set of synthetically-generated regression tasks via the data-sampling procedure, which is modified from the task-sampling procedure of [1] to include an additional source of randomness corresponding to which of a set of cluster components task parameters are generated from. The authors propose to adapt the model-agnostic meta-learning algorithm (MAML) of [1] to reflect this hierarchical structure by either observing (Section 4.1, FixedTree MAML) or inferring (Section 4.2, LearnedTree MAML) an assignment of tasks to clusters at each step of the inner loop (task-specific adaptation phase) of MAML; if tasks belong to the same cluster, the correspond task-parameters receive the same update at that step (in particular, the update direction is averaged). It is assumed that there are increasingly many clusters at each step, so that task-specific parameter updates are increasingly granular.
+
+##### Strengths:
+1) **Clarity**: The experimental setting and exactly how the data-generating process relates to the proposed algorithms are clearly described.
+2) **Significance**: Results on the hierarchically structured synthetic regression task datasets demonstrate that {Fixed|Learned}Tree MAML: is at least as good as MAML, and often outperforms MAML; that it learns more efficiently than a MAML in terms of the cumulative number of datapoints observed; and that both MAML and {Fixed|Learned}Tree MAML outperform a naive baseline.
+
+##### Weaknesses:
+1) **Significance**: Since the evidence provided in favor of the proposed algorithm is in the form of an empirical evaluation on a synthetically generated dataset, the present impact of the algorithm is limited. In particular, there is no evidence that (i) the algorithm works for larger and/or more complex datasets; and (ii) that natural datasets of interest to the community exhibit a hierarchical structure analogous to the synthetic datasets presented in the submission.
+2) **Novelty**: The algorithm modifies and combines previously introduced components: the MAML algorithm of [1]; the online top-down clustering algorithm of [2], and the task-similarity-as-gradient-similarity approach of [3].
+3) **Clarity**: Specific details surrounding the relationship between Algorithm  1 and Algorithm 2 are insufficiently discussed: 
+  i) Algorithm 2 as it appears in the text is very similar to Algorithm 1 (The OTD algorithm) in [2] with the exception of the new hyperparameter $\xi$, and introduces new symbols that do not appear elsewhere in the text. It is therefore not sufficiently adapted for clarity in the context of this work.
+  ii) Whether Algorithm 2 acts as a strict subroutine of Algorithm 2 is not stated. I believe it is not because the clustering decision for a new task relies on tree structures that are ""generated for a training batch,"" although what a ""training batch"" refers to is not clear. Similarly, how the ""online""/""offline"" distinction in the context of the clustering algorithm fits into the training/evaluating setup borrowed from [1] is not made clear.
+  iii) How exactly the task-similarity approach of [3] is employed in Algorithm 2 is not made clear. The only mention of the use of [3] is briefly around Eq. (8) before the main algorithm (Algorithm 1) is introduced, and Algorithm 2 only refers to a generic ""similarity metric"" (as in the original work, [1]).
+
+##### References
+
+[1] [Finn, Chelsea, Pieter Abbeel, and Sergey Levine. ""Model-agnostic meta-learning for fast adaptation of deep networks."" arXiv preprint arXiv:1703.03400 (2017).](https://arxiv.org/abs/1703.03400)
+
+[2] [Menon, Aditya Krishna, Anand Rajagopalan, Baris Sumengen, Gui Citovsky, Qin Cao, and Sanjiv Kumar. ""Online Hierarchical Clustering Approximations."" arXiv preprint arXiv:1909.09667 (2019).](https://arxiv.org/abs/1909.09667)
+
+[3] [Achille, Alessandro, Michael Lam, Rahul Tewari, Avinash Ravichandran, Subhransu Maji, Charless C. Fowlkes, Stefano Soatto, and Pietro Perona. ""Task2vec: Task embedding for meta-learning."" In Proceedings of the IEEE International Conference on Computer Vision, pp. 6430-6439. 2019.](https://arxiv.org/abs/1902.03545)",3,5.0,ICLR2021
+B1evrRiTKS,2,r1eCukHYDH,r1eCukHYDH,Official Blind Review #3,"The paper suggests performing simultaneous manifold learning and alignment by multiple generators that share common weight matrices and a constructed inverse map that is instantiated by a single encoder. The method utilizes a special regularizer to guide the training. It has been empirically well tested on a multi-manifold learning task, manifold alignment, feature disentanglement, and style transfer.
+Overall, this is an interesting idea with a motivated approach, however, I would like several points to be addressed before I could increase a score.
+1. It seems that the method makes direct use of the number of classes in the datasets used. How would it fare compared to other models when the number of manifolds is not known (e.g. CelebA dataset)?
+2. In MADGAN (Ghosh et al., 2017) generators share first layers which possibly makes them not independent as claimed in the paper, thus it is worth checking if MADGAN exhibits any kind of manifold alignment and could be a baseline for disentanglement with multiple generators.
+3. There are hyperparameters \lambda and \mu for the regularizers in the model. It would be helpful to study their effect of different values on the training and encoding.
+4. Is there a reason for DMWGAN/InfoGAN scores being omitted in Table 1?
+
+Minor remark - there are a number of typos in the text.",6,,ICLR2020
+eyt6C-TqAg-,3,E3Ys6a1NTGT,E3Ys6a1NTGT,Review,"**Summary:**
+
+This paper attempts to unify prior work on fixed-dataset (aka ""batch"" or ""offline"") reinforcement learning. Specifically, it emphasizes the importance of pessimism to account for faulty over-estimation from finite datasets. The paper shows that naive algorithms (with no pessimism) can recover the optimal policy with enough data, but do so more efficiently. The pessimistic algorithms are divided into ""uncertainty-aware"" and ""proximal"" algorithms where the uncertainty-aware algorithms are shown to be more principled, but most prior work falls into the computationally easier proximal family of algorithms that is closer to imitation learning. These insights are proven both theoretically and with some small experiments.
+
+--------------------------------------------------------------------
+
+**Strengths:**
+
+1. A nice decomposition of suboptimality. The main workhorse of the paper is the decomposition provided in Lemma 1 which is novel and can provide some good intuition about the necessity of pessimism (although the intuition is only given in appendix G.3, which should definitely find it's way into the main text). The Lemma cleanly and formally demonstrates why we may expect over-estimation to be more damaging than under-estimation.
+2. A clear framework to examine prior work. The paper does well to capture the majority of recent work into a few broad families of algorithms: naive, proximal pessimistic, and uncertainty-aware pessimistic. The bound derived from the main Lemma for each algorithm family provide evidence to prefer uncertainty-aware algorithms. This is supported by the tabular experiments.
+3. The formal statements of Lemmas and Theorems seem to be correct and experimental methodology seems sound.
+
+--------------------------------------------------------------------
+
+**Weaknesses:**
+
+1. I am wary of the comparison of upper bounds done in the paper. Just because one algorithm has a lower upper bound does not prove superior performance. I agree that since all the proofs are derived from Lemma 1 and are very similar, the differences are indeed suggestive. However, the bound in Theorem 3 seems to be more loose than the others. For example, when $\alpha = 0$ it does not recover the bound for the naive algorithm as would be expected. A more measured tone and careful description of these comparisons is needed. Claims like ""uncertainty-aware algorithms are strictly better than proximal algorithms"" in the conclusion are not substantiated. 
+2. Lack of discussion of issues with implementation and function approximation. As the authors get into in Appendix G.6 and Appendix F.2 and briefly in the paper it is not clear how to implement the uncertainty-aware family of algorithms in a scalable way. I am not saying that this paper needs to resolve this issue (it is clearly hard), but this drawback needs to be made more clear in the main text of the paper, so as to not mislead the reader.
+3. Notation is heavy and sometimes nonstandard. I understand that the nature of this paper will lead to a lot of notation, but I think the paper could be made more accessible if the authors go back through the paper and remove notation that may only be needed in the proofs and may be unnecessary to present the main results. For example, the several different notions of uncertainty funtions might be useful in the appendix, but do not seem to all be necessary to present the main results. Similarly, the notion of decomposability is introduced and then largely forgotten for the rest of the paper. Some notation is nonstandard. For example: $d$ is used for number of datapoints (usually it would be dimension) and $ \Phi$ is used as the data distribution (usually if would be a feature generating function or feature matrix).
+4. Abuse of the appendix. While I understant that the 8 page limit can be difficult, this paper especially abuses the appendix often sending important parts of the discussion and intuition for the results into appendix G. The paper would be stronger with some editing of the notation and organization of the main text to make room for more of the needed discussion and intuition in main body of the paper.
+
+--------------------------------------------------------------------
+
+**Recommendation:**
+
+I gave the paper a score of 7, and recommend acceptance. The paper provides a nice framing of prior work on fixed-dataset RL. While it leaves some things to be desired in terms of carefulness, scalability, and clarity, I think it provides a solid contribution that will be useful to researchers in the field.
+
+If the authors are able to sufficiently improve the clarity of presentation as discussed in the weaknesses section, I could consider raising my score.
+
+--------------------------------------------------------------------
+
+**Questions for the authors:**
+
+1. It is natural to think that a practical proximal pessimistic algorithm would reduce the level of pessimism with the dataset size (so that it approaches the naive algorithm with infinite data). Do approaches like this resolve many of the issues that you bring up with proximal pessimistic algorithms (albeit by introducing another hyperparameter to tune)?
+
+--------------------------------------------------------------------
+
+**Additional feedback:**
+
+Typos:
+
+- The first sentence on page 4 is not grammatically correct.
+- In the statements of Lemma 1 and Theorem 1, $ \pi^*_D$ is defined and never used.
+- In the statement of Theorem 1 $ u_{D,\delta}^\pi$ is defined but then only $ \mu_{D,\delta}^\pi$ is used without being defined.",7,4.0,ICLR2021
+HkeQCAq6hQ,3,S1xcx3C5FX,S1xcx3C5FX,Ok to accept after discussion,"Verifying the properties of neural networks can be very difficult.  Instead of
+finding a formal proof for a property that gives a True/False answer, this
+paper proposes to take a sufficiently large number of samples around the input
+point point and estimate the probability that a violation can be found.  Naive
+Monte-Carlo (MC) sampling is not effective especially when the dimension is
+high, so the author proposes to use adaptive multi-level splitting (AMLS) as a
+sampling scheme. This is a good application of AMLS method.
+
+Experiments show that AMLS can make a good estimate (similar quality as naive
+MC with a large number of samples) while using much less samples than MC, on
+both small and relatively larger models.  Additionally, the authors conduct
+sensitivity analysis and run the proposed algorithm with many different
+parameters (M, N, pho, etc), which is good to see.
+
+
+I have some concerns on this paper:
+
+I have doubts on applying the proposed method to higher dimensional inputs. In
+section 6.3, the authors show an experiments in this case, but only on a dense
+ReLU network with 2 hidden layers, and it is unknown if it works in general.
+How does the number of required samples increases when the dimension of input
+(x) increases? 
+
+Formally, if there exists a violation (counter-example) for a certain property,
+and given a failure probability p, what is the upper bound of number of samples
+(in terms of input dimension, and other factors) required so that the
+probability we cannot detect this violation with probability less than p?
+Without such a guarantee, the proposed method is not very useful because we
+have no idea how confident the sampling based result is. Verification needs
+something that is either deterministic, or a probabilistic result with a small
+and bounded failure rate, otherwise it is not really a verification method.
+
+The experiments of this paper lack comparisons to certified verification
+methods. There are some scalable property verification methods that can give a
+lower bound on the input perturbation (see [1][2][3]).  These methods can
+guarantee that when epsilon is smaller than a threshold, no violations can be
+found.  On the other hand, adversarial attacks give an upper bound of input
+perturbation by providing a counter-example (violation). The authors should
+compare the sampling based method with these lower and upper bounds. For
+example, what is log(I) for epsilon larger than upper bound?
+
+Additionally, in section 6.4, the results in Figure 2 also does not look very
+positive - it unlikely to be true that an undefended network is predominantly
+robust to perturbation of size epsilon = 0.1. Without any adversarial training,
+adversarial examples (or counter-examples for property verification) with L_inf
+distortion less than 0.1 (at least on some images) should be able to find. It
+is better to conduct strong adversarial attacks after each epoch and see what
+are the epsilons of adversarial examples.
+
+Ideas on further improvement:
+
+The proposed method can become more useful if it is not a point-wise method.
+If given a point, current formal verification method can tell if a property is
+hold or not.  However, most formal verification method cannot deal with a input
+drawn from a distribution randomly (for example, an unseen test example). This
+is the place where we really need a probabilistic verification method. The
+setting in the current paper is not ideal because a probabilistic estimate of
+violation of a single point is not very useful, especially without a guarantee
+of failure rates.
+
+For finding counter-examples for a property, using gradient based methods might
+be a better way. The authors can consider adding Hamiltonian Monte Carlo to
+this framework (See [4]).
+
+References: 
+There are some papers from the same group of authors, and I merged them to one.
+Some of these papers are very recent, and should be helpful for the authors
+to further improve their work.
+
+[1] ""AI2: Safety and Robustness Certification of Neural Networks with Abstract
+Interpretation"", IEEE S&P 2018 by Timon Gehr, Matthew Mirman, Dana
+Drachsler-Cohen, Petar Tsankov, Swarat Chaudhuri, Martin Vechev 
+
+(see also ""Differentiable Abstract Interpretation for Provably Robust Neural
+Networks"", ICML 2018. by Matthew Mirman, Timon Gehr, Martin Vechev.  They also
+have a new NIPS 2018 paper ""Fast and Effective Robustness Certification"" but is
+not on arxiv yet)
+
+[2] ""Efficient Neural Network Robustness Certification with General Activation
+Functions"", NIPS 2018. by Huan Zhang, Tsui-Wei Weng, Pin-Yu Chen, Cho-Jui
+Hsieh, Luca Daniel.  
+
+(see also ""Towards Fast Computation of Certified Robustness for ReLU Networks"",
+ICML 2018 by Tsui-Wei Weng, Huan Zhang, Hongge Chen, Zhao Song, Cho-Jui Hsieh,
+Duane Boning, Inderjit S. Dhillon, Luca Danie.)
+
+[3] Provable defenses against adversarial examples via the convex outer
+adversarial polytope, NIPS 2018. by Eric Wong, J. Zico Kolter.
+
+(see also ""Scaling provable adversarial defenses"", NIPS 2018 by the same authors)
+
+[4] ""Stochastic gradient hamiltonian monte carlo."" ICML 2014. by Tianqi Chen,
+Emily Fox, and Carlos Guestrin.
+
+============================================
+
+After discussions with the authors, they agree to revise the paper according to our discussions and my primary concerns of this paper have been resolved. Thus I increased my rating.
+",6,5.0,ICLR2019
+41ejbm5DVG2,1,_lV1OrJIgiG,_lV1OrJIgiG,"Interesting contribution to zero-shot navigation, falls a bit short on generalization","[EDIT AFTER DICUSSIONS] I thank the authors for their answer to my comments. I agree with the summary of the Area Chair and do not wish to modify my score.
+[/EDIT]
+
+##########################################################################
+Summary:
+This paper presents addresses the problem of zero-shot navigation in environments with novel layouts.  It introduces two approaches (MMN a model-based approach based on Monte-Carlo Tree Search, and MAH, a model-free approach based on Deep-Q networks). The paper also introduces n-step relabelling as a way to leverage failed trials and make learning more efficient. Experiments on the DeepMind Lab environment show that both methods perform well against a random baseline and that MAH extends better to larger maps.
+
+##########################################################################
+Reasons for score: 
+This paper presents a novel approach to an interesting problem. The method is sound and the approach rigorous. On the downside, an important claim of the paper is that the method is more generalizable than the latest work on this topic (Brunner et al. 2018) but this claim would have been stronger if supported with evidence, in particular with stronger baselines and more tasks.
+
+ ##########################################################################
+Pros:
+
+- Clarity: the paper is well-structured, clear, and easy to read
+
+- Impact: the paper addresses an interesting problem, in particular trying to use a general approach that is not specific to map-based navigation
+
+- Rigor: the work presented in this paper is detailed and follows a clear methodology. Equations seem correct. The experimentation study is fairly detailed and the appendix provides significant details about the methods.
+
+##########################################################################
+Cons: 
+
+- No code is provided with the paper, making it hard for future work to use it as a baseline
+
+- The paper mentions Brunner et al. 2018 as reference work. This work seems to achieve significantly better performance, at the ""cost"" of using a more task-specific, map-based navigation approach.This raises a few questions:
+1. Could the authors have used Brunner et al. as a baseline for this work? Was the code available?
+2. The authors (rightfully) claim that their approach is more generic and may be more readily generalizable to other problems. This statement would be a lot stronger if they actually proved it in the paper, i.e. if they used the same technic to solve a different problem.
+3. The methodology used in Brunner et al. uses a larger variety of map sizes. The authors could have used this approach too to better evaluate their method.
+
+- The paper provides a simple external baseline (random).  Could other (external) stronger baselines have been used, such as asynchronous advantage actor-critic ((Mnih et al.,2016) or model-free episodic control (Blundell et al., 2016)?
+
+- The paper evaluates on a single task and could have been evaluated on more tasks to illustrate robustness to domain shifts, in particular in the light of the comment above regarding the generalization of the method. Examples include Jaco Arm, CoinRun, or the Surreal framework.
+
+#########################################################################
+Some typos: 
+
+- Page 2: ""betweend"" in the first paragraph
+- Figures on page 6 are hardly legible on a printed version of the paper.  Try to make them larger maybe?
+- missing upper cases in some references (e.g. POMDPs)
+",6,4.0,ICLR2021
+ewg_yz1Zjx,1,ZVqZIA1GA_,ZVqZIA1GA_,A good subminssion that can be improved further,"Pros:
+It is indeed that capsules have many promising attributes but they are not quite easy to be exploited in object detectors. This paper has addressed many problems existed when applying capsule architectures for the object detection task. In particular, deformable capsules, SplitCaps, and SE-Routing are respectively introduced to help tackle the object detection with capsules. I believe the techniques of this paper will attract interests from researchers in the corresponding areas. The writing is also good.
+
+Cons:
+There are several points that prevent me from giving a higher rating:
+1) I feel that the descriptions or presentations of introduced techniques are not quite sufficient. In particular, I am confused about the details of deformable capsules. What are the exact operations performed by the deformable capsules? Is it to make parent capsules only aggregate information from a smaller set of their children and the sampling of such a smaller set is implemented by deformable operations? I believe it will be much better to present some detailed formulations or equations to illustrate this operation. Similarly, I also recommend adding detailed equations to describe SE-Routing.  
+
+2) Without sufficient information about operations, I found that the descriptions of motivations are also quite weak, especially in the introductory part. It seems that the authors only mention the problems they will address and the techniques they proposed to address problems. There is not sufficient content briefly explaining why the proposed techniques can tackle the mentioned problems, or at least what are the advantages of the proposed techniques for tackling the problems. 
+
+3) The figures also have many confusing points. For example, in Figure 1, the authors marked that solid red arrows represent 'up' but I can only find dotted red arrows in the figure. Moreover, in what place the modules inside the big blue box is implemented in the detector? After reading, I think it should represent what's inside the SplitCaps, but I do recommend the authors to add some indicative symbols to corresponding related modules. 
+
+4) In addition to the reviewed literature, I found that there are some missing articles that also study how to borrow the capsule concepts for various computer vision tasks. For example:
+[1] Vijayakumar T. Comparative study of capsule neural network in various applications[J]. Journal of Artificial Intelligence, 2019, 1(01): 19-27.
+[2] Zhao Y, Birdal T, Deng H, et al. 3D point capsule networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2019: 1009-1018.
+[3] Chen Z, Zhang J, Tao D. Recursive Context Routing for Object Detection[J]. International Journal of Computer Vision, 2020: 1-19.
+
+Overall, I would like to have some feedback from the authors regarding the above issues before making my final decision. ",6,4.0,ICLR2021
+SJxIHvP15H,2,HJx7uJStPH,HJx7uJStPH,Official Blind Review #1,"This paper suggests using a Unet type architecture to perform end-to-end source separation. They are reporting performance improvement over another architecture which uses Unets architecture in conjunction with Wavenet decoders. They also report a very marginal performance improvement over an STFT based model (open unmix) The improment is 0.02 dB (table 1), and I am not sure if it is statistically significant.
+
+
+Also, there is no mention of algorithms which adaptively learn the basis and then do masking similar to what we do in the STFT domain. A very popular example for this is tasnet, which performs well on speech source separation tasks. I would like to see comparisons with this model. If you think there is a specific reason why not to use adaptive basis approaches such as TASNET, please do let me know. ",3,,ICLR2020
+Byn8p7V-z,3,HJIhGXWCZ,HJIhGXWCZ,"Interesting direction, but alternatives not fully explored and not sure what I learn from experiments.","Summary: 
+
+I like the general idea of learning ""output stochastic"" noise models in the paper, but the idea is not fully explored (in terms of reasonable variations and their comparative performance).  I don't fully understand the rationale for the experiments: I cannot speak to the reasons for the GAN's failure (GANs are not easy to train and this seems to be reflected in the results); the newly proposed model seems to improve with samples simply because the evaluation seems to reward the best sample.  I.e., with enough throws, I can always hit the bullseye with a dart even when blindfolded.
+
+Comments:
+
+The model proposes to learn a conditional stochastic deep model by training an output noise model on the input x_i and the residual y_i - g(x_i).  The trained residual function can be used to predict a residual z_i for x_i.  Then for out-of-sample prediction for x*, the paper appears to propose sampling a z uniformly from the training data {z_i}_i (it is not clear from the description on page 3 that this uniformly sampled z* = z_i depends on the actual x* -- as far as I can tell it does not).  The paper does suggest learning a p(z|x) but does not provide implementation details nor experiment with this approach.
+
+I like the idea of learning an ""output stochastic"" model -- it is much simpler to train than an ""input stochastic"" model that is more standard in the literature (VAE, GAN) and there are many cases where I think it could be quite reasonable.  However, I don't think the authors explore the idea well enough -- they simply appear to propose a non-parametric way of learning the stochastic model (sampling from the training data z_i's) and do not compare to reasonable alternative approaches.  To start, why not plot the empirical histogram of p(z|x) (for some fixed x's) to get a sense of how well-behaved it is as a distribution.  Second, why not simply propose learning exponential family models where the parameters of these models are (deep nets) conditioned on the input?  One could even start with a simple Gaussian and linear parameterization of the mean and variance in terms of x.  If the contribution of the paper is the ""output stochastic"" noise model, I think it is worth experimenting with the design options one has with such a model.
+
+The experiments range over 4 video datasets.  PSNR is evaluated on predicted frames -- PSNR does not appear to be explicitly defined but I am taking it to be the metric defined in the 2nd paragraph from the bottom on page 7.  The new model ""EEN"" is compared to a deterministic model and conditional GAN.  The GAN never seems to perform well -- the authors claim mode collapse, but I wonder if the GAN was simply hard to train in the first place and this is the key reason?  Unsurprisingly (since the EEN noise does not seem to be conditioned on the input), the baseline deterministic model performs quite well.  If I understand what is being evaluated correctly (i.e., best random guess) then I am not surprised the EEN can perform better with enough random samples.  Have we learned anything?
+",5,2.0,ICLR2018
+H1l81UwCYS,2,H1gax6VtDB,H1gax6VtDB,Official Blind Review #2,"This paper tackles the problem of learning an encoder and transition model of an environment, such that the representation learnt uses an object-centric representation which could favor compositionality and generalisation. This is trained using a contrastive max-margin loss, instead of a generative loss as previously explored. They do not consider RL or follow-up tasks leveraging these representations and transition models yet.
+They perform an extensive assessment of their model, with many ablations, on 2 gridworld environments, one physical domain, and on Atari.
+
+The paper is very well motivated, easy to follow, and most of its assumptions and decisions are sensible and well supported. They also provide interesting assessments and insights into the evaluation scheme of such transition models, which would be of interest to many practitioners of this field.
+
+Apart from some issues presented below, I feel that this work is of good quality and would recommend it for acceptance.
+
+1.	The model is introduced in a very clear way, and most decisions seem particularly fair. I found the presentation of the contrastive loss with margin to be clear, and the GraphNet is also well supported (although see question below).  However, two choices are surprising to me and would deserve some clarification and more space in the main text, instead of the Appendix:
+	a.	Why does the object extractor only output a scalar mask? This was not extremely clear from reading the main text (and confused me when I first saw Figure 1 and 3a), but as explained in the Appendix, the CNN is forced to output a sigmoid logit between [0, 1] per object channel. 
+	This seems overly constraining to me, as this restricts the network to only output “1 bit” of information per “object”.
+	However, maybe being able to represent other factors of these objects might be necessary to make better predictions?  
+	This also requires the user to select the number of output channels precisely, or the model might fail. This is visible in the Atari results, where the “objectness” is much less clear.
+	 Did you try allowing the encoder to output more features per objects? 
+	Obviously this would be more complicated and would place you closer to a setting similar to MONet (Burgess et al. 2019) or IODINE (Greff et al. 2019), but this might help a lot.
+	b.	It was hard to find the dimensionality D of the abstract representation $z_t$. It is only reported in the Appendix, and is set to $D=2$ for the 2D gridworld tasks and $D=4$ for Atari and the physics environments.  These are quite small, and the fact that they exactly coincide with your assumed sufficient statistics is a bit unfortunate.  
+	What happens if D is larger? Could you find the optimal D by some means?
+2.	The GraphNet makes sense to me, but I wondered why you did not provide $a_t^j$ to $e_t^{(i, j)}$ as well? I could imagine situations where one would need the action to know if an interaction between two slots is required.
+3.	Similarly, the fact that the action was directly partitioned per object (except in Atari where it was replicated), seemed slightly odd. Would it still work if it was not directly pre-aligned for the network? I.e. provide $a_t$ as conditioning for the global() module of the GraphNet, and let the network learn which nodes/edges it actually affects.
+4.	In your multi-object contrastive loss, how is the mapping between slot k in $z_t$ and $\tilde{z}_t$ performed? Do you assume that a given object (say the red cube) is placed in the same $k$ slot across different scenes/timesteps? This may actually be harder to enforce by the network than expected (e.g. with MONet, there is no such “slot stability”, see [1] for a discussion).
+5.	It was unclear to me if the “grid” shown in Figure 3 (b) and 5 is “real”? I.e. are you exactly plotting your $z_t$ embeddings, and they happen to lie precisely along this grid? If yes, I feel this is a slightly stronger result as you currently present, given this means that the latent space has mirrored the transition dynamics in a rather impressive fashion.
+6.	Related to that point, I found Figure 3 b) to be slightly too hard to understand and parse. The mapping of the colours of the arrows is not provided, and the correspondence between “what 3D object is actually moving where” and “which of the coloured circles correspond to which other cubes in the image” is hard to do (especially given the arbitrary rotation).  Could you add arrows/annotations to make this clearer?  Alternatively, presenting this as a sequence might help: e.g. show the sequence of real 3D images, along with the trajectory it traces on the 2D grid.
+7.	Figure 4 a) was also hard to interpret. Seeing these learnt filters did not tell much, and I felt that you were trying too hard to impose meaning on these, or at least it wasn’t clear to me what to take of them directly. I would have left this in the Appendix. Figure 4 b) on the other hand was great, and I would put more emphasis on it.
+8.	There are no details on how the actual test data used to generate Table 1 was created, and what “unseen environment instances” would correspond to. It would be good to add this to the Appendix, and point forward to it at the end of the first paragraph of Section 4.6, as if you are claiming that combinatorial generalization is being tested this should be made explicit. I found Table 1 to be great, complete, and easy to parse.
+9.	It would be quite interesting to discuss how your work relates to [1], as the principles and goals are quite similar.  On a similar note, if you wanted to extend your 2D shape environment from a gridworld to a continuous one with more factors of variations, their Spriteworld environment [2] might be a good candidate.
+
+
+References:
+[1] Nicholas Watters, Loic Matthey, Matko Bosnjak, Christopher P. Burgess, Alexander Lerchner, “COBRA: Data-Efficient Model-Based RL through Unsupervised Object Discovery and Curiosity-Driven Exploration”, 2019, https://arxiv.org/abs/1905.09275
+[2] Nicholas Watters, Loic Matthey, Sebastian Borgeaud, Rishabh Kabra, Alexander Lerchner, “Spriteworld: A Flexible, Configurable Reinforcement Learning Environment”, https://github.com/deepmind/spriteworld/ 
+
+",8,,ICLR2020
+rknUpWqgz,2,rJNpifWAb,rJNpifWAb,Flipout is an important contribution for weight-perturbation algorithms,"Typical weight perturbation algorithms (as used for e.g. Regularization, Bayesian NN, Evolution
+Strategies) suffer from a high variance of the gradient estimates. This is caused
+by sharing a weight perturbation by all training examples in a minibatch. More specifically
+sharing perturbed weights over samples in a minibtach induces correlations between gradients of each sample, which can
+not be resolved by standard averaging. The paper introduces a simple idea, flipout, to
+perturb the weights quasi-independently within a minibatch: a base perturbation (shared
+by all sample in a minibatch) is multiplied by a random rank-one sign matrix (different
+for every sample). Due to its special structure it is possible to vectorize this
+per-sample-operation such that only matrix-matrix products (as in the default forward
+propagation) are involved. The incurred computational cost is roughly twice as much
+as a standard forward propagation path. The paper also proves that this approach
+reduces the variance of the gradient estimates (and in practice, flipout should
+obtain the ideal variance reduction). In a set of experiments it is demonstrated
+that a significant reduction in gradient variance is achieved, resulting
+in speedups for training time. Additionally, it is demonstrated that
+flipout allows evolution strategies utilizing GPUs.
+
+Overall this is a very nice paper. It clearly lays out the problem, describes
+one solution to it and shows both theoretically as well as empirically
+that the proposed solution is a feasable one. Given the increasing importance
+of Bayesian NN and Evolution Strategies, flipout is an important contribution.
+
+Quality: Overall very well written. Relevant literature is covered and an important
+problem of current research in ML is tackled.
+
+Clarity: Ideas/Reasons are clearly presented.
+
+Significance: The presented work is highly significant for practical applicability
+of Bayesian NN and Evolution Strategies.",8,3.0,ICLR2018
+BklmCVWU3X,1,HyexAiA5Fm,HyexAiA5Fm,"good effort for scalable unbalanced OT, theoretical aspect might be problematic ","### post rebuttal### authors addressed most of my concerns and greatly improved the manuscript and hence I am increasing my score. 
+ 
+Summary: 
+
+The paper introduces a static formulation for unbalanced optimal transport by learning simultaneously a transport map T and scaling factor xi .
+
+Some theory is given to relate this formulation to unbalanced transport metrics such as Wasserstein Fisher Rao metrics  for e.g. Chizat et al 2018.  
+
+The paper proposes to  relax the constraint in the proposed static formulation using a divergence.  furthermore using a bound on the divergence , the final discrepancy proposed  is written as a min max problem between the witness function f of the divergence and the transport map T , and scaling factor xi. 
+
+An algorithm is given to find the optimal map T as a generator in GAN and to learn the scaling factor  and the witness function of the divergence with a neural network paramterization , the whole optimized with stochastic gradient. 
+
+Small experimentation on image to image transportation with unbalance in the classes is given and show how the scaling factor behaves wrt to this kind of unbalance. 
+
+
+Novelty and  Originality:
+
+The paper claims that there are no known static formulations known with a scaling factor and a transport map learned simultaneously. We refer the authors to Unbalanced optimal Transport: Geometry and Kantrovich Formulation Chizat et al 2015. In page 19 in this paper Equation 2.33 a similar formulation to Equation 4 in this paper is given. (Note that phi corresponds to T and lambda to xi). This is known as the monge formulation of unbalanced optimal transport. The main difference is that the authors here introduce a stochastic map T and an additional probabilty space Z. Assuming that the mapping is deterministic those two formulations are equivalent. 
+
+Correctness: 
+
+The metric defined in this paper can be written as follow and corresponds to a generalization of the monge formulation in chizat 2015 :
+L(mu,nu)= inf_{T, xi}  int   c_1(x,T_x(z) ) xi(x) lambda(z) dmu(x)  + int c_2(x_i(x)) dmu(x)
+                        		 s.t T_# (xi mu)=nu
+In order to get a kantorovich formulation out of this chizat et al 2015 defines semi couplings and the formulation is given in Equations 3.1 page 20. 
+
+This paper proposes to relax  T_# (xi mu)=nu with D_psi (xi \mu, \nu) and hence proposes to use:
+
+L(mu,nu)= inf_{T, xi} int   c_1(x,T_x(z) ) xi(x) lambda(z) dmu(x)  + int c_2(x_i(x)) dmu(x)+  D_psi (xi \mu, \nu)
+
+Lemma 3.2 of the paper claims that the formulation above corresponds to the Kantrovich formulation of unbalanced transport. I doubt the correctness of this:
+
+Inspecting the proof of Lemma 3.2 L \geq W seems correct to me, but it is unclear what is going on in the proof of the other direction? The existence of T_x is not well supported by rigorous proof or citation? Where does xi come from in the third line of the equalities in the end of page 14? I don’t follow the equalities written at the end of page 14. 
+
+Another concern is the space Z, how does the metric depend on this space? should there be an inf on all Z?
+
+Other comments:
+
+- Appendix A is good wish you baselined your experiments with those algorithms. 
+
+- The experiments don’t show any benefit for learning the scaling factor, are there any applications in biology that would make a better case for this method?
+
+- What was the architecture used to model T, xi, and f?
+
+- Improved training dynamics in the appendix, it seems you are ignoring the weighting while optimizing on theta? than how would the weighing be beneficial ?",6,4.0,ICLR2019
+qvFAiTJJDFS,3,Ut1vF_q_vC,Ut1vF_q_vC,Unfair comparison and limited novelty,#NAME?,2,5.0,ICLR2021
+Bkx6D1fx9H,3,B1e9Y2NYvS,B1e9Y2NYvS,Official Blind Review #3,"This paper investigates the robustness of Neural Ordinary differential equations (ODEs) against corrupted and adversarial examples. The crux of the analysis is based on the separation property of ODE integral curves. The insights from empirical robustness evaluation show that controlling the difference between neighboring integral curves is able to improve neural ODE's robustness. In general, neural ODE is a hot research topic in recent years, and a paper advancing knowledge in this area about understanding its various characteristics is certainly welcome. The paper is well motivated and clearly written. One aspect that confuses me a little originally is the different effects of getting ridding of the dependency on the time t and adding the steady state regularization. It would be nice to elucidate which part makes more contributions? Furthermore, to compare the robustness of the new approach with CNN, the input data consists of original images and their Gaussian-noise based perturbed samples. Since the paper already involves the evaluation using adversarial examples, it will make the paper much more stronger to show that when training both the new approach and the CNN with adversarial training, the proposed regularization can still lead to better robustness. ",6,,ICLR2020
+rJgm2PsCFB,3,HkgbKaEtvB,HkgbKaEtvB,Official Blind Review #2,"The paper presents an approach to discrete input selection for NNs, using the Gumbel-Softmax trick at its core. It motivates this problem in the context of communicating data over a network with limited bandwidth budget. It proposes constructing different kinds of masks that can be applied over channels or pixels in the input, grounding the discussion in the image domain. This can be seen as a special case of Feature Selection, with image specific substructures motivating the choice of mask types. 
+
+There is very little novelty in this work over that presented in Abubakar Abid et. al. [1], where the idea of using Gumbel-Softmax as a differentiable Feature Selection algorithm has already been expounded at depth, both in unsupervised as well as supervised settings. The current work draws directly from the supervised form in [1]. The only incremental contribution in this work is the specific mask types and mask-specific losses.
+
+Pros
+•	Interesting approach to extend the framework in [1] to CNNs, with use of masks and mask-specific loss
+•	Clear motivation for the network bandwidth limited use case
+
+Cons
+•	Hardly any technical novelty because the core ideas are already presented as well as applied to the same task in [1]
+•	It is very surprising that the authors do not even cite [1] in their paper, despite their work being extremely closely related to it
+•	Most of the discussion in the Related Work section is unrelated to the specific task they tackled in the paper (i.e., input/feature selection). The second paragraph in this section talks about ‘gradient-driven search’ for discrete selection, which has been recently explored not only in [1] but also related G-S applications like [2], [3], but the authors seem unaware of this line of works
+•	The authors do not compare their approach against any existing baselines from literature for this task, again with the most apt being [1] and baselines therein. This makes it hard to understand the true value of their proposals such as mask types, schedule that adjusts both ‘tau’ and ‘lambda’ during training etc.
+
+[1] Abubakar Abid et al., “Concrete Autoencoders for Differentiable Feature Selection and Reconstruction”, ICML 2019, (https://arxiv.org/abs/1901.09346)
+[2] Hanxiao Liu et al., “DARTS: Differentiable Architecture Search”, ICLR 2019
+[3] Bichen Wu et al., “FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search” CVPR 2019
+",3,,ICLR2020
+hD6fQlYHu_m,5,OcTUl1kc_00,OcTUl1kc_00,Incremental contribution,"The paper considers the long-range dependency and proposes four levels of injection of longer-range graph structure information based on random walks with restart (RWR). Experimental results show that the proposed models perform well on the tasks of node classification, graph classification, and counting triangles.
+Utilizing long-range dependency is not new in graph neural networks. I do not think the authors give enough reviews about important related tasks; their related work section focuses more on RWR. Considering the motivation and the solved issues of graph neural networks, more relevant literature in GNN domains should be added, such as MixHop (Abu-El-Haija et al., 2019), Snowball (Luan et al., 2019), APPNP (Klicpera et al., 2019 ), GDC (Klicpera et al., 2019). Compared with those works, the RWR regularization seems incremental. Adding the RWR features is just a new feature and adding the RWR regularization term actually can be translated into a kind of message passing schema. 
+Moreover, some details of the method are not very clear. For example, how do we calculate $S_{I,j}$? When we add the regularization, do we use all node pairs or just the node pairs within some distance? If using all node pairs, is the computational complexity too high?
+",4,4.0,ICLR2021
+rklgSCRptr,2,Syx7WyBtwB,Syx7WyBtwB,Official Blind Review #2,"The paper presents a way of using generated explanations of model predictions to help prevent a model from learning ""unwanted"" relationships between features and class labels. This idea was implemented with a particular explanation generation method from prior work, called contextual decomposition (CD). For a given feature, the corresponding CD can be used to measure its importance. The proposed learning objective in this work optimizes not only the cross entropy loss, but also the difference between the CD score of a given feature and its explanation target value. Experiments show that this new learning algorithm can largely improve the classification performance.
+
+I like the high-level idea of this work and agree that there is not much work on using prediction explanations to help improve model performance. However, there are two major concerns of the model and experiment design. 
+
+First, it seems like the proposed method requires whoever use it already know what the problem is. For example, 
+
+- in section 3.3, the model inputs include a collection of features and the corresponding explanation target values.
+- in section 4.1, it is already known that some colorful patches only appear in some non-cancerous images but not in cancerous images. 
+- it is even more obvious in section 4.2 and 4.3, because in both experiments, the training and test examples were altered on purpose to create some mismatch. 
+
+My question is that if we already know the bias or the mismatch, why not directly use this information in the regularization to penalize some features? Is it necessary to resort to some explanation generation methods?
+
+My second concern is more like a personal opinion. In the experiment of section 4.2, if the colors are good indicators of these digits in the training set, I don't it is wrong for a model to capture these important features. However, the way of altering examples in the same class with different colors in training and test sets seems questionable, because now, the distributions of training and test images are different. On the other hand, if we already know color is the issue, why not simply convert the images into black-and-white? A similar argument can also be applied to the experiment in section 4.3
+
+Overall, I like the idea of using explanations to help build a better classifier. However, I am concerned about the value of this work. 
+",3,,ICLR2020
+r1LAwb9xz,1,SyUkxxZ0b,SyUkxxZ0b,"Proposes and analyzes one very simple artificial data set, looking for insights about adversarial examples; Despite some good motivations, the significance of the results is not clearly established.","The idea of analyzing a simple synthetic data set to get insights into open issues about adversarial examples has merit. However, the results reported here are not sufficiently significant for ICLR.
+
+The authors make a big deal throughout the paper about how close to training data the adversarial examples they can find on the data manifold are. E.g.: “Despite being extremely rare, these misclassifications appear close to randomly sampled points on the sphere.”  They report mean distance to nearest errors on the data manifold is 0.18 whereas mean distance between two random points on inner sphere is 1.41. However, distance between two random points on the sphere is not the right comparison. The mean distance between random nearest neighbors from the training samples would be much more appropriate.
+
+They also stress in the Conclusions their Conjecture 5.1 that under some assumptions “the average distance to nearest error may decrease on the order of O(1 / d) as the input dimension grows large.” However, earlier they admitted that “Whether or not a similar conjecture holds for image manifolds is unclear and should be investigated in future work.” So, the practical significance of this conjecture is unclear.  Furthermore, it is well known that in high dimensions, the distances between pairs of training samples tends towards a large constant (e.g. making nearest neighbor search using triangular inequality pruning infeasible), so extreme care much be taken to not over generalize any results from these sorts of synthetic high dimensional experiments.
+
+Authors note that for higher dimensional spheres, adversarial examples on the manifold (sphere shell) could found, but not smaller d:  “In our experiments the highest dimension we were able to train the ReLU net without adversarial examples seems to be around d = 60.”  Yet,in their later statement in that same paragraph  “We did not investigate if larger networks will work for larger d.”, it is unclear what is meant by “will work”; because, presumably, larger networks (with more weights) would be HARDER to avoid adversarial examples being found on the data manifold, so larger networks should be less likely “to work”, if “work” means avoid adversarial examples.  In any case, their apparent use of only h=1000 unit networks (for both ReLU and quadratic cases) is disappointing, because it is not clear whether the phenomena observed would be qualitatively similar for different fully-separable discriminants (e.g. different h values with different regularization costs even if all such networks had zero classification errors).
+
+The authors repeat the following exact same phrase in both the Introduction and the Conclusion:
+“Our results highlight the fact that the epsilon norm ball adversarial examples often studied in defence papers are not the real problem but are rather a tractable research problem. “
+But it is not clear exactly what the authors meant by this. Also, the term “epsilon norm ball” is not commonly used in adversarial literature, and the only reference to such papers is Madry et al, (2017), which is only on ArXiv and not widely known — if these types of adversarial examples are “often studied” as claimed, there should be other / more established references to cite here.
+
+In short, this work addresses the important problem of better understanding adversarial examples, but the simple setup has a higher burden to establish significance, which this paper as written has not met.
+
+",4,4.0,ICLR2018
+i0U03TKMV3,1,aLtty4sUo0o,aLtty4sUo0o,Recommendation to Accept,"This paper studies the quickest change detection for Markovian data, when both the parameters of pre- and post-change distributions are unknown. The main contribution is a scalable algorithm that sequentially estimates the unknown parameters and plug-in to classical detection schemes to get the stopping rule. A notable feature is that this is a joint estimation and detection framework. And the authors incorporate several tools, like SGD, annealing, penalization, into the detection task, which turns out to have good performance compared with existing benchmarks. 
+
+Overall, this paper is clearly-written and well-organized, and the numerical examples support the claims made in the paper. 
+
+Minor comments: 
+1. Usually in classical change-point detection literature, people assume the pre-change distribution is known since it can be estimated from historical (nominal) data, and the framework proposed in this paper can obviously be applied in such a setting as well. Therefore, I think it might be interesting to add one comparison in such setting (i.e., only post-change parameters is unknown and need to estimated). In such a case, the GLR and adaptive methods do not need to learn theta_0 offline and we can have a fair comparison of the performance of learning post-change parameters and also the detection delay. 
+
+2. In Appendix A.1, the introduction to SHIRYAEV Algorithm, it seems that there is a missing \rho in the denominator of the statistics S. The reason is that only under this \rho-scaled version of likelihood-ratio can the recursion in A.1 holds. 
+
+--------- After rebuttal ---------
+Thanks to the authors for the response and updated paper. I keep my original score and recommend acceptance for this paper.",7,4.0,ICLR2021
+btT3XtCro3O,2,gtwVBChN8td,gtwVBChN8td,Review,"The authors propose a deterministic policy-gradient algorithm that extends the TD3 algorithm (Fujimoto 2018). The main claim is that it reduces overestimation issues in a more effective way. Two Q-critics are maintained with separate parameters, but updated using the same transitions. Then a convex combination of these critics is used in the deterministic policy gradient update. The mixture parameter is learned on a slower time-scale to minimize this convex combination over states (instead of taking the minimum of the 2 critics per batch as in TD3). Another contribution in the paper is the “Unbiased” variant of the algorithm (UAD3), which addresses the off-policy nature of the replay mechanism of the AD3 algorithm described above. My understanding is that this is simply a version of the algorithm that does not use any replay mechanism and samples the state iid from the on-policy distribution, so it isn’t a novel idea in itself.
+
+There are two theorems given to justify the algorithm choices, but I want to question their validity. The first one says that AD3 converges asymptotically, but no formal statement of what this means is given and the proof for it in the appendix only states broad facts about stochastic approximation, but nothing specific that applies to the AD3 algorithm. Theorem 2 is misleading in another way, it says that AD3 has “the property of asymptotical expected policy improvement”, but it only really says that the critic value will be increasing, not that the actual policy value is increasing (and so an actual policy improvement step). Moreover, the proof contains some approximation steps which are not justified.
+
+The approach is tested in two simple continuous control environments (maze + reacher task). (Are these using the full state as input?). There the proposed approaches perform better it seems than the baselines (TD3 and DDPG), but there isn’t any analysis to understand whether that was due to better critics - why not plot the estimated and true returns during learning to see whether AD3 indeed does better than the other critic update strategies? The experimental section is missing details to make these results reproducible and interpretable, for example what network architecture was used for the policy and critic?  All learning curves have rather strange oscillation patterns. Is that an artifact of the smoothing used? How many seeds were used to obtain each learning curve?
+
+At the moment, the advantages of the proposed approach are neither demonstrated theoretically or empirically in a satisfactory way (see comments above). At least one of these aspects need to be improved significantly before the paper is ready for acceptance.
+
+Other comments:
+
+The objective to decide how to mix the critics could be better motivated. Making sure that the critic increase may not be the best criterion to ensure stability and reduce over/underestimation for example. 
+
+Minor things:
+
+* “In actor-critic methods, the policy is deterministic…” Actor-critic methods are actually more commonly described in a stochastic policy setting I believe. This work seems to be based on the deterministic policy gradient paper which is not cited.
+* Citation style should use parentheses around names in most cases in the text.
+* Gattami 2019 is not the right reference for the Bellman equations and derived RL updates.
+* Lots of equations are redundant and add unnecessary complexity. For example Eq 3 and Eq 4 are the same equations with different parameters, so why not call the two parameters w_1 and w_2 and write the equation once with w_i  for i \in {1,2}
+* The hat on the \lambda in Eq 6,7 are hard to see because they are attached to the left bracket.
+* adopts -> adopt
+* Transition slots -> tuples?
+* Minus distance/reward -> negative
+",3,4.0,ICLR2021
+S1ez9A-QKH,1,rJguRyBYvr,rJguRyBYvr,Official Blind Review #3,"After rebuttal: my rating remains the same.
+I have read other reviewers' comments and the response. Overall, the contribution of retraining and detection with previously explored kernel density is limited. 
+
+=================
+Summary: 
+This paper proposes new regularization techniques to train DNNs, which after training, make the crafted adversarial examples more detectable. The general idea is to minimize the inter-class variance and maximize the intra-class distance, at some feature layer. This involves regularization terms: 1) SiameseLoss, an existing idea of contrastive learning known can increase inter-class margin; 2) reduce variance loss (RVL), a variance term on deep features, and 3) reverse cross entropy (RCE), a previously proposed term for detection purpose. The motivation behind seems intuitive and the empirical results demonstrate moderate improve in detection AUC, compared to one existing technique (e.g RCE).
+
+My concerns:
+1. The proposed technique requires retraining the networks to get a few percents of detection improvement. This is a disadvantage compared to standard detection approaches such as [1] and [2] which do not need to retain the network. I am surprised that these standard detection methods were not even mentioned at all. Retraining with fixed loss becomes problematic when the networks have to be trained using their own loss functions due to application-specific reasons. Moreover, the detection performance reported in this paper is not better than the one reported in [2] (ResNet, CIFAR-10, 95.84%) which do not need retraining.
+
+2. There are already well-known margin-based loss functions, such as triplet loss [4], center loss [5], large-Margin softmax loss [6], and many others, which are not mentioned at all.
+
+3. In terms of retraining-based detection, higher AUCs have been reported in [3] for a neural fingerprinting method.
+
+4. Incorrect references to existing works. The second sentence in Intro paragraph 2: Metzen, et al, .... these are not adversarial training. Xu, et al. (feature squeezing) is not a randomization technique.
+
+5. The ""baseline"" method reported in Table 2, is confusing. RCE is also a baseline? You mean conventional cross entropy (CE) training?
+
+6. Some of the norms are not properly defined, which can be confusing in adversarial research. For example, from Equation (1) to (4). The ""Frobenius norm used here"" statement in Equation (3), don't know this F norm comes from.
+
+
+[1] Characterizing adversarial subspaces using local intrinsic dimensionality. ICLR, 2018
+[2] A simple unified framework for detecting out-of-distribution samples and adversarial attacks. NeurIPS, 2018
+[3] Detecting Adversarial Examples via Neural Fingerprinting. arXiv preprint arXiv:1803.03870, 2018
+[4] Facenet: A unified embedding for face recognition and clustering. CVPR, 2015.
+[5] A Discriminative Feature Learning Approach for Deep Face Recognition. ECCV, 2016.
+[6] Large-Margin Softmax Loss for Convolutional Neural Networks. ICML 2016.",3,,ICLR2020
+rJeoQlwTKB,2,rkltE0VKwH,rkltE0VKwH,Official Blind Review #2,"Overall I like the approach in the paper. It proposes a nice 2 pronged method for exploiting exploration via intrinsic rewards for multi-agent systems. The parts that a bit lacking with the current version of the paper in this are the evaluation tasks are few and a bit simple and I think there needs to be more discussion on the ""coverage"" of the intrinsic reward types. Are the ones proposed motivated by the tasks in the paper or are they sufficient for tasks in general?  Last using a more recent novelty metric could allow the method to work on more interesting/complex tasks.
+
+More detailed feedback:
+- It would be good to include more learning curves in the main text for the paper.
+- The fact that applying intrinsic motivation to multi-agent simulations seems like a natural idea would be to convert the problem to a ""single"" agent problem to compare against the ""normal"" application of intrinsic rewards. This might be another baseline to consider for comparison.
+- It says that all agents share the same replay buffer. Does this also imply that every agent is performing the same task there are just many agents? This does not make the problem very multi-agent with different goals. Would it affect the algorithm significantly to work on an environment where the agents have various types of goals?
+- As is noted in the text, this method appears to work well in the centralized training scheme that many have adopted recently. However, It makes me wonder if there is a way to employ these exploration schemes in a non-centralized training form. The ability to ask other agents in the world about there preferences and novelty of states appears to be a strong assumption, especially in a multi-agent robotics problem.
+- While the authors note that the intrinsic rewards used in this work are not comprehensive it would be good to note how comprehensive they are. Are there a few that were left out on purpose. Do the authours believe this set is sufficient. This statement makes it seem like the authors just tried a few options and found one that worked. It would be good to expand on this discussion more.
+- More detail for Figure 1 would be helpful to understand the overall network design. While that figure it helpful maybe it would be good to include a version that goes into detail for the 2 agent environment. Then a more compressed n agent version can also be shown.
+- The paper describes a policy selector that is a type of high-level policy for HRL. This design seems rather unique in that this part of the policy can optimizing for which intrinsic reward to toggle based on the extrinsic rewards observed. I like it. It is noted that entropy is important for this design. Can this be analyzed in an empirical way? Is this true for most environments/tasks?
+- Task 2 seems a bit contrived. Is there another instance of this type of task elsewhere in another paper? It would be better to use more standard tasks if they are available.
+- Before section 6.1 the paper is discussing rewards the are received. It would be good to more explicit about where these rewards are coming from. I think it is meant that these rewards are the extrinsic rewards but it does not say.
+- As noted just before section 6.1 it seems for the collection of tasks 1-3 it is already obvious what types of intrinsic rewards should be used. It would be good to include more tasks where this decision is less obvious.
+- Why are there ""black holes"" in the environment? Also if an agent steps into a black hole they are crushed never to be seen again. What you describe sounds more like a wormhole where one end is non-stationary... Also, can the agents detect the presence of a black hole in some way?
+- It appears the novel metric is count based. While this can work in practice it seems a rather simple metric. Is it possible to use something more like ICM or RND that was referenced in the paper? Especially for the VizDoom environment?
+- In table 2 where are some of the numbers bold? It would be good to include this information in the caption for the table.
+- I am not sure if the discussion on the behaviours the intrinsic reward functions result in are very surprising. Maybe there is a more interesting behaviour that results from the combination of two intrinsic rewards?
+",3,,ICLR2020
+Byl6BILU27,1,SyeQFiCcF7,SyeQFiCcF7,"Good direction for research on capsules, but results too weak and idea too incremental","This paper presents an extension of Capsule Networks, Siamese Capsule Networks (SCNs), that can be applied to the problem of face verification. Results are reported on the small AT&T dataset and the LFW dataset. 
+
+I like the direction that this paper is taking. The original Capsules work has been looking at fairly simple and small scale datasets, and the natural next step for this approach is to start addressing harder datasets, LFW being one of them. Also face verification is a natural problem to look at with Capsules.
+
+However, I think this paper currently falls short of what I would expect from an ICLR paper. First, the results are not particularly impressive. Indeed, SCN doesn't outperform AlexNet on LFW (the most interesting dataset in the experiments). Also, I'm personally not particularly compelled by the use of the contrastive loss as the measure of performance, as it is sensitive to the scaling of the particular representation f(x) used to compute distances. Looking at accuracy (as in other face verification papers, such as DeepFace) for instance would have been more appropriate, in my opinion. I'm also worried about how hyper-parameters were selected. There are A LOT of hyper-parameters involved (loss function hyper-parameters, architecture hyper-parameters, optimizer hyper-parameters) and not much is said about how these were chosen. It is mentioned that cross validation was used to select some margin hyper-parameters, but results in Table 1 are also cross-validation results, which makes me wonder whether hyper-parameters were tuned on the performance reported in Table 1 (which of course would be biased).
+
+The paper is also pretty hard to read. I recognize that there is a lot of complicated literature to cover (e.g. prior work on Capsule Networks has introduced variations on various aspects which are each complicated to describe). But as it currently reads, I can honestly say that I'm not 100% sure what exactly was implemented, i.e. which components of previous Capsule Networks were actually used in the experiments and which weren't. For example, I wasn't able to figure out which routing mechanism was used in this paper. The paper would strongly benefit from more explicitly laying out the exact definition of SCN, perhaps at the expense of enumerating all the other variants of capsules and losses that previous work has used.
+
+Finally, regardless of the clarify of the paper, the novelty in extending Capsule Networks to a siamese architecture is arguably pretty incremental. This wouldn't be too much of a problem if the experimental results were strong, but unfortunately it isn't the case.
+
+In summary:
+
+Pros
+- New extension of Capsule Networks, tackling a more challenging problem than previous work
+
+Cons
+- Novelty is incremental
+- Paper lacks clarity and is hard to read
+- Results are underwhelming
+
+For these reasons, I'm afraid I can't recommend this paper be accepted.
+
+Finally, I've noted the following typos:
+- hinton1985shape => use proper reference
+- within in => within
+- that represent => that represents
+- a Iterated => an Iterated
+- is got => is obtained
+- followed two => followed by two
+- enocded => encoded
+- a a pair => a pair
+- such that to => such as to
+- there 1680 subjects => there are 1680 subjects
+- of varied amount => of the varied amount
+- are used many => are used in many
+- across the paper: lots of in-text references should be in parenthesis
+
+",3,4.0,ICLR2019
+yP-fM74kyAS,2,AWOSz_mMAPx,AWOSz_mMAPx,Part of the main result might already be known (concern clarified by revision),"The main result of the paper states that a strict local minmax point is a stable critical point of t-GDA for some large enough t, and that any non-strict local minmax can be made unstable by s-GDA if we choose s large enough.
+
+
+-Major issues
+
+My greatest concern is that the first part of the main result, that a strict local minmax is stable for t-GDA with all large, but finite t, is already known (Jin et al. 2020). Specifically, the proof of Lemma 40 in (Jin et al. 2020) shows that for all large enough finite t, the Jacobian of t-GDA only has eigenvalues whose real part is smaller than 0, which then implies the stability of t-GDA for a finite t. 
+
+From what I can see, the reason why Jin et al. 2020 stated their results in terms of infinite timescale separation is because they did not have a uniform bound on how large the timescale t should be, and therefore in general it can be made as large as possible (but finite). The proof in the current submission has exactly the same feature: for every game, there is a finite t that makes t-GDA stable, but in general this t can be made arbitrarily large. It seems to me that the authors have not qualitatively improved over (Jin et al. 2020) (although I believe the bounds in this paper are tighter).
+
+In the same vein, the converse statement also more or less appeared in (Jin et al. 2020); see the proof of Theorem 28, p.24-25 therein.
+
+Due to the above, I cannot see the claimed novelty of providing the first finite timescale separation for GDA, hence my rating. I'm willing to change my score should the authors convince me that I have misunderstand something.
+
+
+-Minor issues
+1. The authors claimed that ""On the empirical side, it has been widely demonstrated that timescale separation in gradient descent-ascent is crucial to improving the solution quality when training generative adversarial networks."" I believe this is an overstatement of what we currently know about GANs; see 
+
+https://arxiv.org/pdf/1711.10337.pdf
+
+for a comprehensive empirical study of the effects on the timescale for GDA, which is not as conclusive as the authors stated. I would therefore suggest to tone down the sentence.
+
+2. Appendix B.3 is quite weird. F(x_k) here should be a vector-valued mapping but the authors seem to view it as a function. Also, by F(k) = O(M^k), did the authors mean ||x_{k+1} − x*|| = O(M^k)?
+
+
+----
+
+Post-revision evaluation:
+
+The authors have modified the statements of the main theorems as well as including a more detailed comparison to previous works, which clarifies my concern. I have thus increased my score. 
+
+The technical contributions bring new insight into the studying of scale separation of GDA, and enables a tight characterization of many toy examples. I believe these are solid contributions and should be valued.
+
+On the down side, I'd like to point out that the ""practical implication"" in this paper is a bit of stretch since the ImageNet experiments are run with RMSprop, whereas the analysis of this paper is highly specialized to GDA. 
+
+Of course, studying adaptive algorithms in min-max games is exceedingly hard and well beyond the scope of this paper. What I recommend the authors is then:
+
+1. Explicitly notify the readers of the difference between RMSprop and GDA.
+
+2. Find a nontrivial but simple example where 1-GDA provides an okay baseline (say 7-layer CNNs for mixture of Gaussians or MNIST). Increase the time scale to show if it exhibits a similar behavior that a small $\tau$ gives the best result. This is directly verifying what the theory is saying, and hence feels more valuable to me.
+
+",6,4.0,ICLR2021
+FV8mgwcEn_,2,#NAME?,#NAME?,Review,"`The paper argues that the existing way of using Translation Memory (TM) in neural machine translation (NMT) is sub-optimal. Therefore it proposes TMG-NMT, namely Translation Memory Guided NMT, which consists of two parts, a universal memory encoder and a TM guided decoder. Experiments are performed to demonstrate that their method can significantly improve the translation quality and show strong adaptation for a new domain. 
+
+Pros: None
+
+Cons:
+
+1. The main concern of this paper is that, the contributions are quite limited. The authors claimed three contributions: n-gram retrieval, universal encoder, and using the copy mechanism. Basically, none of them is novel. 
+  - Bapna & Firat (2019) and Xu et al. (2020) have used n-gram matching for retrieval. 
+  - In late 2020, what is the novelty of using a multilingual BERT when encoding source sentences and retrieved TM sentences? Very little if any. 
+  - Likewise, using the copy mechanism to tackle rare word problems seems a regular approach.
+
+ Overall, I don't see any of these so-called contributions are truly technically original. This paper seems a very hurry combination of some existing techniques. I basically learn nothing new from reading this submission. 
+
+2. Another one of the key concerns about the paper is the lack of rigorous experimentation to study the usefulness of the proposed method. Despite the paper stating that there have been earlier work (Gu et al. 2017, Can & Xiong. 2018, Xia et al. 2019) that explore Translation Memory in NMT, the paper does not compare with them and only compare to non-TM-guided baselines, making the improvement less convincing. In addition, what is the language pair evaluated in this paper, which was not even mentioned...
+
+3. Considering the limited results, a deeper analysis of the proposed method would have been nice. Is the semantic relationship between the source sentence and TM sentences well learned in TMG-NMT? What kind of translation error can be well addressed with the help of TM? Further analysis of the proposed model would provide greater insight to the community.
+
+4. Section 5.4: The results would have been more complete if another setting is considered where the transformer is adapted to the target domain without using the TM mechanism, such as fine-tuning the vanilla transformer on the provided TM parallel sentences. In this way, the adaptation ability of TMG-NMT could be better proven.  
+
+5. To be honest, the writing of the current version seems a disaster. Not to mention the impressive amount of grammar errors, many parts of the storytelling are logically incoherent. For example,
+  - ""Although Bapna & Firat (2019) and Xu et al. (2020) **also** use the n-gram method to search, they still need to select the corresponding sentence that maximizes the n-gram similarity with the source sentence. "" - The usage of ""also"" is so weird, when you didn't even mention you are using n-gram method in advance... And by the way, what is the difference between your ngram matching and theirs? You should've made it clear.
+  - ""To obtain sufficient training corpus and train network parameters more fully and effectively, in this paper, we also modify the retrieval algorithm and use a pre-trained language model (PLM) to initialize the encoder’s parameters. Partially inspired by phrase-based SMT, we don’t compute the sentence level similarity score between two sentences in our retrieved method. If two sentences have a common n-gram segment, we assume that they are similar, and the sentence pairs of the TM database can provide a useful segment to help improve the translation quality. Currently, many studies have proven that PLM can offer valuable prior knowledge to enhance the translation performance of NMT (Weng et al., 2020; Song et al., 2019), So we also employ PLM to initialize the parameters of the encoder and give encoder well-trained parameters as a starting point."" - What is the logical relationship b/w the first sentence and the second one? 
+  - Please check minors for more details.
+
+
+*****
+Minors:
+1) It would have been nice to see that the format of the reference are unified.
+2) Khandelwal et al, 2020 [1] propose a novel way to incorporate Translation Memory into NMT which may bring you more thoughts towards using TM.
+
+Typos: too many. 
+1. Equation (2): a redundant close paren
+2. Section 3.1 penultimate paragraph: l-th encoder layer -> L-th encoder layer
+3. Section 4.2 First paragraph: N->M n->m
+
+Grammar errors:
+
+Too many. E.g., In the contribution part in the intro, the second and the third items start with ""does"" and ""apply"". What are the subj of these two verbs?
+
+Please try to properly use Grammarly to check your writing before submission.
+
+
+[1] ""Nearest Neighbor Machine Translation. Khandelwal et al. 2020 arXiv.""
+
+
+******
+Reasons for score:
+The novelty of this paper is basically none. The experimental results are limited and the comparison with prior work is none, which cannot fully demonstrate the effectiveness of the proposed method.
+",2,5.0,ICLR2021
+BJGxx7DVl,3,B1ZXuTolx,B1ZXuTolx,Incremental with too little empirical evidence and insufficiently developed info-theoretic argument.,"The paper proposes a modified DAE objective where it is the mapped representation of the corrupted input that is pushed closer to the representation of the uncorrupted input. This thus borrows from both denoising (DAE) for the stochasticity and from the contractive (CAE) auto-encoders objectives (which the paper doesn’t compare to) for the representational closeness, and as such appears rather incremental. In common with the CAE, a collapse of the representation can only be avoided by additional external constraints, such as tied weights, batch normalization or other normalization heuristics. While I appreciates that the authors added a paragraph discussing this point and the usual remediations after I had raised it in an earlier question, I think it would deserve a proper formal treatment. Note that such external constraints do not seem to arise from the information-theoretic formalism as articulated by the authors. This casts doubt regarding the validity or completeness of the proposed formal motivation as currently exposed.  What the extra regularization does from an information-theoretic perspective remains unclearly articulated (e.g. interpretation of lambda strength?).
+
+On the experimental front, empirical support for the approach is very weak: few experiments on synthetic and small scale data. The modified DAE's test errors on MNIST are larger than those of Original DAE all the time expect for one precise setting of lambda, and then the original DAE performance is still within the displayed error-bar of the modified DAE. So, it is unclear whether the improvement is actually statistically significant. 
+",4,5.0,ICLR2017
+ryeI_q3v3Q,1,B1GMDsR5tm,B1GMDsR5tm,review,"This paper presents an improvement on the local/derivative-free learning algorithm equilibrium propagation. Specifically, it trains a feedforward network to initialize the iterative optimization process in equilibrium prop, leading to greater stability and computational efficiency, and providing a network that can later be used for fast feedforward predictions on test data. Non-local gradient terms are dropped when training the feedforward network, so that the entire system still doesn't require backprop. There is a neat theoretical result showing that, in the neighborhood of the optimum, the dropped non-local gradient terms will be correlated with the retained gradient terms.
+
+My biggest concern with this paper is the lack of significant literature review, and that it is not placed in the context of previous work. There are only 12 references, 5 of which come from a single lab, and almost all of which are to extremely recent papers. Before acceptance, I would ask the authors to perform a literature search, update their paper to include citations to and discussion of previous work, and better motivate the novelty of their paper relative to previous work. Luckily, this is a concern that is addressable during the rebuttal process! If the authors perform a literature search, and update their paper appropriately, I will raise my score as high as 7.
+
+Here are a few related topic areas which are currently not discussed in the paper. *I am including these as a starting point only! It is your job to do a careful literature search. I am completely sure there are obvious connections I'm missing, but these should provide some entry points into the citation web.*
+- The ""method of auxiliary coordinates"" introduces soft (often quadratic) couplings between post- and pre- activations in adjacent layers which, like your distributed quadratic penalty, eliminate backprop across the couplings. I believe researchers have also done similar things with augmented Lagrangian methods. A similar layer-local quadratic penalty also appears in ladder networks.
+- Positive/negative phase (clamped / unclamped phase) training is ubiquitous in energy based models. Note though that it isn't used in classical Hopfield networks. You might want to include references to other work in energy based models for both this and other reasons. e.g., there may be some similarities between this approach and continuous-valued Boltzmann machines?
+- In addition to feedback alignment, there are other approaches to training deep neural networks without standard backprop. examples include: synthetic gradients, meta-learned local update rules, direct feedback alignment, deep Boltzmann machines, ...
+- There is extensive literature on biologically plausible learning rules -- it is a field of study in its own right. As the paper is motivated in terms of biological plausibility, it would be good to include more general context on the different approaches taken to biological plausibility.
+
+More detailed comments follow:
+
+Thank you for including the glossary of symbols!
+
+""Continuous Hopfield Network"" use lowercase for this (unless introducing acronym)
+
+""is the set non-input"" -> ""is the set of non-input""
+
+""$\alpha = ...$ ... $\alpha_j \subset ...$"" I could not make sense of the set notation here.
+
+would recommend using something other than rho for nonlinearity. rho is rarely used as a function, so the prior of many readers will be to interpret this as a scalar. phi( ) or f( ) or h( ) are often used as NN nonlinearities.
+
+inline equation after ""clamping factor"" -- believe this should just be C, rather than \partial C / \partial s.
+Move definition of \mathcal O up to where the symbol is first used.
+
+text before eq. 7 -- why train to approximate s- rather than s+? It seems like s+ would lead to higher accuracy when this is eventually used for inference.
+
+eq. 10 -- doesn't the regularization term also decrease the expressivity of the Hopfield network? e.g. it can no longer engage in ""explaining away"" or enforce top-down consistency, both of which are powerful positive attributes of iterative estimation procedures.
+
+notation nit: it's confusing to use a dot to indicate matrix multiplication. It is commonly used in ML to indicate an inner product between two vectors of the same shape/orientation. Typically matrix multiplication is implied whenever an operator isn't specified (eg x w_1 is matrix multiplication).
+
+eq. 12 -- is f' supposed to be h'? And wasn't the nonlinearity earlier introduced as rho? Should settle on one symbol for the nonlinearity.
+
+This result is very cool. It only holds in the neighborhood of the optimum though. At initialization, I believe the expected correlation is zero by symmetry arguments (eg, d L_2 / d s_2 is equally likely to have either sign). Should include an explicit discussion of when this relationship is expected to hold.
+
+""proportional to"" -> ""correlated with"" (it's not proportional to)
+
+sec. 3 -- describe nonlinearity as ""hard sigmoid""
+
+beta is drawn from uniform distribution including negative numbers? beta was earlier defined to be positive only.
+
+Figure 2 -- how does the final achieved test error change with the number of negative-phase steps? ie, is the final classification test error better even for init eq prop in the bottom row than it is in the top?
+
+The idea of initializing an iterative settling process with a forward pass goes back much farther than this. A couple contexts being deep Boltzmann machines, and the use of variational inference to initialize Monte Carlo chains
+
+sect 4.3 -- ""the the"" -> ""to the""",7,5.0,ICLR2019
+rJx5HLM-oB,3,HkxnclHKDr,HkxnclHKDr,Official Blind Review #4,"Overview:
+
+The paper tackles the representation learning problem where the aim is to learn a generic representation that is useful for a variety of downstream tasks. A two-level optimization framework is proposed: an inner optimization over the specific problem-at-hand, and an outer optimization over other similar problems. The problem is studied in two settings of the imitation learning framework with the additional aim of providing mathematical guarantees in terms of sample efficiency on new tasks. An extensive theoretical analysis is performed, and some preliminary empirical results are presented. 
+
+Decision:
+
+In its current form, the paper should be rejected because (1) the empirical analysis is incomplete – the baseline isn't very appropriate, the results are not conclusive, details are scattered or not included, (2) the literature survey does not connect the proposed approach with existing approaches, and does not convince the reader why all the existing approaches have not been compared against empirically, (3) the paper is generally unpolished and needs more work before being considered for acceptance.
+
+Details:
+
+The paper makes both theoretical and empirical claims. I did not have the time to thoroughly verify the theoretical claims and took them at face value. I consider the theoretical guarantees associated with the proposed approach a welcome and valuable contribution to this field that has recently been relying primarily on limited empirical work to assess any method. 
+
+The empirical results presented in the paper do not sufficiently support the claims of sample efficiency. One of the main issues with the empirical analysis is the choice of the baseline, which learns a policy from scratch. This does not help make conclusions about the sample efficiency of the proposed method on new tasks. A better baseline would be one that learns some representation from the T previous tasks, which would help infer if the proposed method to learn representations is actually more sample efficient on new tasks or not. There is also no comparison with existing approaches that are mentioned in the Related Work section. If those aren’t appropriate baselines for this problem, a small explanation of the reasons why would help readers understand why they haven’t been compared against. Additionally, an analysis of statistical significance of the results is missing and would significantly help in gauging the efficacy of the proposed approach. 
+
+The paper notes that these are some preliminary experiments. The completion of the empirical analysis would definitely make a stronger case for this paper to be accepted.
+
+Minor comments to improve the paper:
+
+- Error bars in the plot, specification of number of runs, and other such experimental details would be very helpful in interpreting the results. 
+- It would help a reader if the paper was more self-contained, e.g., if terms like supp(\eta), \bar{s}, \tilde{s} are defined more clearly.
+- It would also help to say what the proofs intuitively mean, e.g., for a new task drawn from this particular distribution of tasks, the agent would achieve close-to-X performance within Y samples – something along those lines.
+- There are some typos, e.g., 'possibility'->'possibly' on page 1, missing $H$ in specification of MDPs on page 2, 'exiting'->'exciting' on page 8, some latex symbols in Appendix D, etc.
+- The bibliography has a lot of issues – some references are incorrectly parsed (e.g., Yan Duan, Marcin Andrychowicz, Bradly Stadie, Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. One-shot imitation learning. 03 2017), others are inconsistent (e.g., ""In NIPS"" and ""In Advances in Neural…”; the arXiv ones).
+
+   ",3,,ICLR2020
+RELWFkTuRLr,3,Srmggo3b3X6,Srmggo3b3X6,Innovative paper with clear presentation,"The present paper aims to understand the generalization capability of self-supervised learning algorithms that fine-tune a simple linear classifier to the labels. Analyzing generalization in this case is challenging due to a data re-use problem: the same training data that is used for self-supervised learning is also used to fit the labels. The paper addresses this issue by implicitly conditioning on the training covariates x and then deriving generalization bounds that depend only on (hypothetical) noise to the labels y. The paper show that, empirically, the dominant factor in generalization error is a certain quantity called the ""memorization gap"", which can also be upper-bounded via theoretical analysis (the theoretical bound seems to be loose by about a factor of 4 compared to the empirical measurement, but is still non-vacuous in many cases). Interestingly, this is *not* the case for standard supervised learning, likely to the higher-complexity models used to fit the labels; in that case the memorization gap is high, but a different gap (called the ""rationality gap"") is large in magnitude but negative.
+
+Overall, the paper is clearly presented, innovative, and has interesting empirical and theoretical results. It seems like a clear accept to me, with my only uncertainty that I am not completely familiar with the related literature. I am also not sure why the authors could not use the Rademacher complexity---are there theoretical obstacles to using it to upper-bound generalization error in this setting, or is the problem that it is too large? If the latter, then have you considered using your approach in settings other than just the self-supervised setting in order to improve on Rademacher complexity bounds?
+
+====
+
+Framing comments / questions:
+
+-I don't like the word rationality, since it has a technical meaning in Bayesian statistics that is not the same as the usage here (i agree they are somewhat similar in flavor, but I think it's confusing to conflate them).
+
+-I'm not sure it's correct to say that SS+S is a dominant methodology. In practice we would almost always do full fine-tuning on the self-supervised representation, rather than just the final layer. Still, starting with final layer fine-tuning is a reasonable start for analysis.
+
+-It seems an important point of your analysis is that we can condition on x and then just look at label noise for measuring generalization. It seems like empirical Rademacher complexity bounds also condition on x, so is there a fundamental difference here? (I think you try to address this in Remark 3.3 but I didn't understand your point there.)
+
+======
+
+A few presentation comments:
+
+-I didn't understand this claim: "" An optimal Bayesian procedure would have zero rationality gap, and indeed this gap is typically zero or small in practice.""
+
+-Drawing lines between the dots (and shading the area under the curve) in Figure 1 is inappropriate, since the different points don't follow a logical linear progression (it's really just a scatter plot).
+
+-In Fact I, why do we need to take the max with zero? The result is still true even without the max, I believe.
+
+-In Fact I, it would be helpful to comment on the effect of changing eta. Do we expect certain of these quantities to get bigger or smaller in that case? Any heuristic intuition for how to choose the best eta?
+
+-Section 2.1 is a bit dense.
+
+-I liked Figure 2 a lot.",7,4.0,ICLR2021
+Syxv3Mo2YS,1,BygMreSYPB,BygMreSYPB,Official Blind Review #2,"The paper proposes a new deep learning approach based on Takens’s theorem to identify the dynamics of partially observed chaotic systems. In particular, the method augments the state using the solution of an ODE. Experiments on Lorenze-64 dynamics and sea level anomaly demonstrate the advantage of the proposed method over state-of-the-art baselines.
+
++ The unification of Taken’s embedding theorem and deep learning provides a novel perspective into dynamical systems
++ Impressive experiment results compared with baselines including RNN and latent ODE
+
+- The proposed method requires knowledge of the underlying dynamic model to solve the ODE, which is not fair for other methods
+- The model is trained using data from the same initial conditions, which is essentially overfitting. The authors should provide experiments for dataset from different initial conditions.
+- The writing is not very clear. For example, how to solve the optimization problem in Eqn (7),  as the augmented states u_{t-1} are unknown? How to find the bijective mapping M for general dynamical systems?
+
+Minor: question mark in section 4 page 6.   Figures 2 plots are difficult to read, pls provide more details in columns and rows.",3,,ICLR2020
+Byl2kX8c2Q,3,Syf9Q209YQ,Syf9Q209YQ,Interesting Approach with Good Results,"Review for MANIFOLD REGULARIZATION WITH GANS FOR SEMISUPERVISED LEARNING
+Summary:
+The paper proposed to incorporate a manifold regularization penalty to the GAN to adapt to semi-supervised learning. They approximate this penalty empirically by calculating stochastic finite differences of the generator’s latent variables. 
+The paper does a good job of motivating the additional regularization penalty and their approximation to it with a series of experiments and intuitive explanations. The experiment results are very through and overall promising. The paper is presented in a clear manner with only minor issues. 
+Novelty/Significance:
+The authors’ add a manifold regularization penalty to GAN discriminator’s loss function. While this is a simple and seemingly obvious approach, it had to be done by someone. Thus while I don’t think their algorithm is super novel, it is significant and thus novel enough. Additionally, the authors’ use of gradients of the generator as an approximation for the manifold penalty is a clever.
+Questions/Clarity:
+It would be helpful to note in the description of Table 3 what is better (higher/lower). Also Table 3 seems to have standard deviations missing in Supervised DCGANs and Improved GAN for 4000 labels. And is there an explanation on why there isn’t an improvement in the FID score of SVHN for 1000 labels?
+What is the first line of Table 4? Is it supposed to be combined with the second? If not, then it is missing results. And is the Pi model missing results or can it not be run on too few labels? If it can’t be run, it would be helpful to state this.
+On page 11, “in Figure A2” the first word needs to be capitalized. 
+In Figure A1, why is there a dark point at one point in the inner circle? What makes the gradient super high there?
+What are the differences of the 6 pictures in Figure A7? Iterations?
+",7,4.0,ICLR2019
+dwG5d1LF4ou,3,zYmnBGOZtH,zYmnBGOZtH,Lacking in experimental evaluation,"The paper considers the problem of estimating instance-independent label noise. More formally, it is assumed that the true labels for any data point are modified based on a noise transition matrix, and the goal is to estimate this noise transition matrix. The paper proposed an information-theoretic approach for this task, the key idea behind which is to estimate if a particular dataset has maximum entropy with respect to the labels. This estimation problem is solved using a recent discovery that the training dynamics of a neural network can be used to infer the presence of label noise.
+
+Strengths:
+
+1. The problem of learning with instance-independent noise has received some attention from the community and could be of interest. 
+2. The proposed information-theoretic framework is conceptually interesting, and also comes with some theoretical guarantees.
+
+Weaknesses:
+
+1. The paper does not show that the approach leads to better downstream neural networks in the presence of instance-independent noise. The current experiments only show that the approach can find better transition matrices Q in terms of KL divergence. Since this is simple, synthetic noise model this is not very convincing. The paper needs a much more thorough experimental evaluation to demonstrate that the approach can improve the state-of-the-art on downstream learning tasks. The experiments are also only done on CIFAR-10, whereas most of the related work considers at least a few other datasets.
+2. I also think that the setting needs to be motivated better. As can be seen in Table 1 and 2, MPEIA and GLC which need only 0.5% clean samples can actually do better than the proposed approach, which is this such a hard requirement? I also don’t find the ours-2 and ours-3 results convincing and a bit misleading, since we should then also consider combinations of other models.
+
+Overall, I am not in favor of acceptance because of the experimental evaluation not being convincing. However, the proposed algorithm is interesting and has potential, if the authors can build more on it in the future then it should make for a good paper.
+
+Other points: 
+1. The discussion of “anchor points” and “mixture models” is quite unclear in the introduction.
+2. I found the discussion and notation in Section 2.3 to be a bit convoluted. I think there should be a much clearer description of the approach.
+
+--------Updates after author response--------
+
+I thank the authors for the detailed response and appreciate the additional experiment. However, I still believe that evaluation on downstream tasks is essential to demonstrate the superiority of the approach, and I unfortunately do not agree with the authors that it implicit that the approach will yield better downstream networks. Therefore, I cannot raise my score, but would encourage additional experimentation. ",4,4.0,ICLR2021
+B1e8GSdatH,3,rJl05AVtwB,rJl05AVtwB,Official Blind Review #3,"The authors propose Chordal-GCN which is based on the chordal decomposition method post-ordered clique tree and propagates the features based on the order within each subgraph in order to reduce memory usage. The authors show that Chordal-GCN outperforms GCN [1] on all four datasets and argue that Chordal-GCN reduces memory usage.
+The idea of using Chordal graphs to GCN is novel and interesting. However, my main concern lies in the experiment results.
+
+1) To my best knowledge, the proposed Chordal- match SOTA results on Cora, Citeceer, and Pubmed. However, since these datasets are small and easy to run, I would like to see the mean and standard deviation of the accuracy of all models you ran. Can you also provide the results of the commonly used ""random split setting""[1]?
+
+2) What is the epoch time of the Chordal-GCN? Can you also report it in Table 2? Without including the pre-processing time, we don't know the overall training time of the method.
+
+3) Given that the main concern is the memory usage, the authors should compare to a strong baseline, SGC [2], which is a linear classifier trained on top of propagated features with memory/space complexity O(d) when using mini-batch training. This is much smaller than the proposed method O(Lc_2d + Ld^2).
+Also, SGC is at least two magnitudes faster to train (2.7s vs 0.987*410=367.8s + unknown pre-processing time) and more accurate (94.9 vs 94.2) than the proposed Chordal-GCN on the largest Reddit dataset. The authors emphasize that the proposed method is scalable. Please compare it to SGC in Table 2. 
+Nevertheless, there is some chance that the authors can apply the same method to SGC and speed it up further as long as the preprocessing time is relatively small.
+
+4) Based on Table 2, Cluster-GCN uses less memory and is more accurate and faster to train than Chordal-GCN. Can you justify why people should use the proposed method instead?
+
+5) There are some missing citations. These papers [3,4,5,6] achieved previous SOTA results and should be included in the Tables. 
+
+References:
+[1] Kipf and Welling: Semi-Supervised Classification with Graph Convolutional Networks (ICLR 2017)
+[2] Wu et al.: Simplifying Graph Convolutional Networks (ICML 2019)
+[3] Klicpera et al.: Predict then Propagate: Graph Neural Networks meet Personalized PageRank (ICLR 2019)
+[4] Gao and Ji: Graph U-Nets (ICML 2019)
+[5] Zhang et al.: GaAN: Gated Attention Networks for Learning on Large and Spatiotemporal Graphs (UAI 2018) 
+[6] Fey: Just Jump: Dynamic Neighborhood Aggregation in Graph Neural Networks (ICLR-W 2019)",3,,ICLR2020
+HJexYyOstB,1,BkgeQ1BYwS,BkgeQ1BYwS,Official Blind Review #1,"Update: I thank the authors for their response. I believe the paper has been improved by the additional baselines, number of seeds, clarifications to related work and qualitative analysis of the results. I have increased my score to 3 since I still have some concerns. I strongly believe the baselines should be tuned just as much as the proposed approach on the tasks used for evaluation. The baselines were not evaluated on the same environments in the original papers, so there is not much reason to believe those parameters are optimal for other tasks. Moreover, the current draft still lacks comparisons against stronger exploration methods such as Pseudo-Counts (Ostrovski et al. 2017, Bellemare et al. 2016) or Random Network Distillation (Burda et al 2018).  
+
+Summary:
+This paper proposes the use of a generative model to estimate a Bayesian uncertainty of the agent’s belief of the environment dynamics. They use draws from the generative model to approximate the posterior of the transition dynamics function. They use the uncertainty in the output of the dynamics model as intrinsic reward.
+
+Main Comments:
+
+I vote for rejecting this paper because I believe the experimental section has some design flaws, the choice of tasks used for evaluation is questionable, relevant baselines are missing, the intrinsic reward formulation requires more motivation, and overall the empirical results are not convincing (at least not for the scope that the paper sets out for in the introduction).  
+
+While the authors motivate the use of the proposed intrinsic reward for learning to solve tasks in sparse reward environments, the experiments do not include Moreover, some of the tasks used for evaluation do not have very sparse reward (e.g. acrobot but potentially others too). Without understanding how this intrinsic reward helps to solve certain tasks, it is difficult to assess its effectiveness. While state coverage is important, the end goal is solving tasks and it would be useful to understand how this intrinsic reward affects learning when extrinsic reward is also used. Some types of intrinsic motivation can actually hurt performance when used in combination with extrinsic reward on certain tasks. 
+
+I am not sure why the authors chose to not compare against VIME  (https://arxiv.org/pdf/1605.09674.pdf) and NoisyNetworks (https://arxiv.org/pdf/1706.10295.pdf) which are quite powerful exploration methods and also quite strongly related to their our method (e.g. more so than ICM).
+
+Other Questions / Comments:
+
+1. You mention that you use the same hyperparameters for all models. How did you select the HPs to be used? I am concerned this leads to an unfair comparison given that different models may work better for different sets of HPs. A better approach would be to do HP searches for each model and select the best set for each.
+2. Using only 3 seeds does not seem to be enough for robust conclusions. Some  of your results are rather close 
+3. How did you derive equation (1)? Please provide more explanations, at least in the  appendix.
+4. Why is Figure 3 missing the other baselines: ICM & Disagreement? Please include for completeness
+5. Please include the variance across the seeds in Figure 4 (b). 
+6. How is the percentage of the explored maze computed for Figure 5? Is that across the entire training or within one episode? What is the learned behavior of the agents? I believe a heatmap with state visitation would be useful to better understand how the learned behaviors differ within an episode. e.g. Within an episode, do the agents learn to go as far as possible from the initial location and then explore that “less explored” area or do they quasi-uniformly visit the states they’ve already seen during previous episodes?  
+7. In Figure 6 (b), there doesn’t seem to  be a significant difference between your model and the MAX one. What happens if you train them for longer, does MAX achieve the same or even more exploration performance as  your model? I’m concerned this small difference may be due to poor tuning of HPs for the baselines rather than algorithmic differences?
+8. For the robotic hand experiments, can you  provide some intuition about what the number of explored rotations means and how they relate to a good policy? What is the number of rotations needed to solve certain tasks? What kinds of rotations do they explore -- are some of them more useful than others for manipulating certain objects? This would add context and help readers understand what those numbers mean in practice in terms of behavior and relevance to learning good / optimal policies.
+",3,,ICLR2020
+rklsxjcMpQ,3,Sye7qoC5FQ,Sye7qoC5FQ,Nice first try but needs improvement,"The topic of this paper is interesting; however, the significance of the work can be improved.  I recommend that the authors test the vulnerability of node embeddings on various random graph models.  Examples of random graph models include Erdos-Renyi, Stochastic Kronecker Graph, Configuration Model with power-law degree distribution, Barabasi-Albert, Watts-Strogatz, Hyperbolic Graphs, Block Two-level Erdos-Renyi, etc.  That way we can learn what types of networks are more susceptible to attacks on random-walk based node embeddings and perhaps look into why some are more vulnerable than others.",5,5.0,ICLR2019
+r1pQ4-zNe,2,S1J0E-71l,S1J0E-71l,Misleading,"This paper proposes to use previous error signal of the output layer as an additional input to recurrent update function in order to enhance the modelling power of a dynamic system such as RNNs. 
+
+-This paper makes an  erroneous assumption: test label information is not given in most of the real world applications, except few applications. This means that the language modelling task, which is the only experiment of this paper, may not be the right task to test this approach. Also, comparing against the models that do not use test error signal at inference time is unfair. We cannot just say that the test label information is being observed, this only holds in online-prediction problems.
+
+-The experiment is only conducted on one dataset, reporting state-of-the-art result, but unfortunately this is not true. There are already more than four papers reporting better numbers than the one reported in this task, however the author did not cite them. I understand that this paper came before the other papers, but the manuscript should be updated before the final decision.
+
+-The model size is still missing and without this information, it is hard to judge the contribution of the proposed trick.
+",3,5.0,ICLR2017
+HyeZcWk9Kr,1,ByxT7TNFvH,ByxT7TNFvH,Official Blind Review #3,"This work proposes to leverage a pre-trained semantic segmentation network to learn semantically adaptive filters for self-supervised monocular depth estimation. Additionally, a simple two-stage training heuristic is proposed to improve depth estimation performance for dynamic objects that move in a way that induces small apparent motion and thus are projected to infinite depth values when used in an SfM-based supervision framework. Experimental results are shown on the KITTI benchmark, where the approach improves upon the state-of-the-art.
+
+Overview:
+
++ Good results
++ Doesn't require semantic segmentation ground truth in the monodepth training set
+
+- Not clear if semantic segmentation is needed
+- Specific to street scenes
+- Experiments only on KITTI
+
+The qualitative results look great and the experiments show that semantic guidance improves quantitative performance by a non-trivial factor. The qualitative results suggest that the results produced with semantic guidance are sharper and more detailed. However, it is not clear that using features from a pre-trained semantic segmentation network is necessary. The proposed technical approach is to use the pixel-adaptive convolutions by Su et. al. to learn content-adaptive filters that are conditioned on the features of the pre-trained semantic segmentation network. These filters could in principle be directly learned from the input images, without needing to first train a semantic segmentation network. The original work by Su et. al. achieved higher detail compared to their baseline by just training the guidance network jointly.  Alternatively, the guidance network could in principle be pre-trained for any other task. The main advantage of the proposed scheme is that the guidance path doesn't need to be trained together with the depth network. On the other hand, unless shown otherwise, we have to assume that the network needs to be pre-trained on some data that is sufficiently close to the indented application domain. This would limit the approach to situations where a reasonable pre-trained semantic segmentation network is available.
+
+The proposed heuristic to filter some dynamic objects is very specific to street scenes and to some degree even to the KITTI dataset. It requires a dominant ground plane and is only able to detect a small subset of dynamic motion (e.g. apparent motion close to zero and object below the horizon). It is also not clear what the actual impact of this procedure is. Section 5.4.2 mentions that Abs. Rel decreases from 0.121 to 0.119, but it is not clear to what this needs to be compared to as there is no baseline in any of the other tables with an Abs. Rel of 0.121. Additionally, while the authors call this a minor decrease, the order of magnitude is comparable to the decrease in error that this method shows over the state-of-the-art (which the authors call statistically significant) and also over the baselines (c.f. Table 2). Can the authors clarify this?
+
+Related to being specific to street scenes: The paper shows experiments only on the KITTI dataset. The apparent requirement to have a reasonable semantic segmentation model available, make it important to evaluate also in other settings (for example on an indoor dataset like NYU) to show that the approach works beyond street scenes (which is one of the in practice not so interesting settings for monocular depth estimation since it is rather easy to just equip cars with additional cameras to solve the depth estimation problem).
+
+Need for a reasonable segmentation model: It is not clear in how far the quality of the segmentation network impacts the quality for the depth estimation task. What about the domain shift where the segmentation model doesn't do so well? Even if the segmentation result is not used directly, the features will still shift. How much would depth performance suffer?
+
+Summary:
+While the results look good on a single dataset, I have doubts both about the generality of the proposed approach as well as the need for the specific technical contribution.
+
+=== Post rebuttal update ===
+The authors have addressed many of my initial concerns and provided valuable additional experimental evaluations. While I'd like to upgrade my recommendation to weak accept, I strongly encourage the authors to provide additional experiments on different datasets (at least NYU). ",6,,ICLR2020
+Hyg-cD3ycB,2,r1lHAAVtwr,r1lHAAVtwr,Official Blind Review #3,"This paper proposes a regularization strategy motivated with principles of hierarchical, hyperspherical and discrete metric learning. Through regularization of as designed in level-wise, group-wise with the hierarchy of network, in their experiments with classification dataset, better performance are achieved with various distance. 
+
+Pros:
+1: I think the paper is well organized and motivated, the regularization of parameters in deep neural network is one of the center problem for effective learning. 
+2: The proposed strategy is effective with their experiments, various datasets and objective metrics are adopted to validate the regularization.  Combination ablation study is sufficient.
+
+Cons:
+1: The paper is also related with several popular normalization strategies such as weight normalization/standardization, group/batch normalization. It would be more convincing that some comparison could be performed against these strategies.
+2: There would be better to show its performance using larger dataset such as ImageNet or COCO detection.
+",6,,ICLR2020
+rJgl-c_P_B,1,H1gax6VtDB,H1gax6VtDB,Official Blind Review #3,"The construction and learning of structured world models is an interesting area of research that could in principle enable better generalisation and interpretability for predictive models. The authors overcome the problem of using pixel-based losses (a common issue being reconstruction of small but potentially important objects) by using a contrastive latent space. The model otherwise makes use of a fixed number of object slots and a GNN transition model, similarly to prior approaches. The authors back up their method with nice results on 3D cubes and 3-body physics domains, and reasonable initial results on two Atari games, with ablations on the different components showing their contributions, so I would give this paper an accept.
+
+The comparisons to existing literature and related areas is very extensive, with interesting pointers to potential future work - particularly on the transition model and graph embeddings. As expected, the object-factorized action space appears to work well for generalisation, and could be extended/adapted, but setting a fixed number of objects K is a clearly fundamentally limiting hyperparameter, and so showing how the model performs under misspecification of this hyperparameter is useful to know for settings where this is known (2D shapes, 3D blocks, 3-body physics). The fact that K=1 is the best for Pong but K=5 is the best for Space Invaders raises at least two questions: can scaling K > 5 further improve performance on Space Invaders, and is it possible to make the model more robust to a greater-than-needed number of object slots? On a similar note, the data collection procedure for the Atari games seems to indicate that the model is quite sensitive to domains where actions rarely have an impact on the transition dynamics, or the interaction is more complex (e.g. other agents exist in the world) - coming up with a synthetic dataset where the importance of this can be quantified would again aid understanding of the authors' proposed method.",8,,ICLR2020
+xK-Kp1WtSsl,1,kvhzKz-_DMF,kvhzKz-_DMF,Recommendation to reject due to lack of novelty and exclusively empirical contributions.  ,"**Short summary of the paper**:
+The authors apply the Deep Variational Information Bottleneck (DVIB) to a NLP setting, using pretrained BERT
+as a fixed part of the encoder and fine-tune subsequent MLP layers of the encoder as well as an MLP decoder.
+The proposed architecture shows state-of-the-art results compared to other recent regularization methods, especially in low-resource and out-of-domain benchmarks.
+
+**Contributions**:
+- Proposal of the use of DVIB with large-scale pretrained models such as BERT in a NLP setting (low significance)
+- Extensive experiments showing higher generalization and robustness to bias compared to other SOTA regularization methods in NLI benchmarks and low-resource transfer learning (medium significance)
+
+**Pros**:
+- The shown results show SOTA results in terms of generalization for a wide range of benchmarks with only marginal increase of model complexity (in terms of # of parameters & training-time).
+
+
+**Cons**:
+Limited novelty & incremental contribution:
+- Although SOTA results are shown in very extensive experiments, the methodical contribution is rather marginal,
+as it boils down to adding a pre-trained BERT to the encoder part of a DVIB.
+- Aside from the pre-trained BERT part, no contributions or changes to a vanilla DVIB architecture were made.
+- The novelty mainly stems from applying the DVIB to a new specific setting (""fine-tuning large-scale language models on low-resource scenarios"").
+
+**Style**:
+Overall, the paper is well written and structured.
+
+**Experiments**:
+In principal, the experimental setup seems well reasoned, comprehensible & extensive.
+However, I'm rather concerned about the general concept of ""fine-tuning across random seeds"".
+In my opinion the random seed should not be a tunable hyperparameter.
+
+**Minor Comments**:
+I think for the effect of the Lagrange parameter on the losses (Figure 3), an IB curve plotting the two mutual information terms against each other for different betas would be more suitable. 
+",4,4.0,ICLR2021
+p9ETLFbbs76,4,pQq3oLH9UmL,pQq3oLH9UmL,"Review of ""Achieving Explainability in a Visual Hard Attention Model through Content Prediction""","This paper presents a visual hard-attention image classification model. The difference to standard classification methods such as CNN is that the model provides an explainable inner structure by default, that can be inspected to see what the model focused on. The difference to other state-or-the-art hard-attention models is that this model is differentiable, allowing for more robust and stable optimization.
+
+On a positive note, the presented method is sound and mathematically principled, and the description of it is complete and technically correct. The paper is also well written, well organized, and easy to read. The relevant related work is cited.
+
+However, the paper suffers from two major flaws. 
+Firstly, the contribution of the proposed method with respect to other recent hard-attention models based on reinforcement learning it is not well motivated - other than that this model is differentiable. The last paragraph in the Related Work provide no statement whatsoever as to what the present method contributes over the latest methods in the literature. 
+Secondly, the baseline hard-attention model in the experiments, (Mnih et al. 2014), is very old and it is not surprising that the proposed method outperforms it. A more interesting baseline would be a later hard-attention model such as (Elsayed et al. 2019). Moreover, the used datasets are all quite simplistic, and it would be more interesting with a more realistic one.
+
+Due to the above, the recommendation is Reject - but the authors are strongly encouraged to do experiments on more challenging data and compare to a newer baseline.",4,3.0,ICLR2021
+r1gHCrFlM,3,SyPMT6gAb,SyPMT6gAb,"Well written paper, good contribution by leveraging several diverse work but average/limited applicability","This paper studies off-policy learning in the bandit setting. It develops a new learning objective where the empirical risk is regularized by the squared Chi-2 divergence between the new and old policy. This objective is motivated by a bound on the empirical risk, where this divergence appears. The authors propose to solve this objective by using generative adversarial networks for variational divergence minimization (f-GAN). The algorithm is then evaluated on settings derived from supervised learning tasks and compared to other algorithms.
+
+I find the paper well written and clear. I like that the proposed method is both supported by theory and empirical results. 
+
+Minor point: I do not really agree with the discussion on the impact of the stochasticity of the logging policy in section 5.6. Based on Figure 5 a and b, it seems that the learned policy is performing equally well no matter how stochastic the logging policy is. So I find it a bit misleading to suggest that the learned policy are not being improved when the logging policy is more deterministic. Rather, the gap reduces between the two policies because the logging policy gets better. In order to better showcase this mechanism, perhaps you could try using a logging policy that does not favor the best action.
+
+quality and clarity:
+++ code made available
++ well written and clear
+- The proof of theorem 2 is not in the paper nor appendix (the authors say it is similar to another work).
+
+
+originality
++ good extension of the work by Swaminathan & Joachims (2015a): derivation of an alternative objective and use of a deep networks
+. This paper leverages a set of diverse results
+
+significance
+- The proposed method can only be applied if propensity scores were recorded when the data was generated.
+- no test on a real setting
+++ The proposed method is supported both by theoretical insights and empirical experiments.
++ empirical improvement with respect to previous methods
+
+
+details/typos:
+
+3.1, p3: R^(h) has an indexed parenthesis
+5.2; and we more details
+5.3: so that results more comparable",7,3.0,ICLR2018
+HJlmbJG9hQ,1,S1G_cj05YQ,S1G_cj05YQ,"Simple approach, but limited novelty, and needs some improvement in exposition and benchmarking of related work","This paper proposes an approach to mitigate catastrophic forgetting in supervised learning by regularizing activations. The paper views previous techniques (EWC, SI, and GEM) under a multi-task learning lens, and then proposes an additional loss term to minimise the KL between activations from previous and current models, on previous tasks - this is based on a memory which stores some previous samples and their corresponding activations.
+
+I think it is a simple and intuitive approach and a well-written paper. Unfortunately I have a number of concerns that I think preclude publication in the current state.
+
+First, in terms of related work, I believe this is very similar to Learning without forgetting (LwF), with the difference that the KL-divergence is computed on samples kept from the previous tasks. This is briefly mentioned in the paper, but I think it needs to be made more explicit, and LwF should be a baseline in the experiments to clearly indicate the benefit of keeping this data. There is also a relationship to EWC: given the connection between the Fisher information and KL, it can be viewed as minimising the KL divergence in parameter space, rather than in activation space (which is the case here). Also note that EWC uses the true Fisher rather than the empirical, contrary to the derivation in equation (2).
+There are also a number of papers that haven’t been cited in the related work [1][2][3][4].
+
+Second, I think the motivation in Section 3.1 could be more convincing. Most importantly, it’s not clear to me that the decision boundary *shouldn’t* change for previously misclassified examples, as this could be an opportunity for backwards transfer.
+Further, I don’t think the point in the last paragraph about having a small data portion is relevant, since they are from the same data distribution, and we would expect misclassified samples to be in the same (low) frequency in Fisher estimation as overall. I think the point of this paragraph is just that it is important to consider the entire predictive distribution of previous tasks rather than the probability of the correct class, so this should be stated more clearly and then justified. 
+
+Finally, I think the experimental justification could be improved as well. Beyond permuted MNIST (which it has been argued is not as useful as other baselines [4]), only the final performance on split notMNIST / CIFAR-100 is reported. Some comments and questions:
+- The accuracies of EWC (and possibly SI) in the table are worse than reported in previous work (eg. [1]), so I think this needs to be examined.
+- What is the fine-tuning baseline (I don't believe it is actually clearly defined)? How can it be so low in figure 2a but better in 2b?
+- I think plots over time (performance on all tasks) would be much more useful than the final performance in Table 2 and Fig 2.
+- Errors and error bars would be beneficial for all results.
+- Table 1 should also include the references provided.
+
+Some other comments and questions:
+- Compared to eqn (2), eqn (6) is missing the ½ constant.
+- Typos in section 5.3: ""SI performs better than SI"", and VAR instead of SAR.
+- Section 2, unclear of meaning of ""coined with the likelihood"" (should this be “coincide”?)
+- The first line should be “Humans have the ability to learn...” In general, I think the introduction could use another proofread for grammar and readability as I saw a few minor things.
+
+[1] Nguyen, Cuong V., et al. ""Variational Continual Learning."" ICLR, 2018.
+[2] Schwarz, Jonathan, et al. ""Progress & Compress: A scalable framework for continual learning."" ICML, 2018.
+[3] Shin, Hanul, et al. ""Continual learning with deep generative replay."" NIPS, 2017.
+[4] Farquhar, Sebastian, and Yarin Gal. ""Towards Robust Evaluations of Continual Learning."" arXiv, 2018.
+",4,4.0,ICLR2019
+H1g_BV_c2m,2,BJgQB20qFQ,BJgQB20qFQ,"Seems novel, but the evaluations could use some work","
+Summary:
+Search-based policies are stronger than a reactive policies, but the resulting time consumption can be exponential. Existing solutions include designing a plan from scratch given a complete problem specification or performing iterative rewriting of the plan, though the latter approach has only been explored in problems where the action and state spaces are continuous.
+
+In this work, the authors propose a novel study into the application of iterative rewriting planning schemes in discrete spaces and evaluate their approach on two tasks: job scheduling and expression simplification. They formulate the rewriting task as a reinforcement learning problem where the action space is the application of a set of possible rewriting rules to modify the discrete state. 
+
+The approach is broken down into two steps. In the first step, a particular partition of the discrete state space is selected as needing to be changed by a score predictor. Following this step, a rule selector chooses which action to perform to modify this state space accordingly.
+
+In the job scheduling task, the partition of the state space corresponds to a single job who’s scheduled time must be changed. the application of a rule to rewrite the state involves switching the order of any two jobs to be run. In the expression simplification task, a state to be rewritten corresponds to a subtree in the expression parse tree that can be converted to another expression.
+
+To train, the authors define a mixed loss with two component:
+1. A mean squared error term for training the score predictor that minimizes the difference between the benefit of the executed action and the predicted score given to that node
+2. An advantage actor critic method for training the rule selector that uses the difference between the benefit of the executed action and the predicted score given to that node as a reward to evaluate the action sampled from the rule set
+
+Pros:
+
+-The approach seems to be relatively novel and the authors address an important problem.
+-The authors don’t make their approach more complicated than it needs to be
+
+Cons:
+
+Notation: The notation could be a lot clearer. The variable names used in the tasks should be directly mapped to those defined in the theory in Section 2. It wasn’t clear that the state s_t in the job scheduling problem was defined as the set of all nodes g_j and their edges and that the {\hat g_t} corresponds to a single node. Also, there are some key details that have been relegated to the appendix that should be in the main body of the paper (e.g., how inference was performed)
+
+Evaluation: The authors perform this evaluation on two automatically generated synthetic datasets. It’s not clear that the method would generalize to real data. Why not try the approach on a task such as grammar error correction? Additionally, I would have liked to see more analysis of the method. Apart from showing the comparison of the method with several baselines, the authors don’t provide many insights into how their method works. How data hungry is the method? Seeing as the data is synthetically generated, how effective would the method be with 10X of the training data, or 10% of it? Were any other loss functions attempted for training the model, or did the authors only try the Advantage Actor Critic? What about a self-critical approach? I'd like to see more analysis of how varying different components of the method such as the rule selector and score predictor affect performance.",5,3.0,ICLR2019
+H1eWdO9lTQ,3,ByfyHh05tQ,ByfyHh05tQ,"Interesting application of RL to DNA, new SotA perf, some theoretical novelty","I'm happy with the revisions the authors have made, as I find that they call out the novel contributions a bit more explicitly. Specifically I see some novel work in the area of simultaneous multi-task/meta-RL and black box optimization of the policy net architectures. I don't think calling this NAS is justified; calling it bayesopt or black box opt is fair. NAS uses a neural net to propose experiments over structured graphs of computation nodes. This work appears to be simpler hyperparameter optimization.
+
+====
+
+Quality:
+The work is well done, and the experiments are reasonable/competitive, showcasing other recent work and outperforming. 
+
+Clarity:
+I thought the presentation was tolerable. I was a bit confused by Table 1 until reading the prose at the bottom of page 7 indicated Table 1 is presenting percentages, not integer quantities. The local improvement step is not very clearly explained. Are all combos tried across all mismatched positions, or do we try each mismatched position independently holding the others to their predicted values? What value of zeta did you end up using? It seems like this is essential to getting good performance. It is completely unclear to me what the 'restart option' does.
+
+Originality:
+Using RL in this specific application setting seems relatively new (though also explored by RL-LS in https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6029810/). On the other hand, the approach used doesn't seem to be substantially different than anything else typically used for policy gradient RL. The meta-learning approach is interesting, though again not too different from multi-task approaches (though these are perhaps less common in RL than in general deep learning).
+
+Significance:
+Likely to be of practical utility in the inverse design space, specifically therapeutics, CRISPR guide RNA design, etc. Interesting to ICLR as an application area but probably not much theory/methods interest.
+
+
+On balance I lean slightly against accepting and think this is a better fit to either a workshop or a more domain-specific venue (MLHC http://mucmd.org/ for example).",6,4.0,ICLR2019
+fEzTo8YBIkR,2,VwU1lyi5nzb,VwU1lyi5nzb,Review,"This paper proposes a new framework to inherently break the independence assumption of start-end decision making in extractive question answering models. The general idea is to expand the current point-to-point product to a 3-D convolution. Since each point absorbs information within the sliding window, the probability (i, j) does not solely depend on $h_i, h_j$. The basic motivation makes sense to me, but I still have the following concerns about the paper:
+1) the proposed convolutional network though incorporates more information, it's constrained to be the nearest information in its sliding window, let's say 2 or 3, and the probability of p(i, j) thus depends on $h_{i-2:i+2}, h_{j-2:j+2}$. However, I don't think this modification will help the model make a better decision since the nearest surrounding words' information is already well encoded in the lower-layers of the transformer. The errors the model made are mainly due to a lack of global reasoning, the more further-away context is actually a critical issue for breaking the bottleneck. The proposed span-image framework doesn't seem to touch on the core problem in the current QA datasets.
+2) the proposed method obtains similar or even worse performance on top-answer selection, this point kind of reflects my previous concern. Adding more neighboring word context does not help the model make a wiser decision, and thus the final scores remain similar to the standard BERT QA model.
+3) the results in in-house datasets are not quite convincing. I think you can add some heuristics to the BERT QA answer span selection to improve its top-k results like adding diversity, prevent overlapping, etc. I'm not sure whether the gap will still remain as significant as reported in the paper.
+",4,4.0,ICLR2021
+BvmU2V48tyB,4,ZPa2SyGcbwh,ZPa2SyGcbwh,This paper proposes a progressive label correction algorithm by correcting labels and refine the model iteratively.,"Comments:
+Label noise is very frequently in many real world applications. However, the noise can be with different distributions. If we build the learning model under a certain distribution, it is difficult to capture the discriminative information. In this paper, without assuming that the noise is a certain distribution, the proposed method can handle the general noise, and it mainly target a new family of feature-dependent label noise, which is much more general than commonly used i.i.d. label noise and encompasses a broad spectrum of noise patterns. The experimental results show that the proposed method is promising. Meanwhile, the theoretical analysis of the proposed method is well inferred.
+
+Strong points:
+[1] The theoretical foundation of the proposed method is strong. 
+[2] The experimental results of the proposed method are promising.
+Weak points:
+[1] Some details about the experiments are not clear, such as the experimental settings of the compared methods.
+[2] It is better to show the connection between the polynomial margin diminishing noise and the other noises.
+
+Accept reason:
+[1] The paper has shown a promising performance than several state-of-the-art methods. The noise assumption is more general than the traditional types. Hence, the paper may provide a novel way to deal with noise labels.
+
+Feedbacks:
+[1] I found that the step size \beta has an influence on the threshold \theta, and how to set it. It is necessary to show the details about \beta, which has directly influence on the results.
+[2] The noise in the experiments is more than 30%. Whether the proposed method is suitable to the high-level noise. It is better to show the results without any noise.
+[3] Since the polynomial margin diminishing noise is general, whether the polynomial margin diminishing noise can represent any noise functions in theoretical.
+[4] The details of all the compared method may need to be provided.",7,4.0,ICLR2021
+n1qRsuWsQ-e,3,yoVo1fThmS1,yoVo1fThmS1,Presenting a robust method to the noisy training dataset for novelty detection by modeling mixture of Gaussian with outlier and inlier distribution in the latent space,"This study proposes a novel method that can work well even the training data is corrupted by partial data from the unknown domain. Though it deals with the well-known problem called 'Noisy data/label', its approach is not the same thing as the previous works as it focuses on variational autoencoder on the task of novelty detection. And its arguments and statistical assumptions are followed by mathematical proofs. 
+
+Overall, it is an interesting approach and I believe it would give a good way to ML practitioners who are struggling with noisy datasets in real-world applications. However, there some questions/comments about the article which may make the study more consolidate:
+
+Questions
+- In the description of the proposed method, MAW, Discriminator generates Loss(Lw1) by comparing between Zgen and Zhyp. And Zhyp is unimodal distribution while Zgen follows MoG. I wonder whether there is a risk that inlier and outlier distributions are mixed(combined) as the loss makes the generator generates just the same mu/sigma regardless of the domains. If so, is there any equilibrium trick required so that the generator would not be strong too much?
+
+- Though it is hard to pre-estimate how the outlier distribution looks like, it is more common to assume the outlier distribution has multi-modal than uni-modal. However, the proposed method approximates the outlier distribution as unimodal Gaussian distribution. Is it possible to model the outliers as multi-modal distribution such as MoG?
+
+- In the experiment with the multiclass dataset, the number of possible inlier domains is the same as the number of classes in the dataset. And the characteristic of 'training data' may be different by each combination. I wonder the experiment of this study covered all possible sets. 
+ And the corrupted data is sampled randomly from the other classes. Is there any deviation in the performance by each sampling?
+
+- This study aims to generate the model to be robust to corrupted training data. However, in the result, it is not clear that the proposed method is more robust than others as the AUC/AP from MAW falls (maybe greater than others) as the outlier ratio increases. The authors may give explanations about the result in detail. 
+
+
+Additional Comments
+- The readability of Figure 2, 3 is not good. How about showing them on the tables?
+- This study shows the superiority from four datasets (image, non-image). However, there is more dataset widely used for novelty detection such as (Fashion) MNIST or MVTech. The authors may consider doing the same experiments on the other dataset.
+-The authors may compare the method not only to the novelty detection methods, but many previous works which also aims to be robust to noisy data(or label) in the training process. 
+",5,4.0,ICLR2021
+SJlrME_6Yr,3,SygEukHYvB,SygEukHYvB,Official Blind Review #1,"This paper studied the effectiveness of Conditional Entropy Bottleneck (CEB) on improving model robustness. Three tasks are considered to demonstrate its effectiveness; generalization performance over clean test images, adversarially perturbed images, and images corrupted by various synthetic noises. The experiment results demonstrated that CEB improves the model robustness on all considered tasks over the deterministic baseline and adversarially-trained classifiers. 
+
+The proposed idea is simple, easy to implement, and generally applicable to various classification networks. I especially enjoyed the nice experiment results; indeed, it has been widely observed that many existing approaches on adversarial training sacrifice the classification performance to improve the robustness over adversarial examples [1]. The proposed approach sidesteps such a problem since it does not require adversarial retraining. It is surprising to see that the proposed method is still able to achieve stronger robustness than the models based on adversarial training.
+
+One of my major concerns is that it has a large overlap with Fischer (2018) in terms of methodology, experiment settings, and empirical observations, which limits the general contribution of the paper. Fischer (2018) first proposed CEB objective and showed its effectiveness in various tasks, including the adversarial defense. Although this paper extends the results on adversarial defense to CIFAR 10 dataset and includes additional ablative experiments and experiments on other types of input noises, it ends up confirming very similar observations/conclusions in Fischer (2018). Although Fischer (2018) is an unpublished work, I think that it is fair to consider that as a prior work since it is properly cited in the paper.   
+
+My other concern is that the experiment misses comparisons against other adversarial defense approaches, which makes it difficult to understand the degree of robustness this model can achieve. The current comparisons are mostly focused on deterministic and VIB baselines, which are useful to understand the core argument of the paper, but insufficient to understand how useful CEB could be in the purpose of adversarial defense. Especially, I believe that some recent approaches that do not require adversarial training, such as [A2], are worth comparisons.  
+
+Below are some minor comments/questions on the paper.
+1. Section 3.2: For this experiment, BN is removed from the classification network; it would be still beneficial to see the baseline performance with BN (deterministic model) to better compare the classification performance on clean test data.
+2. Section 3.3: The performance on both baseline and the proposed model on clean data is far lower than the SOTA models. Some clarification would be helpful.
+3. It would be great to see further clarifications of improvement in CEB over VIB; currently, it is very simply justified that it is because CEB optimizes tighter variational bound on Eq.(3) than VIB. But it would also be great to see justifications from various angles (e.g. in the context of adversarial defense).  
+",3,,ICLR2020
+S1klbTulM,1,rJWrK9lAb,rJWrK9lAb,"interesting gan architecture, evaluations limited","This paper proposes a new GAN model whereby the discriminator (rather than being a binary classifier) consists of an encoding network followed by an autoregressive model on the encoded features. The discriminator is trained to maximize the probability of the true data and minimize the probability of the generated samples.  The authors also propose a version that combines this autoregressive discriminator with a patchGAN discriminator. The authors train this model on cifar10 and stl10 and show reasonable generations and inception scores, comparing the latter with existing approaches. 
+
+Pros: This discriminator architecture is well motivated, intuitive and novel. the samples are good (though not better than existing approaches as far as I can tell). The paper is also well written and easy to read.
+
+Cons: As is commonly the case with GAN models, it is difficult to assess the advantage of this approach over exiting techniques. The samples generated form this model look fine, but not better than existing samples. The inception scores are ok, but seem to be outperformed by other models (though this shouldn't necessarily be taken as a critique of the approach presented here as inception scores are an approximation to what we care about and we should not be trying to tune models for better inception scores). 
+
+Detailed comments:
+- In terms of experiments, I think think paper is missing the following: (1) An additional dataset -- cifar and stl10 are very similar, a face dataset for example would be good to see and is commonly used in GAN papers. (2) the authors claim their method is stable, so it would be good to see quantitative results backing this claim, i.e. sweeps over hyper-parameters / encoding/generator architectures with evaluations for different settings. 
+- the idea of having some form of  recurrent (either over channels of spatially)  processing in the discriminator seems more general that the specific proposal given here. Could the authors say a bit more about what they think the effects of adding recurrence in the discriminator vs optimizing the likelihood of the features under the autoregressive model?
+
+Ultimately, the approach is interesting but there is not enough empirical evaluations.
+",5,4.0,ICLR2018
+SJlmuE8k5S,2,S1eALyrYDH,S1eALyrYDH,Official Blind Review #1,"The authors proposed an end-to-end method (E2Efold) to predict RNA secondary structure. The method consists of a Deep Score Network and a Post-Process Network (PPN). The two networks are trained jointly. The score network is a deep learning model with transformer and convolution layers, and the post-process network is solving a constrained optimization problem with an T-step unrolled algorithm. Experimental results demonstrate that the proposed approach outperforms other RNA secondary structure estimation approaches.
+
+Overall I found the paper interesting. Although the writing can be improved and some important details are missing.
+
+Major comments
+As the authors point out, several existing approaches for unrolling optimization problems have been proposed. It would be helpful to clarify the methodological novelty of the proposed algorithm compared to those.
+
+
+Training details and implementation details are missing; these hinder the reproducibility of the proposed approach. The author stated pre-training of the score network, how is the PPN and score network updated during the joint training? Does the model always converge? The authors vaguely mentioned add additional logistic regression loss to Eq9 for regularization. What is a typical number of T? How does varying T affect the performance, both in terms of training time (and convergence) and in terms of accuracy/F1?
+
+Minor comments
+The 29.7% improvement of F1 score overstates the improvements compared to non-learning approaches.. This performance was computed on the dataset (RNAStralign) on which E2Efold was trained. A fair comparison, as the authors also stated, is on the independent ArchiveII data. On this data, E2Efold has F1 score 0.686 versus 0.638 for CONTRAfold. The author should report performance improvement under this line.
+
+
+It would be helpful to report performance per RNA category, both for RNAstralign data and ArchiveII data, while the ArchiveII data should still remain independent. Different models may have their strengths and weaknesses on different RNA types.
+
+
+It is not obvious to me how the proximal gradient was derived to (3)-(5). It would be helpful if the authors show some details in the supplements.
+
+
+Why is there a need to introduce an l_1 penalty term to make A sparse?
+
+
+On which data is Table 6?
+
+Typos, etc.
+The references are not consistently formatted
+“structure a result” -> “structure is a result”
+“a few hundred.” -> “a few hundred base pairs.”
+“objective measure the” -> “objective measures the”
+“section 5” -> “Section 5” (in several places)
+In the equation above Equation 2, should it be -\rho||\hat{A}||_{1} instead of plus? Otherwise, the “max” could be made arbitrarily large.
+",8,,ICLR2020
+HkgkHmv5FS,1,BJlyi64FvB,BJlyi64FvB,Official Blind Review #1,"This paper considers the effect of network width of the neural network and its ability to capture various intricate features of the data. In particular, the central claim of this paper is what the title claims ""Wider networks learn features that are better"". They make this claim using the visualization technique called ""activation atlasses"". They find that wider networks learn features in the hidden neurons that are more ""interpretable"" in this visualization framework. Additionally, they also notice that fine-tuning a _linear model_ using the learned features for the wider networks provide better accuracy for new (but related) tasks over the shallower counterparts. For most experiments of this paper, ""shallow network"" refers to a width of 64 and ""wide network"" refers to a width of 2048. The main datasets used for the experiments are MNIST, CIFAR 10/100 and a ""translated"" version of MNIST images.
+
+
+Overall the paper is written well and the ideas and results are communicated crisply. I have a few comments. First, regarding the related work, I think that the reader would be served better if the authors also list the recent works related to effect of network width on convergence and generalization (e.g., [1] and references that cite this). The reason I say this is so that the reader should not (wrongly) interpret that this is the first work that finds ""favorable"" properties of wider networks (the paper does not make this claim, but it is easy for a reader to interpret it). Second, I find it slightly concerning that a lot of findings have been extrapolated from just one architecture. In particular, I find the experiments in section 5 to be the most informative (and also objective), since it is a single number which is easy to think about. To be clear, I like the visualization experiments and it gives credibility to the claim about interpretability. Given that there are many levers in a neural net (batch norm, architectural choices, hyper-params etc.) one could fiddle with, to make the claim made in the introduction one needs a more extensive set of experiments. I acknowledge that the authors say they haven't explored the possibility of fine-tuning the hyper-params for instance, but I think considering some of these choices is really helpful. This will help _isolate_ the effect of width independent of the architecture choice.
+
+Given the above observations, my current decision of this paper is that it doesn't meet the bar. I find the results promising but the paper is not yet ready. 
+
+
+[1] - https://papers.nips.cc/paper/8076-neural-tangent-kernel-convergence-and-generalization-in-neural-networks.pdf",3,,ICLR2020
+SJlT6ASJcS,3,r1xQNlBYPS,r1xQNlBYPS,Official Blind Review #1,"This submission belongs to the area of multi-view modelling. In particular, the submission describes construction of multi-view language models that (i) can generate text simultaneously in multiple languages, (ii) can generate text in one or more languages conditioned on text from another language. This submission extends previously proposed KERMIT from two views to more than two views. I believe this paper could be of interest to multi-view modelling/learning community. 
+
+Though the original KERMIT approach is very interesting and you application of it to more than two views is also interesting I find the presentation to be poor. In particular I find section 2 to be hard if not impossible to understand without referring to the original paper where the story, equations, nomenclature are much more clearly explained. Even though your extension from two views to multiple is simple I find reliance on a diagram to be a mistake as I find your description not to be very clear. Given that there are no equations to support the reader and that the original equations are not adequate I find it hard to understand Sections 2 and 3. The key experimental result in Table 1 is only briefly commented on despite featuring multiple models with different strength and weaknesses, multiple types of inference. If space is of concern I would suggest removing Figure 2 (or changing input from non-English to English and removing or removing another qualitative table).
+",1,,ICLR2020
+XL_eVUnzZ--,2,lXoWPoi_40,lXoWPoi_40,Review,"Summary:
+This paper did an empirical study on the learning rate (LR) schedule for deep neural networks (DNNs) training. The authors argue that the density of wide minima is lower than sharp minima and then show that this makes keeping high LR necessary. Finally, they propose a new LR schedule that maintains high LR enough long. 
+
+Pros:
+-	The problem this paper studies is import for DNNs training. The proposed LR schedule is simple and has the potential to be used widely.
+-	The authors conduct extensive empirical tests to support their claim and the experimental design is reasonable.
+
+Cons:
+-	I’m not fully convinced by the hypothesis that wide minima have lower density. The empirical results can be explained by other hypotheses as well. For example, it is also possible that wide minima are farther away from the initialization. I think the authors need to either provide theoretical analysis or come up with new experiments to further verify this hypothesis.
+-	The proposed LR schedule does not seem necessary. One could easily achieve the same purpose by existing LR schedules, e.g. use a step decay LR schedule. 
+-	The novelty is low. The main novelty of the paper is the above hypothesis, but it is not supported enough. The proposed LR schedule is a slightly modified version of the existing LR schedule.  Thus the contribution of this paper seems incremental. 
+
+",4,2.0,ICLR2021
+SygfkBvaYH,2,ryl-RTEYvB,ryl-RTEYvB,Official Blind Review #2,"This paper proposes an efficient method to (differentiably) estimate input-output Jacobian. The method is useful for Jacobian regularization. The regularization improves robustness and generalization of networks.
+
+I tend to vote for rejection. There are two concerns. 1) This paper needs to demonstrate the effectiveness of the input-output Jacobian regularization over the input gradients regularization. 2) It is doubtful whether the regularizer provides the same benefits mentioned in Experiment 3.1 for other datasets than MNIST.
+
+Major comments:
+1) This paper needs to demonstrate applications that the regularization of input-output Jacobian is more beneficial than that of input gradients. Input gradients regularization has repeatedly appeared in the literature, as the paper mentions. For example, [1] regularized input gradients to improve the robustness against adversarial examples. The input gradients regularization is computationally more efficient than Varga et al. (2017), with which the submitted paper compares the proposed method. If input gradients regularization is sufficient, it limits the impact of the submitted paper. It is strongly encouraged to demonstrate when and why the input-output Jacobian regularization is preferable.
+2)  Experimental results on CIFAR10 and ImageNet show accuracy degradation on clean test data. It is questionable whether we can reach the same conclusion with the experiments 3.1 on those datasets.
+
+[1] Andrew Slavin Ross and Finale Doshi-Velez. ""Improving the Adversarial Robustness and Interpretability of Deep Neural Networks by Regularizing their Input Gradients."" AAAI 2018
+
+Update =====
+
+Thank you for the authors' response. I agree that the regularization of the input-output Jacobian is potentially superior to double back-prop. However, as authors are aware of, it needs to be validated experimentally. I think this paper should not leave this as a future work, and hence keep the review score.",3,,ICLR2020
+SJxDQYG0FS,1,HJlk-eHFwH,HJlk-eHFwH,Official Blind Review #2,"This paper presents a voice conversion approach using GANs based on adaptive instance normalization (AdaIN).  The authors give the mathematical formulation of the problem and provide the implementation of the so-called AdaGAN. Experiments are carried out on VCTK and the proposed AdaGAN is compared with StarGAN.  The idea is ok and the concept of using AdaIN for efficient voice conversion is also good.  But the paper has a lot of issues both technically and grammatically, which makes the paper hard to follow.
+
+1. On writing
+There are glaring grammar errors in numerous places. e.g.
+  -- ""Although, there are few GAN-based systems that produced state-of-the-art results for non-parallel VC. Among these algorithms, even fewer can be applied for many-to-many VC task. At last, there is the only system available for zero-shot VC proposed by Qian et al. (2019).""   This is hard to parse.
+ -- ""helps generator to make ...""  -> ""helps the generator make ...""
+ --  ""let assume"" -> ""Let's assume"" 
+ --  ""We know that the idea of transitivity as a way to regularize structured data has a long history.""   what does it mean?
+ --  ""the generator of AdaGAN is consists of Encoder and Decoder.""  -> ""consist of""
+ -- ""After training of AdaGAN for large number of iteration of $\tau$ , where theoretically $\tau \rightarrow \infty$."" where is the second half of the sentence?
+
+2.  On math notation
+ The math notation is messy and there are lots of inaccuracies.  e.g.
+  --  $X_{i} \in p_{X}(\cdot|Z_{i},U_{i})$ should be $X_{i} \sim p_{X}(\cdot|Z_{i},U_{i})$
+  --  ""generate the distribution denoted by $\hat{X}_{Z_{1}\rightarrow Z_{2}}$""  -> why  $\hat{X}_{Z_{1}\rightarrow Z_{2}}$ becomes a distribution? 
+  --  ""$p_{N}(\cdot|Z_{1},U_{1})$, $p_{N}(\cdot|Z_{2},U_{1})$"" in Eq.14,  $N$ should be replaced by the random variable.
+  --  $S'_{X}$ and $S'_{Y}$ should be $S_{X'}$ and $S_{Y'}$ in line 15 in the algorithm
+
+3. On technical details:
+ -- In Fig.1 (b), why is there only one input to the discriminator?  How do you inject the adversarial samples and how do you generate adversarial samples? 
+-- In section 4.4, ""in encoder and decoder all layers are Linear layers"".  Are you referring to fully-connected layers? Linear layers are usually referred to those with linear activation functions.  
+-- The experiments are claimed to be zero-shot, but 3-5s of speech is required.  can you explain? 
+
+Although the samples sound OK, given its current form, the paper needs significant re-work. 
+
+P.S. rebuttal read.   I will stay with my score.",1,,ICLR2020
+SyxwgLc0tB,2,HkxTwkrKDB,HkxTwkrKDB,Official Blind Review #1,"The paper presents proof that the DeepSets and a variant of PointNet are universal approximators for permutation equivariant functions. The proof uses an expression for equivariant polynomials and the universality of MLP. It then shows that the proposed expression in terms of power-sum polynomials can be constructed in PointNet using a minimal modification to the architecture, or using DeepSets, therefore proving the universality of such deep models. 
+
+The results of this paper are important. In terms of presentation, the notation and statement of theorems are precise, however, the presentation is rather dry, and I think the paper can be significantly more accessible. For example, here is an alternative and clearer route presenting the same result: one may study the simple case of having single input channel, for which the output at index ""i"" of an equivariant polynomial is written as the sum of all powers of input multiplied by a polynomial function of the corresponding power-sum. This second part is indeed what is used in the proof of the universality of the permutation invariant version of DeepSets, making the connection more visible. Generalizing this to the multi-channel input as the next step could make the proof more accessible. 
+
+The second issue I would like to raise is related to discussions around the non-universality of the vanilla PointNet model. Given the fact that it applies the same MLP independently to individual set members, it is obvious that it is not universal equivariant (for example, consider a function that performs a fixed permutation to its input), and I fail to see why the paper goes into the trouble of having theorems and experiments just to demonstrate this point. If there were any other objectives beyond this in the experiments could you please clarify? 
+
+Finally, could you give a more accurate citation (chapter-page number) for the single-channel version of Theorem 2.? 
+",6,,ICLR2020
+XxE5d9Isotw,4,6KZ_kUVCfTa,6KZ_kUVCfTa,PREDICTIVE CODING FOR PLANNING IN LATENT SPACE,"Problem Setup: The paper proposes a mutual information objective to learn a latent representation which
+can be used for planning. The paper note that most of the existing model based RL methods learn a 
+model of the world via reconstruction objective, which requires to predict each and every detail of the visual 
+input, and hence can be detrimental in case of noisy inputs or in the presence of distractors. 
+
+Proposed idea: In order to tackle this problem, the paper proposes a mutual information objective to maximize 
+the mutual information between the latent codes at distinct time steps.  In order to capture the history of the past, 
+the authors utilize a recurrent model (from Dreamer Model) to encode information about the history of the trajectory.
+The paper also uses 2 different objectives in order to prevent the representation from collapsing. The paper proposes
+to use a mutual information objective between the observation and the encoding of the observation (as in Dreamer), 
+as well as consistency objective in the latent space (already used before). Essentially the underlying idea behind the proposed method is not new per se, but as far as I know this is the first paper, which has shown to make it work on DeepMind Control tasks.
+
+Experiments: The authors compare the proposed method to DREAMER model on 6 DeepMind Control (DMC) tasks. 
+The authors also evaluate the robustness of the proposed method by evaluating the capability of the proposed method 
+in dealing with complicated backgrounds (given the scenario, when the entity of interest occupies a small region in the input).
+
+Clarity: The paper is clearly written. 
+
+References: Their are bunch of references that could be cited. Shaping belief paper [1] also uses a CPC style objective for learning a model of the environment. [2] also learns a model of the environment by predicting only the relevant information by constructing a temporal information bottleneck but still within the framework of maximum likelihood prediction. [3] also uses a mutual information based objective and without any explicit reconstruction.
+
+- [1] Shaping Belief States with Generative Environment Models for RL https://arxiv.org/abs/1906.09237
+- [2] Learning dynamics model in reinforcement learning by incorporating the long term future https://arxiv.org/abs/1903.01599
+- [3] Dreaming: Model-based Reinforcement Learning by Latent Imagination without Reconstruction
+ https://arxiv.org/abs/2007.14535
+
+Scalability: It would be interesting to see how the proposed method evaluates on more challenging tasks just as on atari or on continuous control tasks such as ""box"" stacking which requires some relational reasoning. Since the underlying idea has been tried in some other context and in this work the contribution is to make it work for deep RL problems, it becomes important to evaluate on more challenging problems and tasks. 
+
+======
+
+After Rebuttal: I have read the rebuttal, as well as reviews by other reviewers. I keep my original score. Hope to see a better version of the paper soon.
+
+",5,5.0,ICLR2021
+BJg036X15S,3,rJlcLaVFvB,rJlcLaVFvB,Official Blind Review #3,"This work proposed a new model called Sparse Deep Predictive Coding, which introduced top-down connections between consecutive layers, to improve the solutions to hierarchical sparse coding (HSC) problems. Instead of decomposing the HSC problem into independent subproblems, the proposed model added a new term to the loss function, which represents the influence of the latter-layer on the current layer.
+
+#Pros:
+-- The proposed model adopted the idea from predictive coding and came up with a relatively novel idea for HSC problems.
+-- The experiments are solid. The experiments evaluated the proposed methods with different hyper-parameter settings and three real-world datasets.
+-- The figures in the result sections are well designed and concise.
+
+#Cons:
+-- The mathematical description of the main problem and the proposed model is not clear. For example, the dimensionality for the variables in Eq.(1) is not clarified.
+-- The test procedure is not clear. How the internal state variables are obtained for the test set is not clarified.
+-- The proposed model was only compared with a basic Hierarchical Lasso network. There are not any state-of-art methods included as baseline methods.
+
+#Detailed comments:
+
+(1) The proposed model is named as sparse DEEP predictive coding, however, the experiments only considered the SDPC and Hi-La networks with 2 layers. I am wondering if a deeper structure will improve the performance?
+
+(2) For the structure shown in Fig.1, the decoding dictionaries are $D^T_i$, but I am confused why the encoding dictionaries are reciprocal to encoding dictionaries. Does it come from the optimization updates shown in Eq.(3)?
+
+(3) According to Eq.(1), $x$ is a vector and $D$ is a 2d matrix. However, the real inputs in the experiments are images and $D$ is a convolutional filter with 4 dimensions. How are the matrices reshaped?
+
+(4) For section 2.2 and 2.3, the number/index of samples is not shown in the loss function for training. The loss should be over the whole training set. Besides, the test procedure is not clarified.
+
+(4) The number of iterations using FISTA of SDPC and Hi-La networks is shown to compare the rate of convergence. However, considering both models are solving a lasso-type regression problem, I would suggest using coordinate descent for optimization.
+
+(5) For the main result of prediction error, why is the “global prediction error” more important than the reconstruction error? Is the first-layer prediction error the reconstruction error? If yes, Fig.2 shows that Hi-La has a lower prediction error compared to SDPC for the first layer.
+
+(6) Two minor comments on writing:
+(a) It would be better to have a separate section for 2.5 since it describes the dataset and is not related to the proposed model.
+(b) A typo of “neuronal implementation” exists in the introduction section. 
+",3,,ICLR2020
+Q_5PI0NaIvZ,2,jM76BCb6F9m,jM76BCb6F9m,Worthwhile investigation but lack of humility and truthfulness,"This paper presents a new learnable representation fo audio signal classification and compares it to the classical mel-filterbanks representation and two other learnable representations on a broad range of audio classification tasks, from birdsongs to pitch, instrument, language or emotion recognition. The proposed representation combines several parameterized representation techniques from the recent litterature. It is reported to yield on par or better classification results than the other methods on several of these tasks using single- or multi-task learning.
+
+Pros:
+- Learning an ultimate, universal, generic representation for all audio signals that renders the 80 years old mel-frequency scale obsolete is certainly an attractive goal
+- The proposed representation carefully and elegantly combines the best parts of several recently proposed parameterized representations and enjoys a nice interpretability while requiring few parameters to learn.
+- Comparing different audio representations on such a broad range of audio classification task is a welcome and unmatched effort, to the best of my knowledge.
+
+Cons:
+- The paper lacks humility in its story-telling and its style. It employs formulations such as ""lived through the history of audio"", or ""challenging the historical statu quo"" when refereing to mel-frequency representations, although by the authors' own admition in the paper, a large amount of research effort has already been given in recent years towards learnable audio representations (the authors cite a dozen papers but there are more). Hence, this paper is not a first attempt. And despite the pompous use of ""universal"" in the title, I believe it is not a last attempt either. The authors claim that the proposed representation ""outperform mel-filterbanks over several tasks with a unique parametrization"" but this is far from clear when looking at the results carefully. In the majority of the tasks, the representation performs either slightly worse, equal, or about 0.5% better than Mel-filterbanks. It is not clear whether such improvement is significant, since no error bar or standard deviation is provided in the results (a sadly common habit in the audio litterature). The only tasks where a truly significant improvement is reported are language identification and emotion recognition, which are also the tasks where all the methods perform the poorest. It looks like any significant difference between the 4 compared approaches would vanish if these two tasks were omitted. The reason why the proposed representation performs well on these two very specific tasks is not clear and not discussed.
+- More generally, the paper would be much more valuable if it gave a sense of WHAT is actually learned by the proposed method. Is the final representation significantly different from a mel-filterbank? Given how close to mel most reported results are, this is doubtful. In fact, Fig. A.1. strongly suggests that LEAF just re-learned mel, but strangely this figure is never commented. Some comments on the learned compression-parameters would also be appreciated.
+- At least one important comparison point is clearly missing in the reported results: STFT + PCEN or mel-filterbanks + PCEN, e.g., Wang et al. (2017) or Schlüter & Lehner (2018) [note that the latter already uses sPCEN rather than PCEN, contrary to the authors' claim] . Omitting this from the comparisons prevents one from knowing whether the proposed parameterized Gabor filterbank brings any advantage over another time-frequency representation like STFT or mel-filterbanks. Less critically, another missing comparison point is LEAF + CNN14, in Table 4.
+- What the authors refer to as ""audio"" in the title and throughout the paper is in fact much more narrow, namely ""audio classification"". Learnable audio representations have been studied in a broader context in recent years, e.g., speech enhancement, source separation, dereverberation, sound localization or audio (re-)synthesis. In fact, one of the important breakthroughs recently brought by learnable audio frontends was in source separation with the paper TasNet (Luo et al. 2018) which is not cited by the authors. In the same context, (Ditter and Gerkmann 2020) presented a learnable gammatone-like filterbank and showed that fully-parameterized learned filterbanks tended to have logarithmic spread in frequencies. Moreover, the use of learnable analytical filterbanks/Hilbert pairs due to their envelop extraction/shift invariant properties was already discussed in depth in (Pariente et al. 2019) [cited in the paper]. 
+
+Overall, while comparing different learnable audio representations on a broad range of audio classification tasks is a timely and worthwhile topic, and while the proposed representation elegantly combines several recent ideas in this area, the general presentation and angle of the paper strongly lacks humility. Instead of the proposed title, something like ""Benchmark of learnable audio representations on a broad range of classification tasks"" would be more truthful to the work. To make the investigation more worthwhile and insightful, additional comparison points (STFT + PCEN, mel-frequency + PCEN, Gabor + log, etc.) as well as an analysis of what the model has actually learned would be needed.
+
+======= Review edit after authors' revisions ======
+The changes made by the authors in the title, abstract, introduction and conclusion to narrow the scope of the paper, better contextualize it, and make it more humble and truthful are very welcome. The extra experiments, figures, addition of error bars and new statistical tests are also a real plus. In doing so, the authors addressed all of my major concerns.
+
+For these reasons, changed my evaluation score from 5 to 8.
+",8,5.0,ICLR2021
+HJg3kV22FS,1,Hkxi2gHYvH,Hkxi2gHYvH,Official Blind Review #3,"The paper proposes a reward shaping method which aim to tackle sparse reward tasks. The paper first trains a representation using contrastive predictive coding and then uses the learned representation to provide feedback to the control agent. The main difference from the previous work (i.e. CPC) is that the paper uses the learned representation for reward shaping, not for learning on top of these representation. This is an interesting research topic. 
+
+Overall, I am leaning to reject this paper because (1) the main contribution of the paper is not clear (2) the experiments are missing some details and does not seem to support the claim that the proposed methods can tackle the sparse reward problem. 
+
+First of all, it would’ve been better to have a conclusion section, so the readers can see the contributions of the paper. After reading the paper, I still do not understand what are the contributions of the paper and what're from the previous works. The paper does not provide well justification why CPC feature can provide useful information for reward shaping. The paper does not provide a new method to learn predictive coding. It does not provide a novel reward shaping method (the “Optimizing on Negative Distance” method is very similar to [1]). So, I am not sure what’re the contributions of this paper. 
+
+Moreover, I am not convinced that the proposed method can tackle long horizon and sparse reward problems. As the paper discuss in introduction, learning in sparse reward environment is hard because it relies on the agent to enter the goal during exploration. However, the proposed approach seems only able to work in environments where exploration with random policy can generate trajectories that contain sufficient environment dynamics (e.g. dynamics near the goal states). How can the method learn that information without entering the goal? 
+
+Furthermore, it seems that the proposed approach only works for goal-oriented tasks (since we need to know the goal state for reward-shaping). I think this should be clearly stated in the paper. 
+
+There are some missing details which makes it difficult to draw conclusions: 
+1. How is the ‘success rate’ computed (e.g. in figure 7 and table 1).
+2. How were the parameters selected (e.g. table 5 in the appendix). Why did you use the default the parameters?
+3. How many runs are the curve averaged over and what’s the shaded region (e.g. one standard error)? Most of the results in the paper seem not statistically significant. 
+4. In figure 4, five domains are mentioned but only three of them are tested in the section 5.
+5. Section 6.2 seems irrelevant to the paper. What’s the purpose of this section?
+6. Figure 10 shows the result of using CPC feature directly vs. reward shaping. Are both feature using the same NN architecture, same PPO parameters, and same control setting? Moreover, the reward shaping method assumes we know the goal state but using CPC feature does not. Is it a fair comparison?
+
+The paper has some imprecise parts:
+1. The definition of MDPs (in section 2) is imprecise. For example, how is the expectation defined? how is the initial state sampled? What does $p\in\mathcal{P}$ (last line in the first paragraph) mean where $\mathcal{P}$ is the state transition function? 
+
+Minor comments which do not impact the score:
+1. Figure 1 should come before figure 2. 
+2. It would have been better if there is a short description of how the hand-shaped rewards is designed for each domain in the main text. 
+
+[1] The Laplacian in RL: Learning Representations with Efficient Approximations
+",3,,ICLR2020
+U21z4qScv2-,1,l-LGlk4Yl6G,l-LGlk4Yl6G,Anonymous review of subspace splitting paper,"Summary: The paper introduces the problem of subspace spitting, in which an observed mixed-features vector is to be partitioned such that the identified partitions match with given subspaces. The main results of the paper lie in deriving sufficient and necessary conditions for identifiability of these partitions when the subspaces and the entries of the features are randomly positioned in the ambient dimension and the subspaces, respectively. The conditions simply require that there are more entries associated with each subspace than the dimension of the subspace. The paper also presents algorithms to perform the splitting. 
+
+Strengths: 
+- The problem statement is novel. I did not see previous formulations of this problem. 
+- The paper is generally well-written. 
+- The paper has a good balance of theory and algorithm development. 
+
+Weaknesses: 
+- The experimental results section is rather weak. It would be good to include some realistic examples from some concrete applications.
+- While the paper has a dedicated section on motivating applications, they are not that convincing. The metagenomics application is more plausible, but I do not think this model applies well to recommender systems. The examples provided seem to be included to justify the proposed model. Perhaps there are meaningful relevant applications, but for the current version the problem setup is not sufficiently justified and seems somewhat contrived. 
+- The assumptions about the subspaces need to be more explicit, especially when discussing the algorithms. For example, the random sampling algorithm seems to require full knowledge of the dimensions of these subspaces which may be impractical.
+- I have not fully verified this argument, but I am under the impression that the result of the main theorem is trivial. Isn't it obvious that the span of the restriction of a subspace to a given partition of size m, the whole R^m? I believe the main result can just follow from this simple observation.   
+- Reproducibility: the authors did not include code for their developed algorithms in the supplementary material. 
+ 
+Impact: Provided that the problem setup and model are better justified, I believe this work could open up new research questions in machine learning and data analysis, including (mixture) variants of well-studied problems on matrix completion and robust learning.   
+
+Overall I like the paper, especially that the problem setup itself seems novel. However, the model is not sufficiently justified, the assumptions are somewhat questionable, and the experimental section is lacking. As such I would rate this as a marginal acceptance. 
+",6,4.0,ICLR2021
+BJ4AfUoeG,2,Syhr6pxCW,Syhr6pxCW,Shines Light on Deficiencies in Conditional GAN: borderline accept,"This paper presents a pixel-matching based approach to synthesizing RGB images from input edge or normal maps. The approach is compared to Isola et al’s conditional adversarial networks, and unlike the conditional GAN, is able to produce a diverse set of outputs.
+
+Overall, the paper describes a computer visions system based on synthesizing images, and not necessarily a new theoretical framework to compete with GANs. With the current focus of the paper being the proposed system, it is interesting to the computer vision community. However, if one views the paper in a different light, namely showing some “blind-spots” of current conditional GAN approaches like lack of diversity, then it can be of much more interest to the broader ICLR community.
+
+Pros: 
+Overall the paper is well-written
+Makes a strong case that random noise injection inside conditional GANs does not produce enough diversity
+Shows a number of qualitative and quantitative results
+
+Concerns about the paper:
+1.) It is not clear how well the proposed approach works with CNN architectures other than PixelNet
+2.) Since the paper used “the pre-trained PixelNet to extract surface normal and edge maps” for ground-truth generation, it is not clear whether the approach will work as well when the input is a ground-truth semantic segmentation map.
+3.) Since the paper describes a computer-vision image synthesis system and not a new theoretical result, I believe reporting the actual run-time of the system will make the paper stronger. Can PixelNN run in real-time? How does the timing compare to Isola et al’s Conditional GAN?
+
+Minor comments:
+1.) The paper mentions making predictions from “incomplete” input several times, but in all experiments, the input is an edge map, normal map, or low-resolution image. When reading the manuscript the first time, I was expecting experiments on images that have regions that are visible and regions that are masked out. However, I am not sure if the confusion is solely mine, or shared with other readers.
+
+2.) Equation 1 contains the norm operator twice, and the first norm has no subscript, while the second one has an l_2 subscript. I would expect the notation style to be consistent within a single equation (i.e., use ||w||_2^2, ||w||^2, or ||w||_{l_2}^2)
+
+3.) Table 1 has two sub-tables: left and right. The sub-tables have the AP column in different places.
+
+4.) “Dense pixel-level correspondences” are discussed but not evaluated.
+",6,4.0,ICLR2018
+BkjrLVG4x,1,HyWDCXjgx,HyWDCXjgx,Contribution not clear enough; concerns about data set itself,"The manuscript is a bit scattered and hard to follow. There is technical depth but the paper doesn't do a good job explaining what shortcoming the proposed methods are overcoming and what baselines they are outperforming. 
+
+The writing could be improved. There are numerous grammatical errors.
+
+The experiments in 3.1 are interesting, but you need to be clearer about the relationship of your ResCeption method to the state-of-the-art. The use of extensive footnotes on page 5 is a bit odd. ""That is a competitive result"" is vague. A footnote links to ""http://image-net.org/challenges/LSVRC/2015/results"" which doesn't seem to even show the same task you are evaluating. ResCeption: ""The best validation error is reached at 23.37% and 6.17% at top-1 and top-5, respectively"". Single model ResNet-152 gets 19.38 and 4.49, respectively. Resnet-34 is 21.8 and 5.7, respectively. VGGv5 is 24.4 and 7.1, respectively.  [source: Deep Residual Learning for Image Recognition, He et al. 2015]. I think it would be more honest for you to report results of competitors and say that your model is worse than ResNet and slightly better than VGG on ImageNet classification.
+
+3.5, retrieval on Holidays, is a bit too much of a diversion from the goal of this paper. If this paper is more about the novel architecture and less about the particular fashion attribute task then the narrative needs to change accordingly.
+
+Perhaps my biggest concern is that this paper is missing baselines (e.g. non recurrent models, attribute classification instead of detection) and comparisons to prior work by Berg et al.
+
+""Our policy restricts to reveal much more details about the internal dataset"" This is a significant issue. The dataset used in this work cannot be shared? How are future works going to compare to your benchmark?
+",3,3.0,ICLR2017
+BJAFIxfNl,1,S1RP6GLle,S1RP6GLle,Interesting paper,"The paper presents an amortised MAP estimation method for SR problems. By learning a neural network which learns to project to an affine subspace of SR solutions which are consistent with the LR method the method enables finding propoer solutions with by using a variety of methods: GANs, noise assisted and density assisted optimisation.
+Results are nicely demonstrated on several datasets.
+
+I like the paper all in all, though I feel the writing can be polished by quite a bit and presentation should be made clearer. It was hard to follow at times and considering the subject matter is quite complicated making it clearer would help. Also, I would love to see some more analysis of the resulting the networks - what kind of features to they learn? ",7,2.0,ICLR2017
+r1lTNkw6tS,3,HyeaSkrYPH,HyeaSkrYPH,Official Blind Review #1,"This paper attempts to extend the Interval Bound Propagation algorithm from (Gowal et al. 2018) to defend against adversarial patch-based attacks. In order to defend against patches which could appear at any location, all the patches need to be considered. This is too computationally expensive, hence they proposed to use a random subset of patches, or a U-net to predict the locations of the patches and then use those patches to train. The algorithm is tested on the MNIST and CIFAR-10 datasets and it was shown that sometimes the IBP approach is useful for defense, although often with a significant loss on accuracy on clean data (e.g. on CIFAR the loss on clean accuracy is an astounding 300% -- from 66.5% - 35.7%).
+
+I think the technical contribution of this paper is a bit weak in that they mostly followed the original IBP and the only novelties are the random patch training and guided patch training. I partially like how the experiments are conducted, especially the one that generalizes to other shapes. On the other hand, the networks that are tested seem pretty poor by any standard. An experiment that is definitely missing is a CIFAR network that performs a little better than the current one. Clean accuracy of only 66.5% and 47.2% are very lousy for CIFAR.
+
+Another missing experiment is one that would test on different epsilon values. I couldn't find what are the current epsilon values used?
+
+Besides, since this work is testing on adversarial patches, I would like to at least have it applied to some real-life images with patches that are of real-life size. I could care a bit less on how good it is, but one can still make an empirical test (e.g. certified defense accuracy on 5x5 patches, but empirical test using real-life sized patches 40x40 or 80x80) and see how the results would be. All the experiments mentioned above would significantly strengthen the experiments section of the paper.
+
+I don't think I read anywhere a confirmation that the testing is performed on all patches of the prescribed size. Could the authors please confirm whether this is true?
+
+Minor: 
+There is a typo in Eq. (5) and Eq. (6), where the second term multiplied by |W^(k)| should be \underline{z}^(k-1) - \bar{z}^(k-1) instead of \underline{z}^(k-1) + \bar{z}^(k-1)
+
+You should mention that |W^(k)| stand for element-wise absolute value when it first appears.",6,,ICLR2020
+bCip_qLEjK2,1,QpNz8r_Ri2Y,QpNz8r_Ri2Y,Assumptions need better motivation,"Strength: 
+Model learning is an important component for offline RL, which is usually done independently from policy evaluation / optimization. The authors propose a new model learning method for offline RL that takes policy evaluation error into consideration / as regularization. 
+An upper bound is derived to guarantee the worst case performance. In terms of policy evaluation, the authors show empirical advantages over previous model-based offline OPE algorithms. In terms of control, the authors show empirical advantages over existing model-based and model-free algorithms in the challenging D4RL dataset.
+
+Weakness & Points to be clarified:
+My major concern is the assumption used in the paper. 
+The assumption about B_\phi in Theorem 4.3 looks not well motivated. When should we expect there exists such a B_\phi? How large are B_\phi and \bar{k}? If B_\phi and \bar{k} are very large, I feel the bound in theorem 4.3 can be very loose. I think the paper may benefit from clarifying more on this assumption.
+
+Minor comments:
+The authors compare with both model-based and model-free approaches in the control setting.
+In OPE, however, only model-based approach is compared. I would suggest to add more model-free baselines, e.g., Fitted-Q-Evaluation [1] and DICEs, to motivate the necessity for learning a model.
+
+Overall I think the empirical results are convincing and I am happy to increase my score if the assumptions are further clarified.
+
+[1] Voloshin, Cameron, et al. ""Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning."" arXiv preprint arXiv:1911.06854 (2019).
+
+===================
+
+(Nov 24) The authors addressed my concerns in the reply so I increased my score to 6.",6,4.0,ICLR2021
+#NAME?,4,lf7st0bJIA5,lf7st0bJIA5,A so-called 3D object discovery method without any 3D evaluation,"The paper proposed an unsupervised learning model, POD-Net, that learns to discover objects from video. The authors develop an inference model that performs image segmentation and object-based scene decompositions on overlapping sub-patches, and a generative model, which contains an unprojection step, a constant velocity dynamic model and a VAE, to reconstruct the original scene. With the novel dynamic model to predict motions for the 3D object-primitives, the POD-Net learns to segment objects with better physics.
+
+++ Strong points:
+The paper explores the use of motion cues to train a self-supervised model to extract object-based scene representations from videos. And in the approach that to unproject 2d masks into 3d primitives to compute the motion is novel to me, it allows the proposed approach to better discover objects with physical occupancy on the 2D videos.
+
+Overall, the paper is well written. In particular, the intuition behind the method is described well. The Method section, the POD-Net, is easy to read and understand. And the Evaluation section is well structured, it is clear how the models are setted on different datasets.
+
+++ Concerns:
+
+The main concern on the paper is that although it claimed itself as a 3D object discovery method, all its evaluations are done on 2D datasets with 2D metrics. Although there are some reasonable improvements shown on these metrics, we do not know what is the capability of this work in terms of recovering 3D segmentation mask and 3D pose. This I consider incomplete for a work that claims its main difference w.r.t. prior work to be getting to 3D object discovery.
+
+There have been a significant amount of prior work on unsupervised 3D object discovery (many of them on RGB-D) that is missed by the authors:
+
+Herbst et al. Toward Object Discovery and Modeling via 3-D Scene Comparison. ICRA 2011
+Karpathy et al. Object Discovery in 3D scenes via Shape Analysis. ICRA 2013
+Ma and Sibley. Unsupervised Dense Object Discovery, Detection, Tracking and Reconstruction. ECCV 2014
+
+Datasets such as 
+
+Lai et al. A Large-Scale Hierarchical Multi-View RGB-D Object Dataset. ICR 2011
+Georgakis et al. Multiview RGB-D Dataset for Object Instance Detection. 3DV 2016
+
+exist and they provide ways to evaluate unsupervised 2D-3D object discovery (one can start from RGB and deduct 3D pose and velocity). So I don't think the authors have enough excuses to not show any 3D results.
+
+The author claims that the POD-Net is an unsupervised method, meanwhile bashing other methods of using pre-training (last paragraph of Section 1). However, their unprojection model and the project model is pre-trained and it is the same kind of supervision as the pre-trained segmentation models in other work.
+
+Minor concerns: 
+
+-- In Sec 3.3, ""model surprisal"" doesn't sound to me like proper English, or maybe I'm missing something. If you are introducing a new phrase as a term you probably want to define it first.
+
+-- In Sec 3.1 and 3.2, it is not clear how the author counts object discovery performance w.r.t. the time dimension, e.g. how is it handled if the ground truth mask is completely occluded? Is tracking consistency taken into account?
+",5,4.0,ICLR2021
+SJgb3cQF2Q,2,SJzSgnRcKX,SJzSgnRcKX,Nice discussion of what type of information is actually encoded by contextualized word embeddings,"This paper provides new insights on what is captured contextualized word embeddings by compiling a set of “edge probing” tasks.  This is not the first paper to attempt this type of analysis, but the results seem pretty thorough and cover a wider range of tasks than some similar previous works.  The findings in this paper are very timely and relevant given the increasing usage of these types of embeddings.  I imagine that the edge probing tasks could be extended towards looking for other linguistic attributes getting encoded in these embeddings.
+
+Questions & other remarks:
+-The discussion of the tables and graphs in the running text feels a bit condensed and at times unclear about which rows are being referred to.
+-In figures 2 & 3: what are the tinted areas around the lines signifying here? Standard deviation?  Standard error?  Confidence intervals?
+-It seems the orthonormal encoder actually outperforms the full elmo model with the learned weights on the Winograd Schema.  Can the authors comment on this a bit more?
+",7,4.0,ICLR2019
+ByltbXgKnQ,2,r1l-e3Cqtm,r1l-e3Cqtm,"Good approach to deep learning based video compression, but empirical section needs work","Summary
+=======
+This work on video compression extends the variational autoencoder of Balle et al. (2016; 2018) from images to videos. The latent space consists of a global part encoding information about the entire video, and a local part encoding information about each frame. Correspondingly, the encoder consists of two networks, one processing the entire video and one processing the video on a frame-by-frame basis. The prior over latents factorizes over these two parts, and an LSTM is used to model the coefficients of a sequence of frames. The compression performance of the model is evaluated on three datasets of 64x64 resolution: sprites, BAIR, and Kinetics600. The performance is compared to H.264, H.265, and VP9.
+
+Review
+======
+Relevance (9/10):
+-----------------
+Compression using neural networks is an unsolved problem with potential for huge practical impact. While there has been a lot of research on deep image compression recently, video compression has not yet received much attention.
+
+Novelty (6/10):
+---------------
+This approach is a straightforward extension of existing image compression techniques, but it is a reasonable step towards deep video compression. 
+
+What's missing from the paper is a discussion of how the proposed model would be applied to model video sequences longer than a few frames. In particular, the global latent state will be less and less useful as videos get longer. Should the video be split into multiple sequences treated separately? If yes, how should they be split and what is the impact on performance?
+
+Empirical work (2/10):
+----------------------
+Unfortunately, the experiments focus too much on trying to make the algorithm look good at the expense of being less informative and potentially misleading.
+
+Existing video codecs such as H.265 and software like ffmpeg are optimized for longer, high-resolution videos, but even the most realistic dataset used here (Kinetics600) only contains short (10 frames) low-resolution videos. I suggest the authors at least add the performance of classical codecs evaluated on the entire video sequence to their plots. The current reported performance can be viewed as splitting the videos into chunks of 64x64x10, which makes sense for an autoencoder which has been trained to learn a global representation of short videos, but is clearly not necessary and detrimental to the performance of the classical codecs. I think adding these graphs would provide a more realistic view of the current state of video compression using deep neural nets.
+
+For the classical codecs, were the binary files stripped of any file format container and headers before counting bits? This would be crucial for a fair comparison, especially for small videos where the overhead might be significant.
+
+More work could be done to ensure the reader that the hyperparameters of the classical codecs such as GOP or block size have been sufficiently tuned.
+
+What is the frame rate of the videos used? I.e., how much time do 10 frames correspond to?
+
+The videos were downsampled before cropping them to 64x64 pixels. What was the resolution before cropping?
+
+The authors observe that the Kalman prior performs worse than the LSTM prior. This may be due to limitations of the encoder, which processes images frame-by-frame, which makes it hard to decorrelate frames while preserving information. I am wondering why the frame encoder is not at least processing one neighboring frame. (Note: A sufficiently powerful encoder could represent information in a fully factorial way; e.g. Chen & Gopinath, 2001).
+
+Clarity:
+--------
+The paper is well written and clear.",5,4.0,ICLR2019
+xzD-F9c79O,4,TtYSU29zgR,TtYSU29zgR,"Review for paper ""Primal Wasserstein Imitation Learning""","Summary: 
+This paper proposes to use Wasserstein distance in the primal form for imitation learning. Compared with its dual form and f-divergence minimization variants, it avoids the unstable minimax optimization. In order to compute the Wasserstein distance in primal form, they also propose a greedy approximation. Their experiments demonstrate that this method has a better performance compared with baseline methods.
+
+Pros:
++ The greedy approximation of the primal Wasserstein distance is a clean solution, and it works well for MuJoCo tasks, even if in the LfO setting
++ The paper presents a series of experiments and ablation studies, which showed quite strong performance. In the ablation, the experiments about PWIL-support is quite convincing. The agent may stay on the supports for other the density/support estimation-based methods, while PWIL doesn't.
++ The paper is well-written and easy to follow. It provides enough implementation details and codes, which is reproducible
+
+Some questions & concerns:
++ I'm wondering if the uniform distribution assumption can be improved, by incorporating certain density estimation methods. 
++ Why the complexity of the algorithm is O((|S| + |A|)D)? It should also be linear in T
++ One limitation is that you need carefully selected metrics, e.g. L1, standardized L2. And it's generally hard to compute Wasserstein distance in the image domain. 
++ For the visual imitation experiments, the figures only contain PWIL without the results of other baselines. They can still be trained on the feature from TCC. 
+
+",6,4.0,ICLR2021
+SJppHuogG,2,SJCq_fZ0Z,SJCq_fZ0Z,"Sparse attention backtracking, an alternative to (T)BPTT","re. Introduction, page 2: Briefly explain here how SAB is different from regular Attention?
+
+Good paper. There's not that much discussion of the proposed SAB compared to regular Attention, perhaps that could be expanded. Also, I suggest summarizing the experimental findings in the Conclusion.",8,4.0,ICLR2018
+5R9M62gbVow,3,iOnhIy-a-0n,iOnhIy-a-0n,"Interesting, well written, limited originality","Accelerating convergence of replica exchange stochastic gradient mcmc via variance reduction
+
+Summary:
+
+The paper presents a variance reduction technique to achieve more efficient swaps in replica exchange stochastic gradient Langevin dynamics MCMC. The paper provides detailed analysis of the method as well as empirical evaluation on some standard deep learning tasks.
+
+Positive:
+
+1. Overall I would say that the paper is well written and and it is fairly easy to follow the presentation and details in the derivations.
+2. The topic is very timely and the method appears to be very useful. As an attractive method for minibatched Bayesian inference, stochastic gradient Langevin Dynamics samplers are of high interest, but tuning the algorithm can be somewhat finicky in my experience. Replica exchange is sometimes extremely useful, and finding good defaults for these types of methods is important.
+3. Experimental validation is reasonable (although a bit limited) and the methods chosen for comparison are resonable.
+4. A comprehensive set of appendices are included to provide further details. Although I did not go through the appendices in detail, I find it appealing that further information is provided for readers wishing to apply these methods in practice.
+
+Negative:
+
+1. The authors do not provide a reference software implementation. This makes it more difficult for readers to verify the results and might limit the impact of the paper. I would highly appreciate that the authors would create and share a minimal implementation.
+2. The novelty / originality is limited: A well known type of variance reduction applied in a new way/context where it makes perfect sense though.
+
+Recommendation:
+
+Good paper. Accept.
+
+",7,3.0,ICLR2021
+BkxnW6eCFS,3,r1xI-gHFDH,r1xI-gHFDH,Official Blind Review #3,"The paper presents an unsupervised method for graph embedding. 
+
+Despite having good experimental results, the paper is not of the quality to be accepted to the conference yet. The approach is rather a mix of previous works and hence not novel. 
+
+In particular, the algorithm for WL decomposition is almost fully taken from the original paper with a slight modification. Advantage of using it for unlabeled data is poorly motivated as unlabeled graphs can easily take statistics such as degree as the node labels, which was shown well in practice. 
+
+Modified PV-DBOW is in fact the same algorithm as the original CBOW model but applied to different context. It has been used in many papers, including Deep GK, graph2vec, anonymous walks. 
+
+Also, the Figure 1. is taken from the original paper of WL kernel. The algorithms 1 and 2 are taken from the original papers with slight modifications. 
+
+There is no discussion of [1], which uses CBOW framework, has theoretical properties, and produces good results in experiments. There is no comparison with GNN models such as [2]. 
+
+I would be more interested to see explanation of the obtained results for each particular dataset (e.g. why MUTAG has 92% accuracy and PTC 67%); what so different about dataset and whether we reached a limit on most commonly used datasets. 
+
+[1] Anonymous Walk Embeddings? ICML 2018, Ivanov et. al. 
+[2] How Powerful are Graph Neural Networks? ICLR 2019, Xu et. al.",3,,ICLR2020
+ryeoubswhX,1,B1zMDjAqKQ,B1zMDjAqKQ,"interesting idea, but writing quality could be improved","The authors proposed an unsupervised learning framework to learn multisensory binding, using visual and auditory domain from animal videos as example. First, the visual and auditory inputs are autoencoded, and these latent codes are binding using a recurrent self-organizing network (Gamma-GWR). Furthermore, the authors proposed the expectation learning idea, which is inspire by psychology literature. In short, after the first pass of training using the real data. The authors fine tuned the model to bind the real data from one domain and the reconstructed data from another domain. This could be a good idea, as the authors pointed out, human usually bind all kinds of yellow bird to a same mental 'chirping' sounds. So, this expectation learning could potentially group the representation to a canonical one. Also, the authors showed in Table 1 that with the expectation learning, the model's recognition accuracy is improved a bit. I think it would be interesting to show the reconstruction output example (as in Fig. 3) for both model with and without expectation learning. To see if it is as the authors claim, that the model with expectation learning is reconstructing the missing modality with more canonical images/sounds. (This may not be the goal in other practice, though I'm convinced it is a potentially good psychological model as it explain well the multisensory imagery effect (Spence & Deroy, 2013). 
+
+I found this manuscript quite hard to follow though. The description seems sometime not flowing very smoothly. And there are some clear typos and mess up of math notations make the reading unpleasant. I have noted down several points below, and hope the authors could improve in the next iteration.
+
+1. The description of variational autoencoder is not well written. The citation (Chen, 2016) is not the standard VAE paper people usually cite (unless the author is adopting something specific from the Chen's paper.). For example, the authors wrote ""the KL divergence between the encoded representation and a sample from the Gaussian distribution"" which sounds incorrect to me.
+
+2. Why a Variational autoencoder is necessary for visual domain, but a regular autoencoder is used in auditory domain?
+
+Typos:
+1. page 2, 2nd line: a online --> an online
+2. Use subscript I-1 to mean the winner neuron at t-1, I think this is not quite clear. I suggest to follow the notation in (Parisi & Wermter 2017), use I(t-1), which is easier to follow.
+3. page 7, 2nd line: more than 17% for audio.  -> for vision.
+4. page 8, 3rd line: not on the original high-abstraction data. Do the authors mean highly specific data? That seems make more sense.
+5. Several notation mismatch here and there. for example, in formula 6 it is w_j^s, but in the text below it become w_{j,s}.
+",5,2.0,ICLR2019
+HyxjNYwTKr,2,Skl4LTEtDS,Skl4LTEtDS,Official Blind Review #2,"Based on the intuition that smaller action space should be easier to learn, the author proposes a curriculum learning approach which learns by gradually growing the action space. An agent using simpler action space can generate experiences to be used by all the agents with larger action spaces. The author presents experiments on simple domains and also on a challenging video game. In general, it is an interesting research work. I think the author can improve the paper in the following aspect. 
+
+1. Motivation. Curriculum learning is based on the idea that tasks can be arranged in a sequential manner and those tasks learned earlier can be somehow helpful for subsequent tasks. Although it is clear that small action space should be easy to learn, it is unclear why those off-policy samples can be helpful for more complex action space. Smaller action space can correspond to a completely different optimal policy. Imagine that in a tabular environment, two actions A, B are available, and the optimal action is to always take B. Then if the agent with full action space uses the experiences generated by the agent with action space {A} may get completely wrong action values. There must be some constraint of the underlying MDP. The author may provide some experiments in tabular case to illuminate the issue.
+
+2. Relevant works. I think the author should include some discussions regarding large action spaces, since one goal of the proposed method is to handle such situation. There are several works should be discussed. For example, Deep Reinforcement Learning in Large Discrete Action Spaces, Function-valued action space for PDE control. The former handle the large discrete action space by learning an action embedding; while the latter attempts to leverage the regularity in the action space by introducing a convolutional structure for the output of the actor network and hence the proposed method can scale to arbitrary action dimensions. ",3,,ICLR2020
+Vj3MDVsydGU,4,Y5TgO3J_Glc,Y5TgO3J_Glc,Review,"# Overall summary
+This paper proposes a generative model for sequential data structured which uses latent relational constraints and conditions upon them to generate the actual sequence.  The authors propose that such a model can be better suited for data like poetry or music where such constraints are an important feature of the data.
+
+# Strengths
+- The method provides a strong and useful inductive bias for modeling sequences which we can expect to have strong relational constraints, without having to learn them from scratch.
+- The method incorporates a discrete optimization-based component which can lead to much more interpretable results.
+
+# Weaknesses
+- The method requires significant hand design of the constraints in order to work well. This may be difficult or counter-effective when the constraints are not easily described for the domain at hand.
+- It is not entirely clear when the relational constraint optimization will produce good results. There is no theory presented about what it might find.
+- The empirical results presented are not especially strong.
+
+# Comments on the evaluation
+It should be possible to estimate a lower bound on the log likelihood of the proposed model on a held-out test set, rather than using different models. After all, the goal is that the proposed method can model these sequential data with relational constraints better than the existing methods. Therefore, it's not clear that MusicAutoBot, MusicVAE, BERT, etc. putting higher or lower likelihood on samples from the proposed method should necessarily be a strong indication of the method's quality.
+To provide a tighter lower bound, it's possible to sample z multiple times and average, as in IWAE: https://arxiv.org/abs/1509.00519
+Furthermore, if the priors about the data that the proposed method presupposes are correct, then it should be possible to show greater data efficiency than the baseline methods. 
+I would be willing to revise the score if the authors can provide further explanation/data about this.
+",6,4.0,ICLR2021
+SyxsGxuG5r,3,Sklyn6EYvH,Sklyn6EYvH,Official Blind Review #1,"The authors applied residual learning machenism in VAE learning, which I have seen such methods in Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks (https://arxiv.org/abs/1506.05751), which basically applied the residual learning method in GAN. But the authors fail to discuss the relationship and difference with this paper.
+Also, the best paper from ICML 2019 claimed that unsupervised learning method can not really disentangle the features. They claim \beta-VAE, factor VAE is not good. The authors shall all discuess this point. Otherwise, it is not convincing to the readers.
+
+",3,,ICLR2020
+Sy7QPPYxM,2,HyXBcYg0b,HyXBcYg0b,Residual Gated Graph ConvNets,"The paper proposes an adaptation of existing Graph ConvNets and evaluates this formulation on a several existing benchmarks of the graph neural network community. In particular, a tree structured LSTM is taken and modified. The authors describe this as adapting it to general graphs, stacking, followed by adding edge gates and residuality.
+
+My biggest concern is novelty, as the modifications are minor. In particular, the formulation can be seen in a different way. As I see it, instead of adapting Tree LSTMs to arbitary graphs, it can be seen as taking the original formulation by Scarselli and replacing the RNN by a gated version, i.e. adding the known LSTM gates (input, output, forget gate). This is a minor modification. Adding stacking and residuality are now standard operations in deep learning, and edge-gates have also already been introduced in the literature, as described in the paper.
+
+A second concern is the presentation of the paper, which can be confusing at some points. A major example is the mathematical description of the methods. When reading the description as given, one should actually infer that Graph ConvNets and Graph RNNs are the same thing, which can be seen by the fact that equations (1) and (6) are equivalent.
+
+Another example, after (2), the important point to raise is the difference to classical (sequential) RNNs, namely the fact that the dependence graph of the model is not a DAG anymore, which introduces cyclic dependencies. 
+
+Generally, a clear introduction of the problem is also missing. What are the inputs, what are the outputs, what kind of problems should be solved? The update equations for the hidden states are given for all models, but how is the output calculated given the hidden states from variable numbers of nodes of an irregular graph?
+
+The model has been evaluated on standard datasets with a performance, which seems to be on par, or a slight edge, which could probably be due to the newly introduced residuality.
+
+A couple of details :
+
+- the length of a graph is not defined. The size of the set of nodes might be meant.
+
+- at the beginning of section 2.1 I do not understand the reference to word prediction and natural language processing. RNNs are not restricted to NLP and I think there is no need to introduce an application at this point.
+
+- It is unclear what does the following sentence means: ""ConvNets are more pruned to deep networks than RNNs""?
+
+- What are ""heterogeneous graph domains""?
+",3,4.0,ICLR2018
+Rtp1tIz8K3N,1,Pzj6fzU6wkj,Pzj6fzU6wkj,IsarStep: a Benchmark for High-level Mathematical Reasoning,"The authors propose a new benchmark task to evaluate the high-level reasoning capabilities of machine learning models (specifically sequence-to-sequence models) in the context of proof assistants. The task consists of predicting the intermediate proposition from its surrounding ones, namely its previous and its subsequent propositions. The experimental analysis provides evidence on the difficulty of the task at hand. The authors propose also a solution based on a hierarchical transformer, which is able to better capture the mathematical relations of intra- and inter-propositions compared to existing sequence-to-sequence models, as demonstrated by quantitative as well as qualitative analyses.
+
+The paper is clearly written and has a good balance between technicality and readability.
+Provided examples are pedagogical to better understand the introduced concepts. Also, it is positive the fact that the authors are prone to publish their data and code to foster the reproducibility of the experiments.
+
+The major drawback with the paper is in the weak/not well supported motivations of the proposed benchmark task. Also, a discussion of the differences between the proposed model and existing hierarchical transformer architectures is missing. Please, refer below to more detailed comments.
+
+Taking into account that it's not clear to me why the proposed benchmark task is necessary to advance the research in the field of proof assistants, I consider the paper marginally below the acceptance threshold and therefore recommend for an initial rejection. Nevertheless, I'm willing to raise my score if the authors can provide a better explanation on their motivations or provide more convincing arguments supporting the need of their proposed benchmark task. Furthermore, I suggest the authors to discuss some missing related work on hierarchical transformers.
+
+DETAILED COMMENTS
+
+The authors argue that ""solving the IsarStep task will be potentially helpful for improving the automation of theorem provers, because proposing a valid intermediate proposition will help to reduce their search space significantly"". In general, I agree with the authors that developing benchmarks is an essential driving factor in research and that designing methods able to reduce the search space is essential to improve the automation of theorem provers. I'm not able to see why and how IsarStep can drive this advancement though.
+Proofs, both procedural and declarative ones, are inherently sequential and IsarStep breaks this sequentiality by assuming that the proposition subsequent to the missing one is given. For instance, consider the same example used in Section 2 to prove the irrationality of the square root of 2. Why can statement (3) be considered given and in which practical situations does the task of predicting (2) given (1) and (3) occur? Does the IsarStep task occur in practice when proving new conjectures? Wouldn't it be more natural to predict (2) and subsequently (3) by having only (1)?
+
+Furthermore, in which sense is solving IsarStep ""a first step towards the long-term goal of sketching complete human-readable proofs automatically""? Can you elaborate more on that?
+
+Hierarchical transformers have been already proposed in natural language for the purposes of document summarization [1-2]. Can you relate with these existing works and particularly discuss what are the architectural novelties of your proposed transformer, as this is one of the contributions listed in the introduction?
+
+MINOR COMMENTS
+
+In the experimental section regarding the visualisation of attention, can you specify what is F2 and what is F3?
+
+[1] Zhang et al. HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization. ACL 2019
+[2] Liu and Lapata. Hierarchical Transformers for Multi-Document Summarization. ACL 2019
+
+#########################
+
+UPDATE
+
+Authors have clarified the doubts raised by my questions. I believe that the task proposed in the paper provides new insights on the weaknesses of deep learning models. Therefore, solving the task is important to advance the automation of proof assistants through machine learning. Based on this, I recommend for acceptance.",6,4.0,ICLR2021
+rJe7toL2Fr,2,BkeyOxrYwH,BkeyOxrYwH,Official Blind Review #3,"This paper proposes an architecture for synthesizing tools to be used in a reaching task. Specifically, during training the agent jointly learns to segment an image of a set of three tools (via the MONet architecture) and to classify whether one the tools will solve the given scene. At test time, one of the three tools is selected based on which seems most feasible, and then gradient descent is used to modify the latent representation of the tool in order to synthesize a new tool to (hopefully) solve the scene. The paper demonstrates that this approach can achieve ok performance on familiar scenes with familiar tools, but that it fails to generalize when exposed to unfamiliar scenes or unfamiliar tools. The paper reports a combination of the quantitative results showing that optimizing the latent space can lead to successful synthesis in some cases, and qualitative results showing that the synthesized tools change along interpretable dimensions such as length, width, etc. The combination of these results suggest that the model has learned something about which tool dimensions are important for being able to solve the types of reaching tasks given in the paper.
+
+While I think this paper tackles a very interesting, important, and challenging problem, I unfortunately feel it is not ready for publication at ICLR and thus recommend rejection. Specifically, (1) neither the particular task, results, or model are not very compelling, (2) there are no comparisons to meaningful alternatives, and (3) overall I am not quite sure what conclusions I should draw from the paper. However, given the coolness of the problem of tool synthesis, I definitely encourage the authors to continue working on this line of work!
+
+1. The task, results, and model are not very compelling. Any of these three things alone would not necessarily be a problem, but given that all three are true the paper comes across as a bit underwhelming.
+ 
+- First, while the task can be construed as a tool synthesis task, it doesn’t come across to me as very ecologically valid. In fact, the task seems to be more like a navigation task than a tool synthesis task: what’s required is simply to draw an unbroken line from one part of the scene to another, rather than actually generate a tool that has to be manipulated in an interesting way. Navigation has been studied extensively, while synthesis of tools that can be manipulated has not, which makes this task both not very novel and disappointing in comparison to what more ecologically-valid tool synthesis would look like. For example, consider a variation of the task where you would have to start the tool at the red region and move it to the green region. Many of the tools used here would become invalid since you wouldn’t actually be able to fit them through the gaps (e.g. Figure 2E).
+ 
+- Second, given that the “synthesis” task is more like a navigation task, the results are somewhat disappointing. When provided with a feasible solution, the model actually gets *worse* even in some of the in-sample scenes that it has seen during training (e.g. scene types C and D) which suggests that it hasn’t actually learned a good generative model of tools. Generalization performance is pretty bad across the board and is only slightly better than random, which undermines the claim in the abstract that “Our experiments demonstrate that the synthesis process modifies emergent, task-relevant object affordances in a targeted and deliberate way”. While it’s clear there is successful synthesis in some cases, I am not sure that the results support the claim that the synthesis is “targeted” or “deliberate” given how poor the overall performance is.
+ 
+- Third, the model/architecture is a relatively straightforward combination of existing components and is highly specialized to the particular task. As mentioned above, this wouldn’t necessarily be a problem if the task were more interesting (i.e. not just a navigation task) and if the results were better. I do think it is cool to see this use of MONet but I’m skeptical that the particular method of optimizing in the latent space is doing anything meaningful. While there is prior work that has optimized the latent space to achieve certain tasks (as is cited in the paper), there is also a large body of work on adversarial examples which demonstrate that optimizing in the latent space is also fraught with difficulty. I also suspect this is the reason why the results are not particularly good.
+ 
+2. While I do appreciate the comparisons that are in the paper (to a “Random” version of TasMON that moves in a random direction in the latent space, and to “FroMON” agent which is not allowed to backpropagate gradients from the classification loss into MONet), these comparisons are not particularly meaningful. The difference between FroMON performance and TasMON tool imagination performance (I didn’t test tool utility) across tasks is not statistically significant (z(520, 544)=-0.8588, p=.38978), so I don’t think it is valid to claim that “a task-aware latent space can still provide benefits.” The Random baseline is a pretty weak baseline and it would be more interesting to compare to an alternative plausible architecture (for example, which doesn’t use a structured latent space, or which doesn’t have a perceptual frontend and operates directly on a symbolic representation of the tools/scene).
+ 
+3. Overall, I am not quite sure what I am supposed to get out of the paper. Is it that “task relevant object affordances are implicitly encoded as directions in a structured latent space shaped by experience”? If so, then the results do not support this claim and so I am not sure what to take away. Is it that the latent space encodes information about what makes a tool feasible? If so, then this is a bit of a weak argument---of *course* it must encode this information if it is able to do the classification task at all. Is it that tool synthesis is a challenging problem? If so, then the lack of strong or canonical baselines makes it hard to evaluate whether this is true (and the navigation-only synthesis task also undermines this a bit).
+
+Some additional suggestions:
+ 
+It would be good to include a discussion of other recent work on tool use such as Allen et al. (2019) and Baker et al. (2019), as well as on other related synthesis tasks such as Ha (2018) or Ganin et al. (2018).
+ 
+The introduction states that “tool selection and manufacture – especially once demonstrated – is a significantly easier task than tool innovation”. While this may be true, it is a bit misleading in the context of the paper as the agent is doing something more like tool selection and modification rather than tool innovation (and actually the in-sample scenes are more like “manufacture”, which the agent doesn’t always even do well on).
+ 
+It would be helpful to more clearly explain scene types. Here is some suggested phrasings: in-sample = familiar scenes with familiar tools, interpolation = novel scenes with familiar tools, extrapolation = novel scenes with novel tools.
+ 
+I was originally confused how psi’ knew where to actually place the tool and at what orientation, and whether the background part of the rendering process shown in Figure 1. I realized after reading the supplemental that this is not done by the agent itself but by separate code that tries to find the orientation and position of the tool. This should be explained more clearly in the main text.
+ 
+In Table 1 it would be helpful to indicate which scene types are which (in-sample, interpolation, extrapolation).
+ 
+Allen, K. R., Smith, K. A., & Tenenbaum, J. B. (2019). The Tools Challenge: Rapid Trial-and-Error Learning in Physical Problem Solving. arXiv preprint arXiv:1907.09620.
+ 
+Baker, B., Kanitscheider, I., Markov, T., Wu, Y., Powell, G., McGrew, B., & Mordatch, I. (2019). Emergent tool use from multi-agent autocurricula. arXiv preprint arXiv:1909.07528.
+ 
+Ganin, Y., Kulkarni, T., Babuschkin, I., Eslami, S. M., & Vinyals, O. (2018). Synthesizing programs for images using reinforced adversarial learning. arXiv preprint arXiv:1804.01118.
+ 
+Ha, D. (2018). Reinforcement learning for improving agent design. arXiv preprint arXiv:1810.03779.",1,,ICLR2020
+r1l0QRy0FH,2,Hklo5RNtwS,Hklo5RNtwS,Official Blind Review #3,"This work explores two uses of Wasserstein distances (WD) within reinforcement learning: the first is a variant of policy gradient, where WD is used to guide the policy search (instead of alternative such as Trust-region used in TRPO); the second is a variant of evolutionary search where WD is used again to guide the policy updates.
+
+One of the strengths of the work is to clarify the notion of Behavior embeddings (Sec.3), which I expect can have several uses in RL.   In this paper, the behavioral embeddings are assumed to be given; it would be interesting to discuss/explore learning these embeddings.
+
+Section 4 of the paper reviews key concepts related to WD.  This is much harder to follow for an RL researcher, and would be improved by adding some intuition relating the material presented to the concepts of Sec.3.  Furthermore, this confusion carries out in Sec.5.  For example, what is the best way to think of \lambda_1 and \lambda_2?  And the maps s_1 and s_2?  What are necessary/desirable properties of P^\phi_b?   There are also many steps packed in Alg.2 & Alg.3, which are difficult to unpack.  For example, what are the \epsilon (step 1., Alg.3), scalars or vectors, how are they sampled?  It would be helpful to have a discussion of the complexity (both data & compute) of both algorithms.
+
+Section 6 presents empirical results for each proposed algorithm.  Corresponding baselines are presented, but I would be interested to see a wider set of baseline methods. The literature is rich with methods in these classes, both variants of TRPO and ES.  It’s necessary to at least pick a representative sample to show and compare (e.g. GAE, SAC).  I am also puzzled by the actual results presented, for example the Hopper reward shown in Fig.3 seems much worse (by orders of magnitude) compared to that reported in the SAC paper (Haarnoja et al. 2018).
+
+
+",3,,ICLR2020
+rklAzwpVhm,1,Hkg1csA5Y7,Hkg1csA5Y7,"A good attempt, but lacks sufficient explanation and reasoning","This paper presents a new quasi-Newton method for stochastic optimization that solves a regularized least-squares problem to approximate curvature information that relaxes both the symmetry and secant conditions typically ensured in quasi-Newton methods. In addition to this, the authors propose a stochastic Armijo backtracking line search to determine the steplength that utilizes an initial steplength of 1 but switches to a diminishing steplength in later iterations. In order to make this approach computationally tractable, the authors propose updating and maintaining a Cholesky decomposition of a crucial matrix in the Hessian approximation. Although it is a good attempt at developing a new method, the paper ultimately lacks a convincing explanation (both theoretical and empirical) supporting their ideas, as I will critique below.
+
+1. Stochastic Line Search
+
+Determining a steplength in the stochastic setting is a difficult problem, and I appreciate the authors’ attempt to attack this problem by looking at stochastic line searches. However, the paper lacks much detail and rigorous reasoning in the description and proof for the stochastic line search.
+
+First, the theory gives conditions that the Armijo condition holds in expectation. Proving anything about stochastic line searches is particularly difficult, so I’m on board with proving a result in expectation and doing something different in practice. However, much of the detail on how this is implemented in practice is lacking. 
+
+How are the samples chosen for the line search? If we go along with the proposed theory, then when the function is reevaluated in the line search, a new sample is used. If this is the case, can one guarantee that the practical Armijo condition will hold? How often does the line search fail? How does the choice of the samples affect the cost of evaluating the line search?
+
+The theory also suggests that the particular choice of c is dependent on each iteration, particularly the inner product between the true search direction and the true gradient at iteration k. Does this allow for a fixed c to be used? How is c chosen? Is it fixed or adaptive? What happens as the true gradient approaches 0?
+
+The algorithm also places a limit on the number of backtracks permitted that decreases as the iteration count increases. What does the algorithm do when the line search fails? Does one simply take the step although the Armijo condition does not hold?
+
+In deterministic optimization, BFGS typically needs a smaller steplength in the beginning as the algorithm learns the scale of the problem, then eventually accepts the unit steplength to obtain fast local convergence. The line search proposed here uses an initial steplength of $\min(1, \xi/k)$ so that in early iterations, a steplength of 1 is used and in later iterations the algorithm uses a $\xi/k$ steplength. When this is combined with the diminishing maximum number of backtracking iterations, this will eventually yield an algorithm with a steplength of $\xi/k$. Why is this preferred? Are the other algorithms in the numerical experiments tuned similarly?
+
+The theory also asks for a descent direction to be ensured in expectation. However, it is not the case that $E[\hat{p}_k^T \hat{g}_k] = E[\hat{p}_k]^T g_k$, so it is not correct to claim that a descent direction is ensured in expectation. Rather, the condition is requiring the angle between the negative stochastic gradient direction and search direction to be acute in expectation.
+
+All the proofs also depend on a linear Taylor approximation that is not well-explained, and I’m wary of proofs that utilize approximations in this way. Indeed, the precise statement is that $\hat{f}_{z’} (x + \alpha \hat{p}_z) = \hat{f}_{z’} + \alpha \hat{p}_z’ \hat{g}_z(x + \bar{\alpha} \hat{p}_z)$, where $\bar{\alpha} \in [0, \alpha]$. How does this affect the proof?
+
+Lastly, I would propose for the authors to change the name of their condition to the “Armijo condition” rather than using the term “1st Wolfe condition” since the Wolfe condition is typically associated with the curvature condition (p_k’ g_new >= c_2 p_k’ g_k), hence referring to a very different line search. 
+
+2. Design of the Quasi-Newton Matrix
+
+The authors develop an approach for designing the quasi-Newton matrix that does not strictly impose symmetry or the secant condition. The authors claim that this done because “it is not obvious that enforced symmetry necessarily produces a better search direction” and “treating the [secant] condition less strictly might be helpful when [the Hessian] approximation is poor”. This explanation seems insufficient to me to explain why relaxing these conditions via a regularized least-squares approach would yield a better algorithm, particularly in the noisy or stochastic setting. The lack of symmetry seems particularly strange; one would expect the true Hessian in the stochastic setting to still be symmetric, and one would still expect the secant condition to hold if the “true” gradients were accessible. It is also unclear how this approach takes advantage of the stochastic structure that exists within the problem.
+
+Additionally, the quasi-Newton matrix is defined based on the solution of a regularized least squares problem with a regularization parameter lambda. It seems to me that the key to the approximation is the balance between the two terms in the objective. How is lambda chosen? What is the effect of lambda as a tuned parameter, and how does it affect the quality of the Hessian approximation? It is unclear to me how this could be chosen in a more systematic way.
+
+The matrix also does not ensure positive definiteness, hence requiring a multiple of the gradient direction to be added to the search direction. In this case, the key parameter beta must be chosen carefully. What is a typical value of beta that is used for each of these problems? One would hope that beta is small, but if it is large, it may suggest that the search direction is primarily dominated by the stochastic gradient direction and hence the quasi-Newton matrix is not useful. The interplay of these different parameters needs to be investigated carefully.
+
+Lastly, since (L-)BFGS use a weighted Frobenius norm, I am curious why the authors decided to use a non-weighted Frobenius norm to define the matrix. How does changing the norm affect the Hessian approximation?
+
+All of these questions place the onus on the numerical experiments to see if these relaxations will ultimately yield a better algorithm.
+
+3. Numerical Experiments
+
+As written, although the range of problems is broad and the numerical experiments show much promise, I do not believe that I could replicate the experiments conducted in the paper. In particular, how is SVRG and L-BFGS tuned? How is the steplength chosen? What (initial) batch sizes are used? Is the progressive batching mechanism used? (If the progressive batching mechanism is not used, then the authors should refer to the original multi-batch paper by Berahas, et al. [1] which do not increase the batch size and use a constant steplength.)
+
+In addition, a more fair comparison would include the stochastic quasi-Newton method in [2] that also utilize diminishing steplengths, which use Hessian-vector products in place of gradient differences. Multi-batch L-BFGS will only converge if the batch size is increased or the steplength diminished, and it’s not clear if either of these are done in the paper.
+
+Typos/Grammatical Errors:
+- Pg. 1: Commas are needed in some sentences, i.e. “Firstly, for large scale problems, it is…”; “…compute the cost function and its gradients, the result is…”
+- Pg. 2: “Interestingly, most SG algorithms…”
+- Pg 3: Remove “at least a” in second line
+- Pg. 3: suboptimal, not sup-optimal
+- Pg. 3: “Such a solution”, not “Such at solution”
+- Pg. 3: Capitalize Lemma
+- Pg. 4: fulfillment, not fulfilment
+- Pg. 7: Capitalize Lemma
+- Pg. 11: Before (42), Cov \hat{g} = \sigma_g^2 I
+- Pg. 11: Capitalize Lemma
+
+Summary:
+
+In summary, although the ideas appear to provide better numerical performance, it is difficult to evaluate if the ideas proposed in this paper actually yield a better algorithm. Many algorithmic details are left unanswered, and the paper lacks mathematical or empirical evidence to support their claims. More experimental and theoretical work is needed before the manuscript can be considered for publication.
+
+References:
+[1] Berahas, Albert S., Jorge Nocedal, and Martin Takác. ""A multi-batch l-bfgs method for machine learning."" Advances in Neural Information Processing Systems. 2016.
+[2] Byrd, Richard H., et al. ""A stochastic quasi-Newton method for large-scale optimization."" SIAM Journal on Optimization 26.2 (2016): 1008-1031.
+[3] Schraudolph, Nicol N., Jin Yu, and Simon Günter. ""A stochastic quasi-Newton method for online convex optimization."" Artificial Intelligence and Statistics. 2007.",4,4.0,ICLR2019
+1DFqPsD2riD,2,DM6KlL7GeB,DM6KlL7GeB,"Interesting ideas, but needs some more work","### Summary
+This work presents 1) Semi-Relaxed Quantization (SRQ), a method that targets learning low-bit neural networks, 2) DropBits, a method that performs dropout-like regularization on the bit width of the quantizers with an option to also automatically optimise the bit-width per layer according to the data, and 3) quantised lottery ticket hypothesis. SRQ is an extension of Relaxed Quantization (RQ), which is prior work, in two ways; firstly the authors replace the sampling from the concrete relaxation during training to deterministically selecting the mode (which is non-differentiable) and, secondly, they propose a specific straight-through gradient estimator (STE) than only propagates the gradient backwards for the elements that were selected in the forward pass. DropBits is motivated from the perspective of reducing the bias of the STE gradient estimator by randomly dropping grid points associated with a specific bit-width and then renormalising the SRQ distribution over the grid. This essentially induces stochasticity in the sampling distribution for the quantised value (which was removed before by selecting the mode in SRQ). The authors further extend DropBits in a way that allows for learning the drop probabilities for each bit-width, thus allowing for learning mixed-precision networks. Finally the authors postulate the quantised lottery ticket hypothesis, which refers to that “one can find the learned bit-width network which can perform better than the network with the same but fixed bit-widths from scratch”.
+
+### Pros
+- This work provides a set of additions that improves upon prior work
+- The DropBits method is novel and allows for learning the bit-width in a straightforward manner
+- The results improve upon recent works that learn quantised neural networks
+
+### Cons
+- Some claims from the authors are misleading while others are not precise
+- The computational complexity of the method is not discussed 
+- Some experimental settings might not be consistent with some of the baselines
+
+### Detailed feedback
+This work tackles the problem of learning quantized neural networks and the authors show empirically that their proposed method achieves good results. The DropBits extension is particularly interesting in that it allows for learning the appropriate bit-width of a given tensor via traditional pruning approaches. I also like the fact that the authors explain illustratively their proposed approach via several figures, which provide a nice boost to clarity. Nevertheless, I believe that there are still some important aspects that need to be addressed before I recommend acceptance for this work.
+
+First of all, the comparison against the prior work on figure 1 is misleading; the authors compare the *entire categorical distribution* (i.e., the one obtained after discretising the logistic onto the quantization grid) of SRQ at Fig 1(b) with a *single sample* from the concrete relaxation of the same distribution at Fig. 1(a) right for RQ. In fact, the underlying categorical distribution will be the same for both SRQ and RQ in the specific example of figure 1. Furthermore, it is worthwhile to notice that the underlying categorical distribution (i.e., pre relaxation) does have support for the value of -a (as the p(g_i = -a) is nonzero at Fig.1 (b)), thus it is not unreasonable that there are specific samples which lead to the quantised value being -a, thus incurring larger quantization loss.
+
+Furthermore, I believe that the discussion about SRQ misses some important points that would improve the clarity of the work if they are addressed. Selecting the most probable point in the categorical distribution for the forward pass is equivalent to rounding to the nearest grid point, which can be done much more efficiently than computing the entire categorical over the grid and then taking the argmax. In addition, this is also the same as taking sigma -> 0 for the logistic distribution that is to be discretized in the forward pass. Finally the authors argue that their novel multi-class STE reduces the variance of the gradient estimator but no formal justification is given apart from some hand-wavy arguments. Why is the variance lower if the gradient only flows through to r_imax? Furthermore, for the second point with respect to the benefits of the multi-class STE; while it does seem desirable that it aggressively clusters the weights and activations around the grid-points, I wonder how much can that hinder convergence. Do you ever observe that the weights can be prematurely “stuck” (and thus lead to a bad local minimum) and do the weights ever move further away than just the closest grid point? DropBits could potentially help with the latter part, but it would be interesting to see what happens without it.
+
+DropBits in my opinion is the main novel idea of this work, and it is an interesting way to learn the bit-precision of each tensor in the network. I have two main points for this section in general and DropBits in particular. The main motivation behind DropBits seems to be converting the sampling distribution of SRQ (which is deterministic) to a stochastic one. If this is the case, then why have it be deterministic in the first place? You could just sample from the original categorical distribution (by, e.g., using the Gumbel-Softmax STE which gives samples exactly on the grid) in the forward pass and use your multi-class STE approximation in the backward pass.  It would be interesting to see how that fares with SRQ + DropBits and would highlight whether the main benefit of DropBits was the regularization aspect (and not that of improving the sampling distribution). As for DropBits in particular; it seems that you drop bits independently with each of the gates z_1, z_2, … (i.e., figure 3). If this is the case, then you could end up with a non-uniform grid that cannot be exactly represented as a fixed point tensor (e.g., on figure 3b you could have z_1 = 0 and z_2 = 1). If this is the case, then comparison against other approaches that use uniform quantization is not apples-to-apples.
+
+Finally, a couple of other things that I believe should be addressed; the authors don’t make any discussions about the computational and memory complexity of the resulting algorithm. It seems that for every individual weight and activation they first have to construct the categorical distribution over the entire grid (which can quickly become very large, e.g., for 8 bits there are 256 categories), in order to take the weighted sum. This doesn’t seem to scale very well. How expensive is something like this in practice and how long do experiments take on, e.g., Imagenet? Furthermore, the quantised lottery ticket hypothesis (QLTH) is a bit peculiar. The original lottery ticket hypothesis (LTH) was about finding sparse networks at initialisation that can be trained from scratch and achieve the same accuracy as the original dense equivalents. This is different than what the authors articulate here, specifically that it is about finding sparse networks that are easier to train compared to sparse networks obtained from pruning. As a result, their QLTH seems to state the opposite than what the original LTH was about; it states that a QLT is obtained when you manage to find a network X that a.) has smaller bit width than the original network and b.) has better performance than a network initialised to the bit-width of X and trained to convergence. Following the arguments of the LTH, I would expect a QLT to be obtained when you can quantise a neural network to a specific bit-width at initialisation and when you train from scratch that particular quantised network, you obtain the same performance as the full precision equivalent. I would thus encourage the authors to clarify this point and better align with the original LTH. 
+
+Based on the aforementioned points, I cannot at the moment recommend acceptance for this work. Nevertheless, as I believe DropBits is an interesting idea, I would encourage the authors to put in the effort and rework the paper by addressing these points over the rebuttal.
+",6,4.0,ICLR2021
+rkgOJZaaKH,3,H1l2mxHKvr,H1l2mxHKvr,Official Blind Review #3,"A new task is suggested, similarly to FSL the test is done in an episodic manner of k-shot 5-way, but the number of samples for base classes is also limited. The model is potentially pre-trained on a large scale dataset from another domain. The suggested method is applying spatial attention according to entropy criteria (or certainty) of the original classifier (from a different domain).
+
+
+I think the suggested task is important and more realistic than the usual FSL benchmarks. I would modify it so instead of discarding mini-imagenet classes that are overlapping with Places I would discard the problematic Places classes. This way it will be easier to compare to standard FSL. Also, I don’t understand why for CUB the benchmarks includes k={0,1,5} while for mini-imagenet it is k={0,20,50}, obviously k={0,1,5} are more interesting.
+
+As for the suggested method, I find it hard to judge since there are no strong baselines to compare against. Also, the ablation study of removing the attention and/or adaptation doesn’t result in a definitive conclusion. 
+
+
+Update:
+While your comments do weaken some of my concerns, I'm afraid it is not enough for changing my previous rating. I think being more careful about the benchmark definition with regards to train/test overlap and comparing to stronger baselines will help improve the paper for future submissions.",3,,ICLR2020
+Syx1siQK37,2,B1lfHhR9tm,B1lfHhR9tm,"New framework has a lot of potential, but the experiments, motivations, and related work are missing details","Update: I've updated my score based on the clarifications from the authors to some of my questions/concerns about the experimental set-up and multi-task/single-task differences.
+
+Original Review:
+This paper provides a new framework for multitask learning in nlp by taking advantage of the similarities in 10 common NLP tasks. The modeling is building on pre-existing qa models but has some original aspects that were augmented to accommodate the various tasks.  The decaNLP framework could be a useful benchmark for other nlp researchers.  
+
+Experiments indicate that the multi-task set-up does worse on average than the single-task set-up.  I wish there was more analysis on why multi-task setups are helpful in some tasks and not others.  With a bit more fine-grained analysis, the experiments and framework in this paper could be very beneficial towards other researchers who want to experiment with multi-task learning or who want to use the decaNLP framework as a benchmark.
+
+I also found the adaptation to new tasks and zero-shot experiments very interesting but the set-up was not described very concretely: 
+  -in the transfer learning section, I hope the writers will elaborate on whether the performance gain is coming from the model being pretrained on a multi-task objective or if there would still be performance gain by pretraining a model on only one of those tasks.  For example, would a model pre-trained solely on IWSLT see the same performance gain when transferred to English->Czech as in Figure 4? Or is it actually the multi-task training that is causing the improvement in transfer learning? 
+  -Can you please add more detail about the setup for replacing +/- with happy/angry or supportive/unsupportive? What were the (empirical) results of that experiment?
+
+I think the paper doesn’t quite stand on its own without the appendix, which is a major weakness in terms of clarity.  The related work, for example, should really be included in the main body of the paper.  I also recommend that more of the original insights (such as the experimentation with curriculum learning) should be included in the body of the text to count towards original contributions.  
+
+As a suggestion, the authors may be able to condense the discussion of the 10 tasks in order to make more room in the main text for a related work section plus more of their motivations and experimental results.  If necessary, the main paper *can* exceed 8 pages and still fit ICLR guidelines.
+
+Very minor detail: I noticed some inconsistency in the bibliography regarding full names vs. first initials only.",5,3.0,ICLR2019
+r1gMP1TKnQ,1,B1xY-hRctX,B1xY-hRctX,interesting directions but unclear novelty and some claims that are too strong,"The paper introduces Neural Logic Machines, a particular way to combine neural networks and first order but finite logic. 
+
+The paper is very well written and structured. However, there are also some downsides.
+
+First of all, Section 2.1 is rather simple from a logical perspective and hence it is not clear what this gets a special term. Moreover, why do mix Boolean logic (propostional logic) and first order logic? Any how to you deal with the free variables, i.e., the variables that are not bounded by a quantifier? The semantics you define later actually assumes that all free variables (in your notation) are bounded by all quantifiers since you apply the same rule to all ground instances. Given that you argue that you want a neural extension of symbolic logic (""NLM is a neural realization of (symbolic) logic machines"") this has to be clarified as it would not be an extension otherwise. 
+
+Furthermore, Section 2.2 argues that we can use a MLP with a sigmoid output to encode any joint distribution. This should be proven. It particular, given that the input to the network are the marginals of the ground atoms. So this is more like a conditional distribution? Moreover, it is not clear how this is different to other approaches that encode the weight of weighted logical rule (e.g. in a MLN) using neural networks, see
+e.g. 
+
+Marco Lippi, Paolo Frasconi:
+Prediction of protein beta-residue contacts by Markov logic networks with grounding-specific weights. 
+Bioinformatics 25(18): 2326-2333 (2009)
+
+Now of course, and this is the nice part of the present paper, by stacking several of the rules, we could directly specify that we may need a certain number of latent predicates. 
+This is nice but it is not argued that this is highly novel. Consider again the work by Lippi and Frasconi. We unroll a given NN-parameterized MLN for s fixed number of forward chaining steps. This gives us essentially a computational graph that could also be made differentiable and hence we could also have end2end training. The major difference seems to be that now objects are directly attached with vector encodings, which are not present in Lippi and Frasconi's approach. This is nice but also follows from Rocktaeschel and Riedel's differentiable Prolog work (when combined with Lippi and Frasconi's approach).
+Moreover, there have been other combinations of tensors and logic, see e.g. 
+
+Ivan Donadello, Luciano Serafini, Artur S. d'Avila Garcez:
+Logic Tensor Networks for Semantic Image Interpretation. 
+IJCAI 2017: 1596-1602
+ 
+Here you can also have vector encodings of constants. This also holds for 
+
+Robin Manhaeve, Sebastijan Dumancic, Angelika Kimmig, Thomas Demeester, Luc De Raedt:
+DeepProbLog: Neural Probabilistic Logic Programming. CoRR abs/1805.10872 (2018)
+
+The authors should really discuss this missing related work. This should also involve
+a clarification of the ""ILP systems do not scale"" statement. At least if one views statistical relational learning methods as an extension of ILP, this is not true. Probabilistic ILP aka statistical relational learning has been used to learn models on electronic health records, see e.g., the papers collectively discussed in 
+
+Sriraam Natarajan, Kristian Kersting, Tushar Khot, Jude W. Shavlik:
+Boosted Statistical Relational Learners - From Benchmarks to Data-Driven Medicine. Springer Briefs in Computer Science, Springer 2014, ISBN 978-3-319-13643-1, pp. 1-68
+
+So the authors should either discuss SRL and its successes, separating SRL from ILP, or they cannot argue that ILP does not scale. In the related work section, they decided to view both as ILP, and, in turn, the statement that ILP does not scale is not true. Moreover, many of the learning tasks considered have been solved with ILP, too, of course in the ILP setting. Any ILP systems have been shown to scale beyond those toy domains.   
+This also includes the blocks world. Here relational MDP solvers can deal e.g. with BW worlds composed of 10 blocks, resulting in MDPs with several million states. And the can compute relational policies that solve e.g. the goal on(a,b) for arbitrary number of blocks. This should be incorporated in the discussion of the introduction in order to avoid the wrong impression that existing methods just work for toy examples. 
+
+Coming back to scaling, the current examples are on rather small datasets, too, namely <12 training instances. Moreover, given that we learn a continuous approximation with a limit depth of reasoning, it is also very likely that the models to not generate well to larger test instances. So the scaling issue has to be qualified to avoid to give the wrong impression that the present paper solves this issue. 
+
+Finally, the BW experiments should indicate some more information on the goal configuration. This would help to understand whether an average number of moves of 84 is good or bad. Moreover, some hints about the MDP formulation should be provided, given that there have been relational MDPs that solve many of the probabilistic planning competition tasks. And, given that the conclusions argue that NLMs can learn the ""underlying logical rules"", the learned rules should actually be shown. 
+
+Nevertheless, the direction is really interesting but there several downsides that have to be addressed. ",5,5.0,ICLR2019
+HJejsv4An7,1,rylV6i09tX,rylV6i09tX,"An interesting visualization paper, but not always so convincing","This paper uses visualization methods to study how adversarial training methods impact the decision surface of neural networks.  The authors also propose a gradient-based regularizer to improve robustness during training.
+
+Some things I liked about this paper:
+The authors are the first to visualize the ""decision boundary loss"".  I also find this to be a better and more thorough study of loss functions than I have seen in other papers.  The quality of the visualizations is notably higher than I've seen elsewhere on this subject.
+
+I have a few criticisms of this paper that I list below:
+1)  I'm not convinced that the decision surface is more informative than the loss surface.  There is indeed a big ""hole"" in the middle of the plots in Figure 4, but it seems like that is only because the first contour is drawn at too high a level to see what is going on below.  More contours are needed to see what is going on in that central region. 
+2) The proposed regularizer is very similar to the method of Ross & Doshi.  It would be good if this similarity was addressed more directly in the paper.  It feels like it's been brushed under the rug.
+3) In the MNIST results in Table 1:  These results are much less extensive than the results for CIFAR.  It would especially be nice to see the MinMax results since those of commonly considered to be the state of the art. The fact that they are omitted makes it feel like something is being hidden from the reader.
+4) The results of the proposed regularization method aren't super strong.  For CIFAR, the proposed method combined with adversarial training beats MinMax only for small perturbations of size 3, and does worse for larger perturbations.  The original MinMax model is optimized for a perturbation of size 8.  I wonder if a MinMax result with smaller epsilon would be dominant in the regime of small perturbations.  ",5,5.0,ICLR2019
+r1lYBi9cKH,2,rygBVTVFPB,rygBVTVFPB,Official Blind Review #1,"In this paper, the author maps the problem of time series PDE into a naive reinforcement learning problem. Under the MDP assumption, the author sets the initial state of the particles as the current state, the flux at all spaces as the possible actions, and map the state-action pair deterministically to the next state of the particle diffusion. The reward is defined as the two norms between the prediction and the Burger’s equation. The naiveness comes from the fact that the typical reinforcement learning problem, the agent needs to decide how to choose an action. In this paper, it is formulated as an intrinsic proper that follows Burger’s equation instead. 
+
+While the motivation is interesting, the author argues this work is novel due to it does not fall under supervised learning, but rather reinforcement learning. This perspective is not completely correct. The correct category for this work would be more similar to imitation learning using WANO’s algorithm as the expert label. This is a field of supervised reinforcement learning.
+
+The author’s work has brought the possibility of using neural network architecture in the field of particle diffusion. The benefit is the improved estimation of how particles diffuse in long-horizon conditions. The author has shown in their paper their simple fully connected network has already performed better prediction than the current state of the art non-neural network model: WENO.
+
+While the framing of the problem is perhaps novel in the space of PDE, algorithmically there needs to have a breakthrough or new invention. The lack of comparison with other neural-network-based models also hurts the credibility of the model. Therefore, I reject this paper under the ICLR conference. I would suggest that this paper would be better suited as a paper submission under the perspective science field conference instead.
+
+Some suggestions to further improve this paper: The author could add CNN and RNN structure to the prediction model. These structures would further expand other possibilities in the solution space. CNN would help turn the limited 1D problem to a higher-dimensional, a more real-world like problem space. RNN is known for its’ ability to model long horizon problems, perhaps even better breakthrough would happen with these architectures.
+
+As a whole, the paper is written very well such that even nonexpert can grab onto the logic flow of this paper. The weaknesses of the paper are the lack of diversity in comparison with other models and the paper needs some level of novel breakthrough in an algorithmic sense.
+",3,,ICLR2020
+BkgcGwG9hQ,3,rJfW5oA5KQ,rJfW5oA5KQ,Interesting theoretical work on establishing sample complexity bounds for learning certain distributions using GANS,"This paper explores how discriminators can be designed against certain generator classes to reduce mode collapse. The strength of the paper is on establishing the sample complexity bounds for learning such distributions to show why they can be effectively learned. The work is important in understanding the behaviour of GANs. The work is original and significant. A few comments that need to be addressed are listed as below:
+
+1. I found the paper is a bit hard to follow in the beginning, due to its structure. In Section 1, it first gives introduction and then talks about the novelty of the paper; it then shows more background work followed by more introduction of the proposed work; after that, Section 1.4 talks more related work. It makes reading confusing in the beginning.
+
+2. The authors wrote that ""In practice, parametric families of functions F such as multi-layer neural networks are used for approximating Lipschitz functions, so that we can empirically optimize this objective eq. (2) via gradient-based algorithms as long as distributions in the family G have parameterized samplers. (See Section 2 for more details.)"" I am not sure how Section 2 gives more details.
+
+3. There are some typos and the references are not very carefully edited. For example, in Theorem 4.5, ""the exists a ..."" -> ""there exists a ...""; in reference, gan -> GAN.",8,2.0,ICLR2019
+1biktfSjxZ-,4,4kWGWoFGA_H,4kWGWoFGA_H,"The topic this paper tackled may be important for the real-world application, and the method presented in the paper might be effective. However, the paper lacks experiments and evidence to properly support the authors’ claim. Also, the technical novelty is not significant. Therefore, I chose ""rejection"" as an evaluation of this paper.  If all the issues below are fully addressed, I may reconsider my assessment of this paper.","The authors explored the robustness of video machine learning models to bit-level corruption. They investigated previous methods such as Out-Of-Distribution (OOD) detection and adversarial training and found that they are not effective enough to defense against the bit-level corruption.  Accordingly, this paper proposed a new framework, Bit-corruption Augmented Training (BAT), which utilizes the knowledge about corruption by bit-level data augmentation at the training stage. Also, the authors argue that the proposed method outperforms the previous methods in handling the bit-level corrupted dataset.
+
+While the proposed method seems simple and more effective than the previous studies, the authors do not currently provide a sufficient amount of evidence to support their claim.
+
+Pros
+-	The proposed defense technique is effective on the bit-level corruption of videos. Also, it is simple to apply in real-world deployment.
+-	This is the first work that addresses the robustness against bit-level corruption of videos.
+
+Cons
+-	The technical novelty of this paper is not significant. They proposed a new framework for robustness to video bit-level corruption. However, data augmentation method is the main technical contribution this paper proposes. I think adopting existing technique to bit-level corruption problem is not significantly novel.
+-	There is insufficient evidence that the proposed method is better than previous studies. I think additional experiments on OOD detection are needed. The authors implemented only one method as a baseline, ODIN [1]. Since there exist many other state-of-the-art methods than ODIN, such as Mahalanobis [2] and Outlier exposure [3], they should also be compared. In addition, there is no explanation of an input preprocessing method proposed in ODIN. The authors need to articulate why they omit the preprocessing. Also, additional experiments on UCF101 by using corruption-agnostic and corruption-aware defenses would make the paper more convincing.
+-	Furthermore, in Section 4.3, the rationale behind the importance of detecting low-level corrupted samples by OOD is not elaborated. As the authors mentioned, largely corrupted videos are likely to be misclassified, while the accuracy for videos with low-level corruption does not decrease.
+-	For clarity, I recommend the authors to correct the minor typos in the paper. The adversarial training in Section 4.3 seems to be worse compared to the no-defense baseline by 8.6 points (not 8.1 points) on clean data. 
+
+[1] Liang et al., ""Enhancing the reliability of out-of-distribution image detection in neural networks."", ICLR'18
+
+[2] Lee et al., ""A simple unified framework for detecting out-of-distribution samples and adversarial attacks."", NIPS'18
+
+[3] Hendrycks et al., ""Deep anomaly detection with outlier exposure."", ICLR'19
+
+----------------------------------
+After rebuttal:
+
+I appreciate the authors for thoughtful response and additional experimental results, which are helpful for further understanding of the manuscript. Especially, the additional experiment on the recent OOD detection method addresses my concern about the evidence that the previous OOD studies are not sufficient for defending the bit-level corruption.
+
+Unfortunately, I am still not sure about the technical novelty of this paper. I agree that the paper proposed a new problem setting, but I do not think that the technical novelty is significant, given the proposed approach of just applying the data augmentation simply at a bit level, rather than at a pixel level.
+
+Due to this concern, I want to keep my rating of ""4. Ok but not good enough - rejection"" as it is.
+",4,2.0,ICLR2021
+lLRXhjVkmi6,1,6FsCHsZ66Fp,6FsCHsZ66Fp,Review for new robust neural network definition ,"In this paper, the authors consider the task of adversarial/robust learning with respect to neural networks. The problem is a well-motivated one: suppose there is a neural network that on input a training set T={(x_i,y_i)} does a good classification job, but an adversary comes along and modifies some parts of T, then it is very possible that the neural network will classify almost incorrectly and hence isn't robust. In this paper they consider this setting and ask if we can naturally make neural networks robust. 
+
+In this direction, the authors propose a major change: in the conventional neural networks, one has sigmoid function which on input x, outputs sigma( w^T x ) (let's ignore bias for the time being) and the authors observe this functions is neither Lipschitz nor is it robust to noise. Instead, the author propost an ell_inf neuron which is just || w - x ||_inf. In this direction, they consider a neural network that is built out of ell_inf neurons. Simply by definition, it isn't too hard to see that this function is Lipschitz with respect to the ell_inf norm. The authors go on to show that every function can be represented using this new ell_inf NNs with sufficiently many neurons and sufficiently large depth. Furthermore, they go on to perform certain simulations for MNIST data. 
+In my opinion here are the pros and cons of the paper:
+
+1) Pros: I think the problem is very well motivated and has been extremely well-studied. I am not an expert in this area, but i find their ell_inf neuron pretty interesting as well. Their simulations also seem very intriguing that such neurons seem to work after all (which is slightly surprising to me based on what I say next)
+
+2) Cons: In my opinion, the authors do not make a sincere effort to compare both the models. A simple example where the new model is *extremely* inefficient is simply to compute the inner product function. It is easy to do it in the standard neural model (albeit its not robust), but in the new model, even the non-robust setting, I don't think the inner product can be computed easily. SO it seems to me that their ""fix for robustness""might lack the decades of research that has been done in understanding and proving results about the standard sigmoid function. This is an important aspect which is missing in their work. 
+
+Overall, I think the idea is nice, but I'd tend towards rejection since their fix could be nice if they can show that everything computable in the standard NN model can be computed in their new ell_inf NN model (with approximately the same complexity), but this seems to be missing in its current form.",4,3.0,ICLR2021
+5NwFaKQ5y-Y,3,fycxGdpCCmW,fycxGdpCCmW,Simple yet effective method for improvement over JEM,"The paper proposes HDGE - a simple method to improve over JEM. JEM is optimized using a combination of two terms:
+$\log p(y|x) + \log p(x)$
+
+The first term is optimized using the standard cross-entropy loss, while the second term is optimized using SGLD. Running SGLD chains in each iteration can cause instability. In HDGE, instead of optimizing $\log p(x)$, an approximation to the conditional density $\log p(x|y)$ is optimized. The idea is to approximate the normalization constant $Z(\theta)$ with an empirical averaging of energy functions over a large memory bank. This yields a simple objective to optimize. The benefit of using such an approximation is that this eliminates the need for running SGLD, thereby improving the stability of training.
+
+The idea itself is simple and intuitive. Experiments show that HDGE consistently outperform / perform on-par with JEM on image classification, OOD detection and calibration.
+
+Should we call this a contrastive objective? Im not super convinced if the objective of $\log p(x|y)$ can be called a contrastive objective. Because in contrastive losses, we always focus on pairs of samples, i.e., we contrast the representation of one sample to another, while the objective in this paper takes the form similar to cross-entropy loss instead. Should the loss be called something instead?
+
+I would like the authors to have a discussion on the training stability of HDGE compared to JEM. It looks like HDGE would be more stable since we don't need to run SGLD, but this message should be made more clear as this is the most important improvement over JEM.
+
+The performance improvement in table 1 is marginal. So, it is important to perform multiple runs and report mean and standard deviations to understand the statistical significance of the results.
+
+Can HDGE be used for generative modeling? i.e., how do you sample from p(x)? Experiments in appendix show that HDGE can be used in combination with JEM for generative modeling, but this again requires running SGLD. Can HDGE be used in isolation for generative modeling tasks?
+
+How does the performance compare with other SOTA metods for OOD detection and calibation? Some comparisons that could be done include (but not limited to): Ren et al., ""Likelihood Ratios for Out-of-Distribution Detection"", Padhy et al., ""Revisiting One-vs-All Classifiers for Predictive Uncertainty and Out-of-Distribution Detection in Neural Networks"", Morningstar et al. ""Density of States Estimation for Out-of-Distribution Detection"", etc. 
+
+",6,3.0,ICLR2021
+BJlxTZh6Yr,2,SkxpxJBKwS,SkxpxJBKwS,Official Blind Review #3,"Authors in introduce a new competitive/cooperative physics-based environment in which different teams of agents compete in a visual concealment and search task with visibility-based team-based rewards (although There are no explicit incentives for agents to interact with objects in the environment). They show that, complex behaviour emerge as the episode progresses and agents are able to learn 6 emergent skills/(counter-)strategies (including tool use), where agents intentionally change their environment to suit their needs. Agents trained using self-play 
+
+In my opinion, this is an excellent paper which main contribution is to provide experimental evidence that relevant and complex skills and strategies can emerge from multi-agent RL competing scenarios.
+
+Minor comments:
+
+- Hide&seek rules and safety issues: is it not supposed that hiders and the seekers could not get together (i.e., hiders cannot push seekers or as we can see in some videos)? Furthermore, it is surprising (one would say worrying) that hiders identified the barriers as an impediment to the seeker (not only as a way to hide). I wouldn’t say that this is a “ human-relevant strategies and skills “ as the authors claim. Hider agents even double walled seekers!  
+
+- Have the authors thought about joining the Animal-AI Olympics (http://animalaiolympics.com/) competition? It would be a great opportunity to to test the skills of your agents in a further general testing scenario. They provide an arena (test-bed) which contains 300 different intelligent tests for testing the cognitive abilities of RL agents (https://www.mdcrosby.com/blog/animalaiprizes1.html) which have to interact with the environment. 
+",8,,ICLR2020
+ByeTlXM6Fr,1,ryesZANKPB,ryesZANKPB,Official Blind Review #2,"This paper proposes a new meta-learning method (ML3) that meta-learns a loss function that is able to generalize across tasks. Building upon bi-level optimization framework as in MAML, instead of using a task-specific loss function in the inner loop, the authors compute adapted parameters of the model using a parametrized loss network and learn the loss network via backpropagation. Experiments are conducted on supervised sinusoid regression and binary digit classification as well as on model-based and model-free RL benchmarks.
+
+Overall, this paper is an extension to the gradient-based meta-learning algorithms such as MAML. While the idea is natural, there is a prior work [1] that has investigated the effectiveness of learned loss in gradient-based meta-learning, which seems pretty similar to this paper. I wonder how this method could be compared to [1] in various domains. 
+
+Besides, I wonder how important the extra information added during the meta-training time is and the authors should present comparison to ML3 without the extra information.
+
+Moreover, I believe comparing ML3 to more recent meta-learning algorithms such as various MAML variants (e.g. MAML++), PEARL, LEO, etc. would be important to show the effectiveness of ML3. Right now, the method is only compared to ML3 with task loss, which seems not very conclusive.
+
+[1] Yu, T., Finn, C., Xie, A., Dasari, S., Zhang, T., Abbeel, P., & Levine, S. (2018). One-shot imitation from observing humans via domain-adaptive meta-learning. arXiv preprint arXiv:1802.01557.",3,,ICLR2020
+H1gCgZOrKB,1,S1gEFkrtvH,S1gEFkrtvH,Official Blind Review #3,"This paper proposes BasisVAE for acquiring a disentangled representation of VAE. 
+Though the topic is much of interest for this conference, I cannot support its acceptance because the paper leaves many aspects unexplained in the model design. 
+
+In particular, the following points need justified and clarified.
+1) Theorem 1 is difficult to follow. 
+The claim of the theorem is unclear. 
+I suppose it says ELBO can be written as a sum with respect to z_i given p(z)=\prod_i p(z_i), but the statement is not clear enough from the text. 
+Proof of Lemma 1 is logically incomplete. Discuss the cases n>2.
+Derivation of equation (6) from (5) seems erroneous: p(x|z_1, ..., z_n) = \prod_{i=1}^n p(x|z_i) / p^{n-1}(x) does not hold in general even if z_i's are independent p(z_1, ..., z_n)=\prod_{i=1}^n p(z_i).
+
+2) Connection between the objective function and Theorem 1 is unclear. 
+BasisVAE uses a linear combination of Eqs. (9,10,11) as its objective function. 
+How Theorem 1 motivates this formulation?
+
+3) Reconstruction error (9). 
+The text says \ell of Eq. (9) is the binary function and configured as in (Bojanowski et al. 2017). 
+However, Bojanowski et al. used a weighted l1 error Laplacian Pyramid representation. 
+Furthermore, the original VAE formulation uses a conditional log-likelihood log p(x|z) for the reconstruction term. 
+How is binary function \ell related the likelihood?
+
+4) KL regularization term (10).
+For computing this term, the output of encoder c=f(x) should be converted into z. 
+Notation of N(f(x), \Sigma) is confusing. 
+
+5) Figure 6 shows diversity in many factors. 
+Figure 6 is not as impressive for disentangled images since many factors change by varying a single basis. 
+Is this an expected result?",1,,ICLR2020
+Bylfi7tez,1,r17Q6WWA-,r17Q6WWA-,"This paper proposed a new block to combine domain-specific information from related tasks, in order to improve generalization of the target tasks. Although the relative improvement seems high (24.31%), its novelty is a little limited, and the target task in this submission(5 landmarks detection) is too simple to prove the effectiveness. ","Pros:
+1. This paper proposed a new block which can aggregate features from different tasks. By doing this, it can take advantage of common information between related tasks and improve the generalization of target tasks.
+
+2. The achievement in this paper seems good, which is 24.31%.
+
+Cons:
+1. The novelty of this submission seems a little limited.
+
+2. The target task utilized in this paper is too simple, which only detects 5 facial landmarks. It is hard to say this proposed work can still work when facing more challenging tasks, for example, 60+ facial landmarks prediction.
+
+3. "" Also, one drawback of HyperFace is that the proposed feature fusion is specific to AlexNet,"" In the original submission, HyperFace is based on AlexNet, but does this mean it can only work on AlexNet?",5,5.0,ICLR2018
+HygsMMPq2X,3,rkxt8oC9FQ,rkxt8oC9FQ,Review: Interesting paper and impressive results; why novel vs. standard PSM?,"========= Summary =========
+
+The authors propose a novel method for counterfactual inference (i.e. individual/heterogeneous treatment effect, as well as average treatment effect) with neural networks. They perform propensity score matching within each minibatch in order to match the covariate distributions during training, which leads to a doubly robust model.
+
+PM is evaluated on several standard semi-synthetic datasets (jobs, IHDP, TCGA) and PM shows state-of-the-art performance on some datasets, and overall looks quite promising. 
+
+======= Comments =======
+
+The paper is well-written, presents a novel method of some interest to the community, and shows quite good performance across a range of relevant benchmarks.
+
+I have one major issue with this work: I don't see why propensity-score matching *within* a minibatch should provide a substantial improvement over propensity-score matching across the dataset (Ho et al 2011). I find the cursory explanation given (""it ensures that every gradient step is done in a way that is approximately unbiased"") unconvincing, since (a) proper SGD training should be robust to per-batch biases during training (the expected loss is identical for both methods, correct?), and (b) biases should go away in the limit of large batch sizes. If indeed SGD required unbiased *minibatches* then standard minibatch SGD wouldn't work at all.
+
+Looking at the experimental details in the appendix, it appears that the MatchIt package was used to do PSM, rather than a careful comparison under the same conditions. Are the exact matching procedure, PS estimator model, choosing ""one of 6 closest  matches by propensity score"", batch size, etc. the same between your PM implementation and MatchIt? I'd be very curious to see the results of a controlled comparison between Alg S1 and S2 under the same conditions (i.e. run your PM implementation on the whole dataset), and perhaps even some more clever experiments illustrating why matching within a minibatch is important. 
+
+Another hypothesis for why PM is better than PSM is that the matching distribution for PM changes at each epoch (at least due to the randomization among the 6 closest matches). Could it be that the advantage of PM is that it actually provides a randomized rather than constant distribution of matched points?
+
+Can the authors provide more motivation for why PM should outperform PSM? Or some more careful comparison of these methods isolating the benefits of PM? I think a convincing justification and comparison here could change my opinion, as I like the paper otherwise. Thanks!
+
+Detailed Comments:
+
+- There is insufficient explanation of the PM method in the main text. The method is only mentioned in a single sentence buried in the middle of a long paragraph ""In PM, we match every sample within a minibatch..."". This should be made more clear, e.g. by moving Algorithm S1 to the main text.
+- The discussion on Model Selection and the argument for nearest-neighbor PEHE is clever and well-supported by the experiments.
+- In Table 3 and 4, it's not clear which numbers are reported by the original authors and which were replicated by the authors.",5,3.0,ICLR2019
+BeP6Z_DBeL,4,loe6h28yoq,loe6h28yoq,Neat theoretical results; questionable for a practical application,"This paper studies to train a certifiable robust model against data poisoning attacks using nearest neighbors. The paper studies the voting mechanism in the nearest neighbor models, and presents a relationship between the poisoning instances and the difference between the majority votes and the second majority votes. Such a relationship will result in a guarantee on the lower bound of a training model's accuracy, which is referred to as Certified Accuracy (CA). The theoretical results are neat. The experiments are conducted on MNIST and CIFAR, and results show better CA than previous approaches of DPA and Bagging.
+
+My main concern is the limitation of the applicable machine learning models, which seems restricted to only kNN and rNN models --- they may not yield the best performance on most interesting tasks. For example, from Fig 2 and 3, we can see that even when poisoning size e=0, the accuracy (which should be identical to CA) is far below the SOTA on the corresponding MNIST and CIFAR tasks.
+
+Also, the lower bound is established with respect to s_a - s_b - e <= k-e. Therefore, to be able to handle larger poisoning size e, one has to employ a larger k (in the kNN case) or larger r (in the rNN case). Such choices of hyperparameter typically hinder the accuracy on the clean dataset. It is not clear how such a restriction can be mitigated in a practical setup.
+
+Due to the above concern, I'm borderline on work.
+",5,2.0,ICLR2021
+ryl6iV7AtB,2,BJgAf6Etwr,BJgAf6Etwr,Official Blind Review #2,"Summary: This paper proposes to augment crosslingual data with heuristic swaps using aligned translations, somewhat like what bilingual humans do in code-switching. I think this paper investigates a neat extension of the XNLI dataset, which is in fact the sort of thing it was created to enable! It also looks at SQuaD translations (but, I'd have preferred a bit more depth on one of these datasets over having both, but I understand why you made this rhetorical choice). 
+
+Your augmentation extension to XNLI also uncovers a bunch of surprising results, like that code-switched utterances help models do better than monolingual ones! My main issue, if i had to find one, is that the paper doesn't try to offer (even possible) explanations for the unexpected results; maybe try to find space for more of these in a discussion section? Finally, this paper is really fun and well written, thanks for the effort! I'm going to leave a bunch of questions: it would be cool to see some in the final, but if they don't fit, you can consider them for a follow up.
+
+Questions: 
+-Are all ""portions"" full sentences? Did performance change based on which ""portions"" you swapped?  In the human code-switching literature, there are syntactic generalizations about what gets switched. If you analyze the swapping, you could figure out which parts of the sentence (say, verb phrases v. prepositional phrases, beginning v. middle v. end, etc.) mattered more for NLI performance. I'd love to know the answer to that question!
+-you say this: ""The BLEU score of the translation system has little effect on a language’s performance as a cross-lingual augmentor. "" Any ideas on why?
+-you also say this: ""for every language a XLDA approach exists that improves over the standard approach"", what a tantalizing statement! Why did that happen?! 
+-Are there any generalizations over whether typologically similar languages are better augmentors for each other than they are for really different ones? I feel like if you could redo your XLR method (fig. 4) by adding augmentors in order from most similar to least (or vice versa), and you might find the answer to this.
+-for XNLI, I'd love to see if you have differences by label (maybe in an appendix?)
+
+Small Notes:
+-the text in fig1 should be bigger.
+-too many Ms and Ls, you had me chuckling at all the acronym puns!
+-define ""augmentor"" somewhere",8,,ICLR2020
+Hyl2WZpf5H,3,BJxbOlSKPr,BJxbOlSKPr,Official Blind Review #2,"This paper considers the problem of having compact yet expressive KD code for NLP tasks. The authors claim that the proposed differentiable product quantization framework has better compression but similar performance compared to existing KD codes.The authors present two instances of the DPQ framework: DPQ-SX using softmax to make it differentiable, and DPQ-VQ using centroid based approximation. While DPQ-SX performs better in terms of performance and compression, DPQ-VQ has the advantage in scalability.
+
+- Significance
+It's understandable that the size of the embedding is important, but there's been a lack of explanation as to why this should be done only through KD codes. Hence, it is doubtful how big the impact of the proposed framework is.
+
+- Novelty
+Just extending and making Chen et al., 2018b's distilling method to be differentiable has limited novelty.
+
+- Clarity
+The paper is clearly written in most places, but there were some questions about the importance and logic of statements.
+
+- Pros and cons
+Compared to Chen et al., 2018b, there is no need to use expensive functions, and performance is better. But, the baseline consists only of algorithms using KD codes; there might be many disadvantages compared to other types of algorithms.
+
+- Detailed comments and questions
+1. It is true that the parameters for embedding make up a large part of the overall parameters, but I would like some additional explanation of how important they are to learning. It is usually not necessary to train the entire embedding vector on GPU, so it would not be a big issue in the actual learning process.
+2. In a similar vein, it would be nice to show which of the embedding vector size or the LSTM model size contributes significantly to performance improvements. If LSTM model size contributes more, the motivation would be weakened.
+3. It would be nice to add more baselines such as Nakayama 2017 as well as the standard compression/quantization methods used in other deep networks. And please explain why we should use KD codes to reduce embedding size. Also, why the distilling in Chen et al., 2018b is a problem?
+4. Did you run all experiments just one time? There is no confidence interval.
+5. DPQ models have different compression ratios depending on the size of K and D. It would be great to show the change in PPL according to the compression ratio of DPQ models.
+6. Can we apply it to pre-trained models like BERT?",3,,ICLR2020
+gah7TKW8Nlr,3,OcTUl1kc_00,OcTUl1kc_00,"This work is quite constructive, but low innovation","Summarization
+
+The authors formalize four levels of injection of graph structural information, and use them to analyze the importance of long-range dependencies. Among these four different structural information injections, the authors design various graph analysis tasks to evaluate the superiority of the proposed methods, and the experimental results could be reasonable and easy to follow
+
+
+Strong points
+
+1)This paper is good writing and easy to understood. The proposed Random Walk with Restart (RWR) Matrix and RWR Regularization are quite reasonable to boost the model performance of graph neural networks (GNNs), and easy to follow.
+ 
+2)From the experimental results, the proposed methods have been proven efficient in various graph analysis tasks (node classification, graph classification and triangle counting) for different GNNs (GCN, GraphSage and GAT).
+
+
+Weak points: The main weakness could be innovation and experiments
+
+Innovation: 
+
+The proposed methods are quite heuristic, and I assume there could be many other improvements:
+
+1)In section 2.2, the combination of Random Walk with Restart (RWR) Matrix and GCN is too heuristic, why not try other feature fusion methods rather than straightforward concatenation.
+
+2)The developed RWR Regularization can be regarded as a formulation of Laplace Regularization, and only thing you do is replacing the Laplace matrix with the RWR matrix. Actually, there are also other graph construction methods (like consine similarity matrix etc) to replace the RWR matrix in RWR Regularization, you need to introduce additional experiments to prove the advantages of RWR matrix.
+
+Experiments
+
+1)The experimental results in Table. 1 are not so convinced. I agree with the point that your work don’t focus on defining new state-of-the-art results, but you still need to provide the node classification comparisons with the same train/validation/test split defined by Yang et al. (2016).
+
+2)As your definitions, \lambda is a trade-off hyperparameter, but I miss the setting and ablation study of this important hyperparameter. 
+
+3)Why not try AD+RWRREG? From the results, this combination seems could be better (like GCN, Diffpool in Table.2).
+
+Questions:
+
+My questions have been included in Weak points part
+
+Additional Feedback
+
+1)Time complexity of constructing Random Walk with Restart (RWR) Matrix.",6,4.0,ICLR2021
+rJexO5ZynQ,1,B1l08oAct7,B1l08oAct7,"Two advances for variational Bayes on neural networks. Expectations are done deterministically (as in PBP), not by Monte Carlo, thus reducing variance. The weight prior is learned with length scales by empirical Bayes. Both should make VB training more robust, but experiments do not show that.","Summary:
+
+This work is tackling two difficulties in current VB applied to DNNs (""Bayes by backprop""). First, MC approximations of intractable expectations are replaced by deterministic approximations. While this has been done before, the solution here is new and very interesting. Second, a Gaussian prior with length scales is learned by VB empirical Bayes alongside the normal training, which is also very useful.
+
+The term ""fixing VB"" and some of the intro is not really supported by the rather weak experiments, done on small datasets and networks, where much older work like Barber&Bishop would apply without any problems. While interesting and potentially very useful novelties are presented, and the writing is excellent, both experiments and motivation can be improved.
+
+- Quality: Extremely well written paper, I learned a lot from it. Approximations are
+   tested, great figures to explain things. And the major technical novelty, the
+   expression for <h_j h_l>, is really interesting and useful.
+- Clarity: Excellent writing until it comes to the experiments. Here, important
+   details are just missing, for example what q(w) is (fully factorized Gaussian?).
+   Very nice literature review, also historical.
+- Originality: The idea of matching Gaussian moments along the network graph is
+   previously done in PBP (Lobato, Adams), as acknowledged here. Porting this from
+   ADF to VB gives dDVI. PBP also has the property that a DL system gives you the
+   gradients. Having said that, I think dDVI may be more useful than PBP.
+   While Barber&BIshop 98 is cited, they miss the expression for <h_j h_l> in
+   there. Now, what is done here, is more elegant, does not need 1D quadrature.
+- Significance: Judging from the existing experiments, the significance may be
+   rather small, *if one only looks at test log likelihood*. I'd still give this the
+   benefit of the doubt, as in particular dDVI could be really interesting at large
+   scale as well. But the authors may tone down their language a bit.
+   To increase significance, I recommend to comment beyond just test log
+   likelihood scores. For example:
+   - Does the optimization become simpler, less tuning required, more automatic?
+      Would one not expect so, given you make a big point out of reducing variance?
+      Does it converge faster?
+   - Can you do something with your posterior that normal DNN methods cannot
+      do? Better decisions (bandits, active learning, HPO)? Continual learning?
+      In the end, who really cares about test log likelihood?
+
+Experiments:
+- What is the q(w) family being used here? Fully factorized Gaussian? I
+   suppose so for dDVI. But for DVI? Not said anywhere, in main paper or
+   Appendix
+- A bit disappointing. Why not evaluate at least dDVI with diagonal q(w) on
+   some much larger models and datasets? Why not quote numbers on speed
+   and robustness of learning, etc? Show what you really gain by reducing the
+   variance.
+- Experiments are OK, but on pretty small datasets, and for single hidden
+   layer NNs. On such data and models, the Barber&Bishop 98 method could
+   be run as well
+- Was MCVI run with re-parameterization? This is really important. If not,
+   this would be an important missing comparison. Please be clear in the main
+   text
+- Advantages over MCVI are not very large. At least, dDVI should be faster to
+   converge than MCVI.
+   Can you say something about robustness of training? Is it easier to train
+   dDVI than MCVI?
+- Why not show the PBP-1 results, comparing to dDVI, in the main text? Are they
+   obtained with the same model? dDVI is doing better.
+
+Other points:
+- Please acknowledge the <h_j h_l> expression in Barber&Bishop 98. Yours is
+   more elegant and faster (does not need 1D quadrature)
+- Relation to PBP: Note that dDVI has an advantage in practice. With PBP, I need
+   to compute gradients for every datapoint. In dDVI, I can do mini-batch
+   updates.
+- I just *love* the header ""Wild approximations"". I tend to refer to this kind of work
+   as ""weak analogies"". Why do you not also compare against this, and show it really
+   does not work?
+",7,5.0,ICLR2019
+rJepBpantB,2,rJxq3kHKPH,rJxq3kHKPH,Official Blind Review #1,"Updated review: Thanks for your comments. I feel the latest version of the paper is better than the previous version. 
+However, as stated by other reviewers as well, the claims of the paper are quite ambiguous. Another example from the author response is the point about how Chapter 6 of the Elements of Infomation Theory is related to Gambler's loss. This is not clear to me. I would not object to accepting the paper but I find it difficult to recommend accept for this paper. Perhaps the authors can be more clear in their claims. 
+
+----------------------------------------------------------------------------------------------------------------------------------------------------------
+
+Summary: The paper focusses on the problem of noisy labels in supervised learning with deep neural networks. The paper, in turn, proposes an early stopping criterion for handling label noise. The early stopping criterion is dependent on a new loss function that is defined as the log of true label + weight on a reservation option? The paper shows that when the labels are corrupted then the propose early stopping criterion does better than early stopping criteria obtained via the validation set. 
+
+\The first section of the paper establishes that when label noise is present in the dataset, then there are three stages to training a deep neural network. 
+The learning stage where the highest accuracy on the test set is achieved. 
+The gap stage where test set accuracy goes down. 
+Memorization stage corresponds to when a deep neural network memorizes corrupt labels and test accuracy goes completely down. 
+I cannot understand figure 1(a). The y-label says accuracy but it seems that the plot is about loss. What dataset was this and what architecture of DNN was used? The plot shows that the DNN achieved a 100% accuracy in 5 epochs. Is this result meaningful? Before establishing a hypothesis based on this should the hypothesis not be tested on multiple datasets.
+The paper says that these stages are persistent across multiple architectures and datasets and as proof the paper says ‘we verified that’. Why can’t the reader see the experiments? By across datasets does the paper mean MNIST and CIFAR? By across architecture does the paper mean the two architectures mentioned in the appendix one each for MNIST and CIFAR respectively? 
+
+The paper makes the assumption that label noise is symmetrically corrupted. Why and where does such an assumption hold? What happens to the proposed method if that is not true. 
+
+Assumption 2: During the gap stage the model has learned nothing about the corrupt data points. 
+How is that even possible? 
+
+Equation 1: So the loss function proposed is log(f(x)_y + (1/o) f(x)_m+1) .  What is y here? The true label? Why is y called a point mass? Is this different from the cross-entropy loss + log loss on m+1 ?
+
+I do not understand equations 2 to 5. 
+
+For figure 3 again what datasets were used?
+
+“ Making random bet will help with making money and a skilled gambler will not make such bets” 
+Why does making random bet help with making money? If random is good how can a skilled gambler exist in such a game? What is this skill?
+
+k denotes the sum of probability of predicting anything that is not y or m+1 (it does not denote prediction). 
+
+In the experiments section what was the symbol for the rate of corruption changed from epsilon to r. Are they different? 
+ 
+What is nll? 
+
+It seems that gamblers loss best shines when the corruption rate is as high as 80% . That is 80 percent of the data is corrupted. Does this mean that if I trained with only 20% of the non-corrupt data I would still get a 99% accuracy on MNIST (even without gamblers loss)? A comparison of this sort would have been useful.  
+One astonishing result the paper presents is that with gambler’s loss even with 80% corrupt labels a 94% test accuracy is possible on MNIST dataset. I think this is significant, this raises the question that is it required to label all the data points ina dataset to achieve high accuracy or is it possible to achieve just as much with only 20% of the labels?",3,,ICLR2020
+BJeuzz5TnX,3,SyxaYsAqY7,SyxaYsAqY7,"Interesting ideas, insufficient experimental evaluation.","The paper makes three rather independent contributions: a) a method for constructing adversarial examples (AE) utilizing second-order information, b) a method for certifying classifier robustness, c) a method to improve classifier robustness. I will discuss these three contributions separately.
+
+a) Second order attack: Miyato et al. (2017) propose a method for constructing AE for the case where the gradient of the loss is vanishing. In this case, at given a point, the direction of steepest loss ascent can be approximated by the gradient at a randomly sampled nearby point. Miyato et al. (2017) show how this can be derived as a very crude approximation of the power method. The authors of the current paper apply this attack to the adversarial trained networks of Madry et al. (2017). They find that the *L_infinity* trained networks of that work are not as *L_2* robust as originally claimed. I find this result interesting, highlighting a failure case of first-order methods (PGD) for evaluating adversarial robustness. However, it is important to note that these were models that were *not* trained against an L2 attack and thus should not be expected to be very robust to one. Therefore, this result does not identify a failure of adversarial training as the authors seem to suggest but rather a failure of the original evaluation of Madry et al. (2017). It is also worth noting that this finding is specific to MNIST given the results currently presented. This might be explained by the fact that robust MNIST models tend to learn thresholding filters (Madry et al., 2017) which might cause gradient obfuscation.
+
+b) Adversarial robustness certification: The authors proposed a method for certifying the robustness of a model based on the Renyi divergence. The core idea is to define a stochastic classifier that randomly perturbs the input before classifying it. Given such a classifier, one can construct the probability distribution over classes. The authors prove that given the gap between the first and second most likely classes, one can construct a bound on the L2 norm of perturbations required to fool the classifier. This method is able to certify the adversarial accuracy of some classifier to relatively small epsilon values. While I think the theoretical arguments are elegant, I find the overall contribution incremental given the work of Mathias et al. (2018). Both methods seem to certify robustness of roughly the same scale. One component of the experimental evaluation missing is how does the certifiable accuracy differ between robust and non-robust models. Currently there are only results for a single model (Figure 1) and it is not clear from the text which one it is. Given that there exists a section titled ""improved certifiable robustness"" I would at least expect a result where a model with higher certifiable accuracy is constructed. 
+
+c) Improved robustness via stability training: The authors propose a method to make a classifier more robust to input noise. They add a regulatization term to the training loss that penalizes a change in the probabilities predicted by the network when the input is randomly perturbed. In particular, they use the cross-entropy loss between the probability distributions predicted at the original and the perturbed point. The goal is to train a model that is more robust to random perturbation which will then hopefully translate to robustness to adversarial perturbation. This method is evaluated against the proposed attack (a) and is found to be more robust to that attack than previous adversarially trained models. Overall, I find the idea of stability training interesting. However I find the current evaluation severely lacking. First of all, these models should be evaluated against a standard PGD adversary (missing from Table 1). Even if that method is unreliable when applying random noise to the input at each step it is still an important sanity check. Additionally, in order to deal with the stochasticity of the model one should experiment with a PGD attack that estimates the gradient using multiple independent noise samples (see https://arxiv.org/abs/1802.00420). Finally, other attacks such as black-box attacks and finite-differences attacks should potentially be considered. Given how other defenses based purely on data augmentation during training or testing were bypassed it is important to apply a certain amount of care when evaluating the robustness of a model.
+
+Overall, while I think the paper contains interesting ideas, I find the current evaluation lacking. I recommend rejection for now but I would be willing to update by score based on author responses. 
+
+Minor comments to the authors:
+-- Last paragraph of first page: ""Though successful in adversarial defensing, the underlying mechanism is still unclear."", adversarial training has a fairly principled and established underlying mechanism, robust optimization. 
+-- Figure 2 left: is the natural line PGD or SO?
+-- The standard deviation of the noise used is very large relative to the pixel range. You might want to comment on that in the main text.
+-- Figure 3: How was the Madry model trained? L_inf or L_2?",4,5.0,ICLR2019
+6Z0fKkYp941,3,BEs-Q1ggdwT,BEs-Q1ggdwT,"A more practical RL algorithm for mean-variance MDP problem, justified by detailed experimental results","
+In this paper the author proposed a new mean-variance algorithm whose policy gradient algorithm is more simpler than other SOTA methods and it has an unbiased gradient. Instead of formulating the problem as an traditional mean variance constrained problem, the authors utilized quadratic utility theory and formulate the problem as variance minimization problem with a mean reward equality constraint. Then by reformulating the problem with the penalized problem and opening up the variance formulation, they showed that this mean-variance formulation does indeed have an unbiased policy gradient, that does not require advanced techniques such as double sampling or frenchel duality. To demonstrate the effectiveness of this method on balancing risk and return, they also evaluate their methods on several risk-sensitive RL benchmarks (such as portfolio optimization) and compared with a wide range of risk-sensitive RL methods.
+
+In general, I find this paper well-written with ample discussions of state-of-the art mean-variance RL algorithms. I also like the flow of this paper which first enumerates the existing issues of mean-variance RL approaches (that requires double sampling or frenchel duality trick to get an unbiased gradient for actor critic), and then propose an alternative algorithm with unbiased gradient that is much simpler yet circumvents the aforementioned complexity. They also demonstrate the performance of this new method on sufficient number of  experiments, ranging from discrete action atari domains, and the portfolio optimization problem, including comparisons with most known mean-variance RL algorithms, and showed that the proposed method achieves some of the best results.
+
+However i do have several questions about this paper:
+1) How does the per-step variance discussion in Section 3.2 relate to the proposed method?
+
+2) Can the authors provide more motivations for the problem formulation (1)? It's different from the standard Markowitz mean-variance formulation. How does one set \psi instead of treating it as a tunable parameter. Even if  one is convinced that (1) is the ""right"" RSRL formulation to solve, why is the penalty formulation in (3) equivalent to the original formulation of (1)? Is there any formal proof?",6,4.0,ICLR2021
+KjHZ8ijc6w0,3,Mh1Abj33qI,Mh1Abj33qI,Official Blind Review #3,"This paper extends geometric scattering network by relaxing its scattering construction to enable training / data-driven learning. There are three major modules in the proposed network architecture: diffusion module, scatter module, and aggregation module. They conduct experiments on two tasks: whole graph classification and graph regression.
+
+The idea to relax geometric scattering network is novel to the best of my knowledge. However, I have the following concerns:
+
+* Why is it that LEGS only outperforms other GCN methods on biological datasets? The major advantage of LEGS compared with other low-pass filter based GCNs is that it goes beyond low frequencies and consider richer notions of regularity. Why doesn't this advantage manifest on performances on other types of graph data (e.g. social networks)?  
+
+* The result in Table 2 does not seem promising. If LEGS only performs well on graphs that exhibit certain properties, showing results on synthetic datasets would help.   
+
+* I suggest that authors should report results on larger datasets like QM9. All experiments are conducted on datasets with no more than 5000 instances. Or is that due to computational complexity and scalability issues? 
+
+* What are the advantages of the proposed method when compared with scattering-GCN [1]?
+They also address the problem of oversmoothing and scattering-GCN is also learned in a data-driven fashion. How does scattering-GCN  perform if we obtain whole-graph features by aggregating node features obtained by their method? Why is it not included in the baseline?
+
+[1] Yimeng Min, Frederik Wenkel, and Guy Wolf. Scattering gcn: Overcoming oversmoothness in graph convolutional networks. arXiv preprint arXiv:2003.08414, 2020.
+",6,3.0,ICLR2021
+y2mFAXttNp,4,xHqKw3xJQhi,xHqKw3xJQhi,Jointly learning graph and labels,"
+In this work the authors start from the following basic observation, that I will state in terms of binary classification.  In an ideal setting where there exist two labels, the graph structure should be two distinct connected components, and according to the author(s) the most natural choice is that the each component is a clique. However, when one performs semi-supervised learning, edges going across communities over smooth the labels, and especially in the absence of many labeled points this causes big issues in node classification. For this reason the authors assume that the graph is actually a noisy version of some latent graph. They incorporate a variational approach to GCNs, as a novel architecture that iteratively refines the node labels and the graph. Figure 2 nicely summarizes how  the proposed architecture enhances the community structure. Some comments to the authors follow. 
+
+
+- Could the authors discuss alternative choices to l2 minimization such as the paper Algorithms for Lipschitz Learning on Graphs by Rasmus Kyng et al. at uses Lp-norm minimization as means to avoid over smoothing in a different but super relevant context of label propagation? 
+- The authors discuss drop edge. While this is an issue of the DropEdge method, wouldn't it make sense to remove edges that have higher effective resistance instead of random sampling? The latter is more likely to kill intra-edges, instead of the inter-edges that cause the oversmoothing? It would be nice to include such a baseline in the experiments. 
+- What happens when instead of having in the ideal setting two connected components that are cliques, there are two bipartite cliques instead?  This would capture a notion of 'heterophily' instead of 'homophile' that naturally creates a clique. Can your method be extended to this case? 
+- Can you prove that when the input graph is stochastic block model your method provably results in the right classification? It seems such a claim could be plausible to prove analytically, at least when the gap of intra- vs inter-community edges is large enough. 
+",6,4.0,ICLR2021
+vlqTqWqjcaq,2,om1guSP_ray,om1guSP_ray,"The method is novel, but the performance improvement is slight","Summarization
+
+The authors propose a novel pooling layer based on edge cuts in graph, where a regularization function is introduced to produce edge scores via minimizing the minCUT problem. Through extensive experiments, the authors have proved the proposed EdgeCut pooling structure can achieve comparable performance in various graph analysis tasks.
+
+Strong points
+
+1) The paper is good writing and easy to understood. As far as I know, the proposed EdgeCut is a novel pooling layer based on edge cutting, which is reasonable and can explore hierarchical graph structures.
+2) The authors have provided code in supplement and the experimental results are easy to follow.
+
+Weak points
+
+The main weakness could be the model performance and there are not enough experiments to prove the efficiency of the proposed EdgeCut.
+
+1) For graph classification in Table.1, any thought about the reason why the proposed EdgeCut is worse than g-U-Nets.  And there is no detailed experimental analysis in Section 4.1
+
+2) Similar question, for node classification in Table.3, the performance of EdgeCut is still worse than GAT in two of three datasets, and even only obtains a slight improvement compared to the most basic GCN in Table 3. Notably, the performance of EdgeCut without regulazization is even worse than GCN in both Citeseer and Pubmed datasets. The authors should provide more comprehensive experimental analysis and explain the reason why the performance improvement is so slight and even worse than the basic GCN.
+
+3) There are only two quantitative experiments in this paper, additional qualitative experiments like the visualizations of graphs after pooling should also be included to prove the model efficiency, even making use of toy data.
+
+Questions:
+
+My questions have been included in Weak points part
+
+Additional Feedback:
+
+1) Can you provide time complexity comparisons for the proposed EdgeCut and other baselines",5,3.0,ICLR2021
+HJlhggcgM,3,H135uzZ0-,H135uzZ0-,New setup for CNN with half precision that gets 2X speedup on training,"This work presents a CNN training setup that uses half precision implementation that can get 2X speedup for training. The work is clearly presented and the evaluations seem convincing. The presented implementations are competitive in terms of accuracy, when compared to the FP32 representation.  I'm not an expert in this area but the contribution seems relevant to me, and enough for being published.",6,3.0,ICLR2018
+rJeb2tXkqB,2,HJxdTxHYvB,HJxdTxHYvB,Official Blind Review #3,"The paper presents a new attack, called the shadow attack, that can maintain the imperceptibility of adversarial samples when out of the certified radius. This work not only aims to target the classifier label but also the certificate by adding large perturbations to the image. The attacks produce a 'spoofed' certificate, so though these certified systems are meant to be secure, can be attacked. Theirs seem to be the first work focusing on manipulating certificates to attack strongly certified networks. The paper presents shadow attack, that is a generalization of the PGD attack. It involves creation of adversarial examples, and addition of few constraints that forces these perturbations to be small, smooth and not many color variations. For certificate spoofing the authors explore different spoofing losses for l-2(attacks on randomized smoothing) and l-inf(attacks on crown-ibp) norm bounded attacks. 
+
+Strengths: The paper is well written and well motivated. The work is novel since most of the current work focus on the imperceptibility and misclassification aspects of the classifier, but this work addresses attacking the strongly certified networks. 
+
+Weakness: It would be good to see some comparison to the state of the art ",8,,ICLR2020
+J0wH2wfUnT2,1,n4IMHNb8_f,n4IMHNb8_f,"Transformer for planning: interesting, but few points need clarification.","
+Summary:
+ 
+The paper provides an interesting direction in the field of spatial path planning. The method is interesting as it is fully learnt in an end to end fashion. The key idea is to use a transformer like architecture to model long range dependencies. Also the paper extend its findings to out of distribution maps and  the cases where the ground truth map is not known to the agent. 
+
+##########################################################################
+
+Reasons for score: 
+ 
+Overall, I think the paper is slightly below the acceptance threshold. The motivations and the architecture are clearly presented, but I think there is a lack of clarity in the way the methods and the results are presented, especially in the most interesting section, the one about the mapper. Also the analysis are limited to reconstructions of the map and some attention weights in the appendix. 
+
+##########################################################################
+
+Pros: 
+ 
+1. Interesting new approach with potential applications in both navigations and manipulations task.
+2. The out of distribution results are convincing.
+
+Cons:
+
+1.  Although the method is a clear advantage over previous ones it still requires to know the ground truth distance to get the map. 
+
+2. Instead of using a paragraph to explain the transformer architecture, which is relatively well know, my suggestion would be clarifying the mapper section with more notation to clearly state the loss. This would already help to most likely increase my score.
+
+3. In section 3.2 the authors claim that “While it is possible to train a separate mapper model to predict maps from observations, this requires map annotations which are expensive to obtain and often inaccurate”. I don’t think this is true: e.g. Gregor K. et al., 2019 ->  https://arxiv.org/pdf/1906.09237.pdf present an example of how a map can be learnt.
+
+4. It is not clear how much the model relies on having perfect distances as supervision vs. noisy ones. An analysis on this topic I think would be important.
+
+5. No Analysis of computational complexity
+",6,3.0,ICLR2021
+r1e6vm9pFH,3,rklB76EKPr,rklB76EKPr,Official Blind Review #3,"
+[Summary]
+This paper studies the relationship between gradient clipping in stochastic gradient descent and robustness to label noise. Theoretical results show that gradient clipping in general is not robust to symmetric label noise. The paper then proposes a variant of gradient clipping (cl-clipping) that induces label noise robustness. Experiments support these claims on synthetic datasets and typical classification benchmarks.
+
+[Decision]
+The first contribution, that gradient clipping does not induce robustness to label noise, is an important negative result given the prominence of gradient clipping and datasets with noisy labels. The second contribution, cl-clipping, amounts to minimizing a non-convex loss with saturating regions but, as far as I know, these properties are necessary for robustness to label noise. Theoretical results are limited to SGD with mini-batch size 1 but the insights carry over to larger mini-batches in the experiments. Overall, I recommend acceptance.
+
+[Comments]
+The parameter tau controls robustness, and a higher noise level requires a higher tau. There is little discussion on how this parameter is chosen in the experiments. On the synthetic dataset, the Huberized loss uses tau=1 and the partially Huberized loss uses tau=2. How are these values chosen? Did the authors observe a U-shaped curve when sweeping over tau? On the real-world datasets, tau is fixed for each method across different noise levels. Does this mean that a single value of tau worked best regardless of the noise level, or was it tuned for a particular noise level?
+
+Proposition 4 shows that symmetric noise breaks down the clipping method in Eq (7) which can be seen as a special case of gradient clipping. I might be missing something here, but it is not obvious to me that, when the norm of x is constant across the samples, Eq (7) is equal to gradient clipping.
+",6,,ICLR2020
+UwfUB2i6kic,3,PKubaeJkw3,PKubaeJkw3,Finalization step is crucial! ,"Summary:
+
+In one-shot differentiable NAS, a supergraph is usually trained (via bilevel optimization as in DARTS, or other approximations to bilevel such as gumbel softmax, etc). After supergraph training, a final architecture is obtained by taking the operator at each edge which has the highest architecture weight magnitude. This step is usually termed as the 'finalization' step. (In DARTS the finalization step actually orders incoming edges by the max of the architecture weight magnitudes at each edge and selects the top two edges and the corresponding maximum architecture weight in them as the final operators.). This paper examines this quite ad hoc step very closely. It finds that the magnitude of architecture weights (alphas commonly in this niche literature) are misleading. It shows by careful ablation experiments that alpha magnitudes are very much not useful in selecting good operators. 
+
+By taking inspiration from the ""unrolled estimation"" viewpoint of ResNet prior work it shows that DARTS converging to degenerate architectures where alphas over parameters operators like skipconnect is actually to be expected when finalization step relies on the magnitude of alpha. 
+
+The paper proposes a much more intuitive finalization step which just picks the operator at each edge which if removed from the supergraph results in the largest drop in validation accuracy. To bring back the supergraph to convergence a few epochs of further training is carried out between operator selection.  
+
+Experiments show that just by carefully thinking about the finalization step in differentiable one-shot NAS, one can obtain much better performance. In fact, one does not even need architecture weights at all! Don't worry about complicated bilevel optimization, gumbel softmax approximation, etc. Just train a supergraph and pick operators progressively. 
+
+Comments:
+
+- The paper is wonderfully written! Thanks!
+
+- As I read a paper I try to think without looking at the experiments, what set of experiments I would try to run to prove/disprove the hypotheses proposed. Afterwards I go through the experiments and see if those experiments were actually run (or if they differed why). In this case, every experiment and more were already run. Particularly towards the end I was thinking what if we just got rid of all the alphas and just trained a supergraph as usual and did the PT finalization as proposed. And lo and behold, it actually works better!
+
+- This paper is actually throwing a big wrench in one-shot differentiable NAS literature. Many papers are being written which try to improve/fix DARTS and DARTS-like methods. If I were to believe the experiments, I don't actually need to do any of that. I have some questions I hope to discuss with the authors:
+
+1. Is all the complicated bilevel optimization (often popular as 'metalearning' currently) not useful in the case of NAS? (This is not really the authors' burden to answer but I am just hoping to see if they have any insights.)
+
+2. Can we view the PT finalization step as a progressive pruning step?  So if I were to turn this into a method which produces a pareto-frontier of models (e.g. accuracy vs. memory/flops/latency etc), we first train a big supergraph and then progressively prune out operators one at a time as proposed here and take a snapshot of the supergraph and plot it on the x-y plot (where say x is latency and y is accuracy) and pick the ones clearly on the pareto-frontier and train them from scratch? (Again not really authors' burden but curious if they have any insights)
+
+3. Figure 4 suggests that training the supergraph anymore than 20 epochs only hurts performance (no matter which finalization procedure is used, of course PT has far less of a drop). Does bilevel optimization actually hurt with weight sharing?
+",10,5.0,ICLR2021
+rylSHQ6aYH,2,B1lj20NFDS,B1lj20NFDS,Official Blind Review #3,"The work describes an application of a spatial point process for solving problems with missing data. The authors introduce a novel method based on a non-parametric definition of point intensities for the multivariate case. The method incorporates VAE framework to effectively handle missing points via smooth intensity estimation and enjoys amortized inference for efficient computations and quick prediction generation. Using a sequence of mild assumptions, the authors show connection to a popular VAE-based collaborative filtering model, which turns out to be a special case of their approach.
+ 
+
+This is a rigorous study providing theoretically justified evidence on the effectiveness of the proposed approach. Apart from the issues in the last part of experiments with classical collaborative filtering task (which will be detailed below), the work presents a solid research. I would therefore vote for accepting it.
+
+
+The text is well structured, and all key points are clearly explained. The problem solved by the authors is well described, and the motivation for this work is convincing. The way point process theory is applied constitutes a rigorous probabilistic approach. The authors convincingly justify the need for all approximations and simplifications made in the model. One of key results making the entire model feasible is supported by the corresponding theorem proved by the authors. I haven’t carefully verified all the derivations, though.
+
+ 
+My major concern is related to the last part with experiments on the Movielens data. As the authors state, “applications without explicit spatial information, we embed each event into a latent space as a vector.” “No spatial information” is exactly the case with the standard collaborative filtering task, which the authors attempt to solve. This leads to an introduction of an additional model like GNN, which is unrelated to the main approach. As GNN is involved it’s not immediately obvious that the improvement over standard VAE architecture, observed in the experiments on ML-100K and ML-1M, is due to a better point process modelling.  No evidence is provided to argue that this is not simply due to a good compression or a good data preprocessing achieved by a GNN architecture itself. Therefore, the results on a pure recommender systems part are not convincing. What would happen if GNN was trained and fed into another (simpler) algorithm? Maybe a simple KNN based algorithm would produce comparable or even better results? As indicated by the work of [Dacrema, Cremonesi, Jannach 2019] on “A Worrying Analysis of Recent Neural Recommendation Approaches”, VAE-CF (along with several other recently proposed neural network-based methods) is inferior to even properly tuned kNN-based models. I would not be surprised, if a kNN model trained on GNN output would produce even better results than the proposed VAE-SPP.
+
+Another related question is how incorporating GNN affects the training time? Is it comparable to that of VAE-CF or is it much worse? Computational performance is an important part in making practical decisions and should be also considered.
+
+Furthermore, both ML datasets used for tests are too small and not very representative to make any generalized conclusions. Even on a larger ML-20M dataset an optimal SVD-based model can be trained within several minutes on a standard CPU on a laptop (according to my experiments, VAE-CF would take at least twice longer on Tesla K80). Therefore, it can hardly be considered a realistic example. In practice, there could be millions and hundreds of millions of items. The authors even mention it in the in the introduction, using it as a vehicle to motivate their approach. However, computing similarities between that many entities can be a laborious task on its own, which adds an extra layer of complexity and again is not directly related to the main approach. It can easily become a bottleneck or make further computations inefficient. More efficient similarity computations may in contrast reduce the resulting accuracy.
+The issue can get even worse, because, unlike classical MF methods, there’s still no proper support for sparse operations in NN frameworks. In the VAE framework it means that, during the training, user batches will be converted into dense arrays and may become inefficient to work with in terms of memory and CPU utilization (a few non-zero entries vs. hundreds of millions of explicitly stored zeroes).
+In spite of all this, I’d also suggest rephrasing “We validate these beneﬁts through extensive experiments” as it sounds a bit exaggerated (if we are considering real recommender systems applications). I agree that the proposed approach is potentially applicable in real cases for recommender systems, however, there’s still not enough evidence for this. In fact, I don’t even think that completely removing the part with ML-100K and ML-1M datasets would make the whole work any worse. Clearly stating the region of applicability of the proposed approach would be enough. Right now some statements in this section in contrast are raising concerns rather than convincing the reader. The wording should be at least changed, so that readers do not get an impression that the case with classical CF task is solved purely by the proposed VAE-SPP approach.
+
+Other remarks to help improve the text:
+1) “… points are more likely to … form clusters than the simple Poisson process …” the sentence seems to be inconsistent.
+2) “The generative process of our model can be described as follow:” -> … as follows:
+3) Page 4, last paragraph, line 6 – shouldn’t the upper bound for summation be N_u instead of just N?
+
+References:
+Dacrema, Maurizio Ferrari, Paolo Cremonesi, and Dietmar Jannach. ""Are we really making much progress? A worrying analysis of recent neural recommendation approaches."" In Proceedings of the 13th ACM Conference on Recommender Systems, pp. 101-109. ACM, 2019.",8,,ICLR2020
+jz-Ikw-TJxT,2,7IDIy7Jb00l,7IDIy7Jb00l,"Offline Meta-RL using a variational approach coupled with state and reward relabeling. Discusses an important setting, and some new algorithmic ideas therein, but whether these ideas will work in practice is inconclusive. ","The paper studies the problem of offline Meta-Reinforcement Learning (RL). In this problem, N separate RL environments are considered which are drawn from a specific underlying distribution. For each such environment, M trajectories each of horizon H are provided beforehand. The task is to train an RL agent that performs well in expectation on a new RL environment drawn from the same distribution.  The authors adapt the online VariBAD algorithm (Zintgraf et al., 2020) by the use of a techniques called state relabeling and reward relabeling. 
+
+The VariBAD algorithm formulates online Meta-RL as a Bayesian Adaptive MDP (BAMDP) and solves it approximately using a variational approach. The authors, due to lack of online data, use partial offline trajectories to create belief over RL environment. Next, this belief is used to map offline Meta-RL to the BAMDP setting -- a.k.a. state relabelling. Once mapped to BAMDP off-the-shelf offline RL techniques are used to solve the problem approximately. The authors further introduce reward relabelling that adds new trajectories by mixing rewards among the RL environments conditioned on the state and action. This is claimed to be useful if the reward distribution is independent of the transition matrix distribution of the RL. 
+
+The authors provide multiple experiments with sparse rewards to show the VariBAD with relabelling works in some established benchmark RL tasks (albeit in the new offline Meta-RL setting). 
+
+Pros:
+* The offline Meta-RL is a very relevant problem with a lot of applications in the coming years. Designing a methodology to solve this problem using off-the-shelf techniques is a novel pursuit, which is carried out in the paper. 
+
+* The idea of state and reward relabeling is novel in the meta-RL setup, as per my understanding (as an outsider to the Meta-RL community but with a background in RL).
+
+Cons:
+1. Methodology: 
+
+* The paper does not explore properly the (in)-famous ""deadly triad"" which is known to make offline RL difficult. Given the state-space of BAMDP captures beliefs over (possibly) complicated RL environments, the effect of  ""deadly triad"" is conceivably even more prominent in the current formulation.  
+
+* I am not sure why the reward relabeling is able to solve the MDP (a.k.a. aliasing in the offline RL community). In particular, ""Note that our relabelling effectively samples data from an MDP with transitions Pi and reward Ri, which has non-zero prior probability mass under the decoupled prior assumption"" -- the above statement is very vague given finite trajectories. The MDP aliasing (a part of the deadly triad) is closely related to this issue. 
+
+* The state relabelling is heavily reliant on the existing VariBAD algorithm, and off-the-shelf RL techniques. Therefore, the algorithmic contribution seems somewhat lacking in my opinion. 
+
+2. Experiments:
+* The provided experiments lack any conclusive evidence of the effectiveness of reward-labeling. Why in the Half-Cheetah-Vel experiment reward relabeling performs worse -- is the reward independence assumption not true here?  Will reward relabeling always perform worse? 
+
+* The effectiveness of the Thompson sampling baseline used is not clear to me. Does no other simple baseline exist? I am alright with not comparing it with the concurrent works. However, such a comparison would have been much more convincing. 
+
+",5,3.0,ICLR2021
+7wdTNl6c3e4,4,udaowxM8rz,udaowxM8rz,Modification to ImageNet-C too marginal,"This paper proposes a new dataset for estimating robustness to distribution shift, in particular corruption robustness. They accomplish this by proposing an alternative to ImageNet-C, ImageNet-NOC, which uses different corruptions. They consider corruptions not in ImageNet-C, and they argue that their dataset is superior because they have more ""balance and coverage."" They select corruptions that are ""decorrelated"" in a specific sense.
+
+I appreciate the analysis in this paper and provides useful comparisons to ImageNet-C. Unfortunately experimentation is not thorough. They don't show results with AugMix nor DeepAugment, so the analysis only uses SIN and ANT.
+
+The paper is missing key qualifiers.
+""In other words, the expression (1) estimates if increasing the robustness to c2 implies an increase of robustness to c1 and vice versa.""
+... provided that the model fits the exact corruption c2.
+""Then, the higher (1) is, the more the robustnesses to c1 and c2 are correlated, i.e. the more c1 and c2 overlap.""
+... provided the models are robustified through exactly training on c1 or c2.
+Without these qualifications, the claims are far too general.
+Contrast and fog can be negatively correlated with the rest of the corruptions under some robustness interventions. For example, adversarial training really harms contrast and fog corruption robustness, though it might help other corruptions like noise (and a less expansive testbed might not catch this). This demonstrates that their robustness correlations require qualification and are less predictive that this paper suggests.
+
+This paper is fairly similar to Is Robustness Robust? On the interaction between augmentations and corruptions (submission #704). If these papers have very different ratings, then there's a problem with this review process.
+
+Nitpicks:
+
+The bibtex for the citations are messed up. For example, we see ""(Evgenia Rusak, 2020)"" instead of ""(Rusak et al., 2020)"". Author ordering for papers in the bibtex is often scrambled, some are separated by commas, and some are not. In general this paper's formatting is a little unrefined.
+
+""weight decay set to 10e-4."" -> ""weight decay set to $10^{-4}$.""
+
+""For instance, a benchmark that contains a motion blur corruption covers the defocus blur corruption, because the robustnesses [sic] towards these two corruptions are correlated (Vasiljevic et al., 2016).""
+Vasiljevic et al. (https://arxiv.org/pdf/1611.05760v1.pdf) show the opposite. Fine-tuning on defocus blur did made models slightly worse on horizontal motion blur (Figure 2).",5,5.0,ICLR2021
+rki0XSHlf,2,HkbmWqxCZ,HkbmWqxCZ,"An interesting paper, but needs more work","This paper presents mutual autoencoders (MAE). MAE aims to address the limitation of regular variational autoencoders (VAE) for latent representation learning — VAE sometimes simply ignores the latent code z, especially with a powerful decoding distribution. The idea of MAE is to optimize the VAE objective subject to a constraint on the mutual information between the data x and latent code z: setting the mutual information constraints larger will force the latent code z to learn a meaningful representation of the data. An approximation strategy is employed to approximate the intractable mutual information. Experimental results on both synthetic data and movie review data demonstrate the effectiveness of the MAEs.  
+
+Overall, the paper is well-written. The problem that VAEs fail to learn a meaningful representation is a well-known issue. This paper presents a simple, yet principled modification to the VAE objective to address this problem. I do, however, have two major concerns about the paper:
+
+1. The proposed idea to add a mutual information constraint between the data x and latent code z is a very natural fix to the failure of regular VAEs. However, mutual information itself is not a quantity that is easy to comprehend and specify. This is not like, e.g., l2 regularization parameter, for which there exists a relatively clear way to specify and tune. For mutual information, at least it is not clear to me, how much mutual information is “enough” and I am pretty sure it is model/data-dependent. To make it worse, there exist no metrics in representation learning for us to easily tune this mutual information constraint. It seems the only way to select the mutual information constraint is to qualitative inspect the model fits. This makes the method less practical. 
+
+2. The approximation to the mutual information seems rather loose. If I understand correctly, the optimization of MAE is similar to that of a regular VAE, with an additional parametric model r_w(z|x) which is used to approximate the infomax bound. (And this also adds an additional term to the gradient wrt \theta). r_w(z|x) is updated at the same time as \theta, which means r_w(z|x) is quite far from being an optimal r* as it is intended, especially early during the optimization. Further more, all the derivation following Eq (12-13) are based on r* being optimal, while in reality, it is probably not even close. This makes the whole approximation quite hand-waving. 
+
+Related to 2, the discussion in Section 6 deserves more elaboration. It seems that having a flexible encoder is quite important, yet the authors only mention lightly that they use the approximate posterior from Cremer et al. (2017). Will MAE not work without this? How will VAE (without the mutual information constraint) work with this? A lot of the details seem to be glossed over. 
+
+Furthermore, this work is also related to the deep variational information bottleneck of Alemi et al. 2017 (especially in the appendix they derived the VAE objective using information bottleneck principle). My intuition is that using a larger mutual information constraint in MAE is somewhat similar to setting the regularization \beta to be smaller than 1 — both are making the approximating posterior more concentrated. I wonder if the authors have explored this idea. 
+ 
+
+Minor comments:
+
+1. It would be more informative to include the running time in the presented results. 
+
+2. Since the goal of r_w(z | x) is to approximate the posterior p(z | x), what about directly using q(z | x) to approximate it? 
+
+3. In Algorithm 1, should line 14 and 15 be swapped? It seems samples are required in line 14 as well. 
+
+4. Nitpicking: technically the model in Eq (1) is not a hierarchical model. 
+
+",5,4.0,ICLR2018
+S1gdft7NcS,4,B1eXygBFPH,B1eXygBFPH,Official Blind Review #3,"The paper addresses a real problem.  Most attacks on graphs can be easily identified [1].  This paper argues that if one rewires the graph (instead of adding/deleting nodes/edges) such that the top eigenvalues of the Laplacian matrix are only slightly perturbed then the attacker can go undetected.  
+
+The paper should address the following issues:
+
+1. There is no discussion on tracking the path capacity of the graph as measured by the largest eigenvalue of the adjacency matrix and the eigengaps between the largest in module eigenvalues of the adjacency matrix .  Rewiring often affects the path capacity even if one makes sure the degree distribution is the same and restricts the rewiring to 2-hop neighbors.
+
+2. Rewiring affects edge centrality and so one needs to show that the proposed algorithm doesn't change the distribution over edge centrality.
+
+3. In social networks, the highest eigenvalues of the adjacency matrix are very close to each other because of all the triangles.  The paper will be stronger if it included how the proposed method performs under various random graph models -- e.g., Gnp random graph, preferential attachment, and small-world.
+
+Miscellaneous notes:
+
+- The captions for the figures should be more informative.
+
+- Table 2 should list more characteristics of the graphs such as number of nodes, number of edges, exponent of the degree distribution, global clustering coefficient, average clustering coefficient, diameter, average path length.
+
+- ""Zgner &Gnnemann"" is misspelled.
+
+- ""As we can observed from the figures, ..."" has a typo in it.
+
+__________________________________________________
+[1]  B. Miller, M. Çamurcu, A. Gomez, K. Chan, T. Eliassi-Rad. Improving Robustness to Attacks Against Vertex Classification. In The 15th International Workshop on Mining and Learning with Graphs (held in conjunction with ACM SIGKDD’19), Anchorage, AK, August 2019.
+",6,,ICLR2020
+B_JMJWpYaCW,2,EdXhmWvvQV,EdXhmWvvQV,new data augmentations for improved contrastive learning; questions on empirical validation,"This paper focuses on contrastive learning for performing self-supervised network pre-training.  Two components are proposed: First, to select semantically similar images that are pulled together in the contrastive learning, the paper proposes ""center-wise local image mixture"" (CLIM) - both k-means clustering and knn neighbors are computed, and then for a given anchor image x, and images x' that fall within the same cluster and are a knn neighbor (and additionally closer to the cluster center than x) are selected as a positive match to x.  This is motivated from the perspective of allowing for consideration of both local similarity and global aggregation.  This selection is further modified by the use of cutmix data augmentation, where (x, x') are combined via a binary mask.  This is motivated from the perspective of allowing for some smoothing regularization to handle potentially noisy matches from CLIM.
+
+The second component is multi-resolution data augmentation, a variant of crop augmentation that focuses specifically on enabling scale invariance by maintaining the same aspect ratio and performing the cutmix augmentation at different image resolutions.
+
+Positives:
++ interesting proposed method for expanding the neighborhood space of considered positive matches for contrastive learning
++ ablation study provided to show the improvement from each proposed component (sample selection, cutmix, multi-resolution)
++ generally good empirical performance on several tasks - linear evaluation on imagenet, semi-supervised learning with few labels, transfer learning
+
+Neutral:
+- overall novelty is moderate; I would consider the main novelty to be in the selection of positive matches, as the cutmix and multi-resolution augmentations are largely leveraging existing ideas.  
+
+Negatives:
+- ablation studies only use 200 training epochs.  From appendix C, is it clear there is a big difference in accuracy from 200 to 800 or more epochs.  I would like to know how the improvements from the proposed methods still hold up after longer training.  
+- hyper-parameter selection: there are several different parameters to be set in the proposed work: multi-resolution scales, number of clusters and neighbors for CLIM, alpha in cut-mix, with the differences in accuracy between settings approaching the difference between say knn+cutmix and center-wise+cutmix.  It appears that these hyper-parameters were directly set using the ImageNet linear evaluation, which seems like it has some potentially for overfitting then.  I would have liked to know how well the results generalize if these hyper-parameters are set on some separation validation data.
+- as a more minor point, I'm curious how much the restriction in equation (2) matters for CLIM - in other words, what if instead of equation (2), simply let $\Omega_p = \Omega_1 \cap \Omega_2$?
+
+Overall summary:
+
+Given the overall improvements from the proposed method, I'd be inclined toward accept, if the concerns I raised regarding the empirical evaluation were addressed.",6,4.0,ICLR2021
+v9rKISMsML,4,chPj_I5KMHG,chPj_I5KMHG,The research question and the main contributions are not clear.,"This paper introduces DECSTR, which is an agent having a high-level representation of spatial relations between objects. DECSTR is a learning architecture that discovers and masters all reachable configurations from a set of relational spatial primitives. They demonstrated the characteristics in a proof-of-concept setup.
+
+In the introduction, the inspiration obtained from developmental psychology is described. Motivation and background are broadly introduced. A wide range of related works are introduced in section 2.
+The motivation and target of this paper are ambitious and important.
+
+However, from the ""methods"" part, i.e., section 3, this paper is hard to follow. 
+The supplementary material helps to understand. However, I believe some of the informative and detailed information in the supplementary material should come to the main manuscript.
+
+The proposed method, i.e., DECSTR, is comprised of many components. Therefore, the main contribution is also not clear. # What is the main argument of the paper?
+Experimental conditions are also hard to follow.
+
+In evaluation, Figure 1 shows ablation studies alone, i.e., comparison with the variants of DECSTR.
+Therefore, the contribution of the paper is hard to grasp.
+
+We can understand what kind of task is achieved in this paper.
+Currently, the paper somehow seems to be a demonstration of DECSTR. 
+In this sense, if the authors state research questions, challenges, and contributions of this paper more clearly, that will make this paper more impactful.",4,3.0,ICLR2021
+phqAnt6FLhf,3,UiLl8yjh57,UiLl8yjh57,Promising Application of DRL to a classic wireless problem,"This paper addresses the long standing problem of scheduling and resource allocation in wireless networks using modern Deep Reinforcement Learning techniques. 
+It is clearly written and easy to follow but suffers from several minor typos.
+The methodology is well justified and thoroughly motivated.
+Experimental evaluation seems thorough and provides convincing results.
+
+The MDP is not described thoroughly enough: 
+What is your reward, action space, state space, observations?
+How is the allocation deadline incorporated into the reward?
+It would be nice to have these details listed in a sub-section somewhere in Section 3.
+
+Regarding the evaluation, ""synthetic"" traffic patterns are used.
+Can you use real world traces with a simulator for evaluation (similar to https://github.com/hongzimao/pensieve)?
+Also real world applicability is not addressed?
+Will the inference times for the deep network lead to any significant overheads when measured at the time scale of wireless communications?
+Overall, the evaluation setup seems preliminary to me and needs more work to provide assurance of real world usability.",7,4.0,ICLR2021
+SylxYe7TtS,2,ryx0nnEKwH,ryx0nnEKwH,Official Blind Review #3,"This paper develops an improved Batch Normalization method, called BNSR. BNSR applies a nonlinear mapping to modify the skewness of features, which is believed to keep the features similar, speedup training procedure, and further improve the performance. 
+
+I have several concerns:
+1.	To investigate the impact of the similarity of the feature distributions, the author proposes four settings, including BN and BNSR. However, the added noise of last three settings not only makes the feature dissimilar but also breaks the nature of zero mean and unit variance. This is still unclear whether the similar distribution of features make BNSR outperform BN.
+2.	The skewness is used to measure the asymmetry of the probability distribution, but not the similarity between two different distributions. Distributions with zero mean, unit variance, and near zero skewness could still be very different.
+3.	Based on “for ... X with zero mean and unit variance, there is a high probability that ... lies in the interval (-1, 1)”, the paper introduces φ_p(x), where p > 1, to decrease the skewness of the feature map x. However, there are about 32% elements of X that their absolute values are larger than 1, for a standard normal distribution. Figure 6 & 7 also show that for real features in neural network, there are a significant number of elements that lie out of (-1, 1). Will this lead to instability during training? To better understand the effect of φ_p(x), I think ρ (the Pearson’s second skewness coefficient), right before and after φ_p(x), should be shown for each layer at several epochs.
+4.	The results on CIFAR-100 and Tiny ImageNet are not convincing enough in my opinion. Some further experiments on ImageNet with a reasonable baseline will makes the results more convincing.
+",3,,ICLR2020
+ZHhZTAnVym0,3,YZ-NHPj6c6O,YZ-NHPj6c6O,"An interesting metric of disentanglement, but claims and technical aspects need to be clarified","The paper contributes an interesting new measure for disentanglement which consists in: (1) applying to each latent representation a (supervised) linear transformation mapping it to the origin representation (2) measuring the average distance of these canonicalized representation to the origin to obtain a mean discrepancy measure of disentanglement. By adding this measure as an additional loss term in a topological VAE, the authors show that the model can learn disentangled representations with a limited amount of supervision on some simple image datasets.
+
+Strong points:
+- the measure of disentanglement proposed is promising and novel.
+- the mathematical derivations and the simulation results are convincing.
+
+Weak points:
+- some claims in the paper are misleading because they imply that the measure proposed is unsupervised, although it is totally supervised (see below for details).
+- some technical aspects of the paper lack clarity (see below for details).
+- the measure of disentanglement proposed requires a lot of prior knowledge on the data and on the structure of the representation: for all pairs of data points on which the measure is calculated, we need to know what group element this transformation corresponds to, and we need to know to which linear transformation this group element corresponds to in the latent representation. This is impractical in many datasets where the transformations are unknown, and in many models where the latent equivariant operators are learned and not pre-specified.
+
+I recommend to reject this paper due to the clarity concerns and concerns about the validity of the claims, unless they can be addressed satisfactorily during the rebuttal period.
+
+Main concerns to be addressed:
+1) It is impossible to understand from the abstract, intro and related work section whether the proposed measure of disentanglement requires supervision about the transformation between pairs of data points or not. In fact, the measure is 100% supervised. I agree that the model proposed using this measure as an additional loss term is only supervised on part of the data, but the measure itself is supervised. Here are examples of misleading claims:
+-Abstract: ""Although several works focus on learning LSBD representations, none of them provide a metric to quantify disentanglement. Moreover, such methods require supervision on the underlying transformations for the entire dataset, and cannot deal with unlabeled data.""
+-Related Work: ""Moreover, their methods require supervision on the transformation relationships among datapoints for the entire training dataset.""
+2) Some technical aspects of the paper are unclear:  
+- LSBD is clearly defined, SBD is mentioned multiple times but never defined.
+- the ∆VAE model is never described, nor its architecture. The reader is required to read another paper to understand how this model works.
+- ""An easy-to-compute metric to quantify LSBD given certain assumptions (see Section 4), which acts as an upper bound to a more general metric (derived in Appendix C)."" This more general metric is not described in the main text, and it is unclear what the underlying motivation is for this alternative metric (and reading the technical description in the appendix did not help me understand the motivation).
+- Enigmatic discussion points: ""Our LSBD metric and method require a number of assumptions, as explained in Section 4. This limits the applicability of the metric and method, but also provides a clear direction to what needs to be done to obtain and quantify LSBD representations if these assumptions are relaxed."", ""Moreover, our metric is in fact an upper bound to a more general metric (see Appendix C), which is however less straightforward to compute."" What are the clear directions on what needs to be done? How to think of this more general metric?
+
+Additional feedback:
+- The LSBD assumptions at restated three times in the main text (p2,3,4), which is unnecessarily redundant.
+",5,3.0,ICLR2021
+rJxFYm3TtB,2,rkxoh24FPH,rkxoh24FPH,Official Blind Review #3,"The paper addresses a question on whether mutual information (MI) based models for representation learning succeed primarily thanks to the MI maximization. The motivation of the work comes from the fact that although MI is known to be problematic in treatment, it has been successfully applied in a number of recent works in computer vision and natural language processing. The paper conducts a series of experiments that constitute a convincing evidence for a weak connection between the InfoMax principle and these practical successes by showing that maximizing established lower bounds on MI are not predictive of the downstream performance and that contrary to the theory higher capacity instantiations of the critics of MI may result in worse downstream performance of learned representations. The paper concludes that there is a considerable inductive bias in the architectural choices inside MI models that are beneficial for downstream tasks and note that at least one of the lower bounds on MI can be interpreted as a triplet loss connecting it with a metric learning approach.
+I consider this paper to be a considerable contribution to the understanding of what underlies the performance of unsupervised representation learning with MI maximization and provides a good discussion and analogous insights from parallel works and a number of possible directions to explore. Although I still have a couple of questions addressing which will help to understand the paper.
+-  In experiment 3.1, when using RealNVP and maximizing the lower bound of MI may it be the case that the representations are learned to benefit the downstream task because of the form of the lower bounds? In other words, the lower bound is possibly not very tight and its maximization has a side effect of weights adjusting to yield simple representations useful for a linear classifier?
+-  It would be interesting to see which factor contributes more to the performance: I_NCE being a triplet loss or an inductive bias in the design choice?",8,,ICLR2020
+B1l-SROznQ,1,SkloDjAqYm,SkloDjAqYm,"i liked this paper last time i reviewed it, and i like it still :)","last time i had two comments:
+1. the real data motifs did not look like what i'd expect motifs to look like. now that the authors have thresholded the real data motifs, they do look as i'd expect.
+2. i'm not a fan of VAE, and believe that simpler optimization algorithms might be profitable.  i acknowledge that SCC requires additional steps; i am not comparing to SCC. rather, i'm saying given your generative model, there are many strategies one could employ to estimate the motifs.  i realize that VAE is all the rage, and is probably fine.  in my own experiments, simpler methods often work as well or better for these types of problems.  i therefore believe this would be an interesting avenue to explore in future work.",8,5.0,ICLR2019
+9hiF4UAdyJr,3,uMDbGsVjCS4,uMDbGsVjCS4,Generation of complex surfaces with spatial and temporal discriminators under cycle-consistance supervision of down-sampling reconstruction loss,"This paper introduces the adversarial generative network into a physical simulation task, i.e., reconstruction and refining of complex surfaces. Several losses are applied in the proposed framework, such as MSE, adversarial losses, low-resolution reconstruction loss. Experimental results seem good.
+
+Pros:
++ As far as I know, this is the first work that uses GANs to refine complex surfaces. 
++ Some specific changes have been made in the proposed framework. Specifically, MSE loss is introduced to help the low-frequency reconstruction, a spatial discriminator and also a temporal discriminator are used during the training progress, the adversarial losses also provide the ability of learning from unpaired samples, and to overcome the shortage of unpaired samples that cannot learn an exactly map function, a down-sampling reconstruction loss is applied as a cycle-consistence loss. All these loss functions look reasonable.
++ This paper is well-written and easy to read.
+
+Cons:
+- The main shortage of this paper is the experimental part. First, the comparison method is too weak, several good GAN frameworks, e.g. WGAN-GP, PU-GAN, and SN-GAN, are proposed and none of them is compared. Second, many loss functions are introduced in the proposed method. However, how these losses reflect the experimental results is not well studied. An ablation study is desired.
+- The frequency part is interesting. But I am not sure if I miss something, I do not understand how the method leverage the frequency? Do you use the frequency block for reconstruction loss, or only use it to guide the training progress? I am confused about this. 
+- Although many loss functions are introduced, the proposed method is not novel. Most of the losses have been proposed previously. I think the contribution of this paper is not sufficient. However, I would like to read the response from the author about this, especially the frequency part.",5,4.0,ICLR2021
+SJk2VBm4g,3,ryZqPN5xe,ryZqPN5xe,needs stronger experimental validation,"This paper proposes a method of augmenting pre-trained networks for one task with an additional inference path specific to an additional task, as a replacement for the standard “fine-tuning” approach.
+
+Pros:
+-The method is simple and clearly explained.
+-Standard fine-tuning is used widely, so improvements to and analysis of it should be of general interest.
+-Experiments are performed in multiple domains -- vision and NLP.
+
+Cons:
+-The additional modules incur a rather large cost, resulting in 2x the parameters and roughly 3x the computation of the original network (for the “stiched” network).  These costs are not addressed in the paper text, and make the method significantly less practical for real-world use where performance is very often important.
+
+-Given these large additional costs, the core of the idea is not sufficiently validated, to me.  In order to verify that the improved performance is actually coming from some unique aspects of the proposed technique, rather than simply the fact that a higher-capacity network is being used, some additional baselines are needed:
+(1) Allowing the original network weights to be learned for the target task, as well as the additional module.  Outperforming this baseline on the validation set would verify that freezing the original weights provides an interesting form of regularization for the network.
+(2) Training the full module/stitched network from scratch on the *source* task, then fine-tuning it for the target task.  Outperforming this baseline would verify that having a set of weights which never “sees” the source dataset is useful.
+
+-The method is not evaluated on ImageNet, which is far and away the most common domain in which pre-trained networks are used and fine-tuned for other tasks.  I’ve never seen networks pre-trained on CIFAR deployed anywhere, and it’s hard to know whether the method will be practically useful for computer vision applications based on CIFAR results -- often improved performance on CIFAR does not translate to ImageNet.  (In other contexts, such as more theoretical contributions, having results only on small datasets is acceptable to me, but network fine-tuning is far enough on the “practical” end of the spectrum that claiming an improvement to it should necessitate an ImageNet evaluation.)
+
+Overall I think the proposed idea is interesting and potentially promising, but in its current form is not sufficiently evaluated to convince me that the performance boosts don’t simply come from the use of a larger network, and the lack of ImageNet evaluation calls into question its real-world application.
+
+===============
+
+Edit (1/23/17): I had indeed missed the fact that the Stanford Cars does do transfer learning from ImageNet -- thanks for the correction.  However, the experiment in this case is only showing late fusion ensembling, which is a conventional approach compared with the ""stitched network"" idea which is the real novelty of the paper.  Furthermore the results in this case are particularly weak, showing only that an ensemble of ResNet+VGG outperforms VGG alone, which is completely expected given that ResNet alone is a stronger base network than VGG (""ResNet+VGG > ResNet"" would be a stronger result, but still not surprising). Demonstrating the stitched network idea on ImageNet, comparing with the corresponding VGG-only or ResNet-only finetuning, could be enough to push this paper over the bar for me, but the current version of the experiments here don't sufficiently validate the stitched network idea, in my opinion.",4,4.0,ICLR2017
+BklFCPgRKH,2,rJe4_xSFDB,rJe4_xSFDB,Official Blind Review #3,"This paper presents a general approach for upper bounding the Lipschitz constant of a neural network by relaxing the problem to a polynomial optimization problem. And the authors extend the method to fully make use of the sparse connections in the network so that the problem can be decomposed into a series of much smaller problems, saving large amount of computations and memory. Even for networks that don't have high-level sparse connections, the proposed method can still help to reduce the size of the problem. This paper also compares the proposed LiPopt method with another solution derived from a quadratically constrained quadratic program reformulation. Compared with this method, the LiPopt method can handle cases with more parameters efficiently.
+
+Calculating a TIGHT upper bound of a neural network efficiently is very valuable and useful in many areas in deep learning community. And I really like the potential to use this LiPopt method to upper bound local Lipschitz constant in a given neighboring region, which will be very useful in certificated robustness application, etc..
+
+I also like that the authors present results on networks trained on real-world dataset (MNIST). My only suggestion is that I'd like to see LiPopt's computation time and memory usage compared to its counterparts, as the authors argue the proposed method can fully exploit the sparse connections to reduce the problem size.
+
+=======
+Update: I am satisfied with the authors' solid response and would like to raise my score.",8,,ICLR2020
+Hkxxm_Qsh7,2,B1exrnCcF7,B1exrnCcF7,Review of Disjoint Mapping Network for Cross-modal Matching of Voices and Faces,"# Summary
+
+The article proposes a deep learning-based approach aimed at matching face images to voice recordings belonging to the same person. 
+
+To this end, the authors use independently parametrized neural networks to map face images and audio recordings -- represented as spectrograms -- to embeddings of fixed and equal dimensionality. Key to the proposed approach, unlike related prior work, these modules are not directly trained on some particular form of the cross-modal matching task. Instead, the resulting embeddings are fed to a modality-agnostic, multiclass logistic regression classifier that aims to predict simple covariates such as gender, nationality or identity. The whole system is trained jointly to maximise the performance of these classifiers. Given that (face image, voice recording) pairs belonging to the same person must share equal for these covariates, the neural networks embedding face images and audio recordings are thus indirectly encouraged to map face images and voice recordings belonging to the same person to similar embeddings.
+
+The article concludes with an exhaustive set of experiments using the VGGFace and VoxCeleb datasets that demonstrates improvements over prior work on the same set of tasks.
+
+# Originality and significance
+
+The article follows-up on recent work [1, 2], building on their original application, experimental setup and model architecture. The key innovation of the article, compared to the aforementioned papers, lies on the idea of learning face/voice embeddings to maximise their ability to predict covariates, rather than by explicitly trying to optimise an objective related to cross-modal matching. While the fact that these covariates are strongly associated to face images and audio recordings had already been discussed in [1, 2], the idea of actually using them to drive the learning process is novel in this particular task.
+
+While the article does not present substantial, general-purpose methodological innovations in machine learning, I believe it constitutes a solid application of existing techniques. Empirically, the proposed covariate-driven architecture is demonstrated to lead to better performance in the (VGGFace, VoxCeleb) dataset in a comprehensive set of experiments. As a result, I believe the article might be of interest to practitioners interested in solving related cross-modal matching tasks.
+
+# Clarity
+
+The descriptions of the approach, related work and the different experiments carried out are written clearly and precisely. Overall, the paper is rather easy to read and is presented using a logical, easy-to-follow structure.
+
+In my opinion, perhaps the only exception to that claim lies in Section 3.4. If possible, I believe the Seen-Heard and Unseen-Unheard scenarios should be introduced in order to make the article self-contained. 
+
+# Quality
+
+The experimental section is rather exhaustive. Despite essentially consisting of a single dataset, it builds on [1, 2] and presents a solid study that rigorously accounts for many factors, such as potential confounding due to gender and/or nationality driving prediction performance in the test set. 
+
+Multiple variations of the cross-modal matching task are studied. While, in absolute terms, no approach seems to have satisfactory performance yet, the experimental results seem to indicate that the proposed approach outperforms prior work.
+
+Given that the authors claimed to have run 5 repetitions of the experiment, I believe reporting some form of uncertainty estimates around the reported performance values would strengthen the results.
+
+However, I believe that the success of the experimental results, more precisely, of the variants trained to predict the ""covariate"" identity, call into question the very premise of the article. Unlike gender or nationality, I believe that identity is not a ""covariate"" per se. In fact, as argued in Section 3.1, the prediction task for this covariate is not well-defined, as the set of identities in the training, validation and test sets are disjoint. In my opinion, this calls into question the hypothesis that what drives the improved performance is the fact that these models are trained to predict the covariates. Rather, I wonder if the advantages are instead a ""fortunate"" byproduct of the more efficient usage of the data during the training process, thanks to not requiring (face image, audio recording) pairs as input.
+
+# Typos
+
+Section 2.4
+1) ""... image.mGiven ...""
+2) Cosine similarity written using absolute value |f| rather than L2-norm ||f||_{2}
+3) ""Here we are give a probe input ...""
+
+# References
+
+[1] Nagrani, Arsha, Samuel Albanie, and Andrew Zisserman. ""Learnable PINs: Cross-Modal Embeddings for Person Identity."" arXiv preprint arXiv:1805.00833 (2018).
+[2] Nagrani, Arsha, Samuel Albanie, and Andrew Zisserman. ""Seeing voices and hearing faces: Cross-modal biometric matching."" Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.",6,3.0,ICLR2019
+HkxAIib0Fr,3,S1xCcpNYPr,S1xCcpNYPr,Official Blind Review #3,"This work tries to build a sub-pile of the test data to save the testing time with minimum effect on the test adequacy and the output distribution. In this paper, the work is done by adding a test-sample search algorithm on top of the HGS algorithm to balance the output distribution.
+
+However, the novelty of the proposed work is limited, and there is no evidence to show the proposed algorithms can be applied to other related works. Furthermore, the result does not present a strong success: the error of output distribution is much worse than the compared work.
+
+Other comments to the proposed manuscript are:
+
+1. In Definition 1 the authors declare that the goal is to satisfy f(T,M)=f(T’M) and g(T,M)=g(T’M), and then in the following paragraphs they change it to f(T,M)≈f(T’M) and g(T,M)≈g(T’M) with no justification. More explanation is needed.
+
+2. In Table 2, the authors try to compare the output distribution. To better demonstrate the change between raw testing set and proposed subset, I think that it could be better to present the metrics of distribution or the accuracy of each class instead.
+",3,,ICLR2020
+r1zs_jEVe,2,Hy-2G6ile,Hy-2G6ile,"Promising work, but too preliminary for a major conference","The paper introduces Gated Multimodal Units GMUs, which use multiplicative weights to select the degree to which a hidden unit will consider different modalities in determining its activation.  The paper also introduces a new dataset, ""Multimodal IMDb,"" consisting of over 25k movie summaries, with their posters, and labeled genres.
+
+GMUs are related to ""mixture of experts"" in that different examples will be classified by different parts of the model, (but rather than routing/gating entire examples, individual hidden units are gated separately).  They are related to attention models in that different parts of the input are weighted differently; there the emphasis is on gating modalities of input.
+
+The dataset is a very nice contribution, and there are many experiments varying text representation and single-modality vs two-modality.  What the paper is lacking is a careful discussion, experimentation and analysis in comparison to other multiplicative gate models---which is the core intellectual contribution of the paper.  For example, I could imagine that a mixture of experts or attention models or other gated models might perform very well, and at the very least provide interesting scientific comparative analysis.  I encourage the authors to continue the work, and submit a revised paper when ready.
+
+As is, I consider the paper to be a good workshop paper, but not ready for a major conference.",4,4.0,ICLR2017
+r92j6btvNAN,2,jwgZh4Y4U7,jwgZh4Y4U7,"The paper proposes Temporal and Object Quantification Nets (TOQ-Nets), which can be used for learning composable action concepts from time sequences that describe the properties and relations of multiple entities. The authors test their model on two artificial benchmarks and demonstrate the effectiveness of their approach.","Strengths:
+- The authors address a well formulated and an important problem for lots of practical scenarios.
+- The paper is well written.
+- The experiments demonstrate that the proposed method outperforms several baselines.
+
+Weaknesses:
+- I personally found the proposed approach to be very cumbersome and complicated. It consists of many different components that are not conceptually intuitive. Contrary to some of the recent approaches for modeling visual relations (i.e. Non-Local Networks or Space-Time Video Graphs), the proposed model is much more difficult to understand. In my opinion, this is a big disadvantage because the models that are most useful to the community are usually conceptually (and technically) simple, yet effective.
+- The authors assume that the input features to the TOQ net are hand-engineered, and thus, are not learnable. This is very different to most modern approaches, which typically try to learn all the features from raw pixels end-to-end. Very few recent methods (to the best of my knowledge) rely on hand-engineered features for video modeling/action recognition. This begs the question how applicable the proposed approach is to modern computer vision community.
+- The biggest weakness of the paper is its experimental evaluation. The authors evaluate their approach on 2 small-scale artificial video datasets. Thus, it is not clear whether the proposed approach would generalize to real datasets such as Kinetics, Something-Something, EPIC-Kitchens, etc. Most current action recognition methods evaluate their model on these large-scale real-world datasets. Thus, I think it is imperative that the authors would conduct thorough experiments not only on their small artificial datasets but also on the datasets that are most often used by the action recognition community. In particular, datasets like Something-Something, EPIC-Kitchens, or Charades require spatiotemporal relation modeling as demonstrated by prior work (i.e. Space-Time Video Graphs). 
+- The comparison with the pixel-level baselines (i.e. Non-Local Networks, or Space-Time Video Graphs) might not be exactly fair. The authors adapt these baselines to the dataset/task specific scenarios. Compared to their own model, the authors don't have as many incentives to tune these baseline models for their specific tasks. Thus, I believe that the experiments on the real-world large scale datasets such as Kinetics, Something-Something and Charades are essential for validating that the proposed approach is better than these prior methods.
+- Missing relevant work: Wang et al., ""Something-Else: Compositional Action Recognition with Spatial-Temporal Interaction Networks."" (CVPR 2020).
+
+Rebuttal Requests:
+- The authors should include thorough experiments on the real-world large-scale datasets such as Kinetics, Something-Something, and Charades. The currently used datasets are not sufficient to verify the generalizability and usefulness of the proposed approach.
+
+==Post Rebuttal Response:
+
+I read the rebuttal, and unfortunately it didn't address most of my pressing concerns. I appreciate the authors' efforts to add new experiments on other datasets. However, in my opinion these new datasets are not very relevant for the action recognition community, i.e. they are small, and they are rarely used to compare the effectiveness of a particular model. In my initial review, I listed a few datasets that are most commonly used for action recognition comparisons. In my view, without comparisons on these more popular datasets it is very difficult to tell the real value of the proposed approach. If the authors could demonstrate close to state-of-the-art performance on those datasets I would be more convinced that the proposed approach is effective. Currently, most of the comparison are done w.r.t baselines that are implemented by the authors which is insufficient in my opinion. Therefore, I stand by my original recommendation of rejecting the paper.",3,4.0,ICLR2021
+rkqwomzrl,5,Hy-lMNqex,Hy-lMNqex,"Incremental, perhaps better suited for an architecture conference (ISCA/ ASPLOS)","The authors present TARTAN, a derivative of the previously published DNN accelerator architecture: “DaDianNao”. The key difference is that TARTAN’s compute units are bit-serial and unroll MAC operation over several cycles. This enables the units to better exploit any reduction in precision of the input activations for improvement in performance and energy efficiency.
+
+Comments:
+
+1. I second the earlier review requesting the authors to be present more details on the methodology used for estimating energy numbers for TARTAN. It is claimed that TARTAN gives only a 17% improvement in energy efficiency. However, I suspect that this small improvement is clearly within the margin of error ij energy estimation.  
+
+2. TARTAN is a derivative of DaDianNao, and it heavily relies the overall architecture of DaDianNao. The only novel aspect of this contribution is the introduction of the bit-serial compute unit, which (unfortunately) turns out to incur a severe area overhead (of nearly 3x over DaDianNao's compute units).
+
+3. Nonetheless, the idea of bit-serial computation is certainly quite interesting. I am of the opinion that it would be better appreciated (and perhaps be even more relevant) in a circuit design / architecture focused venue.",5,5.0,ICLR2017
+ouomYiRD1JN,2,0pxiMpCyBtr,0pxiMpCyBtr,Novel reparametrization of monotonic lattice regression which shows significant empirical gains. ,"**Summary**
+The authors propose KFL, an efficient reparametrization of monotonic lattice regression using Kronecker factorization. The goal is to achieve efficiency both in terms of computations and in terms of the number of parameters. The authors show that the proposed KFL has storage and computational costs that scale linearly in the number of input features. 
+Experimental results show that KFL has better speed and storage space. The authors also provide necessary and sufficient conditions for a KFL model to be monotonic with respect to some features. 
+
+**+ves**
++ This appears to be a new parametrization of monotonic lattice regression with parametric and computational advanages
++ Theoretical results show necessary and sufficient conditions for KFL to be monotonic, and this allows the design of training algorithms; there are also theoretical results on the capacity
++ Experiments on public and proprietary datasets show that KFL maintains error rate while needing less time to train and fewer parameters. 
+
+**Concerns**
+- The gains in training and eval time (especially compared to simplex) seem modest. However the number of parameters is significantly reduced. I wonder if a larger dataset/more complex task would demonstrate the benefits more clearly. 
+
+",7,3.0,ICLR2021
+Sk-qjGYlz,2,Hkc-TeZ0W,Hkc-TeZ0W,"In a previous work [1], an auto-placement (better model partition on multi GPUs) method was proposed to accelerate a TensorFlow model’s runtime. However, this method requires the rule-based co-locating step, in order to resolve this problem, the authors of this paper purposed a fully connect network (FCN) to replace the co-location step.","In a previous work [1], an auto-placement (better model partition on multi GPUs) method was proposed to accelerate a TensorFlow model’s runtime. However, this method requires the rule-based co-locating step, in order to resolve this problem, the authors of this paper purposed a fully connect network (FCN) to replace the co-location step. In particular, hand-crafted features are fed to the FCN and the output is the prediction of group id of this operation. Then all the embeddings in each group are averaged to serve as the input of a seq2seq encoder. 
+
+Overall speaking, this work is quite interesting. However, it also has several limitations, as explained below.
+
+First, the computational cost of the proposed method seems very high. It may take more than one day on 320-640 GPUs for training (I did not find enough details in this paper, but the training complexity will be no less than the in [1]). This makes it very hard to reproduce the experimental results (in order to verify it), and its practical value becomes quite restrictive (very few organizations can afford such a cost).
+
+Second, as the author mentioned, it’s hard to compare the experimental results in this paper wit those in [1] because different hardware devices and software versions were used. However, this is not a very sound excuse. I would encourage the authors to implement colocRL [1] on their own hardware and software systems, and make direct comparison. Otherwise, it is very hard to tell whether there is improvement, and how significant the improvement is. In addition, it would be better to have some analysis on the end-to-end runtime efficiency and the effectiveness of the placements.
+
+ [1] Mirhoseini A, Pham H, Le Q V, et al. Device Placement Optimization with Reinforcement Learning[J]. arXiv preprint arXiv:1706.04972, 2017. https://arxiv.org/pdf/1706.04972.pdf 
+",5,4.0,ICLR2018
+r1MU1AtlG,2,rkHVZWZAZ,rkHVZWZAZ,interesting work with several contributions and large experiments with some but not all recent approaches,"This paper proposes a novel reinforcement learning algorithm containing several contributions made by the authors: 1) a policy gradient algorithm that uses value function estimates to improve the policy gradient, 2) a distributed multi-step off-policy algorithm to estimate the value function, 3) an experience replay buffer mechanism that can handle sequences and (4) a distributed architecture, where threads are dedicated to either learning or interracting with the environment. Most contributions consist in improvements to handle multi-step trajectories instead of single step transitions. The resulting algorithm is evaluated on the ATARI domain and shown to outperform other similar algorithms, both in terms of score and training time. Ablation studies are also performed to study the interest of the 4 contributions. 
+
+I find the paper interesting. It is also well written and reasonably clear. The experiments are large, although I was disappointed that PPO was not included in the evaluation, as this algorithm also trains much faster than other algorithms.
+
+quality
++ several contributions
++ impressive experiments
+
+clarity
+- I found the replay buffer not as clear as the other parts of the paper.
+. run time comparison: source of the code for the baseline methods?
++ ablation study showing the merits of the different contributions
+- Methods not clearly labeled. For example, what is the difference between Reactor and Reactor 500M?
+
+originality
++ 4 contributions
+
+significance
++ important problem, very active area of research
++ comparison to very recent algorithms
+- but no PPO in the evaluation",7,2.0,ICLR2018
+HJxJpq863X,3,ryM_IoAqYX,ryM_IoAqYX,"interesting problem setting and analysis, unclear conclusion from the analysis and experiments","
+Summary:
+
+This paper studies the convergence properties of loss-aware weight quantization with different gradient precisions in the distributed environment, in which servers keeps the full-precision weights and workers keeps quantized weights. The authors provided convergence analysis for weight quantization with full-precision, quantized and quantized clipped gradients. Specifically, they find that: 1) the regret of loss-aware weight quantization with full-precision gradient converge to an error related to the weight quantization resolution and dimension d. 2) gradient quantization slows the convergence by a factor related to gradient quantization resolution and dimension d. 3) gradient clipping renders the speed degradation dimension-free. 
+
+Comments:
+
+Pros:
+
+- The paper is generally well written and organized. The notation is clean and consistent. Detailed proofs can be found in the appendix, the reader can appreciate the main results without getting lost in details.
+
+- The paper provides theoretical analysis for the convergence properties of loss-aware weight quantization with full-precision gradients, quantized gradient and clipped quantized gradient, which extends existing analysis beyond full-precision gradients, which could be useful for distributed training with limited bandwidth. 
+
+Cons:
+
+- It is unclear what problems the authors try to solve. The problem is about gradient compression, or how the gradient precision will affect the convergence for training quantized nets in the distributed environment, in which workers have limited computation power and the network bandwidth is limited. It is an interesting setting, however, the author does not make it clear the questions they are asking and how the theoretical results can guide the practical algorithm design. 
+
+- The authors mentioned that quantized gradient slows convergence (relative to using full-precision gradient) in contribution 2 while also claims that quantizing gradients can significantly speed up training of quantized weights in contribution 4, which is contradictory to each other.
+
+- It is not clear what relaxation was made on the assumptions of f_t in section 3.1. The analysis are still based on three common assumptions: 1) f_t is convex 2) f_t is twice differentiable 3) f_t has bounded gradients. The assumptions and theoretical results may not hold for non-convex deep nets. E.g., the author does not valides the theorems results on d with neural networks but only with linear models in section 4.1.
+
+- The author demonstrate training quantized nets in the distributed environment with quantized gradients, however, no comparison is made with other related works (e.g., Wen et al, 2017). 
+
+Questions: 
+
+- Theorem 1 is an analysis for training with quantized weights and full-precision gradients, which is essentially the same setting as BinaryConnect. Similar analysis has been done in Li et al, 2017. What is the difference or connection with their bound?
+
+- It is not clear how gradienta are calculated w.r.t. quantized weights on worker, is straight through estimator (STE) used for backpropagation through Q_w?
+
+- In section 3.3, why is \tilde{g}_t stochastically quantized gradient? How about statiscally quantized gradients?
+
+- Why do the authors use linear model in section 4.1? Why are the solid lines in Figure 3 finished earlier than dashed lines? For neural networks, a common observation is that the larger the dimension d, the better the generalization performance. However, Figure 3 and Theorem 1 seem to be contradictory to this common belief. Would it possible to verify the theorem on deep nets of different dimension?
+
+- Why does the number of worker affect the performance? I failed to see why the number of workers affect the performance of training if it is a synchronized distributed training with the same total batch size. After checking appendix C, I think it is better to discuss the influence of batch sizes rather than the number of workers.
+
+- Why is zero weight decay used for CIFAR-10 experiment but non-zero weight decay for imagenet experiment? How was weight decay applied in Adam for quantized weights? 
+
+Minor issues: 
+- The notation of full-precision gradient w.r.t quantized weights in Figure 1 should be \hat{g}_t, however, g_t is used.
+",6,4.0,ICLR2019
+sDe09X7uNcV,4,YNnpaAKeCfx,YNnpaAKeCfx,Adaptively select mini-batch for sensitive groups to improve model fairness,"The problem is well-motivated. The paper is well-organized and written. The claims are well-supported by theoretical analysis and experimental results. In contrast with previous work that requires nontrivial re-configurations in machine learning, FairBatch formalizes the sampling probability as an implicit connection between the inner (for fairness criterion) and outer (task) optimizer in bilevel optimization.
+[Pro 1] This paper provides insights into fairness and machine learning from the lens of bilevel optimization, emphasizing the inverse proportional relationship between sampling probability and loss for sensitive groups.
+[Pro 2] FairBatch is not only consistent with intuition but also easy to implement.
+[Con 1] It would be helpful for the authors to summarize their contributions more from the fairness aspect. I am confused about the ability of FairBatch to mitigate disparate accuracy. Is it designed especially for the unfairness caused by minimizing average error that fits majority populations [to clarify my point, please refer to the 2nd cause in Section 2.1, abs-1810-08810]?
+[abs-1810-08810] Alexandra Chouldechova, Aaron Roth: The Frontiers of Fairness in Machine Learning. CoRR abs/1810.08810 (2018)
+[Con 2] Is the number of total disparities (as the authors claimed in Section 3.2) d in Section 3 the lambda dimension? No explanation is provided for this notion. If true, is d equal to n_z-1? Why not C(n_z, 2)? (C stands for Combination). In practice, when the sensitive space is sometimes continuous and big, there would be too many pairs to compute. More details about d will strengthen the submission. I am not requesting more experiments, but want to understand how FairBatch works in practice.
+[Con 3] As for the experiments, the analysis fails to go beyond single-dimensional performance summaries. Also, alpha is a vital hyper-parameter. However, the criterion used for selecting the final setting is not clear in this paper.",6,5.0,ICLR2021
+IiJ4fWRXFUn,4,RHY_9ZVcTa_,RHY_9ZVcTa_,"Seems like a good paper, but not clear to me what are the practical benefits","This paper claims that, through Canonical Correlation Analysis on the representations learnt by popular deep models, we can show that the representations learnt on the same dataset, with same model using different sets of parameters, the representations are linearly identifiable. 
+
+This seems like a nice thing to know. But I however am not sure about few things. 
+1) I think the paper doesn't really discuss the implications of this enough. That is, I would like to see few example applications where we take advantage of this fact. As is, this contribution in my opinion remains as a cute thing to know. 
+2) It would be interesting to how much would change in initializations would effect the CCA curves. That is, if I initialize my network within a wider range, the identifiability result would still hold in practice? This would be interesting to add in my opinion. 
+3) Ideally I would like to see if this statement hold for different hyperparameters also. 
+
+Overall, the conclusion is interesting and I think that it could be published. 
+ ",6,3.0,ICLR2021
+WNjzYYi6bWG,1,xF5r3dVeaEl,xF5r3dVeaEl,Official Blind Review #3,"This paper considers an algorithm for learning and using an opponent model that is only conditioned on an agent's local information (history of actions, observations, and rewards). A variational autoencoder (VAE) is trained to predict opponent observations and actions, given the agent's local information. While the opponent's observations and actions are required to train the VAE decoder, they're not needed at play time for the encoder. The authors show that with static opponents, the output of the encoder is informative, and conditioning the agent policy on local information plus encoder output outperforms just using the local information.
+
+The paper was well written, and was to easy follow. Experiments seemed appropriate to demonstrate the author's claims. I had only a few specific comments.
+
++++ Points after discussion about Meta-RL, HiP-MDPs, and MARL.
+
+Like Reviewer 4, my view of MARL is more general than requiring all agents to be learning, even if most work does have one algorithm training multiple seats. I don't think this actually changes anything in the current draft, but should hopefully not appear in future edits.
+
+The related works should include discussion of the HiP-MDP paper.
+
+These points fit together: there should be a consistent placement of this paper, and related works. Given fixed opponents, the multiagent problem is equivalent to a single agent problem. It doesn't seem relevant whether or not the unobserved, unknown variables correspond to different but similar environments, or different but similar opponents in a single fixed environment.
+
++++
+
+Section 4.2 ""In Sections 1 and 2, it was noted that most agent modelling methods assume access to the opponent's observations and actions both during training and execution. To eliminate this assumption ...""
+Weakened seems like a better choice of words than eliminated: it does still assume access to the opponent's observations during training.
+
+Section 5
+What is the size of the environment? Initial random placement? I'm trying to get some idea of the magnitudes of the rewards, which are based on Euclidean distance.
+Are the speaker-listener and double speaker-listener experiments using the OpenAI multiagent particle environments? If so, cite this (The MADDPG paper that is already cited for opponent algorithms?)",6,3.0,ICLR2021
+s9OSlXigbc5,1,DegtqJSbxo,DegtqJSbxo,"An important problem, but the results are hard to interpret","#### Summary
+
+In this paper, the authors evaluate the performance of classifiers trained and then later tested on both adversarially generated perturbations as well as more natural perturbations.  By considering six different natural perturbations, they show empirically that natural perturbations can improve performance against clean and adversarially-perturbed images.  They also show that adversarial training does not improve performance on unseen natural transformations.
+
+#### Strengths
+
+- This paper studies an interesting problem.  Natural perturbations have gained increasing interest from the robustness community recently.  While there are many works that study the trade-offs (e.g. in the test accuracy of clean and adversarialy perturbed data) with respect to adversarial perturbations, these trade-offs are much less well understood for more ""natural"" notions of robustness.
+
+- The performance normalization step here is interesting, and in my opinion is an useful way to compare notions of robustness across domains.  As this problem receives more attention, I believe that this will be a useful metric for comparing results for the community.
+
+- The result that (in some cases) adversarial training hurts natural robustness and natural robustness helps adversarial robustness is interesting and noteworthy.  A more in-depth analysis of these phenomena and whether they hold for broader classes of natural transformations would be quite interesting.
+
+#### Weaknesses
+
+- Although the title advertises that this paper considers ""natural"" perturbations of data, it is arguable as to whether these transformations are representative of transformations that are more likely to be ""found in the real world"" (quoted from the Introduction).  Although it seems true that adversarial perturbations may not occur naturally, I'm not entirely convinced that these transformations are any more likely to be encountered.  Moreover, based on Figure 1, the elastic, wave, gaussian noise, and gaussian blur transforms all look quite similar to one another.  A more convincing argument could be made here if a more diverse set of realistic perturbations was considered  (e.g. changing the weather conditions in iamges).
+
+- The notation throughout this paper is at times confusing.  I bullet the instances where I felt the notation was unclear:
+
+   + it seems unintuitive as to why $\zeta_n^A$ and $\zeta_n^t$ need to depend on $n$.
+   + The authors write that each image $x_n$ is an element of $\mathbb{R}^2$ -- I assume this is a typo.  
+   + The symbol $\circledast$ is undefined -- is this a pointwise convolution?  
+   + The constant $\alpha$ is not defined in the definition of $\zeta^O$, and the notation $b^{x_c, t,r}$ is confusing -- I'm still unsure of how $\zeta^O = \min(x_n, b^{x_c,t,r})$ would be applied in practice.  What is the size of $b$?  
+   + The notation $x_n^{N(\mu,\sigma^2)}$ is confusing -- I assume this means that $x_n \sim N(\mu,\sigma^2)$?  
+   + The ""shift operator"" in the definition of the wave transformation is also undefined.  
+   + In the definition of the accuracy drop, the symbol $\alpha$ is reused; previously, it was also used to denote parts of $\zeta_S$ and $\zeta^E$.  To this end, the definition of $\alpha$ here is confusing -- shouldn't it be defined in terms of the cardinality of these sets?
+   + Why is the cross entropy loss indexed by lower-case $s$, which seems to be undefined?
+
+- The figures are quite hard to parse.  In particular, although the message of Fig. 2 is clear from context, the symbols all overlap, which when we look at later figures makes hard to interpret.  Also, it's unclear why all the lines are connected, as each data point on the line simply corresponds to a different dataset.
+
+- In some sense, the fact that natural transformations improve clean accuracy is already known.  The training scheme that is introduced for training using natural perturbations is simply data-augmentation.  And many works have previously shown that data-augmentation (e.g. [1], [2], [3]) improves clean accuracy.  
+
+- I'm a bit surprised that adversarial training does not decrease the performance on clean images (e.g. Fig 3a) for CIFAR-10.  Unless I have misunderstood, this is at odds with other works (e.g. [4]) that claim that robustness is at odds with accuracy.
+
+- I'm not sure how to interpret Fig 5.  In particular, it seems that in some cases, training with data augmentation improves performance and in others, performance suffers when evaluating on unseen transformations.  This makes the claim of the paper feel somewhat weaker, given that the message emphasized in the intro only holds for a subset of the combinations of training/testing transformations.
+
+#### Final thoughts
+
+While I think that this paper studies an interesting problem, and has some interesting ideas such as normalizing the accuracy drop to create a fairer comparison, I feel that the experiments don't fully support the claimed conclusion.  Specifically, the phenomena of interest, such as that data augmentation improves performance on unseen natural transformations and on adversarially perturbed inputs, seems to not hold in general; on the contrary, it appears that a non-negligible amount of the time, it also hurts performance.  Further, the presentation is hard to follow at times due to grammatical mistakes and undefined notation.  Also, the natural transformations under consideration are all quite similar -- it would be very interesting to see if these kinds of results would hold with respect to much more severe changes in images, such as adding snow or changing the background colors.  For these reasons, I am leaning toward suggesting rejection for this paper. 
+
+#### References
+
+[1] https://link.springer.com/article/10.1186/s40537-019-0197-0
+[2] https://ieeexplore.ieee.org/document/8388338
+[3] https://arxiv.org/pdf/1708.04896.pdf
+[4] https://arxiv.org/pdf/1805.12152.pdf",4,5.0,ICLR2021
+Skl6siujnm,2,SkgToo0qFm,SkgToo0qFm,Decent application paper and setup for siamese networks,"Summary:
+This paper uses siamese networks to define a discriminative function for predicting protein-protein interaction interfaces. They show improvements in predictive performance over some other recent deep learning methods. 
+The work is more suitable for a bioinformatics audience though, as the bigger contribution is on the particular application, rather than the model / method itself.
+
+Novelty:
+The main contribution of this paper is the representation of the protein interaction data in the input layer of the CNN
+
+Clarity:
+- The paper is well written, with ample background into the problem.
+
+Significance:
+- Their method improves over prior deep learning approaches to this problem. However, the results are a bit misleading in their reporting of the std error. They should try different train/test splits and report the performance.
+- This is an interesting application paper and would be of interest to computational biologists and potentially some other members of the ICLR community
+- Protein conformation information is not required by their method
+
+Comments:
+- The authors should include citations and motivation for some of their choices (what sequence identity is used, what cut-offs are used etc)
+
+-  The authors should compare to at least some popular previous approaches that use a feature engineering based methodology such as - IntPred
+
+- The authors use a balanced ratio of positive and negative examples. The true distribution of interacting residues is not balanced -- there are several orders of magnitude more non-interacting residues than interacting ones. Can they show performance at various ratios of positive:negative examples? In case there is a consistent improvement over prior methods, then this would be a clear winner
+",5,3.0,ICLR2019
+BylWTKda27,3,S1lPShAqFm,S1lPShAqFm,"Interesting and inspiring observations, but need some further enhancement","This paper discusses the effect of increasing the widths in deep neural networks on the convergence of optimization. To this end, the paper focuses on RNNs and applications to NLP and speech recognition, and designs several groups of experiments/measurements to show that wider RNNs improve the convergence speed in three different aspects: 1) the number of steps taken to converge to the minimum validation loss is smaller; 2) the distance from initialization to final weights is shorter; 3) the step sizes (gradient norms) are larger. This in some sense complements the theoretical result in Arora et al. (2018) for linear neural networks (LNN), which states that deeper LNNs accelerates convergence of optimization, but the hidden layers widths are irrelevant. This also shows some essential difference between LNNs and (practical) nonlinear neural networks. 
+
+### comments about writing ###
+The findings are in general interesting and inspiring, but the explanations need some further improvement. In particular, the writing lacks some consistency and clarity in the wordings. For example, it is unclear to me what ""weight space traversal"" means, ""training size"" is mixed with ""dataset size"", and ""we will show that convergence ... to final weights"" seems to be a trivial comment (unless there is some special meaning of ""convergence rate""), etc. It also lacks some clarity and organization in the results -- some more summarizing comments and sections (and in particular, a separate and clearer conclusion section), as well as less repetitions of the qualitative comments, should largely improve the readability of the paper.
+
+### comments about results ###
+The observations included in the work may kick off some interesting follow-up work, but it is still a bit preliminary in the following sense:
+1. It lacks some discussions with its connection to some relevant literature about ""wider"" networks (e.g., Wide residual networks, Wider or deeper: revisiting the ResNet model for visual recognition, etc.).
+2. It lacks some discussions about the practical implication of the improvement in optimization convergence with respect to the widening of the hidden layers. In particular, what is the trade-off between the validation loss increase and the optimization convergence speed-up resulted from widening hidden layers? A heuristic discussion/approach should largely improve the impact of this work.
+3. The simplified theory about LNNs in the appendix seems a bit too far from the explanation of the difference between the observations in this paper and Arora et al. (2018).
+
+### typos and small suggestions ###
+1. It is suggested that the full name of LNN is provided at the beginning, and the font size should be larger in Figure 1.
+2. There are some mis-spellings that the authors should check (e.g., gradeint -> gradient).
+3. In formula (4), the authors should mention that the third line holds for all $t$ is a sufficient condition for the previous two equivalent lines.",5,3.0,ICLR2019
+Bye3SqyGcH,3,ryxMW6EtPB,ryxMW6EtPB,Official Blind Review #2,"This paper proposed to use the duality gap sup_f V(f, g*) – inf_g V(f*, g) as a metric for GAN training. It proves that this metric is an upper bound of F-distance. It also proves a generalization bound for this metric. Simulation resultson MNIST, CIFAR10, etc. are reported.
+
+  The contribution of this paper is incremental due to the following reasons.
+
+ 1) The duality gap is only an upper bound of the F-distance. This means that if the duality gap is zero then the learned distribution is the true distribution. However, the converse is not necessarily true: even if the algorithm starts with the true distribution, the duality gap may not be zero. Thus the metric is not a proper metric.
+  The proof of the upper bound is straightforward.
+
+  2) Another issue is the gap between the min-max formulation and the real training algorithm. As for GAN, due to the inexact update, it is not really solving the min-max problem. For the proposed metric, it is also impossible to solve sup_f V(f, g*) and inf_g V(f*, g) to reasonable accuracy. Thus what the algorithm is really doing, perhaps, is to optimizing a new loss which is the sum of the original loss and and an extra term. Viewing it as a “duality gap” seems to be far from the practical training. This discrepancy exists for GANs, but it is a bigger issue for the duality gap interpretation. 
+
+  3) The simulation is not convincing. The reported FID for CIFAR10 using WGAN-GP is 54.4, which seems to be a bit high. I’m not sure whether it is due to parameter choice or due to weak D/G networks used in the simulation. If the paper cannot compare various architecture, it is more convincing to at least use some standard architecture, like DCGAN. Or at least report the parameter tuning effort made for getting the results. ",1,,ICLR2020
+HylnIU9AYr,2,SJeC2TNYwB,SJeC2TNYwB,Official Blind Review #1,"This paper tackles the out of distribution detection problem and utilizes the property that the calculation of batch-normalization is different between training and testing for detecting out-of-distribution data with generative models. The paper first empirically demonstrates that the likelihood of out-of-distribution data has a larger difference between training mode and testing mode, then provides a possible theoretical explanation for the phenomenon. The proposed scoring function utilizes such likelihood differences in a permutation test for detecting out-of-distribution data. The evaluation is performed on two small in-distribution image datasets and four out-of-distribution datasets with three types of generative models.
+
+We recommend a weak accept. Its clarity is good, and the strength of the paper has three parts. The first is the thorough observation of the likelihood changes between different modes of batch-normalization. Second, the theoretical explanation for the observed phenomenon is sound. The example in Figure 2 gives a good intuition of how mis-specification can happen. The last is the strong performance on the out-of-distribution detection with generative models.
+
+However, I have some concerns about the design of the scoring function (Section 5), which looks like it is carefully tuned:
+1. Why do the authors use the permutation score ($T_{b,r1,r2}$) instead of likelihood difference ($\delta_{b,r1,r2}$)? The likelihood difference itself seems like it could be a good indication for OoD detection.
+2. Why do the authors use interpolation between training and evaluation? It introduces extra hyperparameters (r1, r2) for the method. Is the performance sensitive to the choice of r1 and r2?
+
+One minor issue:
+3. The sentence at the bottom of page 7 should be removed (""related work still needs some work, but there seems to be some bug on overleaf right now"")
+",6,,ICLR2020
+BkDNhgGNg,2,ry4Vrt5gl,ry4Vrt5gl,,"The current version of the paper is improved w.r.t. the original arXiv version from June. While the results are exactly the same, the text does not oversell them as much as before. You may also consider to avoid words like ""mantra"", etc. 
+I believe that my criticism given in my comment from 3 Dec 2016 about ""randomly generated task"" is valid and you answer is not.",7,4.0,ICLR2017
+rkevTD1qh7,1,B1lKtjA9FQ,B1lKtjA9FQ,Potential overfitting criteria remain vague and were not properly validated,"Overview:
+The authors aim at finding and investigating criteria that allow to determine whether a deep (convolutional) model overfits the training data without using a hold-out data set.  
+Instead of using a hold-out set they propose to randomly flip the labels of certain amounts of training data and inspect the corresponding 'accuracy vs. randomization‘ curves. They propose three potential criteria based on the curves for determining when a model overfits and use those to determine the smallest l1-regularization parameter value that does not overfit. 
+I have several issues with this work. Foremost, the presented criteria are actually not real criteria (expect maybe C1) but rather general guidelines to visually inspect 'accuracy over randomization‘ curves. The criteria remain very vague and seem be to applicable mainly to the evaluated data set (e.g. what defines a ’steep decrease’?). Because of that, the experimental evaluation remains vague as well, as the criteria are tested on one data set by visual inspection. Additionally, only one type of regularization was assumed, namely l1-regularization, though other types are arguably more common in the deep (convolutional) learning literature.  
+Overall, I think this paper is not fit for publication, because the contributions of the paper seem very vague and are neither thoroughly defined nor tested.
+
+
+Detailed remarks:
+
+General:
+A proper definition or at least a somewhat better notion of overfitting would have benefitted the paper. In the current version, you seem to define overfitting on-the-fly while defining your criteria. 
+
+You mention complexity of data and model several times in the paper but never define what you mean by that.
+
+
+Detailed:
+Page 3, last paragraph: Why did you not use bias terms in your model?
+
+Page 4, Assumption. 
+- What do you mean by the data being independent? Independent and identically distributed?  
+- ""As in that case correlation in the data can be destroyed by the introduction of randomness making the data easier to learn.“ What do you mean by ""easier to learn""? Better generalization? Better training error? 
+- I don’t understand the assumptions. You state that the regularization parameter should decrease complexity of the model. Is that an assumption? And how do you use that later?
+- What does ""similar scale“ mean? 
+
+Page 4, Monotony. 
+- You state two assumptions or claims, 'the accuracy curve is strictly monotonically decreasing for increasing randomness‘ and 'we also expect that accuracy drops if the regularization of the model is increased’, and then state that 'This shows that the accuracy is strictly monotonically decreasing as a function of randomness and regularization.‘ Although you didn’t show anything but only state assumptions or claims (which may be reasonable but are not backed up here). 
+I actually don’t understand the purpose of this paragraph.
+
+- Section 3.3 is confusing to me. What you actually do here is you present 3 different general criteria that could potentially detect overfitting on label-randomized  training sets. But you state it as if those measures are actually correct, which you didn’t show yet.
+
+My main concern here, besides the motivations that I did not fully understand (s.b.), is the lack of measurable criteria. While for criterion 1 you define overfitting as 'above the diagonal line‘ and underfitting as ‚below the line‘, which is at least measurable depending on sample density of the randomization, such criteria are missing for C2 and C3.       Instead, you present vague of ’sharp drops’ and two modes but do not present rigorous definitions. You present a number for C2 in Section 5, but that is only applicable to the present data set (i.e. assuming that training accuracy is 1). 
+
+Criterion 2 (b) is not clear.  
+- I neither understand ""As the accuracy curve is also monotone decreasing with increasing regularization we will also detect the convexity by a steep drop in accuracy as depicted by the marked point in the Figure 1(b)"" 
+nor do I understand ""accuracy over regularization curve (plotted in log-log space) is constant""?
+Does that mean that you assume that whenever the training accuracy drops lower than that of the model without regularization, it starts to underfit?
+
+Due to the lack of numerical measures, the experimental evaluation necessarily remains vague by showing some graphs that show that all criteria are roughly met by regularization parameter \lambda=0.00011 on the cifar data set.  In my view, this evaluation of the (vague) criteria is not fit for showing their possible merit.
+
+
+
+
+",3,4.0,ICLR2019
+JS9dfthHV2c,4,ni_nys-C9D6,ni_nys-C9D6,Interesting contribution but not clearly presented,"This paper draws connections between reversible programming and reverse mode automatic differentiation and introduces a reversible DSL in Julia that can be used to calculate first and second order gradients.
+
+I reviewed a previous version of this paper for NeurIPS.
+
+I really like the idea of reversible programming and I think that a clear introduction of reversible programming and its use in automatic differentiation could be of interest to the machine learning community. However, I feel that this paper fails to clearly explain the use of reversible programming and its trade-offs compared to checkpointing and other existing approaches.
+
+As a paper, the first section is great, but then the authors leave me with many questions: How do checkpointing and reversible programming differ in memory usage? Given that the multiplier from listing 1 has 3 outputs, doesn't that mean that a program consisting of n chained multiplications still requires storing n * 2 + 1 outputs, similarly to regular AD? And doesn't binomial checkpointing allow for logarithmic memory usage in exchange for a logarithmic increase in runtime (rather than polynomial)?
+
+Rather than answering these questions, the paper jumps eagerly into Julia code snippets, metaprogramming, and CUDA kernels, which I don't feel actually serve to elucidate the message that reversible programming is of interest to the machine learning community.
+
+Although I feel that this version of the paper is an improvement over the version I reviewed for NeurIPS, I feel that it still fails to clearly introduce reversible programming and shed light on the subtle trade-offs between reversible programming, checkpointing, and regular AD. I encourage the authors to rewrite the paper with less of a focus on the implementation details of their framework, and a stronger focus on the memory and runtime trade-offs provided by all of these methods from a more high-level, theoretical perspective.
+
+Pros
+
+* Very relevant and interesting topic
+* Well-written introduction
+* Good code
+
+Cons
+
+* Fails to introduce the topic appropriately for an ML audience
+* Does not clearly compare to advanced checkpointing methods
+* Not well written; too many details about the software implementation that do not contribute to an understanding of the high-level technique",6,4.0,ICLR2021
+rk-GXLRgz,3,B1X0mzZCW,B1X0mzZCW,"Overall, a nice paper","This paper suggests a simple yet effective approach for learning with weak supervision. This learning scenario involves two datasets, one with clean data (i.e., labeled by the true function) and one with noisy data, collected using a weak source of supervision.  The suggested approach assumes a teacher and student networks, and builds the final representation incrementally, by taking into account the ""fidelity"" of the weak label when training the student at the final step. The fidelity score is given by the teacher, after being trained over the clean data, and it's used to build a cost-sensitive loss function for the students. The suggested method seems to work well on several document classification tasks. 
+
+Overall, I liked the paper.  I would like the authors to consider the following questions - 
+
+- Over the last 10 years or so, many different frameworks for learning with weak supervision were suggested (e.g., indirect supervision, distant supervision, response-based, constraint-based, to name a few).  First, I'd suggest acknowledging these works and discussing the differences to your work. Second - Is your approach applicable to these frameworks?  It would be an interesting to compare to one of those methods  (e.g., distant supervision for relation extraction using a knowledge base), and see if by incorporating fidelity score, results improve. 
+
+- Can this approach be applied to semi-supervised learning? Is there a reason to assume the fidelity scores computed by the teacher would not improve the student in a self-training framework?
+
+- The paper emphasizes that the teacher uses the student's initial representation, when trained over the clean data.  Is it clear that this step in needed? Can you add an additional variant of your framework when the fidelity score are  computed by the teacher when trained from scratch? using different architecture than the student?
+ 
+ - I went over the authors comments and I appreciate their efforts to help clarify the issues raised.",7,4.0,ICLR2018
+SkxP4oloYB,2,HJe88xBKPr,HJe88xBKPr,Official Blind Review #3,"In this paper, the authors propose a method to train deep neural networks using 8-bit floating point representation for weights, activations, errors, and gradients. They use enhanced loss scale, quantization and stochastic rounding techniques to balance the numerical accuracy and computational efficiency. Finally, they get a slightly better validation accuracy compared to full precision baseline. Overall, this paper focuses on engineering techniques about mixed precision training with 8-bit floating point, and state-of-the-art accuracy across multiple data sets shows the effectiveness of their work. 
+
+However, there are some problems to be clarified.
+1. The authors apply several techniques to improve the precision for training with 8-bit floating point, but they do not show the gain for each individual. For example, how much improvement can this work achieve when just using enhanced loss scaling method or a stochastic rounding technique? This should be clearly presented and more experimental comparison is expected.
+
+2. The paper should present a bit more background knowledge and discussion on the adopted techniques. For instance, why the stochastic rounding method proposed in this article by adding a random value in probability can regulate quantization noise in the gradients? And why Resnet-50 demands a large scaling factor?
+
+3. On Table 3, in comparison with Wang et al. (2018), the authors use layers with FP32 (not FP16 in Wang). Thus, it is hard to say the improvement comes from the proposed 8-bit training. This should be clarified.
+
+4. How to set the hyper-parameters, such as scale, thresholds and so on, is not clear in the paper. There are no guidelines for readers to use these techniques.
+
+5. The authors did not give a clear description of the implement for the enhanced loss scaling. They apply different loss scaling methods for different networks. This should be explained in detail.
+
+6. In the experiment, for a single model, some layers are 8-bit, some layers are 32-bit and some layers are 16-bit.  Is the 8-bit training only applicable for a part of the model?  How do we know which layer is suitable for 8-bit training?  ",6,,ICLR2020
+HJxiMPIitS,1,SklEhlHtPr,SklEhlHtPr,Official Blind Review #1,"This paper proposes to learn representations of protein and molecules for the prediction of protein-ligand binding prediction.
+
+The presentation of this paper is a bit lengthy and repetitive in some cases. The long descriptions of protein/drug descriptors are a nice overview,  but it may be unnecessary as the authors in the end use other works’ embedding.
+
+The author points out that there are interpretability issues & the inability to capture the shape of the substructures with previous ligand descriptors, however, it seems that CDDD also is not interpretable and could not capture the shape as it operates on SMILES strings, although seems to have better predictive performance.
+
+For the protein descriptor, the author is missing several important descriptors that may not have the issues mentioned such as Protein Sequence Composition descriptor. 
+
+The technical novelty is very limited. It seems the usage of CDDD and UniRep are its only difference from previous works such as DeepDTA, WideDTA, DeepConv-DTI, PADME and CDDD and UniRep are also from other works. It may be more suitable for a domain journal instead of ICLR which focuses on method innovation. 
+
+
+The experimental setup is solid with realistic considerations. However, it is missing many baselines such as DeepDTA, WideDTA, DeepConv-DTI, PADME and more classic methods such as SimBoost and KronRLS. ",1,,ICLR2020
+SJgryzvchX,2,HJG7m2AcF7,HJG7m2AcF7,"Interesting approach, proposes representation augmentation as opposed to representation learning and the proposed distance not used for training.","The paper proposes a method to augment representation of an entity (such as a word) from standard ""point in a vector space"" to a histogram with bins located at some points in that vector space. In this model, the bins correspond the context objects, the location of which are the standard point embedding of those objects, and the histogram weights correspond to the strength of the contextual association. The distance between two representations is then measured with, Context Mover Distance, based on the theory of optimal transport, which is suitable for computing the discrepancy between distributions. 
+The representation of a sentence is proposed to be computed as the barycenter of the representation of words inside.
+Empirical study evaluate the method in a number of semantic textual similarity and hypernymy detection tasks. 
+
+The topic is important. The paper is well written and well structured and clear. The method could be interesting for the community. However, there are a number of conceptual issues that make the design a little surprising. First, the method does not learn the representations. Instead, augments a given one and computes the context mover distance on top of that. But, if the proposed context mover distance is an effective distance, maybe representations are better to be ""learned"" based on the same distance rather than being received as inputs.
+Also, whether an object is represented as a single point or as a distribution seems to be an orthogonal matter to whether the context predicts the entity or vice versa. This two topics are kind of mixed up in the discussions in this paper.
+
+Other issues:
+
+- One important technicality which seems to be missing is the exact value of p in Wp which is used. This becomes important for barycenters computations and the uniqueness of barycenters. 
+- Competitors in Table 1 are limited. Standard embedding methods are missing from the list.
+- Authors raise a question in the title of the paper, but the content of the paper is not much in the direction of trying to answer the question. 
+- It is not clear why the ""context"" of hyponym is expected to be a subset of the context of the hypernym. This should not always be true.
+- Table 4 gives the impression that parameter might not be set based on performance on validation set, but instead based on the performance on the test set.
+
+- Minor:
+of of
+data ,
+by
+byMuzellec
+CITE
+
+Overall, comparing strengths and shortcomings of the paper, I vote for the paper to be marginally accepted.",6,4.0,ICLR2019
+rcw069COy-b,2,kGvXK_1qzyy,kGvXK_1qzyy,"Interesting work, it is not clear to me how much it is novel w.r.t. standard multivariate CDT.","The authors present a novel change detection test for non i.i.d. data motivated by applications in RL. At first, they provide an offline version of the test, then they extend it to the online setting.
+
+The paper is clearly written and presents both theoretical results and convincing experimental results. My two concerns are about the novelty of what has been proposed w.r.t. standard CDT procedures and on the fact that a consistent part of the material of what has been proposed in the paper is deferred to the appendix.
+
+I would like to have more discussion on the difference between what has been proposed here and the standard multivariate CDTs, for instance:
+Kuncheva, Ludmila I. ""Change detection in streaming multivariate data using likelihood detectors."" IEEE transactions on knowledge and data engineering 25.5 (2011): 1175-1180.
+Boracchi, Giacomo, et al. ""Quanttree: histograms for change detection in multivariate data streams."" International Conference on Machine Learning. 2018.
+A strong motivation of the novelty w.r.t. to the literature might make me increaese the paper score.
+
+In my opinion, the paper is not self-contained. Not even the main theorem proof are included (the sketches are not useful in understanding the proof line) and the experimental setting is described in details only in the additional material. I think you should rearrange some of the material from the appendix to the main paper and viceversa.
+
+I would have appreciated a more detailed description of Algorithm 1. In this version of the paper it is difficult to understand the procedure you proposed, if you do not refer to the appendix for details.
+
+In your setting the change in the episodic reward is only about the expected values. What happens if the new reward distribution changes in terms of covariance \Sigma?
+
+Minor:
+""in RL ... life-time of the task."" I would have preferred a citation about this statement. Showing evidence using your experiment is a bit premature at this stage of the presentation.
+assume strong assumptions -> require strong assumptions
+
+--------------------------------------------------------------------------------------------------
+After rebuttals the authors significantly improved the presented work, including and discussing some relevant work which was previously missing. 
+",6,3.0,ICLR2021
+SkxeCjmJqS,3,S1gwC1StwS,S1gwC1StwS,Official Blind Review #3,"This paper introduces the notion of barcodes as a topological invariant of loss surfaces that encodes the ""depth"" of local minima by associating to each minimum the lowest index-one saddle. An algorithm is presented for the computation of barcodes, and some small-scale experiments are conducted. For very small neural networks, the barcodes are found to live at small loss values, and the authors argue that this suggests it may be hard to get stuck in a suboptimal local minimum.
+
+I believe the concept of barcodes will be new to most members of the ICLR community (at least it was to me), and I appreciate the authors' effort to convey the ideas through multiple definitions in Section 2. I wasn't able to fully appreciate the importance of Definition 3, and Definitions 1 and 2 were tough to digest owing to imprecise language, but I think I got the main point. I was also unable to fully comprehend the definitions of ""birth"" and ""death"" in this context. I'd strongly encourage the authors to improve the readability of this section so that non-experts can follow the story.
+
+It seems like the main contribution is a new algorithm for computing barcodes of minima. I am unfamiliar with prior work in this direction, and I was also unable from the paper to infer what the main improvements were relative to the existing algorithms. I'd encourage the authors to state their explicit algorithmic improvements, and to demonstrate empirically that the new algorithm outperforms the prior ones in the expected ways.
+
+The main experiments are on extremely tiny neural networks, presumably owing to computational restrictions. The authors state that ""it is possible to apply it to large-scale modern neural networks"", but it's not clear to me how that would work or what additional algorithmic improvements (if any) would need to be made in order to do so. I don't think that the results on tiny neural networks have much relevance to practice, so I think the empirical data presented in this paper will have very limited impact. If there were results for practical models, it would be a different story. So I'd encourage the authors to devote additional effort to scaling up the method for use on practical neural network architectures.
+
+Overall, I think there may be some really nice ideas in this paper that could help shape our understanding of neural network loss surfaces, but the current paper does not explore those ideas fully and does not convey them in a sufficiently clear manner. I hope to see an improved version of this paper at a future conference, but I cannot recommend acceptance of this version to ICLR.",1,,ICLR2020
+B1X0w52xz,3,SyuWNMZ0W,SyuWNMZ0W,Importance sampling correction to MMD for handling class imbalance,"This paper presents a modification of the objective used to train generative networks with an MMD adversary (i.e. as in Dziugaite et al or Li et al 2015), where importance weighting is used to evaluate the MMD against a target distribution which differs from the data distribution. The goal is that this could be used to correct for known bias in the training data — the example considered here is for class imbalance for known, fixed classes.
+
+Using importance sampling to estimate the MMD is straightforward only if the relationship between the data-generating distribution and the desired target distribution is somehow known and computable. Unfortunately the treatment of how this can be learned in general in section 4 is rather thin, and the only actual example here is on class imbalance. It would be good to see a comparison with other approaches for handling class imbalance. A straightforward one would be to use a stratified sampling scheme in selecting minibatches — i.e. rather than drawing minibatches uniformly from labeled data, select each minibatch by sampling an equal number of representatives from each class from the data. (Fundamentally, this requires explicit labels for whatever sort of bias we wish to correct for, for every entry in the dataset.) I don't think the demonstration of how to compute the MMD with an importance sampling estimate is a sufficient contribution on its own.
+
+Also, I am afraid I do not understand the description of subfigures a through c in figure 1. The target distribution p(x) is given in 1(a), a thinning function in 1(b), and an observed distribution in 1(c). As described, the observed data distribution in 1(c) should be found by multiplying the density in 1(a) by the function in 1(b) and then normalizing. However, the function \tilde T(x) in 1(b) takes values near zero when x < 0, meaning the product \tilde T(x)p(x) should also be near zero. But in figure 1(c), the mode of p(x) near x=0 actually has higher probability than the mode near x=2, despite the fact that there \tilde T(x) \approx 0.5. I think this might simply be a mistake in the definition of \tilde T(x), and that rather it should be 1.0 - \tilde T(x), but in any case this is quite confusing.
+
+I also am confused by the results in figure 2. I would have thought that the right column, where the thinning function is used to correct for the class imbalance, would then have approximately equal numbers of zeros and ones in the generative samples. But, there are still more zeros by a factor of around 2.
+
+Minor note: please double-check references, there seem to be some issues; for example, Sutherland et al is cited twice, once as appearing at ICML 2016 and once as appearing at ICML 2017.
+
+",4,5.0,ICLR2018
+OaK82EPOM-R,3,8pz6GXZ3YT,8pz6GXZ3YT,Theoretical results for learning a one-hidden-layer neural network with sparse ground truth weights given a Gaussian input distribution,"Summary of review:
+
+This paper provides recovery guarantees for learning one-hidden-layer neural networks with sparse ground truth weights, given an isotropic Gaussian input distribution. The main result shows local convexity guarantees near the ground truth. Provided that a mask of the sparsity pattern is already known, this paper extends the tensor initialization approach of Zhong et al'17 to show a convergence guarantee for learning the sparse neural network. Simulations validate the local
+
+Setting:
+
+This paper focuses on learning a one-hidden-layer neural network, where the weight matrix of the hidden layer is sparse, given input samples from an isotropic Gaussian distribution. 
+
+Results:
+
+(i) The first result is that within a small vicinity of the ground truth weight matrix, the standard mean squared loss for learning the neural network is convex.
+
+(ii) The second result shows how to learn the ground truth weight matrix, by extending the tensor initialization approach in Zhong et al'17.
+
+(iii) Numerical results are provided to validate the above two theoretical results.
+
+Pros:
+
+- The sample size requirement of both result (i) and (ii) growly proportionally to the sparsity of the ground truth matrix, as opposed to the size of the matrix. This result is particularly interesting in light of recent empirical results about network pruning and learning sparse ConvNets (Neyshabur'20).
+
+Cons:
+
+- The authors prove the above results by adapting the proof of Zhong et al'17. In fact, since the input distribution is isotropic, standard concentration results apply whether or not the ground truth matrix is sparse. Therefore, it is unclear to the reviewer whether this result is as novel as the authors claim in the introduction.
+
+- The learning algorithm assumes knowledge of the sparsity mask. This seems like a strong assumption. Isn't the point of IMP to find this sparsity mask? Understanding how to find this sparsity mask seems like a more important question, but this is not discussed at all in this paper.
+
+Writing:
+
+Overall, this paper is easy to follow. The quality of writing is marginally acceptable. Please find several detailed comments below.
+
+- P1, ""the theoretical justification of winning tickets are remains elusive expect for a few recent works"" -> remove ""are"", replace ""expect"" with except
+
+- P4: ""an one-hidden-layer neural network"" -> a one-hidden-layer neural network
+
+- P5: here you say that $\varepsilon_1 = \Theta(\sqrt r)$, but $r > 1$ and $\varepsilon_1 < 1$. Please clarify.
+
+- P5: regarding the convergence for the vanilla GD algorithm. Please add a reference to this claim.
+
+- P5: ""accurate estimate"" -> replace ""estimate"" with estimation",5,4.0,ICLR2021
+oV8uZ3CDIL6,1,n1wPkibo2R,n1wPkibo2R,Review of An Efficient Protocol for Distributed Column Subset Selection in the Entrywise lp Norm,"SUMMARY:
+
+The paper considers the column subset selection (CSS) problem, which has received considerable attention in numerical linear algebra. It considers a distributed variant of CSS in the $\ell_p$ norm, where $p \in [1,2)$. Despite the attention this problem has received previously, it seems like this is a new setting that has not been considered before. The paper primarily provides theoretical results, but also does some experiments. 
+
+I think this is an interesting paper and problem. However, there are a few things I'd like to see addressed before I can give a higher score. There are a few details in the proof of Theorem 6 that are unclear. Moreover, the readability of the paper could be improved. I provide further details below.
+
+
+ADVANTAGES:
+
+- CSS is an important problem in numerical linear algebra.
+- The setting under consideration seems to be new.
+- The paper provides both theoretical and experimental results.
+
+
+CONCERNS/QUESTIONS:
+
+- Section 1 would be easier to follow if the definitions in Section 2.1 were given when the concepts are first used in Section 1 instead.
+
+- The p-stable distribution is discussed throughout the paper, but doesn't seem to be defined anywhere, not even in the supplement. It would be helpful if the p-stable distribution was defined when it is first mentioned. Also, in the proof of Lemma 3.1, it would be helpful if you could point to a reference that contain the relevant facts about p-stable distributions.
+
+- In Section 1.2, the first paragraph, you make a distinction between an approximation to OPT and an approximation to the best possible subset of columns. Can you clarify the distinction between these two types of approximation?
+
+- The numbering of the lemmas is confusing. The numberings used in the main paper are reused in the supplement, but for different results. For example, Lemma 1 in the main paper is called Lemma 2 in the supplement, and Lemma 2 in the main paper is called Lemma 3 in the supplement. This makes the proof of Theorem 6 confusing.
+
+- On the 4th line in Section 1.1., you say that the coordinator computes $\sum_i A_i S = A S$. But $A$ is of size $d \times n$, and each $A_i$ is of size $d \times n_i$, so this equality isn't right. Do you mean to say that $\sum_i \tilde{A}_i S = A S$, where each $\tilde{A}_i$ is of size $d \times n$ and with all entries zero except those columns corresponding to the columns in $A_i$ (or something to that effect)?
+
+- In the proof of Theorem 6, when you apply Lemma 5 the first time: Don't you need both $V'$ and $M$ to be defined in terms of the $p,2$-norm rather than the $p$ norm for Lemma 5 to apply? In other words, should they be defined as 
+
+$V' := \arg \min_V |S A_I V - S A|_{p,2}$ 
+
+and 
+
+$M := \arg \min_M | S A_I M - S A T |_{p,2}$?
+
+- In the proof of Theorem 6, when you apply Alg. 4: It is not clear how Alg. 4 from the supplement is applied here. Alg. 4 outputs factors U, V such that UV approximates A. What is $(SAT)^*$? Is $(SAT)^* = UV$ here?
+
+- In the proof of Theorem 6, when you apply Lemma 5 the second time: It looks like this uses the first equality in Lemma 5 in the supplement, which requires $P_2^*$ to be a projection matrix. How can we know that it is? Moreover, even if the sampling matrices $T_i$ are chosen to satisfy Lemma 5, how do we know that the matrix $T$ obtained from the $T_i$'s in Alg. 1 also satisfies Lemma 5?
+
+- What is the run time of Algorithm 2? Does finding each $j^*$ require evaluating the objective for every choice of $j \in \overline{T}$, or is there a better way to do this?
+
+- In Section 6, in the Setup paragraph, you say that you report the minimum error over 15 trials. Is there a reason for reporting the minimum rather than the mean or median? Is the mean/median also competitive?
+
+- The paper ends abruptly with no conclusion.
+
+
+MINOR CONCERNS/QUESTIONS:
+
+- In the sentence right after Lemma 2 in the main paper, should $|S A_T V - S A_T|_p$ be $|S A_T V - S A|_p$?
+
+- In Algorithm 2, should the argmin be over $j \in \overline{T}$ rather than $j \in A_{\bar{T}}$?
+
+- In Tables 3, 4 and 5 of the supplement, should k-CSS_{!,2} be k-CSS_{1,2} (i.e., a '1' instead of an exclamation point)?
+
+
+###################################
+
+Update:
+
+The authors have addressed the concerns I had in my initial review. I have raised the score from 6 to 7.",7,3.0,ICLR2021
+BJHcawFxM,2,BySRH6CpW,BySRH6CpW,Simple idea that seem to work but the novelty is limited and some regularization choices might not do what is expected.,"This paper proposes training binary and ternary weight distribution networks through the local reparametrization trick and continuous optimization. The argument is that due to the central limit theorem (CLT) the distribution on the neuron pre-activations is approximately Gaussian, with a mean given by the inner product between the input and the mean of the weight distribution and a variance given by the inner product between the squared input and the variance of the weight distribution. As a result, the parameters of the underlying discrete distribution can be optimized via backpropagation by sampling the neuron pre-activations with the reparametrization trick. The authors further propose appropriate initialisation schemes and regularization techniques to either prevent the violation of the CLT or to prevent underfitting. The method is evaluated on multiple experiments.
+
+This paper proposed a relatively simple idea for training networks with discrete weights that seems to work in practice. My main issue is that while the authors argue about novelty, the first application of CLT for sampling neuron pre-activations at neural networks with discrete r.v.s is performed at [1]. While [1] was only interested in faster convergence and not on optimization of the parameters of the underlying distribution, the extension was very straightforward. I would thus suggest that the authors update the paper accordingly. 
+
+Other than that, I have some other comments:
+- The L2 regularization on the distribution parameters for the ternary weights is a bit ad-hoc; why not penalise according to the entropy of the distribution which is exactly what you are trying to achieve? 
+- For the binary setting you mentioned that you had to reduce the entropy thus added a “beta density regulariser”. Did you add R(p) or log R(p) to the objective function? Also, with alpha, beta = 2 the beta density is unimodal with a peak at p=0.5; essentially this will force the probabilities to be close to 0.5, i.e. exactly what you are trying to avoid. To force the probability near the endpoints you have to use alpha, beta < 1 which results into a “bowl” shaped Beta distribution. I thus wonder whether any gains you observed from this regulariser are just an artifact of optimization.  
+- I think that a baseline (at least for the binary case) where you learn the weights with a continuous relaxation, such as the concrete distribution, and not via CLT would be helpful. Maybe for the network to properly converge the entropy for some of the weights needs to become small (hence break the CLT). 
+
+[1] Wang & Manning, Fast Dropout Training.
+
+Edit: After the authors rebuttal I have increased the rating of the paper: 
+- I still believe that the connection to [1] is stronger than what the authors allude to; eg. the first two paragraphs of sec. 3.2 could easily be attributed to [1].
+- The argument for the entropy was to include a term (- lambda * H(p)) in the objective function with H(p) being the entropy of the distribution p. The lambda term would then serve as an indicator to how much entropy is necessary.
+- There indeed was a misunderstanding with the usage of the R(p) regularizer at the objective function (which is now resolved).
+- The authors showed benefits compared to a continuous relaxation baseline.",6,4.0,ICLR2018
+SygDedWRtH,2,rJgUfTEYvH,rJgUfTEYvH,Official Blind Review #1,"The paper ""VideoFlow: A Conditional Flow-Based Model for Stochastic Video "" proposes a new model for video prediction from a starting sequence of conditionning frames. It is based on a state-space model that encodes successive frames in a continuous hierarchical state, with contraints on trajectories of the codes in this state. 
+
+I like the invertible NN framework the model relies on. It allows to avoid variational autoencoding of frames via invertible deterministic transforms. Learning the dynamics of the video is therefore easier, since there is no need of any stochastic inference process.    However, is there no risk of high latent vacancy in the representation space? Uncertainty of stochastic inference usually helps filling the space by considering larger areas of codes than deterministic process. Also, since at each step, the next code is conditionned by the whole past sequence of codes, besides the increasing complexity induced, I am wondering if such a model is able to efficiently encode the dynamics and the stochasticity of the video. In fact, a given z_t does not encode any dynamics nor uncertainty at that point, only the image (it cannot since it is fully determined via the invertible function from the image). Imagine that at a given point, two very different scenarios can follow, with very different following frames. In that case, how could the next state could encode these two different futures with a simple gaussian in the space ? Also,  it would be useful to compare the model with a version where the invertible frame encoder and the sequential model would be learned separately, to better understand what the model really does during training. A study of the impact of the hierarchy depth would also be useful.  
+
+Also, an additional real-world dataset would be useful for really assessing the performance of the model, since BAIR is known to be fully random and the past does not highly impact the future. A possible dataset would be KTH. Other baselines could also be considered, notably the famous approach from  [Denton et al., 2017]. 
+
+At last, the clarity of some parts could be improved. Notably the description of the sequential model in the space, whih is succintly given in the appendix.   
+
+",6,,ICLR2020
+AhiKr6qN6Ma,1,VbLH04pRA3,VbLH04pRA3,"A complex algorithm with some impressive results, but lacking rigorous and/or intuitive justification.","This paper proposes a new algorithm for black-box optimization (BlendSearch) that combines global search methods together with local search methods. The stated goal is to ensure convergence to the global optimum, while avoiding configurations with a high cost (e.g. that take a long time to evaluate).
+
+Pros:
+- Section 2 (related work) is well-written and quite comprehensive. 
+- Extensive experiments on different applications (XGBoost, LightGBM, DeepTables, fine-tuning NLP) which show promising results. 
+
+Cons:
+- The paper is lacking intuitive explanations for how and why the algorithm works so well.
+- Theoretical and/or experimental justifications are needed for some of the key statements in the paper.
+
+My main criticism of the paper is regarding Section 3 and how the BlendSearch algorithm is described and presented. Clearly, the algorithm itself is a complex piece of engineering, and while that by itself is not a problem, very little intuition is provided to help the reader understand why the different design choices were made. Furthermore, it is not very clear to me how or why the fundamental claims made about the algorithm are true. For instance, it is stated that the algorithm will converge to the global optimum given sufficient budget. Is it guaranteed under all conditions? Maybe this needs a theoretical proof or maybe it is somehow obvious by construction, but either way the paper should explain this clearly. Additionally, it is not very clear why the config validator step helps to avoid configurations with a high cost: the process is described in technical detail but what is the intuition here? Fundamentally: why does local search help to avoid evaluating high cost configurations?
+
+In Section 1 it is stated that multi-fidelty methods (e.g. successive halving or Hyperband) can only be used when cheap proxies for the “accuracy assessment” exist. However,  isn’t it always possible to construct such a proxy by taking a random subsample of the training data (e.g. as suggested in the Hyperband paper). In Section 4.1 the authors say that multi-fidelity methods were not considered as a baseline when tuning XGBoost and LightGBM, because they tried using both subsample size as well as number of iterations and in both cases it “hurt the performance”. It would improve the paper to include some experimental results to justify this statement. While of course it is true that a hyper-parameter configuration that works well on a small subsample (or small number of boosting iterations) may not work well on the full dataset (or with a large number of iterations), similar arguments can be made about the number of epochs when training neural networks (e.g. optimal learning rate for small number of epochs is not necessarily the optimal learning rate for a large number of epochs). This does not stop multi-fidelity methods being widely used to tune neural networks. 
+
+Another area where I think the paper could be improved is with how the authors deal with parallelism. While there does seem to be a paragraph in the Appendix stating that BlendSearch can be parallelized, it is not discussed in the main manuscript. Methods like Hyperband and successive halving and in the extreme case, random search, admit very high degree of parallelism. Because of this, in many real applications they are often preferred over more complex methods. While the results of Section 4.3 indicate that BlendSearch using 1 worker can out-perform asynchronous successive halving (ASHA) using 16 workers which is a nice result, it is not clear whether any parallelism was employed for the results in Section 4.2. If no parallelism was used, does ASHA even make sense (e.g. without any parallelism there is no need for the ‘asynchronous’ aspect)?. The paper could be made stronger by showing how the performance of BlendSearch improves as more workers are added, and how it compares to schemes like Hyperband and AISA with the same number of workers. 
+
+Additional comments:
+- Figure 1 does not seem to be referenced from anywhere the text. There are also no error bars shown on the plots. Given that the algorithms being compared are highly sensitive to their initial conditions, it is not possible to make statements without running multiple times with different random seeds.  
+- In Section 4.2, the author state that ‘BlendSearch-HyperBand’ uses random search for global search and HyperBand pruning strategy. However, the concept of a pruning strategy is not discussed anywhere in the manuscript. It is hard to understand what this means. 
+- The notation used in Section 3 to group certain variables as “attributes” (e.g. P.CostFunc(.)) is quite non-standard and a bit hard to read, but maybe this is a personal preference. 
+- In Section 3 could be improved by carefully defining what exactly what is meant by a ‘thread’ in this context.
+- In the experimental section, were the optimization algorithms re-run using different random seeds, or are all statistics generated over multiple folds with the same optimization seed? If the latter, the results could be severely biased if one algorithm uses a ‘lucky’ seed that results in a good sequence of configurations that work well across most datasets. 
+",7,4.0,ICLR2021
+rkxlxvd2KB,2,H1lma24tPB,H1lma24tPB,Official Blind Review #3,"The paper presents an extension of Glorot/He weight variance initiazation formula for the hypernetworks. Hypernetworks are the class of neural networks where one (hyper) model generates the weights for the another (main) network, introduced by Ha et.al in 2016.
+Authors argue and show via experiments that standard weight init formulas do not work for hypernetworks, resulting in vanishing or exploding activations and re-derive the formula for convolutional/fully-connected networks + ReLU. 
+They show that proposed method allows  hypernet training when the standard ways don`t. 
+
+The technical contribution seems as logical and straightforward yet necessary step for hypernetwork-related research.
+
+Questions:
+ 
+ - In standard NNs, initialization issues are mostly solved after introduction of BatchNorm. Wouldn`t it be the case for hypernetwork as well: to just add BN layers between main net layers?
+ 
+ - Figure 2. What are is the y axis of the figure? Norm? Variance? Mean? The same question for the most of Figures in appendix 
+ 
+- It would be nice to see how proposed method stands vs mentioned heuristics like M3 and M4. 
+ 
+Overall I like the paper. 
+
+Minor comments:
+
+ 
+ > ""This fundamental difference suggests that conventional knowledge about neural networks may not
+apply directly to hypernetworks and radically new ways of thinking about weight initialization,
+optimization dynamics and architecture design for hypernetworks are sorely needed.""
+
+I don`t see anything ""radically new"" in re-derivation of Xavier formula to the new type of network.
+
+
+======
+Revision:
+
+Revised version addressed my concerns and the batchnorm-related experiment is indeed surprising.
+Overall, I like the paper and increase evaluation to strong accept. 
+
+",8,,ICLR2020
+H1ePzKpP27,1,SJgiNo0cKX,SJgiNo0cKX,"Interesting idea should be explained better, and lacks objective comparison to baselines","The paper addresses the problem of pixel-wise segmentation of lanes from images taken from a vehicle-mounted camera. The proposed method uses multiple passes through encoders decoders convnets, thereby allowing extract global features to inform better local features, and vice versa. Only qualitative baseline comparisons are presented by manually comparing the output of the network to reported results of other methods in [Pan et al.2017].
+It is unclear to me if the proposed multiple encoder-decoder network is a novel architecture, or a known architecture applied to a novel use case. In case of the former, more details should be given on the design of the network, how it is trained, etc. for reproducibility. The biggest problem however is the subjective manual comparison to existing methods, which the authors do in favor of a quantitative comparison using well-understand objective metrics. While they point out problems with evaluating segmentation with conventional accuracy metrics, no attempt is made to make a better objective measure. We are left to judge the results on only a few selected example frames. 
+It is also unclear how the method and evaluation strategy compares to methods which predict lanes as splines or other parameterized functions. E.g. see surveys on existing approaches, and discussion of different evaluation strategies, e.g. ""Recent progress in road and lane detection: a survey"" [Hillel et al.2014] and ""Visual lane analysis and higher-order tasks: a concise review"" [Shin, 2014].
+Throughout the paper, various fuzzy and unclear statements are made (see detailed comments below). The paper would be in a better shape if more time is spend to improve the writing, provide more details on the method, and extend the experiments.
+
+Pros:
++ multiple encoder-decoder stages could be beneficial for lane segmentation
+
+Cons:
+- lacking evaluation and comparison to baseline methods
+- missing details on proposed network architecture, making it hard to reproduce
+- unclear what colors in figures for qualitative evaluation represent: are individual lanes also distinguished?
+
+Below are more detailed comments and questions:
+* Abstract
+	* ""the capability has not been fully embodied for"" → Fuzzy statement, I don't understand what this means.
+	* ""In especial"" → check grammar
+* Sec 1.: Introduction
+	* ""the local information of a lane such as sharp, edges, texture and color, can not provides distinctive features for lane detection"" local edges are not distinctive for lanes? Possibly local edges alone are not sufficient, but various lanes detection approaches rely on edge extraction as features. This statement therefore seems too strong.
+	* ""End-to-end CNNs always give better results than systems relying on hand-crafted features."". It is not possible to say that one type of classifier categorically better than another. The 'best' classifier depends on the problem at hand, valid assumptions that can be made, and the amount of training data avaiable, among others. For instance, ""How Far are We from Solving Pedestrian Detection?"" [Zhang,CVPR16] demonstrates that CNNs do not always give better results than hand-crafted features for some tasks and datasets. The paper should be more careful with such strong statements.
+	* ""Highly hand-craft features based methods can only deal with harsh scenarios."". I don't understand, is this statement intended as an argument against hand-crafted features? Isn't it good to deal especially with harsh scenarios?
+	* ""but less explored on Semantic Image Segmentation due to strong prior information is needed."" CNNs are extensively used for semantic image segmentation, e.g. see the well-known Cityscapes benchmark.
+	* ""recent methods have replaced the feature-based methods with model-based methods."". Not sure why the paper call CNNs ""model-driven methods"", but refer to the earlier classical methods with highly designed representations (Kalman filter, B-snakes, ...) as ""feature-based methods"". This seems diferent from what I typically see, where CNNs are referred to as 'data-driven methods', and the classical methods as ""model-driven"".
+* Sec 1.2: Contributions
+	* ""First, reduced localization accuracy due to the weak performance of combining the local information and global information effectively and efficiently"". Instead of presenting a first contribution, the paper presents a problem. Do the authors mean that they ""tackle the problem of reduced localization accuracy ..."" ? That would still not make this contribution very concrete though ...
+	* ""We make our attempts to rethink these IoU based methods."" → Please argue in favor of your new method. An in-depth comparison of evaluation methods, and why some metrics fail or could be redesigned would be good. However, the paper currently fails to present a new metric, and convince that it tackles shortcomings of established metrics.
+
+* Sec 2.: Multiple Encoder-Decoder Nets
+	* Figure 2: Is this the first paper to propose this multiple encoder-decoder net? Or is the idea taken from other work, and is the novelty to apply it to this problem? If this general architecture was already proposed (for semantic segmentation?), please add citations and discuss it as related work. If this network design is completely novel, I would expect more details on how the network is constructed (e.g. dimensions of each layer, non-linear activation function used, batch normalization, strides, etc.). 
+	* ""the following loss function:"". Since it is a binary classification problem, and not a regression problem, why not use a (binary) cross entropy loss instead of a mean squared error?
+
+* Sec 3: Experiments
+	* Figure 3: What is the ""Baseline"" method ? Where are the references to the other works, or is the reader required to read [Pan'2017] to understand your figures?
+	* Figure 3: How are the colors in these figures determined? Is this also an instance segmentation problem? From your methodology section I though only binary classification was considered. Do you do some post-processing to separate individual lanes? I find this confusing, as I thought that the task was limited to binary segmentation.
+	* ""Recent works evaluated ..."" please cite the works you refer to.
+	* ""we have compared more than 500 probmaps of each level nets manually and count the accuracy of these probmaps as shown in figure 7."" So if I understand correctly, instead of using an objective evaluation metric, you have reverted to manual labor to visually judge lane detection quality. This is not really a metric, and not really a solution that 'rethinks IoU based methods.' Problems of your approach is that it is unclear on what criteria results are judged, your evaluation is not objectively reproducible by others, and does not scale well for novel future evaluations. Why is this even needed? E.g. why not use some chamfer distance or Gaussian smoothing of the edge map if you want to evaluate near coverage instead of hard boundaries? Or, fit a function through the boundary, and evaluate distance (in meters) to true lane. I find the proper discussion and motivation for manual evaluation over objective metric evaluation lacking.
+	* Figure 6: What are the Ground Truth images of each row ? E.g. in the fourth row from the top, should the right-most yellow lane be present or not? As it stands, I can't interpret the columns and see which x times is visually 'better'.
+* Sec 3.4:
+	* ""To improve ability of the network, we propose a small quantity of channel to reduce overfitting by considering inter-dependencies among channels."" To improve relative to what? Where are the results comparing large amounts vs small amount of channels? Note that Figure 8 is not referred to in the text, and confusingly compares ""18 layers"" to ""1 layers"". Do you mean channels instead of layers? And, how many channels were to obtained the results in the preceding sections?
+",4,4.0,ICLR2019
+S1zWvGXNg,3,Hyq4yhile,Hyq4yhile,Review,"This paper presents an approach for skills transfer from one task to another in a control setting (trained by RL) by forcing the embeddings learned on two different tasks to be close (L2 penalty). The experiments are conducted in MuJoCo, with a set of experiments being from the state of the joints/links (5.2/5.3) and a set of experiments on the pixels (5.4). They exhibit transfer from arms with different number of links, and from a torque-driven arm to a tendon-driven arm.
+
+One limitation of the paper is that the authors suppose that time alignment is trivial, because the tasks are all episodic and in the same domain. Time alignment is one form of domain adaptation / transfer that is not dealt with in the paper, that could be dealt with through subsampling, dynamic time warping, or learning a matching function (e.g. neural network).
+
+General remarks: The approach is compared to CCA, which is a relevant baseline. However, as the paper is purely experimental, another baseline (worse than CCA) would be to just have the random projections for ""f"" and ""g"" (the embedding functions on the two domains), to check that the bad performance of the ""no transfer"" version of the model is due to over-specialisation of these embeddings. I would also add (for information) that the problem of learning invariant feature spaces is also linked to metric learning (e.g. [Xing et al. 2002]). More generally, no parallel is drawn with multi-task learning in ML. In the case of knowledge transfer (4.1.1), it may make sense to anneal \alpha.
+
+The experiments feel a bit rushed. In particular, the performance of the baseline being always 0 (no transfer at all) is uninformative, at least a much bigger sample budget should be tested. Also, why does Figure 7.b contain no ""CCA"" nor ""direct mapping"" results? Another concern that I have with the experiments: (if/how) did the author control for the fact that the embeddings were trained with more iterations in the case of doing transfer?
+
+Overall, the study of transfer is most welcomed in RL. The experiments in this paper are interesting enough for publication, but the paper could have been more thorough.",7,3.0,ICLR2017
+HygHpb4qcr,2,SJezGp4YPr,SJezGp4YPr,Official Blind Review #2,"The paper analyses the convergence of TD-learning in a simplified setup (on-policy, infinitesimally small steps leading to an ODE).
+
+Several new results are proposed:
+- convergence of TD-learning for a new class of approximators (the h-homogenous functions)
+- convergence of TD-learning for residual-homegenous functions and a bound on the distance form optimum
+- a relaxation of the Markov chain reversibility to a reversibility coefficient and convergence proof that relates the reversibility coefficient to the conditioning number of grad_V grad_V^T.
+
+While the theory applies to the ideal case, t provides some practical conclusions:
+- TD learning with k-step returns  converges better because the resulting Markov chain is more reversible
+- convergence can be attained by overparmeterized function approximators, which can still generalize better than tabular value functions.
+
+The experiments corroborate the link between reversibility factor and TD convergence on an artificial example.",6,,ICLR2020
+Bkefd1gj9r,3,SJloA0EYDr,SJloA0EYDr,Official Blind Review #4,"This paper presents the search algorithm A*MCTS to find the optimal policies for problems in Reinforcement Learning. In particular, A*MCTS combines the A* and MCTS algorithms to use the pre-trained value networks for facilitating the exploration and making optimal decisions. A*MCTS refers the value network as a black box and builds a statistical model for the prediction accuracies, which provides theoretical guarantees for the sample complexity. The experiments verify the effectiveness of the proposed A*MCTS. 
+
+In summary, I think the proposed A*MCTS algorithm is promising to push the frontier of studies of the tree search for optimal actions in RL. But the experiments should be improved to illustrate the reasons for the hyper-param setting. For example, in Sec. 6.2, the authors should give some explanations on why the depth of the tree is set as 10 and the number of children per state is set as 5. 
+
+",6,,ICLR2020
+_tiCpFGx1NR,4,J40FkbdldTX,J40FkbdldTX,Official Blind Review #2,"This paper studies the relationship of correlation of ranking of networks sampled from SuperNet and that of stand-alone networks under various settings. They also study the how masking some operations in the search space and different ways of training effect the ranking correlation.
+
+Pros:
+The paper has a lot of experiments to substantiate the claims.
+Figure 3 where every operation is systematically masked, provides more insights about which operations are effective and how NAS behaves if one of the operation is masked.
+
+Cons:
+Several other papers have already published similar findings. Overall the paper is very incremental.
+More specifics in the questions
+
+Questions
+
+1. How is the SuperNet trained?
+2. Figure2: Yu et al [1] have already explored the correlation of ranks of networks sampled from SuperNet and that of stand-alone networks. How is Figure 2 different from that? 
+3. RobustDarts [2] has explored the possibility of how subset of NASBENCH search spaces behave. FAIRDarts [3] also explored the influence of skip connection by running DARTS without skip connection, running random search by limiting skip connection to 2 etc. Figure 4 seems to be inspired by that. While it is interesting, this might be a slight extension to the work done by Yu et al [1]
+4. Bender et al [4] postulate that the operations of a SuperNet are subject to co-adaptation and recommended techniques such as regularization, drop path etc to alleviate the same. RobustDarts also suggest some recommendations such as L2 regularization, drop path etc although in the context of DARTS. So while Figure 6 demonstrates this empirically, it is not a new finding.
+
+Overall, the empirical results in the paper are very useful for the NAS community. But the work is still very incremental. This might be better received as a workshop paper instead.",5,4.0,ICLR2021
+HJg15Lhgz,2,r1SnX5xCb,r1SnX5xCb,This paper proposes a novel method to solve the problem of active sensing,"This paper proposes a novel method to solve the problem of active sensing from a new angle (Essentially, the active sensing is a kind of method that decides when (or where) to take new measurements and what measurements we should conduct at that time or (place)). By taking advantage of the characteristics of long-term memory and Bi-directionality of Bi-RNN and M-RNN, deep sensing can model multivariate time-series signals for predicting future labels and estimating the values of new measurements. The architecture of Deep Sensing basically consists of three components: 
+1. Interpolation and imputation for each of channels where missing points exist;
+2. Prediction for the future labels in terms of the whole multivariate signals (The signal is a time-series data and made up of multiple channels, there is supposed to be a measured label for each moment of the signal); 
+3. Active sensing for the future moments of each of the channels. 
+
+Pros
+
+The novelty of this paper lies in using a neural network structure to solve a traditional statistical problem which was usually done by a Bayesian approach or using the idea of the stochastic process. 
+
+A detailed description of the network architecture is provided and each of the configurations has been fully illustrated.  The explanation of the structure of the combined RNNs is rigorous but clear enough of understanding. 
+
+The method was tested on a large real dataset and got a really promising result based several rational assumptions (such as assuming some of the points are missing for evaluating the error of the interpolation & imputation).
+
+Cons
+
+How and why the architecture is designed in this way should be further discussed or explained. Some of the details of the design could be inferred indirectly. But somewhere like the structure of the interpolation in Fig.3 doesn't have any further discussion. For example, why using GRU based RNN, and how Bi-RNN benefits here. 
+",7,4.0,ICLR2018
+6iou19PmKSn,2,T3kmOP_cMFB,T3kmOP_cMFB,A good paper with new theoretical results and empirical evidence,"Summary:
+The paper considers online optimization with zero-order oracle. Motivated by nonstationarity of the objective function, impracticality is underlined for the two-point feedback approach. Instead, staying in the one-point setting, the proposed approach reuses the objective value from the previous round of observations, which is called as residual feedback. The variance of the corresponding proxy for the subgradient is estimated under more relaxed assumptions than existing in the literature. The proposed approach leads to smaller variance and better regret bounds. Regret bounds are proved for smooth/non-smooth convex/non-convex cases, the non-convex case being analyzed for the first time in the literature. Numerical experiments show that the practical performance of the proposed gradient estimator is better than that of the existing one-point feedback methods and is close to the performance of the one-point approach with two observations per round. The latter approach can be impractical for some applications.   
+
+Evaluation:
+I believe that the paper contains new interesting results on zero-order methods with one-point feedback, which are supported both theoretically and numerically. So, I suggest accepting the paper.
+
+Pros:
+1. New theoretical results which are significant for optimization and learning literature, as well as for applications. 
+2. Numerical results support theoretical findings.
+3. The paper is overall clearly written and motivated.
+
+Cons:
+1. There are several minor comments mainly on the clarity of presentation. See below.
+
+
+Minor comments
+1. Some related work seems to be missing
+http://proceedings.mlr.press/v48/hazanb16.pdf (non-convex optimization with one-point feedback)
+https://link.springer.com/article/10.1007%2Fs10107-014-0846-1 (non-convex stochastic optimization)
+http://papers.nips.cc/paper/5377-bandit-convex-optimization-towards-tight-bounds.pdf
+http://papers.nips.cc/paper/4475-stochastic-convex-optimization-with-bandit-feedback.pdf 
+2. Please consider writing explicitly on p.7 that Bach & Perchet (2016) use two function evaluations in each round. Also it would be nice to explain in more details, why their approach is impractical. For example, in the considered in Sect. 6.1 example, why one can not observe x_{k+1} two times (with different values w_k), and the evaluate the loss twice?
+3. The proof of Lemma 2.5 does not completely correspond to the statement of the Lemma. In the proof more is derived than stated in the Lemma, but under additional assumptions.
+4. In the first line of Appendix F, did you mean that $f_{\delta,t} \in C^{1,1}$? Also here Assumption 3.1 is used, which should be mentioned.
+
+",8,4.0,ICLR2021
+ByeyV10-TX,3,Hye9lnCct7,Hye9lnCct7,"A good idea, but suffers from lack of clarity","The paper suggests a method for generating representations that are linked to goals in reinforcement learning. More precisely, it wishes to learn a representation so that two states are similar if the policies leading to them are similar.
+
+The paper leaves quite a few details unclear. For example, why is this particular metric used to link the feature representation to policy similarity? How is the data collected to obtain the goal-directed policies in the first place? How are the different methods evaluated vis-a-vis data collection?  The current discussion makes me think that the evaluation methodology may be biased. Many unbiased experiment designs are possible. Here are a few:
+
+A. Pre-training with the same data
+
+1. Generate data D from the environment (using an arbitrary policy).
+2. Use D to estimate a model/goal-directed policies and consequenttly features F. 
+3. Use the same data D to estimate features F' using some other method.
+4. Use the same online-RL algorithm on the environment and only changing features F, F'.
+
+B. Online training
+
+1. At step t, take action $a_t$, observe $s_{t+1}$, $r_{t+1}$
+2. Update model $m$ (or simply store the data points)
+3. Use the model to get an estimate of the features 
+
+It is probably time consuming to do B at each step t, but I can imagine the authors being able to do it all with stochastic value iteration. 
+
+All in all, I am uncertain that the evaluation is fair.
+",5,4.0,ICLR2019
+OhfiOPbsiDk,3,7t1FcJUWhi3,7t1FcJUWhi3,[Official Review]: A nice idea being somewhat oversold; weak experimental evaluation ,"This paper proposes an interesting and potentially quite impactful and valuable idea, which I believe is novel.
+The idea is: instead of specifying invariances by hand in the architecture of a network, we can instead specify a set of possible invariances, and regularize the model to favor more invariance.
+The authors describe how to structure and regularize a DNN in this way, and provide proof-of-concept experiments.
+The experiments show that the proposed method outperforms networks that are fully-invariant or non-invariant when the true data is partially-invariant.
+
+Unfortunately, there are a number of weaknesses which lead me to recommend against acceptance.  In no particular order:
+1) The crucial ""unseen is forbidden"" hypothesis is vague and seems to be a bit of a strawman.  
+2) The framing of the paper seems to oversell the method in a way that makes the contribution less clear. 
+3) The writing is not very clear.
+4) The experiments seem to be only proof-of-concept in scenarios where the method is designed to work.  
+5) The method seems to incur an exponential cost, but this is not discussed.
+
+Elaborating:
+1) The authors claim that, because DNN behavior is undefined on unseen datapoints, the ""unseen-is-forbidden learning hypothesis is currently preventing neural networks from assuming symmetric extrapolations without evidence.""  This claim is stated in various forms several times, but never made very precise, and it is crucial in motivating the authors' approach.  Roughly, I take the authors to be claiming that (i) the correct way to ""extrapolate"" is to assume that: transformations that were not observed to change the target distribution should be assumed to NOT change the target distribution, (ii) DNNs will not extrapolate in this way by default, and must be explicitly designed to do so.  
+These claims (or whatever the authors actually mean) need(s) to be stated explicitly, and with appropriate modesty.  After all, both (i) and (ii) seem contentious.  
+The claim about an ""economical data generating process"" supports (i), but is itself somewhat vague and dubious, and should be discussed in the introduction as motivation for (i).
+
+2) The authors claim that their method can discover invariances without any data supporting them.  And their abstract claims: ""Any invariance to transformation groups is mandatory even without evidence, unless the learner deems it inconsistent with the training data.""  But in reality, the authors specify a small number of possible invariances which the method selects among (in a soft way).  And the data is used to guide this selection process.  So in reality, the designer is in charge of specifying a (restricted) set of (possible) invariances.  So like previous works on enforcing invariances, it places a  burden on the designer to identify plausible invariances.  Overall, I found the framing in the work to be ""the model discovers invariances by itself without any data!"" whereas a more neutral version would be ""instead of enforcing a set of invariances, we propose a set of *possible* invariances, and assume that any input transformations that are not observed to affect the label should be enforced""
+
+3) Besides the above issues (vagueness of ""unseen-is-forbidden"" and related discussion (1), overselling (2)), there were several other issues of clarity.  The paper is not poorly written overall, but is much harder to read and understand than it needs to be.  Some specific issues are: 
+- The results in Section 4 are presented with insufficient context or intuition.  Theorems are stated without any proof intuition and should reference proofs in the appendix.  The intuition for the penalty arrived at (eqn13) is unclear.
+- The flow is sometimes unclear.  For instance, ""Learning CG-invariant representations without knowledge of G_I. "" should be a subsection, not a (latex) paragraph, and should explain what the point of the subsection is before diving in.  The authors seem to be using (latex) paragraphs (i.e. beginning with bolded phrases) as subsections and paragraphs beginning with italicized phrases as (latex) paragraphs.  I suspect the paper was edited to fit into 8 pages without removing sufficient content.  This impedes the flow and sacrifices clarity.
+- I think a graph showing the data generating process would be much clearer than the current explanations (e.g. eqn4/5)   
+- it is unclear what equation 7 is saying... the text above makes it seem like a definition of a goal, but the following paragraph treats it as an assertion that the goal is possible to achieve.  
+...Overall, I recommend stripping out some of the mathematical details and using more words and diagrams in the main text to describe the underlying issues/motivations/methods.
+The overall story should be made clearer (e.g. by addressing (1) and (2)), and more space should be devoted to linking each part of the paper into the overall story.  
+
+4) The experiments are synthetic tasks where the correct invariance group is included in the set of invariances being searched over.  I don't think that showing that this method can bring some benefits on a real task is an absolute requirement, given the novelty of the approach.  But without more meaningful results, the paper is held to a much higher standard.  Even for synthetic experiments, these are rather weak; for instance, it would be interesting to see whether/how the method degrades when we consider much larger sets of possible invariances.
+
+5) It seems like the method might require including a set of parameters for each of the possible 2^m invariances.  Is this in fact the case?  If not, why not?  If so, it should be discussed as a limitation. 
+
+
+-------------------------------
+Suggestions/Questions:
+- In Section 4 paragraph 1, are G-invariance and G_I-invariance used interchangeably?  This was confusing.
+- say what I and D are as soon as they are introduced (top of page 4).
+- Typo: ""a somewhat a""
+- Why a ""nonpolynomial"" activation function?
+- The definition of ""almost surely"" at the bottom of page 4 is not correct (it is possible to sample probability 0 events), and also it should say that samples of Gamma(X^(obs)/(cf)) (not X^(obs)/(cf)) are equal with probability 1 (these are not the same statement!).
+- ""level of invariance"" and ""non-extrapolated validation accuracy"", and several other phrases are not defined and should probably be replaced by something more clear and explicit.
+- It seems like you might need to assume that that different x^(hid) can't be used to generate the same x^(obs) or x^(cf).  If so, this should be explicit.",5,4.0,ICLR2021
+uyXhiMNwchh,2,H38f_9b90BO,H38f_9b90BO,Anonymous review,"This paper presents a label propagation based meta learning algorithm to address label noise. Label propagation helps re-label pseudo labels of noisy data and meta learning achieves aggregations. The method is evaluated on several node classification datasets and a custom version of Clothing1M image classification dataset. The comparison to multiple baselines shows better performance.
+
+Pros:
+- The combination of graph neural network and meta learning is interesting and novel
+
+Cons:
+- The paper is not well written. Some descriptions are unclear. For instance, the usage of $w$ is never defined. The usage of meta learning is not novel, which from a previous method - L2R, but it is never clearly mentioned. However, it tried to use several equations to explain the L2R method, which is not sufficient to be clear but makes unnecessary confusion.
+- Experiments might not be that convincing.
+  - The method is only tested on several graph datasets with synthetic noises, which were uncommon in previous papers. However, most compared methods are only evaluated on common image datasets. The comparison may not be very fair. Moreover, from Table 2 and 3. The proposed method is marginally better than several other methods.
+  - Although the method tries to tackle a real-world dataset Clothing1M, the scalability issue of this algorithm makes it difficult to work so it is only tested on a custom toy version. So the results are not generally useful to compare to other methods which have tested on full Clothing1M.
+",5,4.0,ICLR2021
+hLoDZODGAlP,3,MD3D5UbTcb1,MD3D5UbTcb1,A Unified View on Graph Neural Networks as Graph Signal Denoising,"Summary of the paper: In this paper, the authors make the following new argument: The aggregation processes of current popular GNN models such as GCN, GAT, PPNP, and APPNP can be treated as a graph denoising problem where the objective is to minimize a recovery error (a norm of noisy feature matrix, i.e. ||F-X||) plus a graph-based regularization (smoothness). This new view provides a way to build a GNN model, namely (Ada-)UGNN. Experimental results show the effectiveness of Ada-UGNN on the task of node classification and the task of preventing adversarial attacks on graphs.
+
+Strong points: 1) Theoretical contributions of the proposed framework are solid and interesting. These findings show that two basic operations of a GNN layer, feature transformation and feature aggregation can be viewed as a gradient descent step of minimizing a graph denoising function. 2) Experimental results demonstrate the effectiveness of Ada-UGNN.
+
+Weak points: 1) I think one weakness of this paper is that: Explanations are only focused on one layer (local). The theorems do not explain the relations between layers and how nonlinear activation functions affect these theoretical findings. For example, [1] and [2] treat the GNN as a procedure of encoding and decoding as a whole. However, it seems that the objective of GNN cannot be viewed as a simple combination of graph denoising problems. 2) The experiments do not explain well of theoretical findings: these connections are missing in experiments. I do see results of Ada-UGNN are promising on node classification and the task of preventing adversarial attacks. However, it would be better if there are some empirical evidence to explain these new theorems.
+
+Recommendation: Based on the above points, I tend to marginally accept this paper but have concerns (these weak points). 
+
+Questions & other comments:
+“The improvements of the proposed model compared with APPNP are marginal”, as shown in Table 1. Are these really improvements? Based on my understanding, these means of Ada-UGNN are higher than APPNP, but the variance is also high. Significance test is needed.
+In Ada-UGNN, it approximately solves problem 2 and uses a special regularization term. How does the approximation affect the final performance? Is there any clear guidance on how to choose the regularization term in different problems? Are these regularizations problem-dependent?
+
+[1] Hamilton, William L., Rex Ying, and Jure Leskovec. ""Representation learning on graphs: Methods and applications."" arXiv preprint arXiv:1709.05584 (2017).
+[2] Chami, I., Abu-El-Haija, S., Perozzi, B., Ré, C., & Murphy, K. (2020). Machine Learning on Graphs: A Model and Comprehensive Taxonomy. arXiv preprint arXiv:2005.03675.
+",6,3.0,ICLR2021
+S1gpoI__hQ,1,HJf9ZhC9FX,HJf9ZhC9FX,a nice attempt to study implicit regularization of SGD but not sure whether the contribution is sufficient,"Optimization algorithms such as stochastic gradient descent (SGD) and stochastic mirror descent (SMD) have found wide applications in training deep neural networks. In this paper the authors provide some theoretical studies to understand why SGD/SMD can produce a solution with good generalization performance when applied to high-parameterized models. The authors developed a fundamental identity for SGD with least squares loss function, based on which the minimax optimality of SGD is established, meaning that SGD chooses the best estimator that safeguards against the worst-case disturbance. Implicit regularization of SGD is also established in the interpolating case, meaning that SGD iterates converge to the one with minimal distance to the starting point in the set of models with no errors. Results are then extended to SMD with general loss functions.
+
+Comments:
+
+(1) Several results are extended from existing literature. For example, Lemma 1 and Theorem 3 have analogues in (Hassibi et al. 1996). Proposition 8 is recently derived in (Gunasekar et al., 2018). Therefore, it seems that this paper has some incremental nature. I am not sure whether the contribution is sufficient enough.
+
+(2) The authors say that they show the convergence of SMD in Proposition 9, while (Gunasekar et al., 2018) does not. It seems that the convergence may not be surprising since the interpolating case is considered there.
+
+(3) Implicit regularization is only studied in the over-parameterized case. Is it possible to say something in the general setting with noises?
+
+(4) The discussion on the implicit regularization for over-parameterized case is a bit intuitive and based on strong assumptions, e.g., the first iterate is close to the solution set. It would be more interesting to present a more rigorous analysis with relaxed assumptions.",5,3.0,ICLR2019
+EIlDln-C5Ne,4,yvuk0RsLoP7,yvuk0RsLoP7,The paper analyzes the property of local and global data manifold for adversarial training. ,"The paper analyzes the property of local and global data manifold for adversarial training. In particular, they used a discriminator-classifier model, where the discriminator tries to differentiate between the natural and adversarial space, and the classifier aims to classify between them while maintaining the constraints between local and global distributions. The authors implemented the proposed method on several datasets and achieved good performance. They also compared with several whitebox and blackbox methods and proved superiority. 
+
+This paper was, in general, well written. The authors provided a good visualization of their analysis. Using local and global information for adversarial training is intuitive. The authors provided a good theoretical background to establish their method. The empirical evaluations show promising results. 
+
+Some major concerns are listed as follows:
+1. It is not clear how equations 4 and 5 are realized using discriminator and classifier. 
+2. What kind of perturbations are chosen? It looks like all the experiments are with L-infinity. Does this observation hold for other ones?
+3.  If the attackers leverage the global and local data manifold, can they bypass this attack? ",7,4.0,ICLR2021
+rygwJPQ0FH,2,Hke_f0EYPH,Hke_f0EYPH,Official Blind Review #1,"The authors first demonstrate that many existing approaches are special cases of regularized objectives, and then provide a theoretical analysis on the relationship between the local minima of the original loss and the corresponding regularized loss.  Afterwards, the authors propose a new regularizer inspired by the IBP regularizer by taking account of second order information. Through a large set of experiments the authors demonstrate that their approach achieves higher certified accuracy using CNN-Cert, compared to many previous approaches.
+
+I have some concerns on the writing and experiments of this paper. 
+
+- The paper seems to have two parts that are isolated from each other. The first part of this paper discusses some theoretical analyses on the relationship of local minima for regularized and unregularized losses. The second part of this paper proposes the DoubleMargin regularizer. However, I can't see why the theoretical analysis motivates the DoubleMargin regularizer. The only statement that tries to relate theoretical analyses and the proposal is ""the gradient of a regularizer rather than its bound validity determines its certified test loss. Therefore ... using an upper bound on the adversarial loss is not necessary to train certifiable models"". This is a super general and vague motivation, and is not specialized to the DoubleMargin regularizer. The argument can actually be used for justifying arbitrary regularizers...
+
+- Since the advantage of DoubleMargin is not motivated theoretically, the empirical performance becomes critical. However, I don't think the experiments are rigorous and the comparisons are fair. In Table 2 only certified accuracies from CNN-Cert are reported. However, CNN-Cert does not work well for models trained by IBP. For fair comparison, the authors should report the best result from a group of certification methods. The certified results of IBP using CNN-Cert seem to be much worse than the results reported in the original IBP paper (Gowal et al., 2018) , which were verified using IBP. In fact, in both table 4 and table 12, the authors show results that the IBP method outperforms the DoubleMargin approach when results are verified by IBP. ",3,,ICLR2020
+cww45XcaZrp,1,0migj5lyUZl,0migj5lyUZl,Blind Review #4,"The paper discusses different methods in robust reinforcement learning including TRPO and PPO then propose a variant of TRPO using a new type of regularizer to quantify the difference between two distributions.
+
+Strength: 
+- The paper is easy to understand and provide informative discussion on related methods.
+- The point probability distance is simple but appears to work well.
+- The numerical experiments are extensive which include discrete and continuous control tasks.
+
+Weakness:
+- There is lack of justification why the proposed algorithm converges.
+- Although POP3D appears to be better than PPO and others using the mean final scores metric, it seems to be comparable with PPO when using the all episodes mean scores which raise the concern about the clear improvement of POP3D.
+
+I have the following questions and comments:
+Q1. As the point probability distance is bounded, there is an issue with the scale effect where different environment can produce different range of the cumulative reward (the first term in the objective). In this situation, choosing a common \beta that works for most environment appears to be challenging to me? How do you deal with this?
+
+Q2. As from Q1, is there a systematic way to select \beta?
+
+Q3. As the KL divergence is the upper bound of the total variation distance while the point probability distance is the lower bound, POP3D does not inherit from the theoretical result. What is your opinion about this?
+
+C1. Equation (10) appears to be different from the one in the TRPO paper? Are you using a different definition of \eta(.) function, it is better to state its definition for clarity.
+
+C2. It would be better to have figures illustrating the performance of 4 algorithms in some continuous tasks over the number of episodes collected to see their learning process.
+
+
+Small comments: 
+- The paper needs to be proofread again for some small typos.
+- If the hyper-parameter for POP3D is tuned, it is better if the hyper-parameter for other methods are tuned as well for fair comparison in continuous domain.
+- The point probability distance actually depends on the action taken at that iteration. Do you think the formulation needs to be changed to account for that?
+
+Overall, I suggest a weak-reject decision based on the following reason:
+- The paper lack theoretical discussion on the convergence of the new method.
+- The numerical results do not show significant improvement over existing methods especially PPO as the authors claim.
+- The authors claim that the new method removes the need of choosing preferable penalty coefficient while I still believe the parameter \beta needs to be suitably chosen for particular environment.
+
+However, I can adjust my decision once I see the authors’ response.
+
+",5,4.0,ICLR2021
+1JDVGBbykH1,1,wiSgdeJ29ee,wiSgdeJ29ee,Official Blind Review #3,"The authors take an attempt at offline RL thanks to a mix between behavioural policy regularization and model based policy optimization. They basically combine two algorithms: AWAX and, depending on the level of safety given an epistemic uncertainty evaluation, MOPO may be additionally used to fine tune the policy.
+
+Unfortunately the work suffers from several severe weaknesses:
+- the writing is not good. See the typo and minor comments section.
+- the positioning is biased and missing to many accounts. Out of 31 citations, 12 are from the same author. Even more problematic, most, if not all, the references on offline RL are from this author, and therefore lacks diversity. In particular Model-based offline RL [Iyengar2005,Nilim2005], and model-free offline RL [Thomas2015b] have a rich history. More specifically, Equation (3) is identical to that of the Reward-Adjusted MDP found in [Petrik2016]. The Safe Policy Improvement objective has been considered for instance in [Thomas2015b,Petrik2016]. Equation (7) proposes to optimize the policy under a constraint on the policy search (identical to online TRPO, which is evoked later) that is very similar to [Laroche2019], except that the constraint is not state based and therefore probably less efficient.
+- ""If our initial policy does not achieve expert level performance, and we are confident that we can learn an effective model with the available data, then ..."" => it is unclear how these decisions are sorted out. Performing those safety tests are an area of research in themselves [Thomas2015a].
+- even if we assume that the algorithmic novelty is proven, it seems pretty incremental, since it amounts to perform a test to decide between two algorithms.
+- Finally, the experimental results do not savethe day. We observe that ""Ours"" is always the max of AWAC and AWAC+MB2PO, which is a little suspicious, since we have no information on how the decision is made. In comparison with CQL, it is not better (but it is a strong baseline). So, it's not improving the state of the art. It would have been informative to show the behavioural performance in each setting.
+
+Typo and minor comments:
+- AWAC is used without citation or explanation first (2.1)
+- ""The most common off-policy model-free algorithms are actor-critic algorithms that alternate between policy evaluation and policy improvement in order to learn an effective policy."" => this is not actor-critic but policy iteration.
+- ""Otherwise, we use the fully trained AWAC policy. These results are reported in the column Ours in Table 1."" => otherwise what?
+- Sec. 4.2: effect => affect
+- Sec. 4.3: degredation => degradation
+
+[Iyengar2005] Iyengar, G. N. Robust dynamic programming. Mathematics of Operations Research, 30(2):257–280, 2005.
+[Laroche2019] Laroche, R., Trichelair, P., & Tachet des Combes, R. T. (2019, May). Safe policy improvement with baseline bootstrapping. In International Conference on Machine Learning (pp. 3652-3661).
+[Nilim2005] Nilim, A. and El Ghaoui, L. Robust control of Markov decision processes with uncertain transition matrices. Operations Research, 53(5):780–798, 2005.
+[Petrik2016] Petrik, M., Ghavamzadeh, M., & Chow, Y. (2016). Safe policy improvement by minimizing robust baseline regret. In Advances in Neural Information Processing Systems (pp. 2298-2306).
+[Thomas2015a] Thomas, P. S., Theocharous, G., & Ghavamzadeh, M. (2015, February). High-confidence off-policy evaluation. In Twenty-Ninth AAAI Conference on Artificial Intelligence.
+[Thomas2015b] Thomas, P., Theocharous, G., & Ghavamzadeh, M. (2015, June). High confidence policy improvement. In International Conference on Machine Learning (pp. 2380-2388).",3,4.0,ICLR2021
+keh7ZefetV,2,OJiM1R3jAtZ,OJiM1R3jAtZ,Lack of novelty,"This paper studies challenges of offline RL with online fine-tuning and proposes an off-policy actor-critic method to address these challenges. The proposed method uses a supervised learning style to update the model parameters and avoids the behavior model estimation.  Empirical results show that the proposed method provides rapid learning with prior demonstration data and online experience.
+
+I have a major concern about the novelty of this paper. The major component of the proposed AWAC method, updating the model parameter using supervised learning, is exactly the same as AWR (Peng et. al., 2019). One minor difference seems that AWAC uses an off-policy policy evaluation, but this contribution is very marginal. 
+
+This paper provides an analysis on challenges of combining offline RL with online improvement, which motivates this paper. However, most of the discussed challenges are quite well-known. For example, one major challenge is to estimate the behavior model in offline data. This paper does not discuss techniques/methods that address this challenge, like DualDICE (Nachum et al., 2019) and CQL (Kumaret al., 2020). A comparison to these methods is highly recommended. In addition, this paper may also include additional model-based offline RL methods, like MOPO.
+",3,4.0,ICLR2021
+4ac3H-dj8LB,2,HfnQjEN_ZC,HfnQjEN_ZC,Official Review of Paper 1999 (ICLR 2021),"Authors propose an approach to perform classification of ballroom dance movements (called figures) captured by the sensing mechanism of a smartwatch and discriminated via different ANN architectures. The sequence of figures are modelled as a Marlov chain, which work in  a generative+discriminative fashion to output the final prediction of the model. Authors also present a dataset collected specifically for this work, to perform the inference of the algorithms included in the evaluation. Results show a remarkable accuracy, but are not compared to any existing state of art due to limited related work.
+
+-------------------------------------------------
+Contributions
+
+The paper addresses an application of HAR which I see can be of interest for the community due to its novelty. The dance recognition it’s not a common domain and its analysis in terms of feature representation can fit correctly within the scope of the conference.
+
+The paper is well written and easy to read. The state of art is adequate and gives enough background on the domain
+
+-------------------------------------------------
+Points to improve
+
+The principal flaw of the work, and main reason why I do not recommend its acceptance in the proceddings is its weak experimental setup and overall evaluation. Other, but less important, reasons for my recommendation are the lack of comprehensive reproducibility details and the weak findings included in the conclusions.
+
+-------------------------------------------------
+Recommendations
+
+A single dance (one type of sequence), performed by a single couple (lack of variability in the data collected) and a predefined sensor system (lack of multimodal/multilocation setup) is not sufficient to achieve solid results for the type of algorithm the authors want to evaluate. I understand it’s difficult to collect high quality data, so I’d suggest the authors to add data from related domains where the same representation learning can be employed or augment its own data expanding the sensory system (which in principle should be cheaper).
+
+The value of the paper could be not just to address a specific application (ballroom dance recognition) using well known feature representations and algorithmic approaches. Honestly a more interesting motivation for me would be to explore how the type of data representation employed may help overcome effects of limited information when exploring domains where only scarce data is available. Specifically in the domain of HAR, where quality datasets are very expensive to collect. For me it’d be very interesting to know which sensors, located in which part of the dances, and modelled under which representation can perform better. As I say an interesting approach could be to augment the sensory infrastructure to characterize the minimum amount of sensor data required based on different representations.
+
+Following my last comment, the Markov chain wrapping the figure classifier does not offer a significant contribution in my opinion. Its contribution is limited to just chain predictions from models which are translating sensor signals to human movements, that last part is where (in my opinion) the contribution is. The sequence modelling could be perfectly covered by the ANN using a different network topology.
+
+Also, when the authors claim to perform sensor classification “in real time” I expect to see some experiments covering that claim. I’m mean, there are no experiments addressing either the timing performance of the models or their power requirements. I’d recommend including in the experimentation some test of the performance of the methods, not just their accuracy.
+
+I know it’s not possible with the current dataset, but when cross validating in HAR it’s more correct to do the splitting by user. Cross-validation based on the subjects offer a stronger vision of the generalization power of the method
+
+To improve the reproducibility I’d recommend further info. The sensor signal was downsampled to 100 samples, but what was the original resolution? From the rotation sensor only yaw channel is used (because is “based on prior knowledge that roll and pitch are insignificant in the waltz figures included in the study), please improve this explanation. A figure explicitly addressing the input representation would be useful.
+
+The legend of the figures should be enough to understand the figure as a whole. Please improve theIt’s honestly quite surprising that decision trees can work so well with raw features. Derived features (max, min, avg, kurtosis, skew ) tend to work better for these type of algorithms and are more commonly used.
+
+-------------------------------------------------
+Minor
+
+Using the same terminology from sampling rate (in “2.1 Data Collection” relates to the frequency domain) and what later becomes the observation/instance (size 4x100) of a figure is a bit confusing in my opinion.
+
+ The probability of observing the next figure is really independente of past figures sequence? Does not a dance require to include a number of figures during its execution? I’ve the feeling that the accumulated number of figures also work as a prior for the next figure.
+",4,5.0,ICLR2021
+bofvPzZiDla,1,sy4Kg_ZQmS7,sy4Kg_ZQmS7,interesting paper; experiments could be improved a bit.,"I liked reading this paper. It is reasonably well-written and tackles an important problem in an interesting way. I list the issues I had with the paper below. I believe one advantage of DFIV is that it does not rely on some relaxed optimization problem like DeepIV. The main idea in the paper is learning a set of basis functions such that the structural function is a linear combination on them; the learning itself relies on predicting the basic functions from the IV which ensures that confounding information is projected out.
+
+First, while I think the method is well motivated, I am confused by the discussion around 'forbidden regression'. I believe the statement made in the paper about 'high variance' may be misleading. 'Forbidden regression' is a statement about the mis-specification of the conditional density model which may lead to an identification failure. Can the authors point out the exact part of Angrist where high variance is called the 'forbidden regression' problem?
+
+I think the proposed method does not face the forbidden regression problem because of the linear relationship between the IV and the outcome implied by the proposed model. 
+
+Second, I believe it would be in support of the method to expand the experiments to include the high-dimensional IV and treatment experiments from DeepGMM. 
+
+I would like the authors to clarify the following:
+
+1. In 4.2, if the image is given as treatment to the model, isn't the confounder posY be specified as a function of the treatment?
+2. In the OPE experiment, the DFIV (and DeepGMM to a lesser extent) does not  exhibit monotonic behavior with noise increase. Could the authors explain if this is randomness? if it is not, could the authors provides results over a larger number of seeds?
+
+Finally, could the authors expand on any convergence issues during training due to the coupled optimization and how they fixed them? ",7,4.0,ICLR2021
+XFEnIrNbVF4,1,BXewfAYMmJw,BXewfAYMmJw,Interesting approach but the presentation needs improvement,"Deep neural network brittleness can be attributed to their tendency to latch on to spurious correlations in the training dataset. The proposal in the paper is to learn to generate samples where these correlations can be eliminated. To this end, the authors, distill trained conditional big gan into a transformation with explicit modules to capture the shape, texture of the foreground object, and the background. The distilled network is called Counterfactual Generator Network (CGN). Thus, an image can be generated with a background of one class, the shape of another class, and the foreground texture of a different class. Then a classifier with multiple heads is learned where each head predicts a class based on only one of the factors among shape, texture, and background. 
+
+The proposed approach is motivated by the assumption of independent mechanisms where different modules of the causal data generating process are independent of each other. Once the decomposition of a training image into shape, texture, and background is obtained, any component can be swapped to generate counterfactual data.
+
+Pros:
++ The solutions provided to extract object masks, background and texture are interesting and scale to Imagenet dataset.
++  Shows that augmenting the training dataset with the generated counterfactual images can help improve robustness. 
+ 
+Cons + Questions:
+- The presentation of the paper can be improved. It is not always clear if the causal structure is assumed to known. In Sec 2.2 SCM is defined, but the SCM for the MNIST or Imagenet is not provided. Do all the nodes in the CGN share the same noise or exogenous variables?
+- The proposed method appears to assume that the causal structure is known. In this, it assumes it is made up of three nodes shape, texture, and background, and thus can limit the counterfactual generation ability. Many semantic changes cannot be achieved as evidenced by the fact that the counterfactual images are not realistic. 
+- in the invariant MNIST classification task it appears that the results are based on the assumption that the invariant feature - shape is known apriori. In practice, this information is not available. IRM does not assume this knowledge, so it does not seem comparison with IRM is fair in this case.
+- Some related work that seems to be missing [1][2]
+
+[1] Kocaoglu, Murat, et al. ""Causalgan: Learning causal implicit generative models with adversarial training."" arXiv preprint arXiv:1709.02023 (2017).
+
+[2] Kaushik, Divyansh, Eduard Hovy, and Zachary C. Lipton. ""Learning the difference that makes a difference with counterfactually-augmented data."" arXiv preprint arXiv:1909.12434 (2019).
+",5,3.0,ICLR2021
+ryxHNSeUtS,1,HJlQ96EtPr,HJlQ96EtPr,Official Blind Review #1,"Summary: 
+The authors propose quantize the weights of a neural network by enabling a fractional number of bits per weight. They use a network of differentiable XOR gates that maps encrypted weights to higher-dimensional decrypted weights to decode the parameters on-the-fly and learn both the encrypted weights and the scaling factors involved in the XOR networks by gradient descent.
+
+Strengths of the paper:
+- The method allows for a fractional number of bits per weights and relies of well-known differentiable approximations of the sign function. Indeed, virtually any number of bits/weights can be attained by varying the ratio N_in/N_out.
+- The papers displays good results on ImageNet for a ResNet-18.
+
+Weaknesses of the paper:
+- Some arguments that are presented could deserve a bit more precision. For instance, quantizing to a fractional number of bits per weights per layer is in itself interesting. However, if we were to quantize different layers of the same network with distinct integer  ratio of bits per weights (say 1 bit per weight for some particular layers and 2 bits per weight for the other layers), the average ratio would also be fractional (see for instance ""Hardware-aware Automated Quantization with Mixed Precision"", Wang et al., where the authors find the right (integer) number of bits/weights per layer using RL). Similarly, using vector quantization does allow for on-chip low memory: we do not need to re-instantiate the compressed layer but we can compute the forward in the compressed domain (by splitting the activations into similar block sizes and computing dot products). 
+- More extensive and thorough experiments could improve the impact of the paper. For instance, authors could compress the widely used (and more challenging) ResNet-50 architecture, or try other tasks such as image detection (Mask R-CNN). The table is missing results from: ""Hardware Automated Quantization"", Wang et al ; ""Trained Ternary Quantization"", Zhu et al ; ""Deep Compression"",  Han et al; ""Ternary weight networks"", Li et al (not an extensive list).
+- Similarly, providing some code and numbers for inference time would greatly strengthen the paper and the possible usage of this method by the community. Indeed, I wonder what the overhead of decrypting the weights on-the-fly is (although it only involves XOR operations and products)
+- Small typos: for instance, two points at the very end of section 5.
+
+Justification fo rating:
+The proposed method is well presented and illustrated. However, I think the paper would need either (1) more thorough experimental results (see comments above, points 2 and 3 of weaknesses) or (2) more justifications for its existence (see comments above, point 1 of weaknesses).",3,,ICLR2020
+BJ3VB6_xG,2,SyPMT6gAb,SyPMT6gAb,"Interesting combination of off-policy learning from bandit feedback and f-GANs, with some weaknesses in theory and experiment","The paper proposes an interesting alternative to recent approaches to learning from logged bandit feedback, and validates their contribution in a reasonable experimental comparison. The clarity of writing can be improved (several typos in the manuscript, notation used before defining, missing words, poorly formatted citations, etc.).
+Implementing the approach using recent f-GANs is an interesting contribution and may spur follow-up work. There are several lingering concerns about the approach (detailed below) that detract from the quality of their contributions.
+
+[Major] In Lemma 1, L(z) is used before defining it. Crucially, additional assumptions on L(z) are necessary (e.g. |L(z)| <= 1 for all z. If not, a trivial counter-example is: set L(z) >> 1 for all z and Lemma 1 is violated). It is unclear how crucially this additional assumption is required in practice (their expts with Hamming losses clearly do not satisfy such an assumption).
+
+[Minor] Typo: Section 3.2, first equation; the integral equals D_f(...) + 1 (not -1).
+
+[Crucial!] Eqn10: Expected some justification on why it is fruitful to *lower-bound* the divergence term, which contributes to an *upper-bound* on the true risk.
+
+[Crucial!] Algorithm1: How is the condition of the while loop checked in a tractable manner?
+
+[Minor] Typos: Initilization -> Initialization, Varitional -> Variational
+
+[Major] Expected an additional ""baseline"" in the expts -- Supervised but with the neural net policy architecture (NN approaches outperforming Supervised on LYRL dataset was baffling before realizing that Supervised is implemented using a linear CRF).
+
+[Major] Is there any guidance for picking the new regularization hyper-parameters (or at least, a sensible range for them)?
+
+[Minor] The derived bounds depend on M, an a priori upper bound on the Renyi divergence between the logging policy and any new policy. It's unclear that such a bound  can be tractably guessed (in contrast, prior work uses an upper bound on the importance weight -- which is simply 1/(Min action selection prob. by logging policy) ).",5,5.0,ICLR2018
+SJlGiJetcH,3,S1g_S0VYvr,S1g_S0VYvr,Official Blind Review #2,"
+# Rebutal Respons:
+
+Planning Horizon:
+- I agree small horizons can speed up learning. However, if we want to drastically reduce the sample complexity, we need longer planning horizons. Therefore, we should be looking further in this direction but this is out of scope for this paper.
+
+Model Learning:
+- Good idea to update the conclusion and treat the online model learning as future work.
+
+Figure 3:
+You could plot axes a - f in a single row and remove the colorbar for each individual plot. Just do one colorbar for all plots. This would also improve the comparability of the different methods. Currently, the color-coding is different for each axis, which is bad practice. Axes g - i can be reshaped in a separate figure. Furthermore, figure 2 is not necessary as the 4 room environment is depicted in Fig. 3 a - f and and one could reference these axes. 
+
+=> I keep my rating as weak accept.
+
+# Review:
+Summary:  
+The paper proposes an adaptive scheme to adapt the horizon of the value function update. Therefore, the sample efficiency should be increased and the value function should be learned faster. 
+
+Conclusion: 
+The problem of learning good policies from partially correct models is very interesting and important. The proposed approach is technically sound and reasonable. The experiments highlight the qualitative as well as quantitative performance. Furthermore, the quantitative performance is compared to state-of-the-art methods. I cannot comment on the related work as I am not familiar with the baselines MVE & STEVE. 
+
+My Main Concerns are:
+
+- The roll-out horizon is 3-5 timesteps. This horizon is really short especially for problems with strong non-linearities and high sampling frequencies. For such systems 5 timesteps would only correspond to 0.5s (10Hz) or 0.05s (100Hz) and I am uncertain whether these short horizons really help for such problems. 
+
+- The specialized online model learning algorithm seems quite hacky. It feels like it was introduced last minute to make it work. The overall question of how should we learn a model optimally is super important and should be addressed within a separate paper (and more thoroughly). I would even propose to remove the online learning section from this paper as it is just too hacky and without relation to prior work or context. 
+
+- Could the authors please update figure 3 as the figure has too much whitespace. This whitespace could be used to enlarge the individual axis when the axis are rearranged.  
+",6,,ICLR2020
+rJgxbE8wcS,3,HJgCcCNtwH,HJgCcCNtwH,Official Blind Review #1,"The authors propose a new weight initialization method for sparse neural networks and develop a weight topology that satisfies desirable properties. Their derivation is data-free, and thus the analysis should generalize to arbitrary datasets. They demonstrate that their new topology outperforms existing approaches on a matrix reconstruction task.
+
+Overall I think this work is an interesting direction for designing static sparse neural network weight topologies, but it’s lacking in empirical evidence of their claims and could do better to tie themselves to existing literature in training sparse neural networks. 
+
+If the authors could strengthen their results by a) experimenting with their newfound topology and initialization on standard sparsification benchmarks like CIFAR, ImageNet, and WMT EnDe b) comparing their approach to other static-sparse [1, 2, 3] and dynamic-sparse [4, 5] training algorithms this could be a good paper, but without more experimentation it’s unclear what can be taken away from this work. If the authors added results in this direction I would be willing to increase my score.
+
+Comments on Claims of the Paper:
+
+1. “Cascades” are a known trick in both dense & sparse neural networks [2, 6].
+2. The authors describe their motivation for developing a new sparse initialization method in the first paragraph of section 4. It would be nice to see some of this experimental data, such that we could understand the magnitude of the vanishing gradient problem in these tests and see that the newly derived initialization alleviates it. The reason for my skepticism is that Liu et al [7] used a similar scheme where they re-scale the standard deviation of the gaussian based on the fraction of nonzero weights, but later found that it made no difference in their results for unstructured pruning (which I learned through discussion with the authors).
+4. The data-free derivation approach makes sense and I understand that this makes the approach theoretically applicable to arbitrary datasets, but the authors do not apply it to other datasets to show that it generalizes in practice.
+5. The authors show that their topology outperforms on the matrix reconstruction task, but they don’t compare with other sparsification approaches used in deep learning like sparse evolutionary training [4], or SNIP [1] (note that these techniques also maintain a sparse network during training, as opposed to pruning approaches like magnitude pruning [8] that are dense during training but sparse for inference).
+
+Comments on the Results of the Paper:
+
+The authors appear to contextualize their work in deep neural networks, but all of their experimental results are on matrix reconstruction or linear models on MNIST. This is sufficient for analysis and motivation, but taking the developed approaches and applying them to a deep neural network and showing improvements would go a long way towards improving this paper.
+
+Assorted Notes:
+
+In the first paragraph of the introduction:
+
+“In other words, doubling the size of layer inputs and outputs quadruples the size of the layer. This causes majority of the networks to be memory-bound, making DNN training impractical without batching, a method where training is performed on multiple inputs at a time and updates are aggregated per batch.”
+
+It is correct that a matrix-vector product (i.e., neural network training with batch size 1) is typically memory bound on accelerators like GPUs, but it’s not clear why the quadratic growth in computational cost with input/output size has anything to do with this. The cause of memory bound-ness is lack of reuse of operands, which can be alleviated by increasing the batch size s.t. the computation becomes a matrix-matrix product. Batch size 1 training is also not desirable. Recent work has shown that large batch sizes do not degrade model quality with proper hyperparameter tuning [9], and larger batch sizes are desirable from a hardware perspective to achieve higher throughput.
+
+References:
+1. https://arxiv.org/abs/1810.02340
+2. https://d4mucfpksywv.cloudfront.net/blocksparse/blocksparsepaper.pdf
+3. https://arxiv.org/abs/1903.05895
+4. https://www.nature.com/articles/s41467-018-04316-3
+5. https://openreview.net/forum?id=ryg7vA4tPB
+6. https://arxiv.org/abs/1811.10495v3
+7. https://arxiv.org/pdf/1810.05270v2.pdf
+8. https://arxiv.org/abs/1710.01878
+9. https://arxiv.org/abs/1811.03600",3,,ICLR2020
+_WgSSu--O2q,1,o6ndFLB1DST,o6ndFLB1DST,"This paper proposes to obtain counterfactual explanations. Over previous work, the goal is to improve the quality of counterfactuals by ensuring they are close to the distribution of the target subclass. In order to ensure the counterfactuals are from the target class manifold, an auto-encoder scheme is proposed. The auto-encoder is trained by encouraging its latent space to be discriminative of the target label.","My main concern for this paper is around several factors that I will list from the most important to minor.
+
+i) It is unclear to me when and how generator training should be biased according to the class label, which the authors do by encouraging the latent space to be predictive of the class/target label. I have a few questions regarding this:
+ a) Why is it insufficient to train a conditional version of a VAE to create separation in the latent space that guarantees the counterfactual is appropriately separated and lies on the manifold of the target class.
+
+b) This specifically concerns the authors' claim that in the case of post-hoc training, input-output pairs of the model can be used to train the semi-supervised auto-encoder. Now this question is fundamental to me in terms of the goals of the counterfactual explanation. Is the goal to provide explanations in ways that allow us to identify issues with the classifier? Or is the goal of the counterfactual explanation to provide recourse to the end user. In the first case, I would rather see counterfactual explanations that are not in any way or form biased by the generative model, and hoping there is a perfect generative model that captures the entire distribution of the samples this model is supposed to generalize on. Now generating counterfactuals with such a perfect generator might actually give some hope of identifying issues worth debugging in a classifier like a bias toward specific classes etc. If the goal of the counterfactual explanation is to provide recourse, then simply hoping its a likely prototype from the target class is most definitely insufficient. Even if we assumed that it is in-fact sufficient, the post-hoc training on input-output pairs of the classifier will in fact bias the generator heavily and really does not reveal much about the classifier at all. The authors need to strongly consider and clarify what the goal is of their counterfactual explanation, what kinds of biases different methods induce in their explanation, why a conditional generator is not sufficient and potential harms of a jointly trained model. 
+
+ii) This brings me to my second question around empirical evaluation. 
+a) Include a conditional baseline
+b) Compare to all the methods that train generator separately, jointly, and semi-supervised model.
+c) what happens if the support of the training data for the generative model is mismatched from the data used for training. 
+d) Should counterfactual explanations actually be designed to provide fidelity to the classifier behavior or just be purely interpretable but not of great utility in actually improving the classifier. If the goal is to not improve the classifier but to audit it, I strongly feel simply providing qualitative summaries of sparsity and measures of interpretability in terms of target class probability are insufficient. They are also insufficient to provide recourse. 
+
+All of these factors significantly diminish this contribution. I recommend the authors give critical thought to the goals of counterfactual explanations and the effects of joint vs disjoint training in addition to merely semi-supervised and unsupervised training of the generative model, here an autoencoder. ",4,5.0,ICLR2021
+WTDDyFHGi15,4,Q4EUywJIkqr,Q4EUywJIkqr,"Paper addresses Robustness of object recognition pipelines to distribution shift do to natural and synthetic variations.  Interesting paper, but writing quality could be improved.","I like the main ideas articulated in the paper, but find the writing lacks some clarity: 
+
+Summary of paper: The paper  takes as a starting point the study from Barbu et al where the robustness of object recognition pipelines to be able to handle distribution shifts are studied by testing ImageNet trained architectures against ObjectNet.  The main point in the current paper is that the performance degradation seen in Barbu et al is due to the fact that the CNNs were processing the image with entire image as context  and when one only provides a sub-window around the objects of interest the resulting performance improves significantly.  The paper also describes experiments with various synthetic distorted data and finally examines details of ObjectNet dataset to illustrate that there are images that are hard to categorize even for humans. Thus, the paper concludes that object recognition on ObjectNet is still hard to solve.
+
+It is clear that by using bounding boxes or even removing background from those bounding boxes,  the performance will be better (since the training was on ImageNet with single objects). So, in a way, they are kind-of recreating the training distribution in order to improve the performance.
+
+Main significance of the paper (Pros): Detailed study of performance of object recognition and the empirical finding that figure-ground segmentation may improve recognition.  Analysis of properties of ObjectNet and its challenges.
+
+Originality/Novelty:  The paper is largely empirical and has a good discussion of the relevant background literature analyzing object recognition systems.  
+
+Cons:  It has incremental insights. 
+
+Clarity of paper: The description of the experiments are at times unclear.    The structure of the paper could be simplified with a table or diagram that illustrates the logic behind the experimentation and conclusion.  There are multiple datasets used, the training scheme is sometimes on ImageNet and tested on ObjectNet and sometimes on selected categories of ObjectNet and test on the rest of ObjectNet.  
+
+
+",6,4.0,ICLR2021
+5uGD54LYltl,2,ijJZbomCJIm,ijJZbomCJIm,Review,"1, Summary of contribution:
+This paper claims that the pre-trained model trained adversarially can achieve better performance on transfer learning, and conducted extensive experiments on the efficacy of the adversely trained pre-trained models.
+Also, the paper conducts an empirical analysis of the trained models and shows that the adversarially pre-trained models uses the shape of the images rather than the texture to classify the images. 
+Using the influence function (Koh 2017), the paper reveals that each influential image on the adversarially trained model is much more perceptively similar to its test example.
+ 
+
+2, Strengths and Weaknesses:
+The paper is well-written and organized, and the experiments look fair and well support the claim.   The analysis is interesting and insightful. 
+Meanwhile, the transfer is done to the domain of lower complexity, and some important comparative ideas are not extensively investigated. 
+
+
+3, Recommendation:
+While the paper’s empirical results are solid, there seems to be a substantial room left for comparative studies.  More ablation studies shall be done for other regularization methods.
+I believe that the paper is marginally above the acceptance threshold. 
+
+4, Reasons for Recommendation:
+The reader will benefit more from the paper if the authors can justify their use of adversarial training as the regularization in the pretraining process.  I believe that this research warrants some comparative study for dropout, weight decay, as well as random perturbations.  I think the paper can be more insightful if it shows whether the other classical regularization performs better or worse on transfer learning than the proposed approach. 
+
+5,  Additional feedback:
+In addition to the suggestions made in 4, I also believe that comparison shall be made against the model trained without pretraining. 
+
+\
+---Post rebuttal---
+
+Thank you for the response, and thank you for checking the performance comparison against the white-noise perturbation. It would be interesting to see a future work involving means other than Adversarial training (e.g. including other simple mechanisms like weight decay and dropout) to help reduce the overfitting effects in the pretraining phase. I would like to keep my score as is. 
+",6,3.0,ICLR2021
+r10SA2ZVg,2,r1aPbsFle,r1aPbsFle,review,"This paper provides a theoretical framework for tying parameters between input word embeddings and output word representations in the softmax.
+Experiments on PTB shows significant improvement.
+The idea of sharing or tying weights between input and output word embeddings is not new (as noted by others in this thread), which I see as the main negative side of the paper. The proposed justification appears new to me though, and certainly interesting.
+I was concerned that results are only given on one dataset, PTB, which is now kind of old in that literature. I'm glad the authors tried at least one more dataset, and I think it would be nice to find a way to include these results in the paper if accepted.
+Have you considered using character or sub-word units in that context?
+
+",7,4.0,ICLR2017
+r1eClp0acr,4,H1ldzA4tPr,H1ldzA4tPr,Official Blind Review #4,"The paper proposes a novel method for modelling dynamical systems over graphs. The main idea investigated by the authors is to combine Graph Neural Networks together with approximate Koopman embedding. The GNN encodes the input graph to what the authors call ""object-centric embedding"", whose concatenation over all objects is defacto the approximate Koopman embedding of the system.
+One of the key contributions is the reduction in parameters, by assuming that the interactions between different objects in the Koopman space are limited to some fixed number of types, or in other words given the object-centric embedding the Koopman matrix is a block matrix, where each block can only be one of K matrices. In this way the number of parameters is fixed and does not scale with the number of objects, compared to the naive way where it will scale as N^2. In addition to the dynamical modelling the paper adds an extra linear-""control"" input in the Koopman embedding space which to affect the dynamics of the system and allow for modelling systems where there is external control being applied. The models are than compared on three small scale tasks, showing better results in mean squared error prediction compared to the three baseline approaches. Additionally, when used for controls on the environments the methods outperforms the one baseline method it is compared to. 
+
+
+I'm quite borderline on whether the paper should be accepted or rejected, but currently I'm leaning towards a rejection. The main reason for this decision is that in my opinion the experiments presented are somewhat limited with respect to the baselines used and I have some reservations regarding the results presented for IN and PN discussed below.
+
+
+Detailed comments on paper: 
+
+1. I personally like the main idea of the paper, which is to use previous results from approximating the Koopman operator and combining it with GNNs for more accurate physical modelling of object-object interactions. Additionally, the idea of reducing the parameters is quite important. 
+
+2. Linear control theory - although it is quite natural to add the control as a linear affect in the latent space, and this has been done numerous times before in the literature, I don't recall there to be any theory on Koopman embedding when there is a control signal. Additionally, if the policy that has been used in practice is stochastic, the resulting ""induced"" dynamical system then also becomes stochastic, and to my knowledge, at least in theory, learning Koopman embedding for such systems has more challenges are requires certain assumptions about the true system, such as co-diagonalization and few others. I think this should be discussed in more detailed by the authors (and please do correct me if I'm wrong on any of these statements) as currently for the readers who are not too familiar I think the text might come across as though that the Koopman theory extends naively to these scenarios as well, which I do not think is the case.
+
+3. It is not very clear how does the ""metric"" loss affects the solution. I would encourage the authors to provide comparison (only in terms of dynamical modelling, without active control) of whether this metric helps, or have some negative effects on the prediction. I think that for instance if the GNN has some form of weight regularization than this indeed would have some non-trivial effect on the resulting representation. Also, it would useful to have plots of how accurately does the embedding preserve the distance to the true states in order to understand this better.
+
+
+
+Comments on the experiments:
+
+1. In the paper there is no discussion about what are the actual observation space of the environments, could these be clarified better.
+
+2. The block diagonal structure approach in general has been presented as working with multiple types of interactions. However, in practice it seems that the authors have only used two types of interaction -> object-same-object and object-other-object interaction. This however, has never been discussed and is maybe false. Could you clarify these details?
+
+3. The results showed in Figure 3 are somewhat in contrast than the results in the original PN paper, specifically the PN paper states that it can achieve MSE of 7.85 for 1000 time steps, and from figure 6 of that paper it shows about 0.05 MSE over 100 steps on a similar rope environment. These results compared to the one presented here in Figure 3 makes me wonder how well did the authors actually managed to reimperilment the IN and PN paper? Could there be any comments on this as this makes many claims of the proposed method being better questionable and hard to understand its significance in relation to previous work. 
+
+4. For the control tasks, it would have been useful to have more than just the single baseline used. There are plenty of algorithms for Reinforcement Learning that could have been used in order to put the method in perspective. E.g. one can apply MPC with ground-truth model (e.g. the simulator) to show the discrepancy with an ideal case. In the RL literature there are plenty of methods for solving smaller problems, parametric and non-parametric: Q-learning, PPO etc... I think this is very important from the reader perspective. 
+
+
+PS: Please refer to the discussion below with the authors as I have increased my score from 3 to 6.",6,,ICLR2020
+r1BS2p-4e,2,SJMGPrcle,SJMGPrcle,Depth is supervise learning,"I do like the demonstration that including learning of auxiliary tasks does not interfere with the RL tasks but even helps. This is also not so surprising with deep networks. The deep structure of the model allows the model to learn first a good representation of the world on which it can base its solutions for specific goals. While even early representations do of course depend on the task performance itself, it is clear that there are common first stages in sensory representations like the need for edge detection etc. Thus, training by additional tasks will at least increase the effective training size. It is of course unclear how to adjust for this to make a fair comparison, but the paper could have included some more insights such as the change in representation with and without auxiliary training. 
+
+I still strongly disagree with the implied definition of supervised or even self-supervised learning. The definition of unsupervised is learning without external labels. It does not matter if this comes from a human or for example from an expensive machine that is used to train a network so that a task can be solved later without this expensive machine. I would call EM a self-supervised method where labels are predicted from the model itself and used to bootstrap parameter learning. In this case you are using externally supplied labels, which is clearly a supervised learning task!
+",5,4.0,ICLR2017
+IpQDbo0YV-T,1,X5ivSy4AHx,X5ivSy4AHx,Enhanced First and Zeroth Order Variance Reduced Algorithms for Min-Max Optimization,"This paper proposes an enhanced variant of the SREDA algorithm (Lou et al 2020), called SREDA-Boost, that improves SREDA on two aspects: the initial complexity and the step-size. The algorithm achieves the same complexity as the original SREDA scheme.
+
+The main contribution of this paper is perhaps the following aspects:
+-- C1: Improving the initial complexity for finding a starting point y0 of the SREDA.
+-- C2: Proposing a larger stepsize for SREDA compared to the original one.
+-- C3: Injecting a zero-order approximation step for stochastic gradients.
+The authors also claimed that their analysis is new and different from SREDA. However, in my opinion, this seems to be minor since their proof also relies on the bounds of gradient errors (like variance) as well as the delta_t quantity. There are of course some technical details and steps, but those are not the major contribution.
+
+In my opinion, the theoretical contribution of this paper is incremental. Indeed, the idea of using SARAH to improve oracle complexity has been widely studied in the literature, including Spider, SpiderBoost, ProxSARAH, etc. Since model (1) is nonconvex-strongly concave, it can be reformulated into (2) as a stochastic optimization problem. Several methods can be used to solve (2). The idea of using multiple loops is also widely used. 
+
+The first contribution (C1) is not really new. The authors simply replace the step of computing y0 by iSARAH to reduce the computational cost. Because the problem is strongly convex, this step can be done by several methods, including accelerated variance-reduced schemes to further improve its complexity. Since this step is not essential, many previous methods just simply skip it.
+
+The second contribution (C2) is also minor since the idea of using a large batch-size to obtain large step-size has been used in the literature such as Spider-Boost (Wang et al 2019) or ProxSarah (Pham et al 2020). Of course, this enhancement helps SREDA have better practical performance. However, given the previous work, this contribution is very incremental.
+The use of zero-order oracle (C3) is not new as well, since it is another way of approximating the gradients, and it has been widely used in the literature based on Nesterov's idea. Hence, this step seems to be unnecessary if we assume that the underlying function is L-smooth. In most applications, we can directly compute the stochastic gradient components without using finite difference approximation. Unless the authors can provide compelling examples showing that this is an important step, otherwise, it is not really convinced. 
+
+In terms of algorithm, SREDA/SREDA-Boost is also a multiple-loop algorithm, with at least three loops, making it challenging to implement in practice and it requires a lot of proper tuning and choosing parameters. There are some recent algorithms that can solve the same problems but with a single loop. The authors may want to compare with them, though these are very recent work (see, e.g. https://arxiv.org/abs/2008.08170 and the references therein).
+
+In addition to the above major comments, the following are some concrete comments:
+--- In Table 1, why NA is put in the ""Initial Complexity"" column of other methods. I believe that some paper may not describe this clearly, but if they use the exact solution (or up to a given accuracy approximation) of the strongly-concave max-problem, then the complexity is the best one since we can use the best algorithm to find it.
+-- The relation between the gradient of $\Phi$ and the KKT point of (1) should be clarified. 
+-- I do not see where v_t is used in Algorithm 2. It seems that the output of Algorithm 2 should include v_t to use in Algorithm 1.
+-- It is not clear what is the problem solved in the numerical experiments?
+
+",4,5.0,ICLR2021
+SJzYnEqef,3,SyJ7ClWCb,SyJ7ClWCb,Review," The paper investigates using input transformation techniques as a defence against adversarial examples. The authors evaluate a number of simple defences that are based on input transformations such TV minimization and image quilting and compare it against previously proposed ideas of JPEG compression and decompression and random crops.  The authors have evaluated their defences against four main kinds of adversarial attacks.
+
+The main takeaways of the paper are to incorporate transformations that are non-differentiable and randomised. Both TV minimisation and image quilting have that property and show good performance in withstanding adversarial attacks in various settings. 
+
+One argument that I am not sure would be applicable perhaps and could be used by adversarial attacks is as follows: If the defence uses image quilting for instance and obtains an image $P$ that approximates the original observation $X$, it could be possible to use a model based approach that obtains an observation $Q$ that is close to $P$ which can be attacked using adversarial attacks. Would this observation then be vulnerable to such attacks? This could perhaps be explored in future.
+
+The paper provides useful contributions in forming model agnostic defences that could be further investigated. The authors show that the simple input transformations advocated work against the major kind of attacks. The input transformations of TV minimization and image quilting share varying characteristics in terms of being sensitive to various kinds of attacks and therefore can be combined. The evaluation is carried out on ImageNet dataset with large number of examples.",7,3.0,ICLR2018
+mLzmYfChPw,3,RuUdMAU-XbI,RuUdMAU-XbI,"Nice idea, but limited novelty and experimental validation","This work proposes a novel method, called Dynamic Graph Network (DG-Net), for optimizing the architecture of a neural network. Building on the previous work introduced by (Xie et al., 2019), the authors propose to consider the network as a complete directed acyclic graph (DAG). Then, the edge weights of the DAG are generated dynamically for each input of the network. At each node of the network, the authors introduce an extra-module, called router, to estimate the edge weights as function of the input features.
+
+The proposed method addresses the problem of optimizing the connectivity of neural networks in an interesting way, where the architecture is not fixed but it depends on the input instances. Moreover, I think that a strong advantage of the proposed technique is that the optimization of the architecture comes with a negligible extra cost both in terms of parameters and computational complexity. Overall, the paper is well written and easy to follow.
+
+My only serious concern is the degree of novelty with respect to (Yuan et al., 2020), which was published at ECCV 2020. The main difference seems to be that in the proposed method the graph is dynamic (i.e., it depends on the input instances), instead  in (Yuan et al., 2018) the graph is learned but fixed for all the input samples. In the experimental results, I would have expected a deeper ablation study on the importance of the dynamic graph, since this is the main contribution of the paper. Instead, there is only one experiment in the appendix (Table 6). Therefore, the impact of the dynamic graph in the performance of the proposed method is not clear and it is difficult to evaluate the importance of this contribution.
+
+Other comments:
+
+- In Sec. 3.1 the authors say that the ResNet architecture can be represented with a DAG where the set of edges is defined as E={(i,j)|j=i+1,i+2}. This is not true: if you unroll the definition of the ResNet architecture, as done in Eq. (4)-(6) in [1], and compare it with what you obtain using Eq. (1), it is easy to see that the two resulting functions are different.
+
+- The definition of the convolutional block is not clear, is it a ReLU-conv-BN triplet as in (Xie et al., 2019)? 
+
+- The use of a DAG with edge weights for representing the architecture is not novel, it was already introduced in (Xie et al., 2019).
+
+- In Sec. 4.3, Table 5 shows a comparison with state-of-the-art NAS-based methods. DG-Net is implemented using RegNet-X and RegNet-Y as the basic architecture, however in Table 5 the performance of the basic architectures (without the dynamic graph optimization) is not reported, this would be useful to evaluate the gain provided by the optimization of the architecture.  
+
+
+[1] Veit et al., Residual Networks Behave like Ensembles of Relatively Shallow Networks, NIPS 2016
+
+###############################################################################################
+
+After the discussion period:
+
+I thank the authors for their responses and for updating the paper. The authors have added a deeper analysis on the impact of the dynamic graph, however I still believe that the novelty of the paper is a bit limited. I have slightly increased my score to 6.",6,3.0,ICLR2021
+dV5cuVxttBa,1,t4EWDRLHwcZ,t4EWDRLHwcZ,Fast graph learning based on interesting mathematical formulation,"Summary and significance: Learning a graph from data is an important, yet less studied, problem. The proposed algorithm (GRASPEL) is based on a graphical Lasso formulation with the precision matrix restricted to be a graph Laplacian. The algorithm starts with a sparse kNN graph, and recursively adds critical edges (identification of these critical edges based on Lasso and spectral perturbation analysis is the main contribution of the paper). 
+The outcome is a highly scalable that learns a graph in nearly linear time (ignoring log factors and number of recursions). The scalability of the algorithm makes the contributions significant.
+
+Originality: The basic formulation and idea of selecting spectrally critical edges seem original and interesting, although the reviewer is not an expert in related methods. The authors should note that the there are graph learning methods based on solution of graphical lasso [Pavez, Ortega, ICASSP 2016; Kumar et al, Neurips 2019]. Beyond this step, the authors employ several existing techniques to make GRASPEL scalable (although this part is not highly novel, the overall method is original).
+
+Quality and clarity: The theory in the paper is technically sound. The only exception (this is also an issue about clarity) is the assumption that $U_N^T e_{pq} \approx U_r^T e_{pq}$. In general, this is not valid and hence it should be clarified when this assumption is reasonable. The experimental section is well executed. 
+The paper presented is mostly good. It would help to include Algorithm 1 and Table 1 in the main paper. 
+
+Typo: On page 6 (Phase A), perhaps the authors meant (r=2) instead of (r=1). Also reference of Carey is currently in all caps
+",6,3.0,ICLR2021
+TceFVSQQX7n,4,akgiLNAkC7P,akgiLNAkC7P,Official Blind Review | Reviewer #4,"#### Summary
+
+The submission focuses on a variant of inverse reinforcement learning, where the learner knows the task reward but is unaware of hard constraints that need to be respected while completing the task. The authors provide an algorithm to recover these constraints from expert demonstrations. The proposed algorithm builds upon a recent technique (Scobee & Sastry 2020) and addresses problems with large and continuous state spaces.
+
+++++++++++++++++++++++++++++++++++
+#### Reasons for score
+
+Strengths:
+* The problem considered is interesting and relevant to the ICLR community.
+* The technique developed (Algorithm 1) is novel and well motivated.
+* The experiments provide adequate evidence to back the claims.
+* The paper is very well written and organized.
+
+Weaknesses:
+* Justification for the policy loss function (Equation 9) is unclear.
+* Comparison with prior art is lacking.
+* Discussion of related work is sparse and can be more detailed.
+
+Based on the above-mentioned strengths, I vote for accepting. My concerns (further detailed below) potentially can be addressed during the rebuttal phase.
+
+++++++++++++++++++++++++++++++++++
+#### Major Comments
+
+1. (page 2) The requirement of ‘ability to modify the environment’ is listed as a limitation of prior art (Scobee & Sastry 2020). However, like the current approach, the prior art adds the constraints / modifies the environments only conceptually (and not physically). Further, both the current and prior work focus on the case of hard constraints. Please clarify this limitation of the prior art vis-à-vis proposed approach.
+
+2. (page 2) The rationale behind the objective (Equation 7) of the prior art and the proposed approach is identical. Please clarify, then, if the current algorithm is also greedy.
+
+3. (Equation 9) Please provide additional details for the inclusion of the entropy term in the policy loss function. 
+  - The principle of maximum entropy is used to arrive at Eq. 4, the loss function of theta (since Eq. 4 uses the term derived in Eq. 2, which in turn is obtained from the maximum entropy principle). Given this, it is unclear why the entropy term is also included in Eq. 9. Is it used as a regularizer?
+  - Alternatively put, consider the unconstrained version of Equation 9. In this unconstrained case, the problem is analogous to MaxEnt IRL (Ziebart et al.). In MaxEnt IRL, given the reward $\theta$, the policy $\phi$ is computed by value / policy iteration and without the extra entropy term.
+  - Further, adding both $J$ and $H$ in the loss seem counterintuitive as they have different ‘units’. J is cumulative reward, while H is dimensionless entropy. Why is the entropy term normalized by $\beta$? How is the normalization constant chosen?
+
+4. (Section 4) While not all domains considered in the Experiments can be captured by the prior art (Scobee & Sastry 2020), the first three can be (as they have discrete state, action spaces). Please benchmark the proposed approach with prior art for these three domains. Time permitting, also consider utilizing one of the recent high-dimensional techniques (see below) as another baseline.
+
+5. (Section 6) Space permitting, please include a discussion of following related works.
+
+  - Constrained IRL for high-dimensional problems:
+
+    * Chou, Glen, Necmiye Ozay, and Dmitry Berenson. ""Learning parametric constraints in high dimensions from demonstrations."" Conference on Robot Learning. PMLR, 2020. 
+
+    * Park, Daehyung, et al. ""Inferring Task Goals and Constraints using Bayesian Nonparametric Inverse Reinforcement Learning."" Conference on Robot Learning. PMLR, 2020. Notes: Extends beyond the proposed approach to consider constraints which may not be global (i.e., locally active constraints).
+
+    * Chou, Glen, Necmiye Ozay, and Dmitry Berenson. ""Learning constraints from locally-optimal demonstrations under cost function uncertainty."" IEEE Robotics and Automation Letters 5.2 (2020): 3682-3690.
+
+  - Inverse reward / policy learning frameworks that incorporate prior knowledge of reward / policy:
+
+    * Ramachandran, Deepak, and Eyal Amir. ""Bayesian Inverse Reinforcement Learning."" IJCAI. Vol. 7. 2007. 
+
+    * Michini, Bernard, and Jonathan P. How. ""Bayesian nonparametric inverse reinforcement learning."" Joint European conference on machine learning and knowledge discovery in databases. Springer, Berlin, Heidelberg, 2012.
+
+    * Unhelkar, Vaibhav V., and Julie A. Shah. ""Learning models of sequential decision-making with partial specification of agent behavior."" Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. 2019.
+
+    * Jeon, Wonseok, Seokin Seo, and Kee-Eung Kim. ""A bayesian approach to generative adversarial imitation learning."" Advances in Neural Information Processing Systems. 2018.
+
+  - Learning features (which can be in the form of logical constraints) for IRL:
+
+    * Choi, Jaedeug, and Kee-Eung Kim. ""Bayesian nonparametric feature construction for inverse reinforcement learning."" Twenty-Third International Joint Conference on Artificial Intelligence. 2013.
+
+++++++++++++++++++++++++++++++++++
+#### Questions for Rebuttal Phase
+
+Please address comments 1-4.
+
+++++++++++++++++++++++++++++++++++
+#### Minor Comments
+
+- (typo) In the Introduction, Scobee & Sastry is used as singular noun, where in fact it is plural.
+
+- (Equation 5) Beta is missing in the log exponential term.
+
+- (Page 4, below Equation 7) The statement ‘Notice that … essentially tries to match’ is ambiguous, since the gradient by itself does not try to match the two values. Please consider rephrasing to say that this matching occurs at the minima (where the gradient is zero).
+
+- (Page 5, Section 3.3) Please denote the range $[0,1]$ as $(0,1)$, since 0 and 1 are not in the range of $\zeta$.
+
+- (Equation 9) Consider distinguishing the loss functions in Equation 5 and 9 (say through superscript or subscript). Due to $L$ being overloaded, at first glance, I misunderstood the loss function in Eq 9 as a continued derivation of Eq 5.
+
+- (Section 7, typo) (2) -> Eq. (2)
+
+++++++++++++++++++++++++++++++++++
+",7,4.0,ICLR2021
+GcprFO47VMR,2,FOyuZ26emy,FOyuZ26emy,Review,"This paper studies the flaws associated with extending subspace clustering methods to the nonlinear manifolds scenario. In particular, the authors demonstrate that the optimization problem solved due to the extension can be ill-posed and thus lead to solutions which are degenerate/trivial in nature. The paper also showed that the performance benefits often associated with the Self-Expressive Deep Subspace Clustering techniques are potentially due to post-processing steps and other factors rather than due to the efficacy of these methods themselves.
+
+Overall the outline of the paper is good and I found the discussed problem informative. Few aspects were unclear to me (refer below). 
+
+1) In section 2, the authors discuss positively-homogeneous functions i.e. (leaky) Rectifier Linear Units and how it effects the statement in Proposition 1. I would like to understand how Proposition 1 relates to other activation functions i.e. the Hyperbolic Tangent (tanh) and Sigmoid functions for example. In general, given the latter activation functions are not positively-homogeneous and have disparate saturation/non-saturation regions, can we extend the analysis to these activation functions and make the theoretical parts of the paper more generic ? 
+
+2) In section 2.1, the authors talk about how Auto-encoders do not necessarily impose significant constraints on the geometric alignment of points. Are the authors aware of techniques like TopoAE and other state-of-the-art VAE/GAN and generative model variants which use some form of regularization to allow for topology modeling etc. and/or can produce results which achieve the above ? Did the authors consider this ?
+
+3) How do the authors quantify which encoder-decoder architectures are ""reasonably expressive"" ? Does any constraint as part of the objective function hamper this or are they referring to more specific constraints ?
+
+4) In sections 2.1 - 2.2, the authors mention that to remove trivial solutions the magnitude of the representations should be greater than some minimum value. What is this minimum value and how can we compute it efficiently ?
+
+5) In section 2.2, the authors briefly talk about how optimal solutions to the optimization problem may exist which have zero mean. How often do the computed optimum values fall in this category or in general are degenerate or trivial ? My point is existence of degenerate solutions need not mean that our optimization process will actually end up with these degenerate/trivial solutions. We typically can add constraints (which act like a prior and factor in the objective) and thus push our final solution away from degenerate/trivial solutions. 
+
+I felt that the paper points out an important issue but if the authors could provide a general solution which helps ameliorate the issue (some general directions or even providing rudimentary results for one of the these directions) could have made the their contribution much more stronger. Overall I like the paper and the arguments made even though I have some inhibitions with regards to the scope of the contribution given the problem addressed is so specific.",7,4.0,ICLR2021
+WoLmfN_JBay,3,kyaIeYj4zZ,kyaIeYj4zZ,Review,"## After author responses
+Based on the revision of the draft and the authors' responses to the review, I am raising my score to 7 from 6.
+
+---
+## Overall summary
+The paper proposes a pre-training method useful for training neural semantic parsing models that translate natural language questions into database queries (text-to-SQL). The authors manually write down and then sample from synchronous context-free grammars that can generate SQL queries along with corresponding natural language (but stilted) utterances that have the same meaning. The authors also collect questions about databases and tables from various sources. The combination of the two is used for fine-tuning RoBERTa using an auxiliary objective, then using the fine-tuned RoBERTa in semantic parsing tasks. The authors show strong results on four benchmarks.
+
+## Strengths
+- The authors obtain state-of-the-art results on four different benchmarks which convincingly shows that the proposed methods can help with empirical results.
+- The method is simple and straightforward to implement.
+- The paper addresses a problem domain with significant practical applications and community interest.
+
+## Weaknesses
+- It is unclear how much effort is needed to construct the SCFG. Since the SCFG was constructed directly by examining examples in Spider, the kinds of natural language questions and queries it generates may also be unfairly biased towards the distribution used in Spider.
+- There could have been more quantitative analyses of the method and its component parts in the paper.
+
+## Recommendation
+I am giving a rating of 6 considering the strong results but the relative lack of analysis of the method (ablations, etc). I think further revisions of the paper would benefit from greater analysis of the method. 
+
+## Questions
+- What would happen if you use the synthetic Spider data and use it to train the semantic parsing model for Spider?
+- Does the grammar provide for multiple natural language utterances that will translate to the same SQL?
+- The introduction states that ""Our approach dramatically reduces the training time and GPU cost."" Is there evidence to show this beyond Figure 2? There, the gains don't seem so dramatic. Considering that fine-tuning GRAPPA took 10 hours on 8 V100 GPUs, it seems unlikely that there will GPU cost will be reduced especially if the fine-tuning cost is also included.
+- What happens if the task-specific training data is also used with the MLM or SSP objectives in pre-training?  https://arxiv.org/abs/2004.10964 gives evidence that it can be useful to fine-tune RoBERTa on the downstream task's data using the pre-training objectives.
+
+## Miscellaneous comments
+The papers for Tapas, Overnight, and TaBERT are duplicated in the references.",7,4.0,ICLR2021
+H16Rrvtlz,1,ryUlhzWCZ,ryUlhzWCZ,Lack of rigor and claim (that k=1 does not allow the RL agent to outperform the expert) not justified,"This work proposes to use the value function V^e of some expert policy \pi^e in order to speed up learning of an RL agent which should eventually do better than the expert. The emphasis is put on using k-steps (with k>1) Bellman updates using bootstrapping from V^e. 
+
+It is claimed that the case k=1 does not allow the agent to outperform the expert policy, whereas k>1 does (Section 3.1, paragraph before Lemma 3.2).
+
+I disagree with this claim. Indeed a policy gradient algorithm (similar to (10)) with a 1-step advantage c(s,a) + gamma V^e(s_{t+1}) - V^e(s_t) will converge (say in the tabular case, or in the case you consider of a rich enough policy space \Pi) to the greedy policy with respect to V^e, which is strictly better than V^e (if V^e is not optimal). So you don’t need to use k>1 to improve the expert policy. Now it’s true that this will not converge to the optimal policy (since you keep bootstrapping with V^e instead of the current value function), but neither the k-step advantage will. 
+
+So I don’t see any fundamental difference between k=1 and k>1. The only difference being that the k-step bootstrapping will implement a k-step Bellman operator which contracts faster (as gamma^k) when k is large. But the best choice of k has to be discussed in light of a bias-variance discussion, which is missing here. So I find that the main motivation for this work is not well supported. 
+
+Algorithmic suggestion:
+Instead of bootstrapping with V^e, why not bootstrap with min(V^e, V), where V is your current approximation of the value function. In that way you would benefit from (1) fast initialization with V^e at the beginning of learning, (2) continual improvement once you’ve reached the performance of the expert. 
+
+Other comments:
+Requiring that we know the value function of the expert on the whole state space is a very strong assumption that we do not usually make in Imitation learning. Instead we assume we have trajectories from expert (from which we can compute value function along those trajectories only). Generalization of the value function to other states is a hard problem in RL and is the topic of important research.
+
+The overall writing lacks rigor and the contribution is poor. Indeed the lower bound (Theorem 3.1) is not novel (btw, the constant hidden in the \Omega notation is 1/(1-gamma)). Theorems 3.2 and 3.3 are not novel either. Please read [Bertsekas and Tsitsiklis, 96] as an introduction to dynamic programming with approximation.
+
+The writing could be improved, and there are many typos, such as:
+- J is not defined (Equation (2))
+- Why do you call A a disadvantage function whereas this quantity is usually called an advantage?
+- You are considering a finite (ie, k) horizon setting, so the value function depend on time. For example the value functions defined in (11) depend on time. 
+- All derivations in Section 4, before subsection 4.1 are very approximate and lack rigor.
+- Last sentence of Proof of theorem 3.1. I don’t understand H -> 2H epsilon. H is fixed, right? Also your example does not seem to be a discounted problem.
+",3,5.0,ICLR2018
+H1uFgwqeM,2,SJQHjzZ0-,SJQHjzZ0-,review,"This paper proposes using divergence and distance functions typically used for generative model training to evaluate the performance of various types of GANs. Through numerical evaluation, the authors observed that the behavior is consistent across various proposed metrics and the test-time metrics do not favor networks that use the same training-time criterion. 
+
+More specifically, the evaluation metric used in the paper are: 1) Jensen-Shannon divergence, 2) Constrained Pearson chi-squared, 3) Maximum Mean Discrepancy, 4) Wasserstein Distance, and 5) Inception Score. They applied those metrics to compare three different GANs: the standard DCGAN, Wasserstein DCGAN, and LS-DCGAN on MNIST and CIFAR-10 datasets. 
+
+Summary:
+——
+In summary, it is an interesting topic, but I think that the paper does not have sufficient novelty. Some empirical results are still preliminary. It is hard to judge the effectiveness of the proposed metrics for model selection and is not clear that those metrics are better qualitative descriptors to replace visual assessment. In addition, the writing should be improved. See comments below for details and other points.
+
+Comments:
+——
+1.	In Section 3, the evaluation metrics are existing metrics and some of them have already been used in comparing GAN models.  Maximum mean discrepancy has been used before in work by Yujia Li et al. (2016, 2017)
+
+2.	In the experiments, the proposed metrics were only tested on small scale datasets; the authors should evaluate on larger datasets such as CIFAR-100, Toronto Faces, LSUN bedrooms or CelebA.
+
+3.	In the experiments, the authors noted that “Gaussian observable model might not be the ideal assumption for GANs. Moreover, we observe a high log-likelihood at the beginning of training, followed by a drop in likelihood, which then returns to the high value, and we are unable to explain why this happens.” Could the authors give explanation to this phenomenon? The authors should look into this more carefully.
+
+4.	In algorithm 1, it seems that the distance is computed via gradient decent. Is it possible to show that the optimization always converges? Is it meaningful to compare the metrics if some of them cannot be properly computed?
+
+5.     With many different metrics for assessing GANs, how should people choose? How do we trust the scores? Recently, Fréchet Inception Distance (FID) was proposed to evaluate the samples generated from GANs (Heusel et al. 2017), how are the above scores compared with FID?
+
+Minor Comments:
+——
+1.	Writing should be fixed: “It seems that the common failure case of MMD is when the mean pixel intensities are a better match than texture matches (see Figure 5), and the common failure cases of IS happens to be when the samples are recognizable textures, but the intensity of the samples are either brighter or darker (see Figure 2).”
+",4,3.0,ICLR2018
+Hkxk2bKt2X,2,ryx3_iAcY7,ryx3_iAcY7,"Interesting idea, but the improvement over the baseline is not significant.","
+[Summary]
+This paper proposes “a role interaction layer” (briefly, RIL) that consists of context-dependent (latent) role assignments and role-specific transformations: Given an RIL layer, different dimensions of an embedding vector are “interacted” based on Eqn. (5), Eqn. (7), etc. The authors work on IWSLT De->En and WMT En->De, En->Fi to verify their proposed algorithm with case study included. 
+
+[Pros]
+(+) I think the idea/thought of using a “role interaction layer” is interesting.  The case study in Section 5.3 demonstrates different “roles”. Also, different RIL architectures are designed.
+(+) The paper is easy to follow.
+
+[Cons & Details]
+(1) As stated in the abstract, “…, but that the improvement diminishes as the size of data grows, indicating that powerful neural MT systems are capable of implicitly modeling role-word interaction by themselves…” (1) The main concern is that, considering RIL does not obtain significant gain on large datasets, then we cannot say that the proposed algorithm is better than the baseline. (2) Why the NMT systems trained on large dataset can “implicitly modeling role-word interaction”, while small dataset cannot? Any analysis?
+
+(2) For the “matched baseline”, page 5, you increase the dimensionality of the models. But an RIL is an additional layer, which makes the network deeper. Therefore, a baseline with an additional layer should be implemented. 
+",5,4.0,ICLR2019
+TNukuwPmX33,1,LIOgGKRCYkG,LIOgGKRCYkG,Not sure if this is an effective defense ,"This paper proposed an ad hoc defense mechanism against white-box attacks, by duplicating the training data with original samples or adversarial samples and the number of prediction classes. The authors claim that this method achieves better results compare to baseline methods.
+
+I admit that this method could potentially defend against gradient-based attacks like CW and PGD if the attacker have no knowledge of the defense mechanism. However, since this method is defending the white-box  threat model, I believe a simple attack could break it:
+
+To generate adversarial sample for test data $x$ whose correct label is $y$ or $y+k$, run PGD algorithm with the objective function $\hat{x}= argmax_{x} l(f(x),y)+l(f(x),y+k)$, where $\hat{x}$ is the adversarial example, $k$ is the number of classes (before duplication) and $f$ is the network trained by target training. Namely, this is just maximize the loss for both the real class $y$ and the duplicated class $y+k$.
+I'm not sure why the authors did not include any adaptive attacks like this one in section 5.
+
+Other flaws:
+
+-The writing is confusing, a lot of details are omitted. For example, what is perturbation size for CIFAR10 under PGD attack?
+
+-In section 4.1, the authors say ""Thus, the undefeated Adversarial Training defense cannot be used as a baseline because
+it uses adversarial samples during training for all types of attack."" I don't buy it. The choice of baseline method should be based on the threat model, not the algorithm or training data used.
+
+-The authors say ""Adversarial training assumes that the attack is known..."" I believe this is not true.
+
+-Black-box attack ZOO shows only 81.5% accuracy on unsecured classifier, which basically means this is not an effective black-box attack. How could the authors use an ineffective attack to demonstrate the effectiveness of their defense method? 
+
+In summary, given the execution of the experiment, I'm not convinced this is an effective defense against white-box attacks.",5,5.0,ICLR2021
+IPIoGgCo6Fv,1,tJz_QUXB7C,tJz_QUXB7C,"A solid contribution, but with sparse results","This work presents FOSAE++, an end-to-end system capable of producing ""lifted"" action models provided only bounding box annotations of image pairs before and after an unknown action is executed. Building on recent work in the space, the primary contribution of this work is to generate PDDL action rules. To accomplish this, the authors introduce novel 'params' function that use the Gumbel-Softmax function to implement a differentiable mechanism for selecting which entities are relevant to the current action and feeds them into the new 'bind' and 'unbind' functions that select those elements in the tensor predicting their relevance. Overall, this work is a meaningful contribution in the direction of generated lifted action models without direct labeled data.
+
+The biggest flaw in the paper is the rather sparse results section. Additionally, I do somewhat take issue with the statement in 4.2 that ""due to the time constraint"" planning experiments beyond the 8-Puzzle domain. Though I appreciate the author's candor in this regard, at the moment the paper is rather weakened by the lack of inclusion of additional planning experiments, particularly since the 8-Puzzle domain is arguably the easiest domain from which one might generated a lifted action representation, due to the black background surrounding the digits. Seeing planning experiments in the Sokoban domain would greatly strengthen the paper. The reconstructed Sokoban environments in Figure 4 look rather good, though the slightly shifted tiles in the reconstructed scene results in black lines/gaps between the cells, raising questions about the ability of the system to perform planning in these domains.
+
+Relatedly, though I appreciate that the authors are space-constrained at the moment, the results section (and in particular the planning section) is quite short and lacking in detail. More detail, including discussion of the limitations of this approach, would strengthen the paper.
+
+The Appendix is incredibly thorough and a welcome addition to the paper. It provides helpful additional content that, while not necessary for understanding the paper, aid in understanding and implementation.
+
+Smaller comments:
+- The caption for Fig. 2 should be extended or more annotations should be added to the figure itself. Right now, it is only clear from the body text how these components are used or where these models were introduced in other papers. This change would help clarity.
+- A passage addressing the limitations of the proposed system would be a welcome addition as well. In particular, there seems to be a general assumption that only the regions within the provided bounding boxes will change after an action is executed, something this is not generally true (nor is true in the Photo-realistic Blocksworld domain, in which shadows can change outside the bounding boxes).
+- The caption in Fig. 3 is (I think) incorrect: the final note should read that `move` was simplified to `?from, ?to`.
+- Fig. 12 in the appendix is a duplicate of another figure earlier in the paper. It should be replaced with the _actual_ figure before publication.
+",6,3.0,ICLR2021
+rJlcMLoPnm,1,SyfXKoRqFQ,SyfXKoRqFQ,"Decent paper, but little added insight beyond ""Active Bias""","This paper attempts to speed up convergence of deep neural networks by intelligently selecting batches. The experiments show this method works moderately well.
+
+This paper appears quite similar to the recent work ""Active Bias"" [1].
+The motivation for the technique and setting appear very similar, while the details of the techniques are different. Unfortunately, this is not mentioned in the related work, or even cited.
+
+When introducing a new method, it is important that design choices are principled, have theoretical guidance, or are experimentally verified against similar design choices. Without one of these, the methods become arbitrary and it is unclear what causes better performance. Unfortunately, this paper makes several choices, about an uncertainty function, the probability distribution, the discretization, and the algorithm (when to update) that appear rather arbitrary. For instance, the uncertainty function is a signed standard deviation of the softmax output. While there are a variety of uncertainty functions, such as entropy and margin, a new seemingly arbitrary uncertainty function is introduced.
+
+The experiments are good but could be designed a bit better. For instance, it is unclear if the gains are because of lower asymptotic error or because of faster convergence. The learning curves are stopped too early, while the test error is still dropping quickly.
+
+In summary, it is not clear if this paper adds any insight beyond ""Active Bias"".
+
+[1] Active Bias: Training More Accurate Neural Networks by Emphasizing High Variance Samples. 2017. Haw-Shiuan Chang, Erik Learned-Miller, Andrew McCallum.",5,4.0,ICLR2019
+rVMW_pFElt2,3,9Y7_c5ZAd5i,9Y7_c5ZAd5i,Excellent theoretical paper,"Summary
+-------
+
+The authors introduce new algorithms to solve two-players zero-sum Markov games, as well as two-players Markov game in the reward-free setting. The approach is model-based, based on successive episods of planning and counting for updating the model estimate. It involves solving a matrix game at each iteration, looking for a notion of equilibria that is computable in polynomial time (unlike Nash equilibria) An extension to multi-player games is proposed for both reward and reward free setting (in the appendix).
+
+The sample complexity of the zero-sum Markov game algorithm (Nash-VI), has a
+sample complexity that is closer to known lower-bounds than existing work, and which is better in the horizon dependency than the best model-free algorithm.
+
+In this zero-sum setting, the improvement to existing work (VI-UCLB) are the following:
+-  use of an auxiliary bonus that is proportional to the gap of upper-confidence and lower-confidence value function, that allows to use much-smaller standard Hoeffding/Bernstein bonuses
+- use of a relaxed notion of equilibria when looking for the next policy given the Q functions of both player. This relaxed notion is introduced in Xie et al. 2020
+
+The reward-free setting is simpler, in that it simply use greedy policies for each player, with each player maintaining artifical rewards based on Hoeffding bonuses.
+
+Review
+------
+
+The paper is very well written and presents some exciting results. It is completely theoretical, but provides two different algorithms for two different settings, in both the two-player and n-player setting. The improvement in existing bounds is significant. I have only slight concerns:
+
+- The algorithmic contribution (Alg. 1) could be seen as incremental, as it changes two elements in known algorithm, one of which (coarse correlated equilibria), having been used for a similar purpose in a previous paper. Similarly, Alg 2. is at the end of the day a rather naive extension of zero-reward exploration in single player MDP. Yet the notion of auxiliary bonus is very original, and the reduction of complexity non-trivial.
+
+- The absence of experiments, even in toy setting, is regrettable. It is
+especially true as the use of coarse correlated equilibria may be expensive, and
+I would have appreciated seing Alg. 1 implemented. As this is ICLR, extensions
+to function approximations would also be interesting. In particular, comparison
+with model-free approaches would be welcome, as constants before the sample
+complexities may vary.
+
+- Some parts of the text could be further explained: in particular the
+intuitions behind coarse correlated equilibria, which is introduced only mathematically.
+
+- The paper theoretical content is rather heavy, which may make this manuscript more suitable for a journal venue, where it would be more thoroughly reviewed. I must admit that I could not proof read the entire appendix.
+
+ I have several questions, as follow:
+
+- I do not understand the note ""Our results directly generalize to randomized reward
+functions, since learning the transition is more difficult than learning the reward."" Could you elaborate on this aspect.
+
+- Is there any reason why we would like to use Hoeffding bonuses instead of Bernstein bonuses in the rewarded case ?
+
+- In the multi-player, rewarded case, it appears that using coarse correlated equilibrium instead of Nash equilibrium yields non-product policies, which is unfortunate. Is there any way that we could obtain product policies solving a relaxed notion of equilibria that would be computable in polynomial time ? Similarly, is there a foreseeable way in which $\Pi A_i$ could be transformed in a sum ? Lower-bounds in this case are not discussed in this case, is there any ?",8,4.0,ICLR2021
+SJeUt8H62X,3,BygMAiRqK7,BygMAiRqK7,"Interesting connection, but lacks clarity","Summary
+The authors notice that entropy regularized optimal transport produce an upper bound of a certain model likelihood. Then, the authors claim it is possible to leverage that upper bound to come up with a measure of 'sample likelihood', the probability of a certain sample under the model.
+
+Evaluation
+The idea is certainly interesting and novel, as it allows to bridge two distinct worlds (VAE and GANs). However, I am concerned about the message (or lack of thereof) that is conveyed in the paper. Particularly, the following two points makes me be reluctant to recommend an acceptance:
+
+1)There is no measure on the tightness of the lower bound. How can we tell if this bound isnt tight? All results are dependent on the bound being close to the true value. No comments about this are given.
+2)The sample likelihoods are dependent on a certain ""model"". Here the nomenclature is confusing because I thought GANS were a probabilistic model, but now there is an additional model regarding a function f. How these two relate? What happens if I change f? to which extent the results depend on f?
+3)related to 2): the histograms in figure 2 are interesting, but they are not conclusive that the measure that is being proposed is a 'bona fide' sample likelihood.
+
+",5,3.0,ICLR2019
+ryxRUPb83X,1,HkM3vjCcF7,HkM3vjCcF7,"Authors extends stacked hourglass network with inception-resnet-A mudules and a multi-scale approach for human pose estimation in still RGB images. Given a RGB image, a pre-processing module generates feature maps in different scales which are fed into a set of serial stack hourglass modules each responsible for a different scale. Authors propose an incremental adaptive weighting formulation for each stack-scale-joint. They evaluate proposed architecture on LSP and MPII datasets.","Authors extends stacked hourglass network with inception-resnet-A mudules and a multi-scale approach for human pose estimation in still RGB images. Given a RGB image, a pre-processing module generates feature maps in different scales which are fed into a set of serial stack hourglass modules each responsible for a different scale. Authors propose an incremental adaptive weighting formulation for each stack-scale-joint. They evaluate proposed architecture on LSP and MPII datasets.
+
+positive:
+- Having an adaptive weight strategy is a necessary procedure in multi-loss functions where cross-validation or manual tuning of fixed weights are expensive. While the weights are functions of the loss, it is not analyzed and evaluated thoroughly, e.g. evolution of weights for each joint-stack-scale. Even it is not given in the section 5.2.1. So it is hard to judge effectiveness of the proposed loss. 
+
+negative:
+- In general experiments section is the most weakness of the paper. I comment some points in the following:
+a) Multi-scale image processing is not a novel idea in computer vision and specifically in human pose estimation. The authors have not compared their methods with most recent papers in the literature and I believe the results are not state-of-the-art (see [1] for instance which is a multi-scale approach).
+b) What is the effect of each scale in the results and for each joint? This must be analyzed and shown visually or numerically.
+
+- The number citations is not enough.
+
+- The writing must be improved.
+
+overall:
+Regarding mentioned comments, I believe the paper needs a huge extra effort (both analytically and numerically) to be adequate for publication. Therefore, I recommend rejection.
+
+
+[1] Yang, W., Li, S., Ouyang, W., Li, H., Wang, X.: Learning feature pyramids for human pose estimation. In: ICCV. (2017)",3,5.0,ICLR2019
+4vyb_Jq8POh,3,KLH36ELmwIB,KLH36ELmwIB,interesting idea to separate the two roles of skip connection to prevent model collapse,"The paper reveals two roles of skip connection to prevent model collapse: stabilize the supernet training, and as a candidate operation to build the final network. Intuitively, the skip connection playing the first role should only be there during the training phase. Therefore, the authors propose to add an auxiliary skip connection to play that role. This auxiliary skip connection is decayed gradually during training. The paper provides an interesting theoretical analysis that this can help prevent the gradient vanishing problem. 
+Compared to other approaches, the proposed one has the advantage that it does not rely on any heuristic indicator. Indeed, it is found that the existing indicator based on Hessian eigenvalue can discard good models.
+In the experiments, this method - DART-, is compared with DART and several other approaches, and show that the proposed method can outperform the others, however, by small margins. 
+
+pros:
+The assumption that skip connections play two different roles is very interesting and inspiring. It is indeed the case that such skip connections not only can stabilize the training, but are part of the model. The idea to separate the two roles is well motivated.
+The paper provides some interesting theoretical analysis to show the potential impact of the auxiliary skip connection on the gradient vanishing problem.
+The experiments are extensive, using several datasets and comparing several existing approaches. The experimental results provide some evidence that the idea works.
+
+cons:
+While the basic idea is well motivated, one could question about the specific architecture to add the auxiliary skip connection. The latter is intended to play the role of stabilizing the training. It is unclear why this specific architecture is appropriate for it. The paper does not provide a strong explanation for it.
+While DART- outperforms many of the baselines, it is only marginally better than some recent methods (namely P-DART). For example, in Table 2, DART- is equivalent to P-DART on CIFAR-10.  The experimental evident is moderately strong to show that DART- is better than the existing methods.
+
+Recommendation to the authors: It would be useful to justify the choice of the architecture of DART-, namely to explain why the auxiliary skip connection added can truly play the role of stabilizer. Would it be possible to use a different architecture as well?",6,3.0,ICLR2021
+Byxnr1pYhQ,2,ByxAOoR5K7,ByxAOoR5K7,Nice introduction/summary to rate-distortion in RL with illustrative examples,"(score raised from 6 to 7 after the rebuttal)
+The paper explores the application of the rate-distortion framework to policy learning in the reinforcement learning setting. In particular, a policy that maps from states to actions is considered an information theoretic channel of limited capacity. This viewpoint provides an interesting angle which allows modeling/learning of (computationally) bounded-rational policies. While capacity-limitation might intuitively seem to be a disadvantage, intriguing arguments (based on solid theoretical foundations, rooted in first principles) can be made in favor of capacity-limited systems. Two of the main-arguments are that capacity-limited policies should be faster to learn and be more robust, i.e. generalize better. After thoroughly introducing these arguments on a less formal level and putting them into perspective with regard to reinforcement learning and related work in the literature, the paper demonstrates these properties in a toy grid-world example. When compared against a vanilla actor-critic (AC) algorithm, the capacity-limited version is shown to converge faster and reach better final policies. The paper then extends the basic version of the algorithm, which requires knowledge of the optimal value function, towards simultaneously learning the value function. While any theoretical guarantees are lost, the empirical results are still in line with the theoretical benefits, outperforming vanilla AC and producing better results in previously unencountered variations of the grid-world environment.
+
+The paper is very well written and the toy-examples illustrate the theoretical advantages in a very nice and intuitively understandable way. The topic of modeling capacity-limited RL agents and exploring how capacity-limitation is an advantage, rather than a “bug” is very timely and important. In particular, rate-distortion theory might provide key-insights into building agents that generalize well, which is among the major open problems in reinforcement learning. The paper is thus very timely and highly relevant to a broad audience. 
+
+The main weakness of the paper is that it is of quite limited novelty and that the brute-force approach towards using Blahut-Arimoto in RL is unlikely to scale to large, complex state-/action-spaces without major additional work. Continuous state-/action-spaces are in principle covered by the theory, but they come with additional caveats and subtleties (I appreciate the authors using discrete notation with sums instead of integrals). Additionally, when simultaneously learning the value function (in the online setting), any guarantees about Blahut-Arimoto convergence are lost. However, solving either of these issues is hard and many attempts have been made in the communications community. Despite these weaknesses I argue for accepting and presenting the paper at the conference for the following reasons: 
+- modelling capacity-limited agents via ideas from rate-distortion theory (which is very closely related to free-energy optimization, such as ELBO maximization, Bayesian inference and the MDL principle) is an underrated topic in reinforcement learning. On a conceptual level, the strong idea is that moving away from strict optimization and infinite-capacity systems is not a shortcoming but can actually help building agents that perform better and generalize better. This is not a well established idea in the community. The paper does a good job at introducing the general idea, illustrating it intuitively with toy examples and pointing out relevant literature.
+- Simultaneously learning the value function is necessary in the RL setting, but breaks quite a bit of the theory. However, very similar ideas seem to work quite well empirically in other settings, such as for instance ELBO maximization in VAEs, where the “value function” is the log-likelihood (under the decoder), which is learned simultaneously while learning a “policy” (the encoder) under capacity limitation (the KL term). Similar arguments can be made for modern InfoBottleneck-style objectives in deep learning. Based on this empirical observation, it is not unlikely that simultaneous learning of the value function works reasonably well without catastrophically collapsing in other settings and tasks. 
+- While achieving a solution that strictly lies on the rate-distortion curve might be crucial in communications, it might be of lesser significance for building RL agents that generalize well - slight sub-optimalities (solutions that lie off the RD curve) should still yield interesting agents. Therefore, losing theoretical guarantees might be less severe for simply exploring how much the idea can be scaled up empirically.
+
+Minor issues:
+1) While the paper, strictly speaking, introduces a novel algorithm and the Bellman loss function (which requires knowledge of the optimal value function), I think that the main contribution is a clear and well-focused introduction of rate-distortion theory in the context of RL, including very illustrative toy examples. I do consider this an important contribution.
+
+2) Transfer to novel environments. The final example (Fig. 4) does show that the capacity limited agent performs better in novel environments. However, I’m not entirely convinced that this demonstrates “superior transfer to novel environments” (from the abstract). While the latter might very well arise from capacity-limitations, I think that in the example in the paper there is not too much transfer going on, but the capacity-limited agent simply has a more stochastic policy which helps if unknown walls are in the way. After all, the average accumulated reward of the capacity limited agent does also decrease significantly in the novel environment - it simply does a slightly better random walk than the AC (correct me if I’m wrong, of course). On page 7, last paragraph this is phrased as: “agents retain knowledge of exploratory actions”. In my opinion this wording is a bit too strong to simply describe increased stochasticity.
+
+3) Since the paper does provide a good overview over the literature, I think it would help to mention that the current main approach towards generalizing (deep) RL is via hierarchical RL (options framework, etc) and provide a good reference.
+
+4) At the very end of the intro you might also want to mention that rate-distortion has been used before in the context of decision-making (not RL), for instance under the term rational inattention.
+
+5) Page 5, last paragraph: the paper mentions that one Blahut-Arimoto iteration is enough. This is an empirical observation, justified by the toy experiments. However, the wording sounds like this is a generally known fact. Please rephrase to emphasize that this must not necessarily hold true in general and that convergence behavior might crucially depend on this.
+
+6) It would be good to give readers some guidance towards choosing beta if doing an exhaustive grid-search is infeasible. I am aware that there is no good general rule or recipe, but perhaps something can be added to the discussion (even if it is just mentioning that there is no good heuristic, etc. - however, there should be plenty of research in communications that deals with estimating the RD curve from as few points as possible). 
+
+7) Please consider adding this reference - it has a very similar objective function (but for navigating towards multiple goals) and is very much in line with some of the theoretical arguments.
+Informational Constraints-Driven Organization in Goal-Directed Behavior - Van Dijk, Polani, 2013.
+",7,4.0,ICLR2019
+BkYfM_Rgz,3,rkw-jlb0W,rkw-jlb0W,Marginally below acceptance threshold,"It is clear that the problem studied in this paper is interesting. However, after reading through the manuscript, it is not clear to me what are the real contributions made in this paper. I also failed to find any rigorous results on generalization bounds. In this case, I cannot recommend the acceptance of this paper. ",5,1.0,ICLR2018
+gErsWdf0h8c,1,C4-QQ1EHNcI,C4-QQ1EHNcI,Good PoC but the main experiment raises questions,"The paper presents a new way for approximating posteriors in Bayesian DNN. The network is split into two subnets. One uses only point estimates while another one uses full (non-diagonal) Gaussian approximation. The structure of that subnet is found by taking largest second derivatives of Hessian of linearized DNN (the authors call it generalized Gauss-Newton (GGN) matrix). The authors show that under very specific conditions such choice correposnds to minimization of Wasserstein-2 distance between their approximation and the true posterior. In the experimental part they provide a set of explorative experiments showing that it may be better to use their approximation for inference in large newtork than both using standard (simple) approximations in large network and full Bayesian inference in small network. This is very nice methodologically and I welcome such demonstration but this can only be considered as a (good) proof of concept. The flagship experiment however looks very inconvincing (see below).
+
+Pros.
+1. Interesting idea of finding a better approximation for the posterior on subset of parameters.
+2. Methodologically nice PoC.
+3. Thorough comparison with alternative similar techniques such as SWAG.
+
+Cons.
+1. The authors claim that they theoretically characterize the descrepancy and derive optimal strategy (see contribution 3) but they (0) consider linearized approximation of DNN (they admit this); (1) do this ONLY for regression problem although their flagship experiment is on classification problems; (2) the method they derive is based on un-natural assumption that covariance matrix is diagonal (it the matrix is diagonal there is no way to approximate with a full submatrix anyway). I would not call it an optimal strategy - it rather looks like a reasonalbe heuristic.
+2. My major concern is section 5.3. The authors claim that their method estimates uncertainty better than all baselines including deep ensembles (DE). I am afraid in its current form the comparison is not fair. They use only DE of size 5 while their method requires approx. (42K)*(42K) = 1756B of parameters that is enough to keep in memory about 160 initial networks. So it would be fair to compare aganist DE of size 160 since we know that larger DE estimate uncertainties better. It is not that surprizing that the suggested method outperforms other baselines since all of them require much less memory. So I would recomend the authors to compare (1) their current model aganist DE that requires similar amount of memory; (2) their reduced model (that requires approximately the same amount of memory as baselines) aganist other baselines to check wether the proposed algorithm may still estimate uncertainty better given the same memory budget.",5,5.0,ICLR2021
+rJlCmGvTFH,1,rkgrbTNtDr,rkgrbTNtDr,Official Blind Review #1,"This paper tackles the Image-to-Image translation task via a simplified yet more effective training procedure. 
+
+Compared to the direct baseline BicycleGAN, the training procedure proposed in this paper replaces the simultaneous training of the
+encoder E and the generator G with a staged training that alternatively trains on E and G and then finetune them together. Although this appears to be a simple modification, the empirical performance for generalization and reconstruction qualities prove the effectiveness of the proposal. 
+
+It is better to provide more intuition on why this pretraining phase would help to make the results generalize better and yield better performance. The current presentation of the paper mostly consists of detailed descriptions of the proposal training procedure, without some interesting discussions about why this pretraining makes the problem easier. For instance, I'm interested in seeing with some toy distributions, what is the training progress (measured quantitatively) comparing the proposed method and traditional BicycleGAN. 
+
+Although the results look nice, with the current presentation, there's not much inspiration one could get from the paper. I encourage the authors to make some adjustments, and I will reconsider the score. ",3,,ICLR2020
+HJ-3KJzEe,1,BkGakb9lx,BkGakb9lx,RenderGAN: Generating Realistic Labeled Data,"The submission proposes an interesting way to match synthetic data to real data in a GAN type architecture.
+The main novelty are parametric modules that emulate different transformations and artefact that allow to match the natural appearance.
+
+several points were raised during the discussion:
+
+1. the proposed method is more model driven that previous GAN models. But does it pay off? how would a traditional GAN approach perform? The mentioned effects like blur, lighting and background could also potentially be modelled by upsamling network that directly predicts the image. I would assume that blur and lighting can be modelled by convolutions. transformations to some extend by convolutions - or spatial transformer networks.
+The answers of the authors only partially addresses the point. The key proposal of the submission seems parameterised modules that can be trained to match the real data distribution. but it remains unclear why not a more generic parameterisation can also do the job. E.g. a neural network - as done in regular GANs. The benefit of introducing a stronger model is unclear. Using a render engine to generate the initial sample appearance if of limited novelty.
+
+
+2. how does it compare to traditional data augmentation techniques, e.g. noise, dropout, transformations. you are linking to keras code - where data augmentation is readily available and could be tested (ImageDataGenerator)
+The authors reply that plenty of such augmentation was used and more details are going to be provided in the appendix. it would have been appreciated if such information was directly included in the revision - so that the procedure could be directly checked. right now - this remains a point of uncertainty.
+
+3. How do the different stages (\phis) effect performance? which are the most important ones?
+The authors do evaluate the effect of hand tuning the transformation stages vs. learning them. it would be great to also include results of including/excluding stages completely - and also reporting how much the initial jittering of the data helps.
+
+While there is an interesting idea of (limited) novelty to the paper, there are some concerns about evalations and comparisons as outlined above. In addition, only success on a single dataset/task is shown. Yet the task is interesting and seems challenging. Overall, this remains makes only a weak recommendation for acceptance.",6,4.0,ICLR2017
+BkMc3FWEe,2,ryCcJaqgl,ryCcJaqgl,Promising architecture but insufficient experiments,"1) Summary
+
+This paper proposes an end-to-end hybrid architecture to predict the local linear trends of time series. A temporal convnet on raw data extracts short-term features. In parallel, long term representations are learned via a LSTM on piecewise linear approximations of the time series. Both representations are combined using a MLP with one hidden layer (in two parts, one for each stream), and the entire architecture is trained end-to-end by minimizing (using Adam) the (l2-regularized) euclidean loss w.r.t. ground truth local trend durations and slopes.
+ 
+2) Contributions
+
++ Interesting end-to-end architecture decoupling short-term and long-term representation learning in two separate streams in the first part of the architecture.
++ Comparison to deep and shallow baselines.
+
+3) Suggestions for improvement
+
+Add a LRCN baseline and discussion:
+The benefits of decoupling short-term and long-term representation learning need to be assessed by comparing to the popular ""long-term recurrent convolutional network"" (LRCN) of Donahue et al (https://arxiv.org/abs/1411.4389). This approach stacks a LSTM on top of CNN features and is typically used on time series of video frames for tasks that are more general than local linear trend prediction. Furthermore, LRCN does not require the hand-crafted preprocessing of time series to extract piecewise linear approximations needed by the LSTM of the TreNet architecture proposed here. Finally, LRCN might be more parameter efficient, as it does not have the fully connected fusion layers of TreNet (eq. 8). 
+
+Add more complex multivariate datasets:
+The currently used 3 datasets are limited, especially compared to modern research in representation learning for time series forecasting. For instance, and of particular interest to ICLR, I would suggest investigating future frame prediction on natural video datasets like UCF101 where CNN+LSTM are typically used albeit with a more complex loss (cf. for instance the popular adversarial one of Mathieu et al). Although different from the task of local linear trend prediction, it would be interesting to see how TreNet could be applied to the encoder stage of existing encoder-decoder architectures for frame prediction. It seems that decoupling short term and long term motion representation learning (for instance) could be beneficial in natural videos, as they often contain fast object motions together with slower camera ones.
+
+Clarification about the target variables:
+The authors need to clarify whether they handle separately or jointly the duration and slope. The text is ambiguous and seems to suggest training two separate models, one for slope, one for duration, which is particularly puzzling considering that predicting them jointly is in fact much easier (just two output variables instead of one), makes more sense, and is entirely feasible with the current method.
+
+Other parts of the text can be improved too. For instance, the authors can vastly compress the generic description of standard convnet and LSTM equations in section 4, while the preprocessing of the time series needs to appear much earlier.
+
+4) Conclusion
+
+Although the architecture seems promising, the current experiments are too preliminary to validate its usefulness, in particular to existing alternatives like LRCN, which are not compared to.",4,4.0,ICLR2017
+vOoGdFWcQL0,3,FyucNzzMba-,FyucNzzMba-,"Thorough evaluation, but the novelty is a bit limited, and the observations may be very specific to the testing environments.","=== Summary
+
+This paper investigates the performance of several state-of-the-art forward-prediction models in the complex physical-reasoning tasks of the PHYRE benchmark. The authors have provided thorough evaluations of the models by ablating on different ways of representing the state (object-based or pixel-based), forms of model class (Interaction network, Transformers, Spatial transformer networks, etc.), and specific evaluations settings (within-template or cross-template), from which they made several interesting observations. For example, forward predictors with better pixel accuracy do not necessarily lead to better physical-reasoning performance. Their best-performing model also sets a new state-of-the-art on the PHYRE benchmark.
+
+
+=== Strengths
+
+This paper targets a very challenging task, the PHYRE benchmark, where the simulation results can be very sensitive to small changes in the action (or the world), and their best-performing model from this paper outperforms the prior state-of-the-art methods on this benchmark.
+
+This paper provides a thorough evaluation of the forward-prediction models by considering different state representation and model class, resulting in some interesting observations about the relationship between the modeling choices and the physical-reasoning performance.
+
+The videos in the supplemental materials are very illustrative and provide good qualitative comparisons between the methods.
+
+
+=== Weaknesses
+
+My primary concern of this paper is the novelty is a bit limited. This paper does not propose any new method, but mainly focus on comparing several existing forward-prediction approaches by assessing their ability to perform physical reasoning on the PHYRE benchmark. Although the best model achieves a new state-of-the-art, I would not consider it to be particularly novel.
+
+The proposed method seems to be very specific to the PHYRE benchmark, which, although challenging but only contains open-loop tasks with rigid objects of simple shapes in 2d space. It is hard to know whether this paper's observations are still valid in more diversified and complicated environments. For example, this paper suggests that ""pixel-based models are more helpful in physical reasoning"" than object-based models, which may not be true if we apply the methods in three-dimensional environments where pixel-based models could suffer from occlusions and a poorer estimation of the 3d location and geometry of an object.
+
+Even if the forward model is reasonably accurate, the control problem can still be very challenging. Given that the results are very sensitive to small variations in the initialization, the task-solution model and the search strategy proposed in this paper may require extensive samples to find a suitable solution to circle around bad local minimums. What is the distribution of the sampling space? How does the performance change with respect to the number of samples? How long does the process take?
+
+When training the forward-prediction model, using a fully-connected graph to model the contacting events between the objects can sometimes lead to unsatisfying results. This is because neural networks are essentially a continuous function, whereas contacts are inherently discontinuous. As shown in the videos accompanying this paper, IN sometimes misses or smoothes the contacts, which is also the reason why some other works choose to add edges between constituent components only when they are close enough [1,2]. The object-based model may have a better performance if the graph is built dynamically.
+
+When constructing the training set, the authors ""sample task-action pairs in a balanced way: half of the samples solve the task and the other half do not."" How did the authors specify the sampling space? How likely will a random sample solve the task, given the sensitivity of the simulation results to the input variations?
+
+In Figure 5, are there any intuitive explanations of why the red curve first decreases and then increases?
+
+
+[1] A Compositional Object-Based Approach to Learning Physical Dynamics, ICLR 2017
+
+[2] Learning to simulate complex physics with graph networks, ICML 2020
+
+
+=== Post rebuttal
+
+Having read the rebuttal and the reviews from other reviewers, my rating remains the same (5: Marginally below acceptance threshold). Many of the reviewers share similar concerns:
+
+(1) The novelty is a bit limited as the paper did not introduce any novel technique approaches. While not every paper needs to propose a new method, a more in-depth analysis of the benchmarked approaches may be needed to provide insights into how existing methods fail and how we can improve them.
+
+(2) The scope of this paper is a bit narrow where the authors only evaluated the methods on PHYRE that is fully-observable and only contains open-loop tasks with rigid objects of simple shapes in 2D space.
+
+(3) The conclusion that the pixel-based model does better than the object-based counterparts may be a bit controversial. The observation is very specific to the methods and environments the authors were using and may not hold when generalizing to more complicated partially-observable or 3D scenarios.",5,4.0,ICLR2021
+XsYE7DtikRt,4,7pgFL2Dkyyy,7pgFL2Dkyyy,Several major weaknesses,"# Summary
+
+This paper claims to have 3 main contributions. 
+
+C1: Understanding/Theory. It explains why the two tricks work in zero-shot learning (ZSL): (i) normalization + scaling in the compatibility function of the class features and the attributes, and (ii) attribute unit normalization.
+
+C2: Method. It proposes a “class normalization” scheme (Eq. 9 and 10) and Fig. 3. 
+C2.1 From a “theoretical explanation” of C1 (ii), this fixes (ii) in a “deep” ZSL model.
+C2.2 It improves “smoothness” of a “irregular” loss landscape in ZSL.
+
+C3: Experiments. It demonstrates strong accuracy and training speed of the proposed approach in standard generalized ZSL. It also considers continual ZSL (Sect. 4), in which the proposed method is evaluated via mean accuracy (over timesteps) accuracy metrics and a forgetting metric.
+
+###
+
+# Strengths
+
+S1. Simple method. This is a simple feature-attribute scoring function via scaled cosine similarity (with normalization).
+
+S2. Strong empirical results (on both accuracy and training speed). See Table 2.
+
+# Weaknesses
+
+W1. Clarity
+
+The organization of the paper is such that the reader has to refer to the appendix a lot. My biggest concern on clarity is on the “theoretical” results which are not rigorous and at times unsupported. Further, some statements/claims are not precise or clear enough for me to be convinced that the method is well-motivated and is doing what it is claimed to be doing. 
+
+W2. Soundness
+
+I have a lot of concerns and questions here as I read through Sect. 3. At a high-level, I don’t see a clear connection between “improved variance control of prediction y^ or the smoothness of loss landscape” and “zero-shot learning effectiveness.” Details below. This is in part due to poor clarity.
+
+W3. Experiments
+
+IMO, if the main claim is really about the effectiveness of the two tricks and the proposed class normalization, then the experiments should go beyond one zero-shot learning starting point --- 3-layer MLP (Table 2). 
+
+- If baseline methods already adopt some of these tricks, it should be made clear and see if removing these tricks lead to inferior performance.
+- If baseline methods do not adopt some of these tricks, these tricks, especially class normalization, could be applied to show improved performance. If it is difficult to apply these tricks, further explanation should be given (generally, also mention applicability of these tricks.) 
+
+This is done to some degree in the continual setting.
+
+W4. Related work
+
+As I mentioned in W3, it is unclear which methods are linear/deep, and which methods have already benefited from existing/proposed tricks.
+
+###
+
+# Detailed comments (mainly to clarify my points about weaknesses)
+
+
+## Statement 1
+
+The main claim for this part is that this statement provides “a theoretical understanding of the trick” and “allows to speed up the search [of the optimal value fo \gamma].”
+
+However, I feel that we need further justifications on the correlation between Statement 1 (variance of y^_c, “better stability” and “the training would not stale”) and the zero-shot learning accuracy for this to be the “why normalization + scaling works.” My understanding is that the Appendix simply validates that Eq. (4) seems to hold in practice.
+
+Moreover, is the usual search region [5,10] actually effective? Do we have stronger supporting empirical evidence than the three groups of practitioners (Li et al 2019, Zhang et al. 2019, Guo et al. 2020), who may have influenced each other, used it? 
+
+Finally, can the authors comment on the validity of multiple assumptions in Appendix A? To which degrees does each of them hold in practice?
+
+
+## Statement 2 and 3
+
+Why wouldn’t the following statement in Sect. 3.3 invalidate Statement 1? 
+“This may create an impression that it does not matter how we initialize the weights — normalization would undo any fluctuations. However it is not true, because it is still important how the signal flows, i.e. for an unnormalized and unscaled logit value”
+
+It is unclear (at least not from the beginning) why understanding attribute normalization has to do with initialization of the weights.
+
+Similar to my comments to Statement 1, why should we believe that the explanation in Sect. 3.3 and Sect. 3.4 is the reason for zero-shot learning effectiveness? In particular, the authors again claim that the main bottleneck in improving zero-shot learning is “variance control” (the end of Sect. 3.3).
+
+I also have a hard time understanding some statements in Appendix H, which is needed to motivate the following statement in Sect. 3.3: “And these assumptions are safe to assume only for z but not for a_c, because they do not hold for the standard datasets (see Appendix H).”
+H1: Would this statement still be true after we transform a_c with an MLP?
+H2: Why is it not “a sensible thing to do” if we just want zero mean and unit variance? 
+H3: Why is “such an approach far from being scalable”? 
+H4: What if these are things like word embeddings?
+H5: Fig. 12 and Fig. 13 are not explained.
+H6: Histograms in Fig. 13 look quite normal.
+
+How useful is Statement 2? Why is the connection with Xavier initialization important?
+
+Why is “preserving the variance between z and y~” in Statement 3  important for zero-shot learning?
+
+
+## Improved smoothness
+
+The claim “improved smoothness” at the end of Sect. 3 and Appendix F is really hard to understand.
+F1: How do the authors define “irregular loss surface”?
+F2: “Santurkar et al. (2018) showed that batch-wise standardization procedure decreases the Lipschitz constant of a model, which suggests that our class-wise standardization will provide the same impact.” This is not very precise and seems unsupported. Please make it clear how. If this is a hypothesis, please make it clear.
+
+Similarly to my comments to Statement 1-3, how is improved smoothness related to zero-shot learning effectiveness?  
+
+
+## Other more minor comments
+1. Abstract: Are the authors the one to “generalize ZSL to a broader problem”? Please tone down the claim if not.
+2. After Eq. (2): Why does attribute normalization look “inconsiderable” (possibly this is not the right word?) or why is it “surprising” that this is preferred in practice? Don’t most zero-shot learning methods use this (see for example Table 4 in [A])?
+3. Suggestions for references for attribute normalization. This can be improved; I can trace this back to much earlier work such as [A] and [B] (though I think this fact is stated more explicitly in [A]).
+4. Under Table 1 “These two tricks work well and normalize the variance to a unit value when the underlying ZSL model is linear (see Figure 1), but they fail when we use a multi-layer architecture.”: Could the authors provide a reference to evidence to support this? I think it is also important to provide a clear statement of what separates a “linear” or “multi-layer” model.
+5. The first paragraph of Sect. 3: Could you provide references for motivations for different activation functions? Further, It is unclear that all of them perform normalization.
+6. The second paragraph of Sect. 3: What exactly limits “the tools” for zero-shot learning vs. supervised learning? Further, it would also be nice to separate traditional supervised learning where classes are balanced and imbalanced; see, e.g., [C].
+7. What is the closest existing zero-shot model to the one the authors describe in Sect. 3.1? Why is the described model considered/selected? 
+
+[A] Synthesized Classifiers for Zero-Shot Learning
+
+[B] Zero-Shot Learning by Convex Combination of Semantic Embeddings
+
+[C] Class-Balanced Loss Based on Effective Number of Samples
+
+",3,4.0,ICLR2021
+B1eXpM0L6m,3,r1luCsCqFm,r1luCsCqFm,"Some basic intuition, but very handwavy, unclear paper, with dubious experimental significance.","This paper suggests a source of slowness when training a two-layer neural networks: improperly trained output layer (classifier) may hamper learning of the hidden layer (feature). The authors call this “inverse” internal covariate shift (as opposed to the usual one where the feature distribution shifts and trips the classifier). They identify “hard” samples, those with large loss, as being the impediment. They then propose a curriculum, where such hard samples are identified at early epochs, their loss attenuated and replaced with a requirement that their features be close to neighboring (in feature space) samples that are similarly classified, but with a more comfortable margin (thus “easy”.) The authors claim that this allows those samples to contribute through their features at first, without slowing the training down, then in later epochs fully contribute. Some experiments are offered as evidence that this indeed helps speedup.
+
+The paper is extremely unclear and was hard to read. The narrative is too casual, a lot of handwaving is made. The notation is very informal and inconsistent. I had to second guess multiple times until deciphering what could have possibly been said. Based on this only, I do not deem this work ready for sharing. Furthermore, there are some general issues with the concepts. Here are some specific remarks.
+
+-	The intuition of the inverse internal covariate shift is perhaps the main merit of the paper, but I’m not sure if this was not mostly appreciated already.
+
+-	The paper offers some experimental poking and probing to find the source of the issue. But that part of the paper (section 3) is disconnected from what follows, mainly because hardness there is not a single point’s notion, but rather that of regions of space with a heterogeneous presence of classes. This is quite intuitive in fact. Later, in section 4, hard simply means high loss. This isn’t quite the same, since the former notion means rather being near the decision boundary, which is not captured by just having high loss. (Also, the loss is not specified.)
+
+-	Some issues with Section 3: the notions of “task” needs a more formal definition, and then subtasks, and union of tasks, priors on tasks, etc. it’s all too vague. The term “non-computable” has very specific meaning, best to avoid. Figure 2 is very badly explained (I believe the green curve is the number of classes represented by one element or more, while the red curve is the number of classes represented by 5 elements or more, but I had to figure it out on my own). The whole paragraph preceding Figure 3 is hard to follow. I sort of can make up what is going, especially with the hindsight of Section 4, since it’s basically a variant of the proposed schedule (easy to hard making sure all clusters, as proxy to classes, are represented) without the feature loss, but it needs a rewriting.
+
+-	It is important to emphasize that the notion of “easy” and “hard” can change along the training, because they are relative to what the weights are at the hidden layer. Features of some samples may be not very separable at some stage, but they may become very separable later. The suggested algorithm does this reevaluation, but this is not made clear early on.
+
+-	In Section 4, the sentence where S_t(x) is mentioned is unclear. I assume “surpass” means achieving a better loss. Also later M_t (a margin) is used, when I think what is meant is S_t (a set). The whole notation (e.g. “topk”, indexing that is not subscripted, non-math mode math) is bad.
+
+-	If L_t is indeed a loss (and not a “performance” like it’s sometimes referred to, as in minus loss), then I assume larger losses means that the weight on the feature loss in equation (3) should be larger. So I think a minus sign is missing in the exponent of equation (2), and also in the algorithm.
+
+-	I’m not sure if the experiments actually show a speedup, in the sense of what the authors started out motivating. A speedup, for me, would look like the training progress curves are basically compressed: everything happens sooner, in terms of epochs. Instead, what we have is basically the same shape curve but with a slight boost in performance (Figure 4.) It’s totally disingenuous to say “this is a great boost in speed” (end of Section 5.2) by saying it took 30 epochs for the non-curriculum version to get to its performance, when within 4 epochs (just like the curriculum version) it was at its final performance basically.
+
+-	So the real conclusion here is that this curriculum may not have sped up the training in the way we expect it at all. However, the gradual introduction of badly classified samples in later epochs, while essentially replacing their features with similarly classified samples for earlier epochs, has somehow regularized the training. The authors do not discuss this at all, and I think draw the wrong conclusion from the results.
+",2,5.0,ICLR2019
+SygnB7pe5S,2,HJx-3grYDB,HJx-3grYDB,Official Blind Review #1,"Authors propose a new method for multi-agent reinforcement learning by using nearly decomposable value functions. The main idea is to have local (agent specific) value function and one global value function (for all agents). They also try minimizing the required communication need for multi-agent setup. To this end they deploy variational inference tools and support their method with an experimental study.
+
+I found the paper interesting and empirical evaluation is good. However, my knowledge in the field is quite limited.
+",6,,ICLR2020
+BkenmE9a_r,1,r1l-5pEtDr,r1l-5pEtDr,Official Blind Review #3,"This paper introduces a new step-size adaptation algorithm called AdaX. AdaX 
+builds on the ideas of the Adam algorithm to address instability and non-convergence issues. 
+Convergence of AdaX is proven in both convex and non-convex settings. The paper 
+also provides an empirical comparison of AdaX against its predecessors 
+(SGD, RMSProp, Adam, AMSGrad) on a variety of tasks.
+
+I recommend the paper be rejected. I believe the convergence results could be a significant contribution, but the 
+quality of the paper is hampered by its experimental design. The paper felt generally unpolished, containing
+frequent grammatical errors, imprecise language, and uncited statements.
+
+My main issue with the paper is the experimental design. I am not convinced that we can 
+draw valid conclusions from the experimental results for the following reasons:
+  - The experiments are lacking important details. How many independent runs of the experiment 
+    were the experimental results averaged over? All of the experiments have random initial conditions 
+    (e.g. initialization of the network), and should be ran multiple times, not just once. 
+    There's no error bars in any of the plots, so it's unclear whether AdaX really 
+    does provide a statistically significant improvement over the baselines. 
+    Similarly, the data in all the tables is quite similar, so without indicating the 
+    spread of these estimates its impossible to tell whether these results are significant or not.
+
+  - How were the hyperparameters and step-size schedules chosen? The performance of Adam, AMSGrad, and 
+    RMSProp are quite sensitive to their hyperparameters, and the optimal hyperparameters are problem-dependent. 
+    Some of the experiments just use the default hyperparameters; this is insufficient when trying to directly 
+    compare the performance of these methods, as their performance can vary greatly with different values of
+    these parameters. I'm not convinced that we should be drawing conclusions about the relative 
+    performance of these algorithms from any of the experiments for this reason.
+
+Of course, meaningful empirical results are not necessarily characterized by statistically outperforming the baselines. 
+Well designed experiments can highlight important ways in which the performances differ, providing the community 
+with a deeper understanding of the methods investigated. I would argue that the experiments in the paper do not 
+achieve this either; the experiments do not provide any new intuition or understanding of the methods, showing 
+only the relative performances in terms of learning curves on a somewhat random collection of supervised learning problems. Why were these specific problems chosen? What makes these problems ideal for showcasing the performance of AdaX? If AdaX is an improvement over Adam, why? What exactly is happening with it's effective step-sizes that leads 
+to the better performance? Can you show how their step-sizes differ over time? 
+
+Statements that need citation or revision:
+  - ""Adaptive optimization algorithms such as RMSProp and Adam... as well as weak performance 
+     compared to the first order gradient methods such as SGD"" (Abstract). This needs a citation. 
+     Similarly, ""AdaX outperforms various tasks of computer vision and natural language processing and can catch 
+     up with SGD""; as above, I'm unaware of work (other than theoretical) that shows that SGD significantly 
+     outperforms Adam in deep neural networks.
+  -  ""In the era of deep learning, SGD ... remains the most effective algorithm in training deep neural 
+      networks"" (Introduction). What are you referring to here? Vanilla SGD? Or are you including Adam etc here? 
+      As above, this should have a citation. Adam's popularity is largely due to its effectiveness in training 
+      deep neural networks.
+  - ""However, Adam has worse performance (i.e. generalization ability in testing stage) compared with SGD"" 
+     (Introduction). Citation needed.
+  - In the last paragraph of the Introduction, you introduced AdaX twice: ""To address the above issues, we propose a 
+    new adaptive optimization method, termed AdaX, which guarantees convergence..."", and, ""To address 
+    the above problems, we introduce a novel AdaX algorithm and theoreetically prove that it converges...""
+",1,,ICLR2020
+rJgiuVx-6Q,3,BJxYEsAqY7,BJxYEsAqY7,Potentially lack of true novelty,"I do not necessarily see something wrong with the paper, but I'm not convinced of the significance (or sufficient novelty) of the approach. 
+
+The way I understand it, a translator is added on top of the top layer of the student, which is nothing but a few conv layers that project the output to potentially the size of the teacher (by the way, why do you need both a paraphraser and translator, rather than making the translator always project to the size of the teacher which basically will do the same thing !? )
+And then a distance is minimized between the translated value of the students and the teacher output layer. The distance is somewhat similar to L2 (though the norm is removed from the features -- which probably helps with learning in terms of gradient norm). 
+
+Comparing with normal distillation I'm not sure how significant the improvement is. And technically this is just a distance metric between the output of the student and teacher. Sure it is a more involved distance metric, however it is in the spirit of what the distillation work is all about and I do not see this as being fundamentally different, or at least not different enough for an ICLR paper.
+
+Some of the choices seem arbitrary to me (e.g. using both translator and paraphraser). Does the translator need to be non-linear? Could it be linear? What is this mapping doing (e.g. when teacher and student have the same size) ? Is it just finding a rotation of the features? Is it doing something fundamentally more interesting? 
+
+Why this particular distance metric between the translated features? Why not just L2? 
+
+In the end I'm not sure the work as is, is ready for ICLR.
+",5,3.0,ICLR2019
+SyiRxi7El,1,rJY0-Kcll,rJY0-Kcll,Strong paper but presentation unclear at times,"In light of the authors' responsiveness and the updates to the manuscript -- in particular to clarify the meta-learning task -- I am updating my score to an 8.
+
+-----
+
+This manuscript proposes to tackle few-shot learning with neural networks by leveraging meta-learning, a classic idea that has seen a renaissance in the last 12 months. The authors formulate few-shot learning as a sequential meta-learning problem: each ""example"" includes a sequence of batches of ""training"" pairs, followed by a final ""test"" batch. The inputs at each ""step"" include the outputs of a ""base learner"" (e.g., training loss and gradients), as well as the base learner's current state (parameters). The paper applies an LSTM to this meta-learning problem, using the inner memory cells in the *second* layer to directly model the updated parameters of the base learner. In doing this, they note similarities between the respective update rules of LSTM memory cells and gradient descent. Updates to the LSTM meta-learner are computed based on the base learner's prediction loss for the final ""test"" batch. The authors make several simplifying assumptions, such as sharing weights across all second layer cells (analogous to using the same learning rate for all parameters). The paper recreates the Mini-ImageNet data set proposed in Vinyals et al 2016, and shows that the meta-learner LSTM is competitive with the current state-of-the-art (Matchin Networks, Vinyals 2016) on 1- and 5-shot learning.
+
+Strengths:
+- It is intriguing -- and in hindsight, natural -- to cast the few-shot learning problem as a sequential (meta-)learning problem. While the authors did not originate the general idea of persisting learning across a series of learning problems, I think it is fair to say that they have advanced the state of the art, though I cannot confidently assert its novelty as I am not deeply familiar with recent work on meta-learning.
+- The proposed approach is competitive with and outperforms Vinyals 2016 in 1-shot and 5-shot Mini-ImageNet experiments.
+- The base learner in this setting (simple ConvNet classifier) is quite different from the nearest-neighbor-on-top-of-learned-embedding approach used in Vinyals 2016. It is always exciting when state-of-the-art results can be reported using very different approaches, rather than incremental follow-up work.
+- As far as I know, the insight about the relationship between the memory cell and gradient descent updates is novel here. It is interesting regardless.
+- The paper offers several practical insights about how to design and train an LSTM meta-learner, which should make it easier for others to replicate this work and apply these ideas to new problems. These include proper initialization, weight sharing across coordinates, and the importance of normalizing/rescaling the loss, gradient, and parameter inputs. Some of the insights have been previously described (the importance of simulating test conditions during meta-training; assuming independence between meta-learner and base learner parameters when taking gradients with respect to the meta-learner parameters), but the discussion here is useful nonetheless.
+
+Weaknesses:
+- The writing is at times quite opaque. While it describes very interesting work, I would not call the paper an enjoyable read. It took me multiple passes (as well as consulting related work) to understand the general learning problem. The task description in Section 2 (Page 2) is very abstract and uses notation and language that is not common outside of this sub-area. The paper could benefit from a brief concrete example (based on MNIST is fine), perhaps paired with a diagram illustrating a sequence of few-shot learning tasks. This would definitely make it accessible to a wider audience.
+- Following up on that note, the precise nature of the N-class, few-shot learning problem here is unclear to me. Specifically, the Mini-ImageNet data set has 100 labels, of which 64/16/20 are used during meta-training/validation/testing. Does this mean that only 64/100 classes are observed through meta-training? Or does it mean that only 64/100 are observed in each batch, but on average all 100 are observed during meta-training? If it's the former, how many outputs does the softmax layer of the ConvNet base learner have during meta-training? 64 (only those observed in training) or 100 (of which 36 are never observed)? Many other details like these are unclear (see question).
+- The plots in Figure 2 are pretty uninformative in and of themselves, and the discussion section offers very little insight around them.
+
+This is an interesting paper with convincing results. It seems like a fairly clear accept, but the presentation of the ideas and work therein could be improved. I will definitely raise my score if the writing is improved.",8,4.0,ICLR2017
+H1xzzmnIFr,1,HJlY_6VKDr,HJlY_6VKDr,Official Blind Review #2,"This paper proposes a defense against black box adversarial attack. The authors train an ensemble of deep networks, and output a null label when the ensemble disagree. Success of adversarial attack is defined as the ensemble outputs an incorrect label that is not null. The authors experimentally show improved robustness to adversarial attack. 
+
+The idea is itself new, but very similar ideas are well known in the literature, and it is difficult to conclude that the proposed approach is superior. Several examples are:
+
+Defense by majority vote with ensembles has appeared several times in the literature (e.g. Pang et al 2019). The pro is that this paper proposes a novel way to create an ensemble by applying random linear transformations and rescaling to the input. But it is not clear this is superior compared to existing methods. 
+
+Randomized smoothing (Cohen et al, 2019) guarantees smoothness of the classifier (and thus robustness to perturbation attack under certain norms). Note that randomized smoothing  provides certified guarantee against a stronger attack model (white box); it also guarantees the size of the margin (or buffer as the authors call it). The intuition of this paper is based on similar ideas so it seems necessary to at least compare with randomized smoothing. 
+
+Outputting a null label is a major workhorse of adversarial defense. For example, previous work use a generative model to detect out of distribution samples; use a calibrated classifier to output null when low confidence. 
+
+Because the idea is only mildly interesting, good experimental support becomes crucial. However, I think there are several short-comings with the experiments:
+
+The experiment contains only one (fairly old) attack method. Several recent alternatives such as SimBA (Guo et al 2019) can make the experiments more convincing. 
+
+The architecture is no longer the same for the target model (which is an ensemble with an additional random transformations) and surrogate model. It is unclear if the improvement is simply because of the difference in architecture. 
+
+The comparison to baselines seem unfair because it seems that the compared baselines do not have the option of outputting the null label. For example, a simple baseline of randomized smoothing + output the null label if the logit scores are below a threshold can make the story much stronger.
+	
+Minor comments:
+
+Several suggestions on writing: the introduction contains much technical detail and even experimental results, and these are repeated again in later sections. The experimental section has many minor implementation details that could go into the appendix. 
+",1,,ICLR2020
+yU4Ykzr57s1,3,jz7tDvX6XYR,jz7tDvX6XYR,Review,"##########################################################################
+
+Summary:
+
+The paper proposed a new way for training models that stack the same basic block for multiple times -- share the weights first and then untie the weights. The author tried to provide theoretical insights on why it can be better. Also, ablation study shows that the proposed algorithm has marginal improvement over the baseline.
+
+##########################################################################
+
+Reasons for score: 
+
+
+The improvement over the baseline method, which just trains BERT for more steps, is quite marginal. In addition, after the weights are untied, the training process becomes exactly the same as that in BERT. It's highly likely that both methods can have similar performance after being trained for a large number of steps. Also, I do not think the theoretical results on a deep linear network can help explain the  phenomena we see in BERT training. BERT uses a much more complicated building block that involves layer normalization, attention and non-linearity. Due to these reasons, I vote for rejection.
+
+##########################################################################
+
+Pros: 
+ 
+
+1. The idea of first train the model with shared weights and then untie the weights is interesting.
+ 
+
+##########################################################################
+
+Cons: 
+ 
+
+1. The improvement over the baseline method is not very substantial. Basically, it is possible that both methods perform similarly after being trained for a longer number of steps (e.g., 1M, 1.5M).
+
+2. I'm not convinced by the theoretical analysis. Building block in BERT is much more complicated that the building block discussed in Section 3.
+
+3. The criteria of changing from sharing to unsharing is largely heuristic.
+
+
+##########################################################################
+
+Questions during rebuttal period: 
+ 
+
+Please address and clarify the cons above 
+ 
+
+#########################################################################
+
+Typos: 
+
+(1) Page 3, ""To be best of"" should be ""To the best of""
+",4,3.0,ICLR2021
+Zgc4Qsv7xng,4,krz7T0xU9Z_,krz7T0xU9Z_,"Very special case of inductive bias, very well presented.","(a) This belongs to the literature of implicit bias/inductive bias, which has
+gained a great deal of attention among theoretical enthusiasts with an
+optimization leaning.
+
+(b) The paper is carefully laid out and argued, and is at a nice level
+of clarity and precision.
+
+(c) The mathematical argumentation seems to me correct; 
+however I haven't checked line-by-line.
+
+(d) The situation being studied is very very special
+and doesn't much correspond to the big kahuna
+deep learning. Nevertheless the intellectual clarity
+of this special case is quite appealing.
+
+(e) The implied conclusion seems rather special
+as well. From one viewpoint it says that if you start 
+from the get-go with perfect 
+separation of a particular strong form,
+then the future evolution of the training 
+can never spoil things. This is a very weak statement,
+but I suppose if we can't get results here in a very special case
+that we can understand well, then the 
+general situation is truly hopeless.
+
+Specific Comments.
+
+(1) Why is this max-margin if your constraints only consider one class. It seems to be more of a finding a minimum-norm vector aligned with all training data of that class. Its unclear why the concept of separating margins comes in.
+
+(2) “Theory III: Dynamics and Generalization in Deep Networks” by Banburski et al. also considers general deep relu networks and shows that the resulting margins are max-margin—requiring only separability, not orthogonal separability. In addition, that paper uses traditional DE methods rather than relying on lesser-known extremal sector techniques. Can you discuss or highlight why the simpler example in this paper might lead to insights not found in the other paper.
+
+(3) While the paper says that it is not directly applicable to deep nets, it draws motivation from the popularity of that literature. In that spirit, to justify such an evocation, can you show at least one experiment on a non-synthetic dataset such as MNIST/CIFAR/etc (perhaps even simplified with hand-engineered preprocessed features and subsetted to two-classes) that would support the potential connection to deep learning?
+
+(4) Can you provide any evidence why datasets would become orthogonally separated? Is there some feature
+engineering procedure that tends to produce orthogonal separation?
+
+(5) In Figure 1, variance is strange: shows one big outlier, but the plotted projection shows two roughly-equal-magitude directions of variation.
+
+(6) It is unclear how Definition 2 relates to strict extremal directions as defined by the sign patterns.
+
+(7) G should be clarified: What G is and what it represents should be explained to make the results more insightful
+
+
+
+ 16m 50s
+
+Type a message
+",8,3.0,ICLR2021
+SyxpLO9_2m,2,Byx1VnR9K7,Byx1VnR9K7,"Good but generic model, contribution limited ","
+This paper proposes a VAE for modelling state-action sequences using a single latent variable rather than one per timestep. The authors empirically demonstrate that this model works on toy 2D examples and a simplified 2D Minecraft-like environment. Although I am unaware of other works that use a VAE in this setting, the model is still quite generic, thus requires further application to justify its significance. This paper is clear and well written. 
+
+The current contribution of this paper is limited, however it could be improved in a number of ways. The main component lacking from this paper is a meaningful comparison to other related works. Its unclear what the advantage of this model is over other models and so a thorough comparison to other sequence models would really help this paper. As mentioned in the conclusion, another direction for this work would be to bootstrap reinforcement learning. If this bootrapping was demonstrated then it would make this paper’s contribution stronger. Finally, another important direction for improvement for this paper would be to demonstrate its usefulness on more complex environments, instead of only 2D examples. 
+
+Pros:
+- clear and well written
+- model works on toy examples
+Cons:
+- lack of baseline comparisons
+- lack of contributions
+
+
+",4,4.0,ICLR2019
+SyeFu0fRqr,3,rklJ2CEYPH,rklJ2CEYPH,Official Blind Review #2,"The paper proposes a new intensity-free model for temporal point processes based on continuous normalizing flows and VAEs. Intensity-free methods are an interesting alternative to standard approaches for TPPs and fit well into ICLR.
+
+The paper is written well and is mostly good to follow (although it would be good to integrate Appendix A.1 into the main text). The paper proposes interesting ideas to learn non-parametric distributions over event sequences using CNFs and the initial experimental results are indeed promising. However, I found the presentation of the new framework and the associated contributions somewhat insufficient.
+
+The proposed approach seems to consist mostly of applications of existing techniques and of only few technical contributions. There is also no real theoretical analysis of the advantages of the new approach beyond general statements. In addition, the experimental analysis is missing comparisons to
+- other intensity-free methods (e.g., [1, 2]) 
+- other NeuralODE based methods (e.g, [3, 4]) 
+and would also benefit from a closer analysis of the models advantages and/or additional tasks. While each of these points on its own would not be very severe, I found that the combination of all of them is problematic in the current version of the paper. I hope that the authors can address this in their response or future revision.
+
+Further comments:
+The results on Breakfast of the competing methods seem quite lower than the results published in (Mehrasa 2019). What is the cause for the differences here? For instance, APP-VAE in (Mehsara 2019) would outperform the results of PPF-P both in terms of LL and MAE (142.7 vs 204.9)?
+
+[1] Xiao et al: Wasserstein Learning of deep generative point process models, 2017. 
+[2] Xiao et al: Learning conditional generative models of temporal points processes, 2018. 
+[3] Chen et al: Neural Ordinary Differential Equations. 
+[4] Jia et al: Neural Jump Stochastic Differential Equations",3,,ICLR2020
+SJlLFPEjFr,1,rJxtgJBKDr,rJxtgJBKDr,Official Blind Review #2,"This paper attempts to tackle transfer learning and lifelong learning problem by subscribing to knowledge via channel pooling. The channel pooling is actually selecting the subsect of the feature map according to the way that prediction accuracy from the delta model can be maximized. Experiments show effectiveness of the proposed method.
+
+Pros:
+Overall, this paper is well written and easy to follow. The technique is sound and the problem studied in this paper is significant.
+
+Cons:
+1.	I do not think that the model proposed in this paper is able to tackle lifelong learning problem. The main reason is that lifelong learning basically requires only one model that will continue to learn from new tasks. After learning several new tasks, people hope this model can still perform well on the previous tasks as well as the current ones. However, in this paper, not only one model is learned. Instead, new models appear when new tasks are given, which does not meet the definition or requirement of lifelong learning. It only meets the requirement of transfer learning. The experimental results also validate my opinion since only one new task is given while in lifelong learning, continuous new tasks will come and the original model should perform well on all of them as well as on the old tasks.
+2.	In Figure 4, the legend in the first picture will confuse the readers. I suggest the authors put it outside all the figures. Besides, the proposed method in the last picture is not the best. What do the authors want to convey by this picture?
+
+",3,,ICLR2020
+Hyg3Zr60FS,3,r1gfQgSFDr,r1gfQgSFDr,Official Blind Review #2,"This paper puts forth adversarial architectures for TTS. Currently, there aren't many examples (e.g. Donahue et al,  Engel et al. referenced in paper) of GANs being used successfully in TTS, so this papers in this area are significant. 
+
+The architectures proposed are convolutional (in the manner of Yu and Koltun), with increasing receptive field sizes taking into account the long term dependency structure inherent in speech signals. The input to the generator are linguistic and pitch signals - extracted externally, and noise. In that sense, we are working with a conditional GAN. 
+
+I found the discriminator design very interesting. As the comment below notes, it is a sort of patch GAN discriminator (See pix2pix, and this comment from Philip Isola - https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix/issues/39) and that is could be quite significant in that it classifies at different scales. In the image world, having a single discriminator for the whole model would not take into account local structure of the images. Likewise, perhaps we can imagine something similar in the case of audio at varying scales - in fact, audio dependencies are even more long range. That might be one reason why the variable window sizes work here. 
+
+The paper also presents to image analogues for metrics based on FID and the KID, with the features being taken from DeepSpeech2. 
+
+I found the speech sample presented very convincing. In general, the architectures are also presented quite clearly, so it seems that we might be able to reproduce these experiments in our own practice. It is also promising that producing good speech could be achieved by a non-autoregressive or attention based architecture.
+
+The authors mention that they hardly encounter any issues with training stability and mode collapse. Is that because of the design of the multiple discriminator architecture?
+",8,,ICLR2020
+SJeRLFcth7,2,Bkl87h09FX,Bkl87h09FX,This paper presents an extremely comprehensive comparison of sentence representation methods.,"Only a handful of NLP tasks have an ample amount of labeled data to get state-of-the-art results without using any form of transfer learning. Training sentence representation in an unsupervised manner is hence crucial for real-world NLP applications.
+Contextualized word representations have gained a lot of interest in recent years and the NLP and ML community could benefit from such detailed comparison of such methods.
+
+This paper's biggest strength is the experimental setting.  The authors cover a lot of ground in comparing a lot of the recent work, both qualitatively and quantitatively -- there are a lot of experiments.
+I do understand the computational limitations of the authors (as they mention on HYPERPARAMETER TUINING) and I do agree with their statement “ The choice not to tune limits our ability to diagnose the causes of poor performance when it occurs”.
+Extensive hyper-parameter tuning can make a substantial different when dealing with NN models, maybe the authors should have considered dropping some of the tasks (the article has more than enough IMHO) and focus on a smaller sub set of tasks with proper hyper-parameter tuning.
+Table 2 is very interesting, the results suggesting that we are indeed very far from fully robust sentence representation method. 
+",7,4.0,ICLR2019
+KAiTv-9pk4,1,rsf1z-JSj87,rsf1z-JSj87,Review of End-to-end adversarial text-to-speech,"
+## Summary
+
+The authors propose EATS, a method for TTS from unaligned audio and text data, directly to the waveform. Previous work either use aligned phonetic features, or output spectrograms that are later converted to a waveform by a deep vocoder model.
+
+In order to achieve this, the authors had to use several tricks, some already existing, for instance taken from the GAN TTS architecture, and some novel. The three key novelties are:
+- differentiable monotonic attention with gaussian kernel and length prediction.
+- dynamic time warping for the spectrogram loss.
+- using both spectrogram and waveform domain discriminators
+
+The authors provide a comprehensive ablation study with MOS score, although their model is under the state of the art by a significant margin.
+
+## Review
+
+This paper builds on GAN TTS, and tries to make it trainable end-to-end without aligned features.
+
+The two main contributions, namely dynamic time warping and monotonic attention with gaussian kernel are both elegant, and can likely be used for many other applications related to time series with heterogeneous time scales. In particular, the time warping loss allows to accomodate both for the natural irregularities in spoken speech, as well as providing sufficient signal for the monotonic attention to work.
+
+The rest of the architecture is very similar to GAN TTS except for the spectrogram domain discriminator that was added.
+
+While the model is under the state of the art for TTS, the samples are already quite convincing. The authors conduct a thorough ablation study, both with MOS and audio samples.
+
+Overall I think this is a really good paper, that is likely to prove quite useful for the development of end to end speech synthesis solution. As I already mentioned, I also believe that the approach of using dynamic time warping and monotonic attention can be used for other kind of time series.
+
+
+## Remarks and questions to the authors
+
+- Table 1, MOS for Tacotron 2 would be very informative. All the baselines are trained on aligned data while Tacotron is a legitimate contender for EATS as it can be trained on the same data. The point of the authors is that their methods is simpler because the training is in one stage. However, given the large number of losses and components in their model, with their respective hyper-parameters to tune, I'm not entirely sold on the simplicity argument. The tacotron 2 paper reports a MOS of 4.5 but on a private dataset.
+- Section 3, [1] used the same simple L1 + log spectrogram loss as used here.
+- I was surprised by the bad performance of the transformer attention, in particular in the audio samples, the output for this model is garbage towards the end of the signal. Any clue on why this would happen?
+- It would be interesting to have a benchmark, in particular, can the model generate speech in real time on GPU and on CPU?
+
+[1] SING: Symbol-to-Instrument Neural Generator, Defossez et al. Neurips 2018.",8,4.0,ICLR2021
+rylsLCoBnX,1,ByGuynAct7,ByGuynAct7,Deep weight prior,"This paper considers learning informative priors for convolutional neural network models based on fits to data sets from similar problem domains.  For trained networks on related datasets the authors use autoencoders to obtain an expressive prior on the filter weights, with independence assumed between different layers.  The resulting prior is generative and its density has no closed form expression, and a novel variational method for dealing with this is described.  Some empirical comparisons of the deep weight prior with alternative priors is considered, as well as a comparison of deep weight samples for initialization with alternative initialization schemes.  
+
+This is an interesting paper.  It is mostly clearly written, but there is a lack of detail in Section 4 that makes it hard for me, at least, to understand exactly what was done there.  I think the originality level of the paper is high.  The issue of informative priors in these complex models seems wide open and the authors provide an interesting approach both conceptually and computationally.  I did wonder whether there was any link between the suggested priors and the idea of modelling the current and related data sets used in constructing the prior jointly, with data set specific parameters given an exchangeable prior?  This would be a standard hierarchical modelling approach.  Such an approach would not be computationally attractive, I just wondered if there is some conceptual link with the current method being an approximation of that approach in some sense.  In Section 4.1, it seems that for the trained networks on the source datasets, point estimates of the filter weights are treated as data for learning the variational autoencoder - is that correct?  Could you model dataset heterogeneity here as well?  Presumably the p_l(z) density is N(0,I)?  Details of the inference and reconstruction networks are sketchy.  In Section 4.2, you say that the number of filters is proportional to the scale parameter k and that you vary k.  What scale parameter do you mean?  
+
+
+",7,3.0,ICLR2019
+H1lWSY0fKH,1,SygcCnNKwr,SygcCnNKwr,Official Blind Review #2,"Summary
+
+Strength:
+1. This topic studied in this paper is interesting and is helping to promote the following developing of algorithms with compositional generalization ability.
+2. The experimental results show the effectiveness of the proposed metric for measuring compositional diversity.
+
+
+comments:
+1. How to control the trade-off between the atom and compositional divergence? It's interesting to show how different
+compositional divergence can affect the performance of different models.
+2. Many previous works are proposed for improving the generalization ability of the seq2seq models[1]. More experiments need to
+be conducted using these previous methods.
+
+[1]Compositional generalization in a deep seq2seq model by separating syntax and semantics
+",6,,ICLR2020
+P3360aJi8gQ,3,Cn706AbJaKW,Cn706AbJaKW,This paper focuses on understanding and analyzing the reviewing process for a large conference such as ICLR and understand the reproducibility of the review process.,"This paper focuses on understanding and analyzing the reviewing process for a large conference such as ICLR and understand the reproducibility of the review process through Monte Carlo simulations.  Further, the authors also aim to study the impact of factors such as institutional bias, gender and study if higher review scores ensure more number of citations.
+
+Comments:
+1. How were the gender labels produced? Was it done through a manual process? With regard to gender bias analysis, one missing type of analysis in terms of reviewers' scores is to take into consideration the arxiv/ resubmission effect. This would provide better insights.
+2. The timing of this study is important and some of the findings in this paper raise concerns about the overall review process. Also, a big thanks to the authors of this work for providing the entire codebase to reproduce their results
+3. ""As more reviewers are added, the high level of disagreement among area chairs remains constant, while the standard error in mean scores falls slowly"" --> How is the high level of disagreement quantified? or is that an assumption? And if there is high-level of disagreement, how does the logistic regression model take that into consideration?
+4.  Figure 4 is really interesting in terms of showing the impact of making submissions available earlier. This raises concerns about how making papers available online is biasing the reviewers in terms of the scores provided.
+5. I have concerns over the way interrater reliability is calculated in section 4. The reviewers are randomized and this may not lead to the right number of samples per review and may affect the interrater reliability. Also, the assumptions mentioned by the authors seems to be wrong for this process.
+
+ Suggestions:
+1. ""then"" should be corrected to ""the"" in the last line of the first paragraph of section 2.
+2. Can the mention of NIPS conference be changed to Neurips conference even though the change only happened in 2018?
+
+",6,3.0,ICLR2021
+Hkg2dz7Kh7,3,rkgoyn09KQ,rkgoyn09KQ,Method is not novel but results seem to be solid,"
+Cons: 
+The proposed method is not novel. For example, Lauly et al., 2017 have proposed a similar way of combining LM and DocNADE. This paper does not provide some motivations or theories behind such artificial combination (i.e., just linearly combine their hidden state) to explain why it works better than other alternatives (e.g., what about adding some linear layers before combining h_i^{DN} and h_i^{LM}).
+
+Pros: 
+However, the results seem to be solid and significantly better than the previous state-of-the-art methods. I think some recent neural topic models such as [1,2,3] are still missing even though there are already many tables in the paper (I am not an expert on neural topic modeling or embedding for IR tasks, so there might be others missing state of the arts which I am not aware of). In addition, why does Table 5 only compares perplexity between 3 methods and Table 6 only compares coherence between 4 or 5 methods, while there are 9 or 12 methods are compared in IR task (Table 3 and 4). What's the difficulty of comparing the coherence and perplexity of all different topic models (including [1,2,3])?
+I will vote for acceptance if the mentioned baselines are also compared or there are good reasons why they cannot be compared.
+
+
+Writing and presentation:
+The quality of writing should be improved. Here are several examples.
+1. In the abstract, the following sentence needs to be rewritten and the rule of capitalization should be consistent. ""(2) Limited Context and/or Smaller training corpus of documents: In settings with a small number of word occurrences (i.e., lack of context) in short text or data sparsity in a corpus of few documents, the application of TMs is challenging.""
+2. I do not understand what's the purpose of the right figure in Figure 1. I think the paper does not do any matching like that.
+3. In the 3rd paragraph of the introduction, ""topmost"" -> top most 
+4. The paper should have a related work section. In addition to the related work discussion scattered in the introduction, authors should discuss the difference between this work and Lauly et al., 2017. Authors should also include some related work such as [1,2,3].
+5. Just below (1), ""where,"" -> , where
+6. In the last sentence of the paragraph after (1), you mentioned ""v_{<i} are orderless"", so what's the ordering used in experiments? Random ordering?
+7. I guess ""a"" in algorithm 1 means sum_{k<i}(W_{:,v_k}), but I cannot find the explicit explanation about the purpose of ""a"".
+8. For ctx-DocNADEe, is W+E the embedding of words at input layer in LM?
+9. In the 3rd paragraph of section 2.2, you said: ""each row vector W_{j,:} is a distribution over vocabulary of size K"". Could W has negative values during optimization?  If yes, why a distribution representing a topic could have negative value. If no, you should explicitly mention this non-negativity constraint.
+10. Why are some values in Table 12 and 13 missing?
+
+[1] Cao, Z., Li, S., Liu, Y., Li, W., & Ji, H. (2015, January). A Novel Neural Topic Model and Its Supervised Extension. In AAAI (pp. 2210-2216).
+[2] Srivastava, A., & Sutton, C. (2017). Autoencoding variational inference for topic models. ICLR
+[3] Card, D., Tan, C., & Smith, N. A. (2017). A Neural Framework for Generalized Topic Models. arXiv preprint arXiv:1705.09296.",7,4.0,ICLR2019
+AhTU9STXRhQ,4,AjrRA6WYSW,AjrRA6WYSW,Well-motivated problems and relevant to the ML community. Non-trivial way to configure the methods. Experiments need improvement.,"In this paper the authors consider the problem of computing the number of communities K in an arbitrarily sparse graph generated under the Stochastic Block Model (SMB). Previous studies that consider the problem of computing K show theoretical guarantees only in graphs with average degree $\Omega(\log n)$. One of the previous studies (namely, [1]) has shown that  the number of communities equals the number of negative eigenvalues of the Bethe Hessian matrix for graphs with expected average degree $\Omega(\log n)$ under SBM. In this paper the authors show that with an appropriate scalar parameter for the Bethe Hessian matrix of graphs with average degree $o(\log n)$, the same property still holds, and thus obtain a method for computing K in graphs of sublogarithmic density. In particular, the authors give an interval for the choice of the $\zeta$ scalar that depends on several parameters of the underlying SBM distribution.
+
+To use this result in practice (i.e., to set the right $\zeta$), one has to estimate the maximum expected degree of the underlying SBM distribution, the maximum probability of an edge, and the smallest eigenvalue of the normalized probability matrix B of the SBM. However, this is a non-trivial task, and the authors show that using estimates (with no guarantees) of B and the sizes of the communities they get reasonable estimates in practice and can thus choose a reasonable parameter $\zeta$.
+
+Finaly, the authors compare their choice of $\zeta$ with the one from [1] and show that on graphs with sparsity $3\sqrt{\log n}$ and sufficient assortativity the Bethe-Hessian-based method more frequently estimates the right number of communities. 
+
+=================
+PROS 
++ The problem of computing the number of communities is important and relevant to the ML community, as most methods for computing the actual communities require prior knowledge of K.
++ This is the first study to show theoretical guarantees in the sublogarithmic density regime in graphs generated by the SBM. As the authors stretch out this is an important practical scenario. 
+
+CONS
+- The time complexity of the method and the estimation computation is not discussed.
+- Non-extensive experiments (K is set only to 3 and 4, and the average degree is set to a single value)
+- The right choice of the parameter $\zeta$ is non-straightforward, and the authors suggest non-guaranteed estimations. That makes the theoretical guarantee of the overall method weaker, as one has no guaranteed way of tuning it.
+The problem itself is well-motivated and the regime of sublogarithmic average degree is important, and typically real-world graphs fall under this category. 
+
+The presentation of the paper is acceptable, although I would like to see some further discussions:* Please add a formal definition of the problem and explain the connection of the number of the negative eigenvalues of the Bethe Hessian matrix and the number of communities. Unless I missed something, this is not discussed.* I think it would be beneficial to discuss the time complexity of the method. That would also support the claim about efficiency.
+
+The choice of $\zeta$ is non-straightforward. Recall that $\zeta$ needs to be set right to enable the theoretical guarantees of the method. My understanding is that the authors rely on heuristics to estimate the right parameters and eventually choose the appropriate $\zeta$. If there is no guaranteed way of computing the right $\zeta$, then in practice one has no way to run the algorithm with guarantees. Please, do correct me if I'm wrong here.
+
+The experimental section could also be improved. My main complaint is the very limited number of communities generated under the SBM. In particular, the authors only consider setting $K \in {3,4}$, and this is not justified. Why not set K = 2, 5, 10, 50, 100 and show the performance (also compared to [1]) of the method as K changes? Also, it would be nice to consider different densities ranging from constant average degree to super-logarithmic. My understanding is that there is no upper bound on the densities supported by the method. 
+
+Overall, I'm not enthusiastic about the paper because of the non-systematic way of choosing $\zeta$, and because of the weak experimental evaluation. 
+
+Minor things:
+* there are some hats flying around in Procedure 4.2, Line 4
+* "" expected expected degree"" in page 6
+
+[1] Can M Le and Elizaveta Levina. Estimating the number of communities in networks by spectral methods. arXiv preprint arXiv:1507.00827, 2015.",5,2.0,ICLR2021
+7-PpF95yDW3,3,fw1-fHJpPK,fw1-fHJpPK,"OK method, but significant clarity issues.","=== Summary ===
+
+This paper proposes a ""decentralized"" method for representation learning in knowledge graphs that doesn't explicitly depend on a learned embedding for the entity node of interest, e_i. Rather, the embedding for e_i is constructed in a distributed fashion (similar in motivation to the distributional hypothesis/skip-gram word embeddings) from its neighbors via a second-order attention mechanism. The main idea is that this is better for ""cold start"" problems in which unknown entities might have no features, which makes building any representation that explicitly depends on entity-centric features hard.
+
+=== Justification for Score ===
+
+While the method might indeed hold some promise, I find it a bit difficult to get excited about it. Furthermore, in the current version it's presentation and motivation is unclear and hard to follow. It is also not clear from the paper why this method should work better than simpler baselines (see concerns). 
+
+=== Strengths ===
+
+- Relevant and timely topic, as leveraging knowledge graphs with initially unknown entities etc is an important problem (especially with constantly growing KGs).
+
+- Some mixed empirical results w.r.t. other compared models, but generally positive.
+
+=== Concerns ===
+
+- It's unclear to me why a purely ""decentralized"" representation is desired, especially if the experiments aren't purely on unknown entity settings. A natural baseline to me would seem to be a standard GCN/GAT with ""entity dropout"", i.e., during training time you introduce entities for which some redundancy must be gained by its neighborhood. Another approach would be to use a framework reminiscent of label propagation to deal only with imputing missing features at test time. Fully dropping all entity-specific features seems like overkill, and potentially harmful. After all, at test time, a majority of the KG will be known. 
+
+- In general, in addition to the comment above, though I am not intimately familiar with their details, it appears that none of the considered baseline methods (AliNet, RSN, etc) are specifically designed to accommodate missing entities. As this is a major claimed contribution, it would be useful to either clarify this, or explain how it is handled differently in these networks. 
+
+- I'm a bit perplexed by the consistently underperforming H@10 results relative to AliNet---and I'm not sure I find the justification in the paper (data augmentation) that convincing (can you clarify why this would explain worse performance @10, but not @1 or MRR?).
+
+- Overall, though there might be something here (I am willing to be convinced otherwise...), the paper fails to convince me of its significance at this stage. I feel that it would benefit greatly from an overall more compelling re-write.
+
+=== Minor Comments ===
+
+- In terms of readability, Table 2 is a bit too small.
+
+- Lemmas 1 and 2 don't add much to the paper in my opinion---I would recommend moving them fully to the appendix. 
+
+=== Response After Rebuttal ===
+
+I thank the authors for their responses to my comments. After reading the response as well as the other reviews, I still stand by my original rating. I still find the motivation and empirical results non-compelling, given the current version of the paper.",4,3.0,ICLR2021
+j80tTQaK_hD,2,78SlGFxtlM,78SlGFxtlM,"This paper presents a reptile-based meta-learning algorithm called Eigen-Reptile for few shot learning with sampling and label nosing. When Eigen-Reptile updates meta-parameters, it leverages not only the gradient direction of different task, but also the direction of eigenvector related to parameters matrix. Besides, authors propose Introspective Self-paced Learning (ISPL) for label noise problem. ","This paper presents a reptile-based meta-learning algorithm called Eigen-Reptile for few shot learning with sampling and label nosing. When Eigen-Reptile updates meta-parameters, it leverages not only the gradient direction of different task, but also the direction of eigenvector related to parameters matrix. Besides, authors propose Introspective Self-paced Learning (ISPL) for label noise problem. N-way-K-shot experiments on Mini-Imagenet demonstrate the effectiveness of Eigen-Reptile and 5-way-1shot with symmetric label noise experiments on Mini-Imagenet show that ISPL could alleviate the noising problem in some degree.
+
+Strength:
+1. The idea of Eigen-Reptile to alleviate gradient noise by eigenvector of parameters-related matrix is interesting and authors prove the effectiveness of the idea in theory.
+2. A clever, avoiding high time complexity way to obtain the eigenvector of parameters-related matrix
+3. By idea of self-paced learning with prior model to solve noisy few shot problem is reasonable and ingenious with theoretical proof.
+
+Weakness:
+1. The writing is confusing and not clear enough. For example, Para. 4.1, line 6. What is the specific meaning of “main direction of n points”, and what is suitable mathematical expression of “the unit vector e” ?
+2. In Mini-Imagenet N-way K-shot experiments, authors didn’t show specific numbers of filter of the most important comparison object,  Reptile, and the final experiments results are not particularly outstanding compared with recent papers, like DPGN[1], SIB[2].
+3. For label noise experiments, it is hard to say the ISPL is indeed effective as results showed in line 1,2,3 of Table 2 in paper.
+
+
+[1] Yang, Ling, et al. ""DPGN: Distribution Propagation Graph Network for Few-shot Learning."" Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020
+[2] Hu, Shell Xu, et al. ""Empirical Bayes Transductive Meta-Learning with Synthetic Gradients."" ICLR (2020)..",4,3.0,ICLR2021
+HJx6UQbfhX,1,SyVuRiC5K7,SyVuRiC5K7,Transductive few-shot by meta-learning to propagate labels for . Solid work.,"The paper studies few-host learning in a transductive setting: using meta learning to learn to propagate labels from training samples to test samples. 
+
+There is nothing strikingly novel in this work, using unlabeled test samples in a transductive way seem to help slightly. However, the paper does cover a setup that I am not aware that was studied before. The paper is written clearly, and the experiments seem solid. 
+
+Comments: 
+-- What can be said about how computationally demanding the procedure is? running label propagation within meta learning might be too costly. 
+-- It is not clear how the  per-example scalar sigma-i is learned. (for Eq 2)
+-- solving Eq 3 by matrix inversion does not scale. Would be best to also show results using iterative optimization 
+
+
+
+
+",7,4.0,ICLR2019
+ln0Rwo7GaEZ,2,nIqapkAyZ9_,nIqapkAyZ9_,"Good paper, novel idea, few missing points","Summary:
+
+This paper proposes a new approach to regularize the feature embedding of neural networks. The proposed regularizer, maximizes the mean singular value of the feature matrix per batch, leading to a uniform spread of features. This enables learning with larger learning rates without the risk of model collapse. Authors derive lower and upper bounds for the proposed singular values loss, as well as popular ranking losses used in recent studies, eg. triplet loss and pairwise loss. These bounds help to tune the mixing parameter of network’s loss and the singular value loss.
+
+Pros:
+
+1- The paper considers a crucial problem of learning with neural networks, ie. model collapse. The approach is systematic and shows promising results for learning with large learning rates, while a model without regularization fails to perform well.
+
+2- Authors derive lower and upper bounds for the proposed singular value loss, as well as well-known ranking losses in the literature. 
+
+
+Cons or comments:
+
+1- Even though there is a guide to choose parameter $\lambda$ using the bounds on losses, authors use a fixed value in the experimental setup. In case that this value is according to the bounds for the particular setting, it would be nice that authors mention it.
+
+2- The proposed regularizer is particularly useful with a larger learning rate. Then, one expects that less number of epochs would be enough to converge to a good state of network. It seems that the number of epochs in the retrieval experiments is the same across methods. If SVMax with lr=0.01 and spread-out with lr=0.0001 use the same number of epochs, there will be not much gain using SVMax.
+
+3- Model collapse may appear in different modes, eg. unrolled GAN paper discusses these modes for GANs. It would be nice if authors investigate how the model collapse happens. For example in Table 1, lr=0.01, contrastive loss, vanilla and spread-out show very poor performance. One can investigate the principal components of the batch embeddings.
+
+4- Authors did not mention how they exactly incorporate the mean singular value into the loss. Is the exact value computed for each mini-batch? Or an estimation would be considered? How much computation overhead does the mean singular value loss add?
+",6,3.0,ICLR2021
+ryeKOFZW5S,3,SJlEs1HKDr,SJlEs1HKDr,Official Blind Review #3,"This paper deals with the underfitting problem happening in neural process and sequential neural process (SNP). The idea is to incorporate the attention scheme in SNP and carry out the so-called attentive sequential neural process (ASNP) for sequence learning.
+
+Strength:
+1. A combination of attention into SNP.
+2. Some formulations were provided.
+3. Different tasks were evaluated to investigate the merit of this method.
+
+Weakness:
+1. The comparison for time complexity and parameter size was missing.
+2. The labels in figures were inconsistent.
+3. An incremental research.",3,,ICLR2020
+HylClIbC2m,3,rJlDnoA5Y7,rJlDnoA5Y7,cool new approach with some limitations,"This paper proposes to replace the softmax over the vocab in the decoder with a single embedding layer using the Von Mises-Fisher distribution, which speeds up training 2.5x compared to a standard softmax+cross entropy decoder. The goal is admirable, as the softmax during training is a huge time sink (the proposed approach does not speed up inference due to requiring a nearest neighbor computation over the whole vocab). The approach is evaluated on machine translation (De/F>En and En>F), and the results indicate that there is minor quality loss (measured by BLEU) when using vMF. One huge limitation of the approach is the lack of a beam search-like algorithm; as such, the model is compared to greedy softmax+CE decoders (I would like to see numbers with a standard beam search model as well just to emphasize the quality drop from the state-of-the-art systems). With that said, I found this approach quite exciting and it has potential to be further improved, so I'm a weak accept.  
+
+comments:
+- is convergence time the right thing to measure when you're comparing the two different types of models? i'd like to see something like flops as in the transformer paper. 
+- relatedly, it's great that you can use a bigger batch size! this could be very important especially for non-MT tasks that require producing longer output sequences (e.g., summarization). 
+- it looks like the choice of pretrained embedding makes a very significant difference in BLEU. i wonder if contextualized embeddings such as ELMo or CoVE could be somehow incorporated into this framework, since they generally outperform static word embeddings. ",6,4.0,ICLR2019
+plCcrnD9rW3,1,bIQF55zCpWf,bIQF55zCpWf,"This paper proposes an effective Pani to regularize the networks. The proposed Pani can be applied to input or feature space for better performance. Based on the Pani, it also proposes Pani VAT, Pani MixUP and Pani MixMatch for the generalization performance improvement. The provided results also show the effect of the proposed method.","The proposed Pani seems to be novel. It can explore the information of the neighboring relationship between samples and can be regarded as the meta-regularization. For the general formulation of Pani, how does the number of the nearest neighbor patch graphs affect the results?
+
+As I am not familiar with this topic of the paper, there are lots of regularization methods, it would be better to add more details about the regularization methods that do not neighboring relationship among samples.
+",6,1.0,ICLR2021
+r1gTx_WOcH,3,Syxp-1HtvB,Syxp-1HtvB,Official Blind Review #3,"The paper proposes an approach to analyze the latent space learned by recent GAN approaches into semantically meaningful directions of variation, thus allowing for interpretable manipulation of latent space vectors and subsequent generated images.  The approach is based on using pre-trained classifiers for semantic attributes of the images at a variety of levels, including indoor room layout, objects present, illumination (indoor lightining, outdoor lighting), etc. By forming a decision boundary in the latent space for each of these classifiers, the latent code is then manipulated along the boundary normal direction, and re-scored by the classifiers to determine the extent to which the boundary is coupled to the semantic attribute.
+
+By taking advantage of the structured composition of the latent space into per-layer contributions in the StyleGAN approach, experiments are performed to show that different levels of semantics are captured at different layers: layout being localized in lower layers, object categories in middle layers, followed by other scene attribute, and lastly the color scheme of the image in the highest layers.  A user study shows that human judgments of the coupling between layers and semantic attribute being manipulated are consistent with this observation.  A set of qualitative experiments demonstrate manipulation along several axes.  Another set of experiments demonstrate that the importance of different semantic attribute dimensions for different scene categories varies in an interpretable way, and also that certain attribute dimensions influence each other strongly (e.g. ""indoor lighting"" and ""natural lighting""), whereas other ones are decoupled (e.g. ""layout"" and other dimensions).
+
+I am somewhat positive with respect to acceptance of the paper. On the one hand, the key idea is simple, and has been demonstrated compellingly with a broad set of experiments.  On the other hand, the insight gained is fairly superficial, boiling down to the statement that the learned latent code has structure that corresponds to semantically meaningful axes of variation, and that such structure is localized to particular levels of the layer hierarchy for particular semantic axes.
+
+There are a few small issues with the clarity of the paper that would be good to fix:
+- Fig 3a: the interpretation of the vertical axis here was not clearly described in the caption or the main text
+- Fig 4 caption: typo ""while lindoor"" -> ""while indoor""
+- Fig 5: the construction of the pixel area flow visualization is not explained in the caption, and needs a bit more clarity in the main text (e.g., how are multiple instances of the same class handled?)
+- Fig 6: the caption could use a bit more explanation for making these plots interpretable: e.g. say what value the vertical axis is reporting
+- Fig 8: same issue as above
+- p8 typo: ""that contacts the latent vector"" -> ""that concatenates the latent vector""
+",6,,ICLR2020
+rkli05svnm,1,r1znKiAcY7,r1znKiAcY7,Idea is reasonable; work is preliminary,"Edited: I raised the score by 1 point after the authors revised the paper significantly.
+
+--------------------------------------------
+
+This paper proposes a regularization approach for improving GCN when the training examples are very few. The regularization is the reconstruction loss of the node features under an autoencoder. The encoder is the usual GCN whereas the decoder is a transpose version of it.
+
+The approach is reasonable because the unsupervised loss restrains GCN from being overfitted with very few unknown labels. However, this paper appears to be rushed in the last minute and more work is needed before it reaches an acceptable level.
+
+1. Theorem 1 is dubious and the proof is not mathematical. The result is derived based on the ignorance of the nonlinearities of the network. The authors hide the assumption of linearity in the proof rather than stating it in the theorem. Moreover, the justification of why activation functions can be ignored is handwavy and not mathematical.
+
+2. In Section 2.2 the authors write ""... framework is shown in Figure X"" without even showing the figure.
+
+3. The current experimental results may be strengthened, based on Figures 1 and 2, through showing the accuracy distribution of GAT as well and thoroughly discussing the results.
+
+4. There are numerous grammatical errors throughout the paper. Casual reading catches these typos: ""vertices which satisfies"", ""makes W be affected"", ""the some strong baseline methods"", ""a set research papers"", and ""in align with"". The authors are suggested to do a thorough proofreading.
+
+",5,4.0,ICLR2019
+H1eu0MZ2tS,3,Sye2s2VtDr,Sye2s2VtDr,Official Blind Review #3,"In the paper, the authors proposed CrossGO, an algorithm for finding crossing features useful for prediction.
+In CrossGO, one trains a neural network that captures feature crossing implicitly.
+Then, possible crossing features are estimated using the gradient-based saliency.
+The idea here is that, if a feature has a crossing with some other features, its contribution in the saliency can vary across different inputs.
+Thus, by looking at the variation of the saliency, one can find candidates features for feature crossing.
+CrossGO greedily selects candidate crossings based on the idea above.
+In the last step, a simple logistic regression is trained using the candidate crossings, and the effective crossings are selected using a forward greedy feature selection.
+
+I found the paper well-written and the idea is easy to follow.
+My concern, however, is the lack of Factorization Machines (FM) in the experiments.
+In Introduction, the authors mention to the deep version of FM and stated ""(deep FMs are) not able to generate interpretable cross features"".
+But, as the authors are aware of, non-deep FMs are able to handle feature crossings in a interpretable way.
+Thus, it would be essential to adopt non-deep FMs as the baseline in the experiments.
+Because the important baseline is missing, I found the results are not convincing enough to claim the effectiveness of the proposed method.
+
+
+### Updated after author response ###
+The authors have partially addressed my concern by adding FM/HOFMs as the experiment baselines, which I greatly appreciate.
+However, I found the current paper misses some other possible baselines for high-order interaction models [Ref1,2].
+As the authors mentioned in the response, FMs find the feature crossing as a kind of embedded representations, which may not be suitable for modeling sparse interactions.
+Thus, the sparse interaction models need to be taken into consideration as well.
+
+[Ref1] Safe Feature Pruning for Sparse High-Order Interaction Models
+[Ref2] Selective Inference for Sparse High-Order Interaction Models",3,,ICLR2020
+rkggOp9Y37,2,S1gWz2CcKX,S1gWz2CcKX,"Official Review: a multiplayer environment. Lacks comparison with related settings, many arbitrary choices, needs rewriting.","The paper presents a new evaluation platform based on massive multiplayer games, allowing for a huge number of neural agents in persistent environments.
+The justification evolves from MMO as a source of complex behaviours arguing that these settings have some characteristics of life on earth, being a “competitive game of life”. However, there are many combinations with completely different insights and implications. The key characteristics for the setting in this paper seem to be:
+1.	Cognitive evolution with learning, rather than physical or just genetic evolution (all bodies and architectures are equal)
+2.	Changing environments (tasks), between parameter updates
+3.	Survival-oriented rewards
+And for some experiments some agents share policy parameters to simulate “species”.
+From the introduction and the rest of the paper, it’s not clear whether the same platform can be used with agents that are not neural, or even agents that are hardcoded (for the sake of diversity or to analyse specific behaviours). This is an important issue, as other platforms allow for the definition of some baseline agents, including random agents, agents with simple policies, etc.
+The background and related work section covers MMO and artificial life, but has some important omissions, especially those ideas in the recent literature that are closest to this proposal.
+First, why can’t Yang et al., 2018 be extended with further tasks? 
+Second, conceptually, the whole setting is very similar to the Darwin-Wallace setting proposed in Orello et al. 2011:
+@inproceedings{hernandez2011more,
+  title={On more realistic environment distributions for defining, evaluating and developing intelligence},
+  author={Hern{\'a}ndez-Orallo, Jos{\'e} and Dowe, David L and Espa{\~n}a-Cubillo, Sergio and Hern{\'a}ndez-Lloreda, M Victoria and Insa-Cabrera, Javier},
+  booktitle={International Conference on Artificial General Intelligence},
+  pages={82--91},
+  year={2011},
+  organization={Springer}
+}
+
+The three characteristics mentioned before are the key elements of this evaluation setting, which changes environments between generations. Also, the setting is presented in the context of evaluation and experimentation, as this manuscript.
+
+Third, regarding multi-agent evaluation setting, Marlo over Minecraft (Malmo) is covering this niche as well. 
+
+https://marlo-ai.github.io/
+
+Although it is episodic and the number of agents is limited, this should be compared too.
+
+Nevertheless, the authors should make a more convincing argument about why we need *massively* multiplayer settings. Why is it the case that some behaviours and skills appear with thousands of agents but cannot appear with dozens of examples? In evolutionary game theory, for instance, some complex situations emerge from very few agents.
+
+Finally, the use of agents that have to survive with “health, food and water” and its use as experimental setting can be found in Strannegård et al. 2018.   
+https://www.degruyter.com/downloadpdf/j/jagi.2018.9.issue-1/jagi-2018-0002/jagi-2018-0002.pdf
+Figures are not very helpful. Especially the captions do not really explain what we see in the figures. For instance, Figure 2 doesn’t show much. Figure 3 left and middle show some weird dots and patterns, but they are not explained. Also, the one on the right tries to show “ghosting”, but colours and their meaning are not explained. Similarly, it is not clear what the agents see and process. I assume it is a local grid as the one seen in figure 4. But this is quite an aerial view, and other grid options might do the job as well.
+Similarly, some actions are mentioned (it seems that N, S, E, W and “Pass”? plus some attack options, but they are not described). In the end, I understand many choices have to be made for any evaluating setting, but many choices are very arbitrary (end of section 3 and especially experiments) and there is a lot of tuning, so it’s unclear whether some of the observations happen just in a particular combination of choices, but are more general. The authors end up with many inconclusive observations and doubts (“perhaps”) about small changes, at the end of section 5.
+Other things such as the “spawn cap” and the “server merge” are poorly explained, with clear definitions and proper justification of their role. Similarly, I’m not sure about how reproduction takes place or not, and if so, whether weights are inherited or reinitialised. Something related is said about species.
+I found the statement about multiagent competition being a curriculum magnifier, not a curriculum itself, very interesting, but is this really shown in the paper or elsewhere?
+In general, I miss many details and justifications for the whole architecture and mechanism of this neural MMO.
+Pros:
+-	Designed to be scalable
+-	Goes in the right direction of benchmarks that can capture generally variable (social) behaviour.
+Cons:
+-	Poor comparison with existing platforms and similar ideas.
+-	Too many arbitrary decisions for the setting and the experiments to make it work or show complex behaviours
+-	The paper needs extensive rewriting, clarifying many details, with the figures really helping for the understanding.
+Typos and minor things: 
+-	“Susan Zhang 2018” is named a couple of times, but the reference is missing. Also, it is quite unusual to use the given name for this researcher while this is not done for any other of the references.
+-	“as show in Figure 2” -> shown
+-	“impassible” -> “impassable”
+
+****************************
+I've read the new comments from the authors and the new version of the paper. I think that the paper has improved significantly in terms of presentations, coverage of related work. I still see that the contribution is somewhat limited, but I'm updating the score to better account with this new version of the paper.",5,2.0,ICLR2019
+rklAz2j92m,2,HyxhusA9Fm,HyxhusA9Fm,A challenging new task and dataset,"The paper introduces a new task called ""Talk the Walk"", where a tourist and a guide has to communicate in natural language to reach a common goal. It also introduces strong baselines for the task. The descriptions are thorough and clear. My only worry is that the task is too hard and has too many complexities to be a stand alone task.  Future work will probably focus on sub-parts of the task.",7,4.0,ICLR2019
+B1xA7Po2tS,2,BJlahxHYDS,BJlahxHYDS,Official Blind Review #2,"Overview:
+This paper introduces a new method for uncertainty estimation which utilizes randomly initialized networks. Essentially, instead of training a single predictor that outputs means and uncertainty estimates together, authors propose to have two separate models: one that outputs means, and one that outputs uncertainties. The later one consists of two networks: a randomly initialized “prior” which is fixed and is not trained, and a “predictor”, which is then trained to predict the output of the randomly initialized “prior” applied to the training samples.  
+Authors show that under some reasonable assumptions the resulting estimates are conservative and concentrated (i.e. bounded and converge to zero with more data).
+
+Writing quality: 
+Overall, the paper is relatively well-written, although it might be at times hard to follow, especially for someone who is not familiar with the original work that used randomized prior functions (Burda’18, Osband ‘18, ‘19). 
+
+Evaluation:
+The method is experimentally evaluated on a task of out-of-distribution detection on CIFAR+SVHN, and seems to perform on-par or better than the baselines (including “standard” deep ensembles and dropout networks). In addition, there are experiments that demonstrate that the model is performing relatively well in terms of calibration (whether the model predictive behaviour makes sense as the model confidence changes).
+
+Decision:
+I find the core idea behind the paper quite interesting, however, as indicated by authors themselves, it has already been studied in a slightly different context (RL, works by Burda et. al, Osband et. al). That said, authors do provide additional insides for the supervised settings, and also analyse theoretically the behaviour of uncertainty estimates. 
+Overall, I cannot say I am fully convinced that the paper should be accepted as is (also see questions below), but generally I am positive about this work, and hence the final score: “weak accept”.
+
+Additional comments / questions:
+(somewhat minor) p1: “While deep ensembles …, where the individual ensembles are trained on different data“ - here and related text, it should probably be “individual models” / “individual networks”. Generally, I am not convinced that these are strong arguments against deep ensembles.
+
+(minor) p2-p3: “2. Preliminaries” - I am not sure if this section adds much to the understanding, it would seem more natural to spend more time explaining the intuitions behind the net
+
+(kind of major) p3. “prior” - The explanation of why using a randomly initialized network makes sense is not very strict. I kind of get the general idea, but it is not clear to me why not use something less expensive, e.g. just random projections, and why do we actually need a full network. Intuitively it seems quite strange to waste a lot of capacity to fit to essentially fit a set of random weights: is it something that allows the network to avoid easily learning the “random prior”? And, more generally, can this also be considered as a “trick” to de-correlate individual predictors? I believe these points should be discussed in more detail.
+
+<update>
+I would like to thank authors for verbose response and the revised version: it is a bit more clear. 
+I stand by my original rating.
+</update>
+
+
+
+
+
+
+",6,,ICLR2020
+rkEX3x_Nx,3,rywUcQogx,rywUcQogx,Unclear about the contribution ,"It is not clear to me at all what this paper is contributing. Deep CCA (Andrew et al, 2013) already gives the gradient derivation of the correlation objective with respect to the network outputs which are then back-propagated to update the network weights. Again, the paper gives the gradient of the correlation (i.e. the CCA objective) w.r.t. the network outputs, so it is confusing to me when authors say that their differentiable version enables them to back-propagate directly through the computation of CCA. 
+",3,4.0,ICLR2017
+rke4w8yIKr,1,rklHqRVKvH,rklHqRVKvH,Official Blind Review #3,"Summary:
+	This paper develops a method for taking advantage of structure in the value function to facilitate faster planning and learning. The key insight is that MDPs with low rank Q^* matrices can be solved more expediently using matrix estimation methods, both for classical dynamic programming methods (value iteration) and for learning in rich environments using recent model-free deep RL techniques. Thorough empirical analysis is conducted both for value iteration in tabular MDPs and for deep RL in rich environments. These experiments highlight new findings about the role the rank of the Q matrix plays in planning convergence and learning rates.
+
+	I view this paper as containing several key contributions: first, the analysis on the role of Q rank in planning and learning---experiments conducted indicate that even complicated environments tend to have low rank Q matrices (when approximated). Highlighting the role of this rank and the corresponding empirical analysis estimating it in benchmark RL and control tasks is, to my knowledge, novel. Second, and perhaps the most significant contribution, is ""Structured Value RL"" (SV-RL), an easy-to-apply method that can be incorporated into many Q-based deep RL methods with little overhead. The empirical results are compelling: across three different variations of DQN-like architectures, the SV RL augmentation tends to improve learning. Presentation of results is rigorous, too, and provide strong evidence that the method works.
+
+	As the paper mentions, theoretical analysis on the impact of Q-rank on dynamic programming (and perhaps learning) would be of great interest to the community. I take this analysis to be out of scope for this paper, but could see the work motivating future investigation into these questions.
+
+Verdict: Overall, I take this paper to present many novel insights, establish solid motivation with good writing and examples, and offers compelling evidence about the strength of SV RL. I recommend accepting the paper.
+
+Comments:
+	C1: The visuals throughout the paper are helpful!
+	C2: The paper is well written: the use of examples was effective in developing the motivation.
+	C3: Section 2 is helpful for understanding the ideas developed in the paper. However, there are many well developed planning frameworks for MDPs that trade-off optimality with computational efficiency. It might be worth discussing some of these methods up front. For instance, Bounded Real-Time Dynamic Programming (McMahan et al. 2005) explicitly uses value function structure to improve planning speed, with performance guarantees, as does Focused RTDP (Smith and Simmons 2006). I don't take the computational complexity improvements of the proposed method to be the primary contribution, so just a brief discussion to contextualize the work against other planning literature would be helpful.
+	C4: While the ""rank"" studied here is of a different form, some discussion of the Bellman Rank work (Jiang et al. 2017) might be useful for differentiating the two notions of ""rank"" at play, and how they are each used to expedite learning. The Bellman Rank is used as a measure of complexity of an MDP---Jiang et al. develop an RL algorithm that has sample complexity that depends on this measure. It is not strictly necessary, but I could see multiple uses of ""rank"" appearing in the RL literature as a means of exploiting structure for faster learning being confusing. If space (perhaps in the appendix if not), a sentence or two differentiating the two ideas might be helpful to readers. Additionally, the study of sparsity in value function representation was studied by Calandriello et al. 2014. If space permits, the paper might benefit from some discussion of the relation to this work.
+
+Questions:
+	Q1: In the inverted pendulum results, I am curious about the effect of the discretization on plan quality. Specifically: how were 2500 states and 1000 actions chosen? Were different orders of magnitude (for both values) considered? How did this impact SVP? Does the rank change as the discretization becomes more or less coarse? I don't think this is strictly critical for the paper, but a few sentences clarifying this point would be informative.
+
+	Q2: Figure 4 provides nice insights into how to scale these ideas to deep RL. How were the four games chosen? Is there anything special that motivated their selection?
+
+	Q3: Additionally, I am curious about whether the results from Figure 4 are the consequence of algorithmic decisions, rather than the environment. Is it possible to determine whether different value based methods (or different choices of hyperparameters) lead to different outcomes? For instance, I could imagine a more shallow network, or a tighter bottleneck, leading to Q evaluations that produce higher rank. 
+
+
+Typos and Writing Suggestions:
+
+[Abstract]
+	- This sentence is quite long, and I had a hard time following it as a result: ""As our key contribution, by leveraging..."". Consider dividing into two sentences.
+
+[Intro]
+	- Oxford comma: ""control, planning and reinforcement learning""::""control, planning, and reinforcement learning
+	- ""the structured dynamic""::""the structure in the dynamics""
+	- ""where much fewer samples""::""where fewer samples""
+	- Consider rewording: ""almost the same policy as the optimal one"". Is it that the policies are in fact the same? Or that their values are close? Perhaps: ""a policy with near optimal value"".
+	- When introducing Double DQN and Dueling DQN for the first time it would be appropriate to cite each (end of Section 1).
+
+[Sec. 2: Warm Up]
+	- ""understand the structures""::""understand the structure""
+	- ""give a strong evidence for""::""provide evidence that""
+	- ""exploit the structures for""::""exploit structure in the value function for""
+	- I think the italicized statement at the top of page 3 could be sharpened. The antecedent currently stating ""why not"" is quite a soft statement compared to the motivation the section develops. Consider changing: ""...why not enforcing such a structure throughout the iterations?""::""...then enforcing such a structure throughout planning can improve the rate of convergence"".
+
+[Sec. 3: Structured ... Planning]
+	- ""even non-convex optimization approaches (...""::""even non-convex optimization approaches to solving this problem (...""
+	- ""offer a sounding foundation for future""::""offer a sound foundation for future""
+
+[Sec. 4: Structured ... RL]
+	- ""Previously, we start by""::""Previously, we started by""
+	- ""which in deep scenarios""::""which in scenarios with large state spaces"", or perhaps: ""which in deep scenarios""::""which in scenarios where value function approximation is used""
+
+
+References:
+
+Calandriello, Daniele, Alessandro Lazaric, and Marcello Restelli. ""Sparse multi-task reinforcement learning."" Advances in Neural Information Processing Systems. 2014.
+
+Jiang, Nan, et al. ""Contextual decision processes with low Bellman rank are PAC-learnable."" Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017.
+
+McMahan, H. Brendan, Maxim Likhachev, and Geoffrey J. Gordon. ""Bounded real-time dynamic programming: RTDP with monotone upper bounds and performance guarantees."" Proceedings of the 22nd international conference on Machine learning. ACM, 2005.
+
+
+Smith, Trey, and Reid Simmons. ""Focused real-time dynamic programming for MDPs: Squeezing more out of a heuristic."" AAAI. 2006.
+",8,,ICLR2020
+BJxST0itnm,2,r1gVqsA9tQ,r1gVqsA9tQ,"interesting idea, insufficiently fleshed out","The authors propose `ChainGAN, a GAN architecture where the generator is supplemented with a series of editors that iteratively improve image quality. In practice, the algorithm also uses multiple critics (discriminators), although this is not explained until the Experiments section. 
+
+The paper contains the germ of a powerful idea. However, it feels as if the authors haven't yet come to grip with their own idea and architecture. Currently, the role of the editors feels underspecified: it is unclear (and unexplored?) what architectures make for good editors; exactly how editors should interact with the various losses; and what the role of the critics (ideas are proposed in related work) should be. In the experiments, the editors sharpen image quality, but the tradeoffs are not explored. Are more editors always better? When does it saturate? Why? Adding a few editors and critics makes the architecture more parameter-efficient, but increases the number of losses. What happens to wall-clock training time? Moreover, the paper is conflicted about the role of the critic(s). Is the core idea to have multiple generators, discriminators, or both? What is moving the needle? 
+
+
+",4,4.0,ICLR2019
+H1gshz42KB,1,B1xoserKPH,B1xoserKPH,Official Blind Review #1,"Summary: This paper looks at privacy concerns regarding data for a specific model before and after a single update. It discusses the privacy concerns thoroughly and look at language modeling as a representative task. They find that there are plenty of cases namely when the composition of the sequences involve low frequency words, that a lot of information leak occurs.
+
+Positives: The ideas and style of research is nice. This is an important problem and I think this paper does a good job investigating this in the context of language modeling. I do hope the community (and I think the community is) moving towards being aware of these sorts of privacy issues.
+
+Concerns: I don't know how generalizable these results would be on really well-trained language models (rnn, convolution-based, or transformer-based). The related work section doesn't seem particularly well put together, so its difficult to place the work in appropriate context and gauge its impact.
+
+Other Thoughts: I'd like more thorough error analysis looking at exactly what kinds of strings/more nuanced properties of sequences that get a high differential score.
+
+Overall I think this work is interesting and I would encourage the authors to try and add as much quantitative evaluation as possible, but also try and include qualitative information regarding specific sequences after prodding the models. Those could go a long way in strengthening the paper.",6,,ICLR2020
+a95_PE9aJG0,5,2NHl-ETnHxk,2NHl-ETnHxk,Confusion over point of method,"The paper is well written, and seems to solve the proposed problem well. I am very confused however about the point of the problem you are trying to solve. From what I understand you want to take a voxelized version of an MRI scan and distort it such that the observed face is no longer identifiable with the original patient, but the useful information in the MRI is preserved. Part of your method requires or extracts the brain from the MRI. In either case you claim that that is information is available and sufficient for downstream tasks, and everything can be removed. Would a superior method here not just be to take this brain information directly.  The De-Identification Quality metric which you use would obviously get the optimal 20% accuracy here as the we have no way of associating images of brains with images of faces, and even if we do, you have not demonstrated that you method would perform well on it either. I see no point in altering the face of the scans as the face information will be useless for any downstream task, and all other information is apparently preserved, so why not just take this preserved data as the privacy preserving MRI information? While this may fail to perform well under SIENAX, I would say this is a problem with using this software as a metric, as the brain would clearly be preserved. 
+
+If I am mistaken here, I apologies, and am happy to change by rating and comments given sufficient reasoning. However, from the knowledge I posses in the field I do not think the approach has merit due to superior, trivial baselines. 
+
+I also have an issue with the experiment for assessing privacy preservation. I would say that the mechanical turk experiment is valuable, but only if you also demonstrate that deep learning based approaches cannot identify the associate the faces as well. Without this I have no way of knowing if your method fails to a simple CNN. 
+
+",3,4.0,ICLR2021
+Cgp0XpUAS8j,1,uR9LaO_QxF,uR9LaO_QxF,"Interesting idea, more visually complex environments would strengthen the paper","Summary:
+
+The paper proposes a method for ""actor-latency constrained"" settings: Recently, transformers have been shown to be powerful models in RL which, in particular, exhibited better sample complexity in settings in which long-term credit assignment in partial observability was required (e.g. the T-maze). However, they are computationally expensive. Consequently, the authors propose to train transformers on the learner, supported by hardware acceleration, but also train a smaller LSTM agent which can be efficiently executed on the actors. 
+
+Positive points:
+(+) I think this is a relevant problem setting
+(+) Clear description of the algorithm which seems well thought-out 
+
+Possible weakness:
+(-) Experimental evaluation. 
+
+I'm currently recommending a weak rejection. I believe the idea and execution of the paper (in particular the method part) is good. However, a stronger experimental section which also evaluates on more complex visual domains would greatly strengthen the paper. Doing so on tasks with an equal complexity on the required long-term memory might be computationally challenging, however, I personally would already be satisfied with showing that the proposed algorithm doesn't perform worse than the baselines, even on tasks which don't require (as much) memory. 
+
+In other words: The current experiments clearly show why and when ALD can provide an advantage. What is missing for me is a visually more complex experiment that shows me that the actor/learner split and associated off-policy-ness doesn't create additional problems on (visually) more complex environments. If those environments don't require as much memory, there is no reason to expect that ALD can outperform the baseline, which would be ok for me, as long as it doesn't underperform it.
+
+One additional question (unrelated to evaluation of the paper) I was wondering: The authors mention Teh et al. What I am wondering is why, when updating pi_A, you only use the distillation loss and not also the RL loss? Did you try?
+
+Other minor point (no impact on evaluation): I would have found a brief description of HOGWILD, e.g. in the appendix, to be helpful. ",5,4.0,ICLR2021
+XlRlf2Pb_M,4,-6vS_4Kfz0,-6vS_4Kfz0,Well-written paper on memory mapping method that outperforms native NNP-I compiler by 28-78%,"The paper describes a machine learning method for mapping computational nodes in a neural network onto different levels of the memory hierarchy (DRAM, LLC, and SRAM) to minimize latency of inference. The proposed approach builds on CERL, combining policy gradient, reinforcement learning, and graph neural network, achieving 28-78% speed-up over the native NNP-I compiler on vision and language benchmarks (ResNet-50, ResNet-101, and BERT).
+
+Overall, the paper was well-written, targets an impactful problem, and the reported improvements (28-78% over native compiler) are impressive.
+
+In the related work section, I did have a concern, as the authors state “For example, previous work with manual grouping (sic) operate at most in 5^280 \~= 10^196 dimensional action space (Mirhoseini et al., 2020), compared to 10~ 20^358 for our BERT problem”. However, Mirhoseini et al., 2020 (“Chip placement with deep reinforcement learning”) places “a few thousand clusters” (>=2000 nodes) onto a grid with “an average of 30 rows and columns” (~900 cells), so wouldn’t the action space be at least 900^2000? Also, didn’t that work use a heuristic grouper (hMETIS), but maybe that’s close enough to “manual”? 
+
+The authors only look at three benchmarks, but they were well-chosen (two representative vision models and one large language model). It’s also good that they compare against PG and EA alone as a form of ablation, given that their method is effectively a combination of these two. It would have been better if they also had compared with prior state-of-the-art (e.g. HDP, REGAL, Placeto, or (the unmentioned) GDP / GO), but it is somewhat understandable given that their code does not seem to be open-sourced.  
+
+I liked that the authors report mean and standard deviation for the five runs, and measured “true” reward by running on hardware. I also thought they did a good job motivating their method (aside from the questionable statements about action spaces in prior work), and of analyzing and visualizing its performance. 
+
+Nits:
+In the Method section, “the compiler rectifies them and outputs a modified map, M_c, that is fully executable (Line 6).” It would probably be good to add “in Algorithm 1” so as not to confused the reader.
+
+“It comprises of a single PG learner” -> “It is comprised of…”
+
+“Both methods are known to produce highly performant and stable solutions but are also significantly slow compared to Deep RL” (“significantly slower than”?)
+
+“While the transferred policies are clearly underperform those from scratch” -> “underperforming”",5,5.0,ICLR2021
+SJxIVpkZM,2,BJIgi_eCZ,BJIgi_eCZ,state of the art on SQuAD with FusionNet,"The primary intellectual point the authors make is that previous networks for machine comprehension are not fully attentive. That is, they do not provide attention on all possible layers on abstraction such as the word-level and the phrase-level. The network proposed here, FusionHet, fixes problem. Importantly, the model achieves state-of-the-art performance of the SQuAD dataset.
+
+The paper is very well-written and easy to follow. I found the architecture very intuitively laid out, even though this is not my area of expertise. Moreover, I found the figures very helpful -- the authors clearly took a lot of time into clearly depicting their work! What most impressed me, however, was the literature review. Perhaps this is facilitated by the SQuAD leaderboard, which makes it simple to list related work. Nevertheless, I am not used to seeing comparison to as many recent systems as are presented in Table 2. 
+
+All in all, it is difficult not to highly recommend an architecture that achieves state-of-the-art results on such a popular dataset.",8,3.0,ICLR2018
+hsUFWTeEg5k,3,pavee2r1N01,pavee2r1N01,"An slight improvement over an existing regularization method for neural network robustness, but more convincing experiments needed.","The authors exploit the piecewise linear nature of ReLU neural networks to design a new regularizer that improves the robustness of the neural network. It can be viewed as a alternative to the regularizer proposed in Croce et al. (2018) -- the current regularizer uses the analytic center, whereas the previous work (named MMR) guarantees that are straightforward consequences of the geometric interpretation. The experimental setup is similar to the one in Croce et al. (2018), which compares the result of different adversarial defense methods on MNIST, F-MNIST, and CIFAR10 on small shallow networks. These computational results generally slightly outperform MMR on MNIST.
+
+The paper is polished and well-written, and the use of the analytic center is reasonably motivated via the geometry of the neural network.
+
+However, the paper has a few major drawbacks that need to be addressed before acceptance.
+
+(1) Novelty: The paper can be seen as a twist on Croce et al., working off some of the same intuition and using very similar experiments. The theoretical results are a simple consequence of using the analytic center. Hence, this work can be viewed as incremental.
+
+(2) Experimental Results: Given that the set of regularization approaches has significantly grown since the prior paper was published, the authors should at least compare their methods with some of these. This is especially important given that the results are not much better than MMR (and worse in some cases).
+
+(3) Computational Considerations: The computation effort needed to incorporate this regularizer should be discussed explicitly so the reader can understand the trade-offs between incorporating this regularizer vs. other defense methods.
+
+Side note:
+The authors should update the citation for Croce et al. to refer to the conference version of the paper. ",5,4.0,ICLR2021
+x1IcJIMoO4d,3,3jjmdp7Hha,3jjmdp7Hha,Review,"The paper describes a method to improve NMT training with backtranslation.
+
+Rather than using a fixed t->s model to translate target monolingual data in order to augment the training set for the s->t model, the proposed approach first pretrains the t->s model as usual then jointly trains it with the forward s->t model using a meta-learning approach: the s->t model is trained on the syntetic backtranslated data and a ""meta-validation"" loss is computed on a paralled dataset, which is used to update the t->s model using REINFORCE.
+The approach is similar to the DualNMT model by Xia et al, but rather than updating on monolingual data based on LM and reconstruction scores, it uses a reward based on the cross-entropy on parallel data.
+The paper also proposes a way to adapt this method to a multi-lingual setting.
+
+Experiments are performed on WMT-14 En->De and En->Fr, and on 4 IWSLT-2018 language pairs. The authors report small but consistent improvements. Additional analyses are also reported.
+
+Overall the method seems valid, although it is described at a very high level and no code release is mentioned. In my experience successfully implementing RL-based method is strongly dependent on getting hyperparameters and implementation details right, so it could be hard to reproduce this work without the code or a more detailed description. Also it's not entirely clear what is being used as ""meta-validation"" data here, I suppose it's all the parallel training data, but the paper doesn't make it clear.
+
+Minor issues: the ""Tagged backtranslation"" paper by Caswell et al. 2019 contrasts the claim that improvements with sampling backtranslation are due to increased diversity. It should be referenced as relevant work. The Xia et al., 2016 Dual NMT paper is referenced multiple times in the text but not in the bibliography section",7,5.0,ICLR2021
+SkvdG35xz,3,H18uzzWAZ,H18uzzWAZ,Review of Correcting Nuisance Variation using Wasserstein Distance,"This contribution deal with nuisance factors afflicting biological cell images with a domain adaptation approach: the embedding vectors generated from cell images show spurious correlation. The authors define a Wasserstein Distance Network to find  a suitable affine transformation that reduces the nuisance factor. The evaluation on a real dataset yields correct results, this approach is quite general and could be applied to different problems.
+
+The contribution of this approach could be better highlighted. The early stopping criteria tend to favor suboptimal solution, indeed relying on the Cramer distance is possible improvement.
+
+As a side note, the k-NN MOA is central to for the evaluation of the proposed approach. A possible improvement is to try other means for the embedding instead of the Euclidean one.
+
+",7,3.0,ICLR2018
+HJlIc6pdYH,2,HkejNgBtPB,HkejNgBtPB,Official Blind Review #3,"This paper proposes Variational Template Machine (VTM), a generative model to generate textual descriptions from structured data (i.e., tables). VTM is derived from the variational autoencoder, where the input is a row entry from a table and the output is the text associated with this entry. The authors introduce two latent variables to model contents and templates. The content variable is conditioned on the table entry, and generates the textual output together with the template variable. The model is trained on both paired table-to-text examples as well as unpaired (text only) examples. Experiments on the Wiki and SpNLG datasets show that models generate diverse sentences, and the overall performance in terms of BLEU is only slightly below the best baseline Table2Seq model that does not generate diverse sentences. The results also show that additional losses for preserving contents and templates introduced by the authors play an important role in the overall model performance. 
+
+I have several questions regarding the experiments:
+- For the Table2Seq baseline, how was the beam size chosen? Did it have any effect on the performance of the baseline model?
+- Did the authors try other sampling methods for Table2Seq? (e.g., top-K or nucleus sampling)
+- VTM is only able to achieve comparable performance to Table2Seq in terms of BLEU after including the unlabeled corpus, especially on the Wiki dataset. A way to incorporate this unlabeled data to Table2Seq is by first pretraining the LSTM generator on it before training it on pairwise data (or in parallel). How would this baseline model perform in comparison to VTM?
+- In the conclusion section, the authors mentioned that VTM outperforms VAE both in terms of diversity and generation quality. What does this VAE model refer to? The experiments show that VTM is comparable to Table2Seq in terms of quality and is better in terms of diversity. 
+
+Generating text from structured data is an interesting research area. However, I am not convinced that the proposed method is a significant development based on the results presented in the paper. There are also many grammatical errors in the paper (e.g., ... only enable to sample in the latent space ..., and many others), so I think the writing of the paper can be improved.",3,,ICLR2020
+rJxqaMwTtS,1,rkxNelrKPB,rkxNelrKPB,Official Blind Review #2,"This paper performs a general analysis of sign-based methods for non-convex optimization. They define a new norm-like function depending on the success probabilities. Using this new norm-like function and under an assumption, they prove exponentially variance reduction properties in both directions and small mini-batch sizes. 
+
+I am not convinced about assumption 1, which plays the key role of the proof. It assumes that success probabilities are always large or equal to 1/2. 
+
+How can we guarantee this property hold for an algorithm? I suggest the authors provide some real learning examples, under which it will satisfy the condition.  I may revise my rating according to this. 
+",3,,ICLR2020
+J6VnSfEbZ60,1,y13JLBiNMsf,y13JLBiNMsf,"Nice idea, but details unclear and presentation needs polish","The paper describes a simple extension to the location-only monotonic GMM attention mechanism from Graves (2013), which takes the source/key context into account when computing attention weights.  The proposed method improves ASR performance over model using the baseline GMM attention which does not take source-content into account,  generalizes better to input sequences much longer than those seen during training, while also obtaining competitive performance to other streaming seq2seq ASR models on ""matched"" test sets.
+
+## Pros:
+
+1. Incorporating source content is an obvious and useful extension to monotonic GMM attention, combining the strengths of content-based approaches such as additive or dot-product attention.
+
+2. It improves performance and generalization while being simpler than existing techniques in the literature (e.g. Mocha, CTC/transducer models which have more complex loss functions).
+
+## Cons:
+
+1. The description of the proposed mechanism is inconsistent with existing literature and is very unclear and confusing in parts.
+
+2. Experiments are somewhat incomplete/missing important comparisons, e.g. comparing baseline GMM attention to the proposed ""source-aware"" variant in Tables 1,2,5, and comparing to other location-based attention mechanisms, even if non-monotinic, e.g. from http://papers.nips.cc/paper/5847-attention-based-models-for-speech-recognition.pdf
+ 
+3. Overall the writing/language use could use improvement.
+
+## Detailed comments
+
+At the high level, the idea of incorporating the source keys K into GMM attention is a good one.  The proposed method seems to work, and be  simpler to implement than alternative monotonic alignment mechanisms used in seq2seq ASR models.  However, given the current state of the text, with many confusing details, I feel that the paper is not yet ready for publication without significant revisions. 
+
+Many details in the paper, especially Section 2, are unclear:
+
+- Sec 2.2. and throughout the paper:  The described ""GMM"" and ""SA-GMM"" attention always use a single component, so don't really count as a Gaussian *mixture*.  Using multiple components would explicitly allow for multimodal attention weights for each output step.  Moreover, since the mixing weights are generally computed independently at each step, using multiple components makes it possible for the base GMM attention to ""discard the non-informative tokens"".   This mechanism, which would be more precisely called ""Gaussian attention"", is strictly less flexible than the base GMM attention mechanism that was originally described in Graves, 2013.
+
+- This claim is repeated in paragraph 2 of Sec 3: ""uni-modal similar to conventional GMM attention"".  When using multiple mixture components, GMM attention is not unimodal.
+
+- Eq 8: The notation here is unclear.  Why is there a softmax over $\psi_i^h$ (a scalar AFAICT)?  Is the softmax computed over all attention heads?   Why is this necessary?  It seems to impair the training of some heads, at least for SAGMM-tr according to paragraph 4 of Sec. 4.2
+
+- Figure 1 is difficult to interpret.  The two plots have difference horizontal axes and therefore don't seem to be directly comparable...  It's not clear what the ""key width"" in Figure 1b is trying to convey since there is always going to be a single weight per (discrete) encoder step j.
+
+- Sec 2.3.  There is no particular motivation given for the proposed method for integrating source keys.  Why not include $K_j$ in the computation for the standard deviation $\Sigma_i$ as well?  And why is the same weight $\delta_j$ used as a scaling factor (eq (12)) and the mean offset in eq (10)?  These design choices deserve more explanation, and possibly empirical justification.
+
+- Sec 2.4: Is it possible to train SAGMM-tr from scratch?  Or does it need to be first trained using SAGMM and then fine-tuning with truncation enabled?
+
+Experiments:
+
+- Are the different GMM attention variants used encoder self-attention layers as well?  Or does the encoder use conventional ""soft attention""?
+
+- As above, it seems unfair not to include any experiments using multiple components when comparing different variants of GMM attention.
+
+- Sec 4.2, Table 1:  Please clarify the differences between the three models labeled (Ours).  Is the difference only in the encoder-decoder attention layer?
+
+English usage.  Just a few examples of grammar errors and unclear text, as there are too many to list.
+
+- page 1, ""attend subset of long sequences"" is missing a preposition, e.g., ""attend to a subset"".  It seems that ""long sequences"" is meant  to refer a single source sequence.
+
+- page 1, ""mismatch between attention parameters from decoders and information distribution in encoder outputs"".  This sentence is difficult to parse.  What is the mismatch here?   Why would the source encoding and decoder query vector need to ""match"", especially in a purely location-based attention scheme?
+
+- page 4: ""fixed length with hyperparameters""  What are they hyperparameters being referred to here?
+- page 5, Sec. 4: ""enables early inference"".  What does ""early inference"" mean?  
+- page 5, Sec. 3: ""for the inference"" -> ""for inference"" 
+- page 5, Sec 4.1:  ""1 second speeches"" -> ""1 second long utterances""
+- page 5, Sec 4.1: ""from 30 vocabulary"" -> ""from a vocabulary of 30 words""
+
+
+Other comments:
+
+- Sec 2.2: Sutskever et al., 2014 did not use content-based attention.
+",5,4.0,ICLR2021
+SJeiEX3YnX,1,rkeT8iR9Y7,rkeT8iR9Y7,Contribution not entirely clear,"Summary: This work provides an analysis of the directional distribution of of stochastic gradients in SGD. The basic claim is that the distribution, when modeled as a von Mises-Fisher distribution, becomes more uniform as training progresses. There is experimental verification of this claim, and some results suggesting that the SNR is more correlated with their measure of uniformity than with the norm of the gradients.
+
+Quality: The proofs appear correct to me. 
+
+Clarity: The paper is generally easy to read.
+
+Originality & Significance: I don't know of this specific analysis existing in the literature, so in that sense it may be original. Nonetheless, I think there are serious issues with the significance. The idea that there are two phases of optimization is not particularly new (see for example Bertsekas 2015) and the paper's claim that uniformity of direction increases as SGD convergence is easy to see in a simple example. Consider f_i(x) = |x-b_i|^2  quadratics with different centers. Clearly the minimum will be the centroid. Outside of a ball of certain radius from the centroid all of the gradients grad f_i point in the same direction, closer to the minimum they will point towards their respective centers. It is pretty clear, then that uniformity goes up as convergence proceeds, depending on the arrangement of the centers.
+
+The analysis in the paper is clearly more general and meaningful than the toy example, but I am not seeing what the take-home is other than the insight generated by the toy example. The paper would be improved by clarifying how this analysis provides additional insight, providing more analysis on the norm SNR vs uniformity experiment at the end. 
+
+Pros:
+- SGD is a central algorithm and further analysis laying out its properties is important
+- Thorough experiments.
+
+Cons:
+- It is not entirely clear what the contribution is.
+
+Specific comments:
+- The comment at the top of page 4 about the convergence of the minibatch gradients is a bit strange. This could also be seen as the reason that analysis of the convergence of SGD rely on annealed step sizes. Without annealing step-sizes, it's fairly clear that SGD will converge to a kind of stochastic process.
+
+- The paper would be stronger if the authors try to turn this insight into something actionable, either by providing a theoretical result that gives guidance or some practical algorithmic suggestions that exploit it.
+
+Dimitri P. Bertsekas. Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey. ArXiv 2015.",4,3.0,ICLR2019
+H1gIAox8tH,1,Syetja4KPH,Syetja4KPH,Official Blind Review #3,"Deep Randomized Least Squares Value Iteration
+=========================================================
+
+This paper proposes a method for exploration via randomized value functions in Deep RL.
+The algorithm performs a standard DQN update, but then acts according to an exploration policy sampled from a posterior approximation based on a last layer linear rule.
+The authors show that this algorithm can perform well on a toy domain designed to require efficient exploration, together with some results on Atari games.
+
+
+There are several things to like about this paper:
+- The problem of efficient exploration in Deep RL is a pressing one, and there is no clearly effective method out there widely used.
+- The proposed algorithm is interesting, and appears to have some reasonable properties. One nice thing is that it requires only relatively minor changes to the DQN algorithm.
+- The general flow of the paper and structured progression is nice.
+- The algorithm generally appears to bring superior exploration and outperform epsilon-greedy baseline.
+
+
+However, there are some other places the work could be improved:
+- I think that the name ""Deep RLSVI"" is a little imprecise... actually RLSVI could already be a ""deep"" algorithm as defined by the JMLR paper: http://jmlr.org/papers/volume20/18-339/18-339.pdf (Algorithm 4). I see that you mean this as an extension to the linear case for RLSVI... but I do think it would be better to call it something more explicit like ""Last-layer RLSVI for DQN"".
+- Related to the above, the comparison to other similar methods for exploration via ""randomized value functions"" is not very comprehensive. I'm not sure what the pros/cons are of this method versus BootDQN or the very similar work from Azizzadenesheli?
+- It would be good to compare these methods more explicitly, particularly on the domains designed specifically for testing exploration. To this end, I might suggest bsuite https://github.com/deepmind/bsuite and particularly the ""deep sea"" domains?
+- Something feels a little off about the Atari results, particularly the curves for ""rainbow""... these appear to be inconsistent with published results (look at Breakout).
+
+Overall I think there is interesting material here, and I'd like to see more.
+However, I do have some concerns about the treatment/comparison to related work and I think without this it's not ready for publication.",3,,ICLR2020
+NU4cggIX4DC,4,heqv8eIweMY,heqv8eIweMY,interesting and novel idea of non-local GNN for disassortative graphs,"This paper points out an interesting and important issue of GNNs, i.e., local aggregation is harmful for some disassortative graphs. It further proposes non-local GNNs by first sorting the nodes followed by aggregation. The paper is well written and easy to follow.
+
++ Positives
+1. The paper studies an important problem. The proposed Non-local GNNs by first sorting the nodes followed by aggregation is interesting and makes sense.
+2. The paper is well written and easy to follow.
+3. Experiments well support the claim of the paper. The results demonstrated the effectiveness of the proposed method for disassortative graphs for node classification. In addition, the authors show the running time to demonstrate its efficiency and analyze the sorted nodes to demonstrate that the proposed method can learn non-local graphs.
+
+-Negative
+1. It seems that for some disassortative graphs such as Actor, Cornell, Texas and Wisconsin, using the node attributes to build the non-local graph is much effective than using the attributed graph. The authors may also need to compare a baseline that simply use MLP to learn node embedding, then construct the graph by calculating pairwise node similarity, followed by GNN for node classification. This can be treated as a variants of the proposed NLMLP to show that sorting the nodes is more efficient and more effective.
+",7,5.0,ICLR2021
+DUpl0854on-,4,cotg54BSX8,cotg54BSX8,"Important theoretical + empirical results for model extraction attacks, which is helpful and insightful for general NLP interpretability/probing work as well. ","Summary:
+
+This paper proposes a range of algebraic model extraction attacks (different from the prevalent learning-based approaches) for transformer models trained for NLP tasks in a grey-box setting i.e., an existing, public, usually pretrained encoder, with a private classification layer. Through attacks on different sizes of models and a range of downstream tasks, they observe that only a portion of the embedding space forms a basis of the tuned classification layer’s input space, and using a grey-box method, this can be algebraically computed. The pretraining-finetuning experiments on different tasks also show the smallest number of dimensions needed for high-fidelity extraction, and also that the model extraction attacks effectiveness decreases with fine-tuning the larger models base layers---which is an insight that is very useful for a lot of interpretability/probing work.
+
+
+Reason for score:
+
+I think this paper is very well-formulated---both theoretically and empirically with promising results that will be useful not just for grey-box adversarial attacks, but also for works interesting in the effects of pretraining-finetuning (which at this point encompasses nearly all NLP tasks). The empirical results look promising---however I would like to see this demonstrated on more than just 2 datasets (and maybe even a GPT-like model, instead of just BERT) to see if (1) the results hold empirically and (2) if there any insights to be gleaned about adversarial attacks from different task structures and model types.
+
+
+Positive points + questions:
+
+1. The transformation of the raw logits for recovering information is really interesting. In the experiments for the random set of n embeddings chosen to form a basis of the last layer’s input space---are there any insights on what those embeddings amount to semantically; and also what a ground truth selection of embeddings (e.g., that an oracle adversary would compute) should be? It would be helpful to have a discussion and examples of those.
+
+2. Is there a difference in extraction results when using in-distribution queries vs. random? Most of the results say “extraction is possible with both” which is good to see, but a more finer-grained analysis/explanation of benefits/pitfalls of each would really help clarity.
+
+3. It’s nice that both a single-sentence and pairwise-sentence (SST-2 vs. MNLI) task are used to evaluate effects for the fine-tuning experiments in big transformer models.
+
+4. The results look very promising and these insights are extremely helpful even for general probing/interpretability works (especially the learning rate finetuning effects) and also hold up to existing BERT-finetuned results.
+
+5. Unlike previous work, this algebraic model extraction words even with non-linear activation layers---and this is helpful given the current standard of fine-tuning large transformer models e.g., with simple MLP/softmax classifiers. 
+
+6. Slightly different from previous work, not only can this work when attacks require embeddings to be chosen, but also when selecting (e.g., random/or from a distribution) needs to be done as well. 
+
+
+Negative points + questions:
+
+1. For the fine-tuning/learning rate experiments it would be good to evaluate this on more than just 2 tasks (e.g., maybe a range of different tasks in GLUE) not only to see if the trend still holds, but also to see if task “type” or characteristics of the task/fine-tuning affect the extraction fidelity.
+
+2. The *extracted model accuracy of BERT-base with MNLI seems to be quite static (almost no effect on increasing or decreasing learning rate)---and it would be really helpful to see how statistically significant those results are and what they look like over different seeds.
+
+3. Is there a comparison between the algebraic approach and a learning-based approach for the same tasks? (I think the paper is novel and useful enough in itself, but it would be helpful to see a side-by-side comparison).
+
+4. Is there a comparison between extracting only a single layer or going beyond to having multiple layers of target/finetuned classifiers? Is this approach feasible and similarly beneficial as a grey-box attack in that scenario? It would be really helpful to have a discussion on what that would require for future work.
+
+
+Additional minor comments:
+
+This is really well written and placed in literature, no minor nitpicks re: writing!
+",5,4.0,ICLR2021
+B1x6Bf-XnX,1,rklEUjR5tm,rklEUjR5tm,Interesting idea but the analysis and the writing need to be improved,"This paper suggests a continuous-time framework consisting of two coupled processes in order to perform derivative-free optimization. The first process optimizes a surrogate function, while the second process updates the surrogate function. This continuous-time process is then discretized in order to be run on various machine learning datasets. Overall, I think this is an interesting idea as competing methods do have high computational complexity costs. However, I’m not satisfied with the current state of the paper that does not properly discuss notions of complexity of their own method compared to existing methods.
+
+1) “The computational and storage complexity for (convex) surrogates is extremely high.” The discussion in this paragraph is too superficial and not precise enough.
+a) First of all, the authors only discuss quadratic models but one can of course use linear models as well, see two references below (including work by Powell referenced there):
+Chapter 9 in Nocedal, J., & Wright, S. J. (2006). Numerical optimization 2nd.
+Conn, A. R., Scheinberg, K., & Vicente, L. N. (2009). Global convergence of general derivative-free trust-region algorithms to first-and second-order critical points. SIAM Journal on Optimization, 20(1), 387-415.
+I think this discussion should also be more precise, the authors claim the cost is extremely high but I would really expect a discussion comparing the complexity of this method with the complexity of their own approach. As discussed in Nocedal (reference above) the cost of each iteration with a linear model is O(n^3) instead of O(n^4) where n is the number of interpolation points. Perhaps this can also be improved with more recent developments, the authors should do a more thorough literature review.
+b) What is the complexity of the methods cited in the paper that rely on Gaussian processes?
+(including (Wu et al., 2017) and mini-batch (Lyu et al., 2018)).
+
+
+2) “The convergence of trust region methods cannot be guaranteed for high-dimensional nonconvex DFO”
+Two remarks: a) This statement is incorrect as there are global convergence guarantees for derivative-free trust-region algorithms, see e.g.
+Conn, A. R., Scheinberg, K., & Vicente, L. N. (2009). Global convergence of general derivative-free trust-region algorithms to first-and second-order critical points. SIAM Journal on Optimization, 20(1), 387-415.
+In chapter 10, you will find global convergence guarantees for both first-order and second-order critical points.
+b) The authors seem to emphasize high-dimensional problems although the convergence guarantees above still apply. For high-order models, the dimension does have an effect, please elaborate on what specific comment you would like to make. Finally, can you comment on whether the lower bounds derived by Jamieson mentioned depend on the dimension.
+
+3) Quadratic loss function
+The method developed by the authors rely on the use of a quadratic loss function. Can you comment on generalizing the results derived in the paper to more general loss functions? It seems that the computational complexity wouldn’t increase as much as existing DFO methods. Again, I think it would be interesting to give a more in-depth discussion of the complexity of your approach.
+
+4) Convergence rate
+The authors used a perturbed variant of the second-order ODE defined in Su et al. 2014. The noise added to the ODE implies that the analysis derived in Su et al. 2014 does not apply as is. In order to deal with the noise the authors show that unbiased noise does not affect the asymptotic convergence. I think the authors could get strong non-asymptotic convergence results. In a nutshell, one could use tools from Ito calculus in order to bound the effect of the noise in the derivative of the Hamiltonian used in Lemma 1. See following references:
+Li, Q., Tai, C., et al. (2015). Stochastic modified equations and adaptive stochastic gradient algorithms. arXiv preprint arXiv:1511.06251.
+Krichene, W., Bayen, A., and Bartlett, P. L. (2015). Accelerated mirror descent in continuous
+and discrete time. In Advances in neural information processing systems, pages 2845–2853.
+Of course, the above works rely on the use of derivatives but as mentioned earlier, one should be able to rely on existing DFO results to prove convergence. If you check Chapter 2 in the book of Conn et al. (see reference above), you will see that linear interpolation schemes already offer some simple bounds on the distance between the true gradient of the gradient of the model (assuming Lipschitz continuity and differentiability).
+
+5) Noise
+“The noise would help the system escape from an unstable stationary point in even shorter time”
+Please add a relevant citation. For isotropic noise, see
+Ge, R., Huang, F., Jin, C., and Yuan, Y. Escaping from saddle points-online stochastic gradient for tensor decomposition.
+Jin, C., Netrapalli, P., and Jordan, M. I. Accelerated gradient descent escapes saddle points faster than gradient descent. arXiv preprint arXiv:1711.10456,
+
+6) Figure 2
+Instead of having 2 separate plots for iteration numbers and time per iteration, why don’t you combine them to show the loss vs time. This would make it easier for the reader to see the combined effect.
+
+7) Empirical evaluation
+a) There are not enough details provided to be able to reproduce the experiments. Reporting the range of the hyperparameters (Table 2 in the appendix) is not enough. How did you select the hyperparameters for each method? Especially step-size and batch-size which are critical for the performance of most algorithms. 
+b) I have to admit that I am not extremely familiar with common experimental evaluations used for derivative-free methods but the datasets used in the paper seem to be rather small. Can you please justify the choice of these datasets, perhaps citing other recent papers that use similar datasets?
+
+8) Connection to existing solutions
+The text is quite unclear but the authors seem to claim they establish a rigorous connection between their approach and particle swarm (“In terms of contribution, our research made as yet an rigorous analysis for Particle Swarm”). This however is not **rigorously** established and needs further explanation. The reference cited in the text (Kennedy 2011) does not appear to make any connection between particle swarm and accelerated gradient descent. Please elaborate.
+
+9) SGD results
+Why are the results for SGD only reported in Table 1 and not in the figure? Some results for SGD are better than for P-SHE2 so why are you bolding the numbers for P-SHE2?
+It also seem surprising that SGD would achieve better results than the accelerated SGD method. What are the possible explanations?
+
+10) Minor comments
+- Corollaries 1 and 2 should probably be named as theorems. They are not derived from any other theorem in the paper. They are also not Corollaries in Su et al. 2014.
+- Corollary 2 uses both X and Z.
+- Equation 5, the last equation with \dot{V}(t): there is a dot missing on top of the first X(t)
+“SHE2 should enjoy the same convergence rate Ω(1/T) without addressing any further assumptions” => What do you mean by “should”?
+- There are **many** typos in the text!! e.g. “the the”, “is to used”, “convergeable”,... please have someone else proofread your submission.
+",3,5.0,ICLR2019
+BJl4_UpTYH,1,rylnK6VtDH,rylnK6VtDH,Official Blind Review #2,"This paper presents multiplicative interaction as a unified characterization for representing commonly used model architecture design components (e.g. gating, attention layers and hypernetworks). Multiplicative interactions can be viewed as an effective way of integrating contextual information in a network. Through a series of thorough empirical experiments, this paper demonstrates superior performance on a variety of tasks (RL, sequence modeling) when a such multiplicative interaction module is incorporated.
+
+The framework seems applicable for learning tasks where a latent variable or context embedding presents. And the paper hypothesizes that multiplicative interactions can help introduce desirable inductive biases, therefore leading to improved generalization performance. However, the theoretical intuition and justification for such hypothesis is missing, which weakens the contribution of the paper. 
+
+In Table 1, the model LSTM with Multiplicative Decoder has more parameters (105M) than the vanilla LSTM in comparison (88M). It’d be good to provide a more fair comparison by slightly increasing the capacity of baseline LSTM model (e.g., using larger output dimension to match the capacity). This will rule out the cofounding factor that the improved performance is due to the algorithm instead of increased model capacity. 
+
+In Section 8, it’d be convincing if the authors can also report results for replacing default LSTM cell with multiplicative interactions. In paragraph 3, the authors only discussed the feasibility conceptually, without providing experimental results to support this argument. 
+",6,,ICLR2020
+eIQH-FMqcz-,3,RSn0s-T-qoy,RSn0s-T-qoy,Clarifications on mutual information is needed,"This paper proposes definition and conditions for unsupervised multi-view disentanglement providing general instructions for disentangling representations between different views. The authors also provide a novel objective function to explicitly disentangle the multi-view data into a shared part across different views and a (private) exclusive part within each view. I have the following comments on the paper.
+
+major comments
+1. In literature, there are several ways to estimate mutual information, such as lower bound of JS divergence (Hjelm 2019, Federici 2020), InfoNCE etc. Also there are other types of mutual information estimators that maximize similarities between two views. Could you provide any indication which of them work better than the other in you model?
+
+2. I think the proposed model presented in the paper is inspired by the paper by Gonzalez-Garcia in 2018, where the authors didn't use the maximization and minimization of mutual information. Why a similar criteria based on mutual information performs better than the distance metric?
+
+3. Maximizing mutual information is often considered as a difficult task and needs complicated sampling strategy (negative, hard negative etc). How do you handle those issues in your setting? Some details will be helpful for the community.
+
+4. Without a bottleneck, i.e. assuming unbounded capacity, maximizing mutual information can be trivially solved by setting the underlying function to identity (Xu Ji et al. ICCV 2019). Have you tried any bottlenecking in your case? If not, how do you ensure that the mutual information maximization does not results in a degenerated solution. Do you think including a bootlenecking would improve the results?
+
+5. The experimental evaluation only shows experimental comparison with some baseline approaches. I think it is also worth to compare the proposed methods with the prior works (for example, Gonzalez-Garcia et al. NeurIPS 2018).
+
+Based on my current understanding and the above comments, I currently recommend the paper as ""marginally below acceptance threshold"". I would like to hear clarification on the proposed models and if satisfied would be happy to increase my recommendation.
+
+minor comments
+1. In ICLR 2020, there were few works that proposed to learn mutual information from diverse domains. I think it is worth to provide to have a discussion on them.
+(i) M. Federici et al., Learning Robust Representations via Multi-View Information Bottleneck, ICLR, 2020.
+(ii) M. Tschannen et al., On Mutual Information Maximization for Representation Learning, ICLR, 2020.
+2. I think it worth providing some details on the implementation and architectures in the paper. I would also recommend to share the code.",5,4.0,ICLR2021
+Y5WFAROzcx,4,MBdafA3G9k,MBdafA3G9k,Good results in a challenging setting: one-shot imitation from visual demonstration ,"The approach described in the paper uses a relatively complicated architecture, with multiple losses and a very processed training set, but it achieves strong results when compared to two of the most similar published methods - TCNs (Time-Contrastive Networks) and GAIfO (Generative adversarial imitation from observations). A set of ablations is presented which is not exhaustive, but which is adequate to understand the comparative importance of the different parts of the approach. Overall, I think that there are some rough edges to the paper which could be improved, but the contribution of the paper is enough to warrant publication.
+
+One problem is that the method is poorly explained in some places. Some of the terms need to be explained earlier in the paper - Siamese Network or Siamese Loss is never properly explained and TCN is used in the abstract without reference or definition. At the beginning of section 3, it is stated that the 'Siamese network triplet loss' is used in the proposed method, but Eq. 3 shows a contrastive loss with margin, not a triplet loss. Adding to the problems with clarity, there are a number of obvious typos which are distracting to the reader and make it clear that this is not a fully polished submission:
+
+typos - oultine -> outline; Advisarial -> Adversarial; primariliy -> primarily; subsquent -> subsequent; independanlty -> independently; kinematiccally -> kinematically 
+fragments: 'Including recent Sim2Real quadreped robots and a huanoid with 38 DoF, which is a
+particularly challenging problem domain.'; 'Where states and actions are discrete.'
+
+The paper claims that the method can train an agent to do imitations from noisy visual data from single demonstrations, but this is not clearly shown by the experiments. The domains where the agent performs well (judging by the videos available on the linked website) are simple and cyclic and have little or no variation. It is unclear whether the agent would be able to 1-shot imitate a novel demonstration which was not within the training set already. On the more challenging domains, the videos show that the agent imitates the expert for a very short timespan before diverging. However, visual imitation is very difficult and the proposed method does perform better than the other baselines.
+
+The paper would have more impact if the authors had shown whether the trained approach could be used for better performance or learning speed on new tasks where expert data was not available, perhaps by training a new policy on top of the LSTM, or distilling to a new agent and then finetuning.
+
+Overall, the paper is exciting because it shows the value of the complex architecture for solving a challenging problem. The experiments are convincing and show that this is a promising approach which may be useful in a real-world domain.
+
+Pros: 
+- The method is well-designed and is a natural extension on existing work (TCNs, Siamese nets, GAIfO). There is enough novelty for the work to be published.
+- The results show that the approach works well on a number of different continuous control environments
+- The comparisons to other published methods, the baselines and ablations, are well chosen.
+- The supplementary contains additional details on training, and additional analysis, which is valuable
+
+Cons:
+- The text is poorly written in places - Acronyms and terms need to be explained, and spelling and grammar need to be proofed. However, this is an easy fix that should not prevent the paper from being published.
+- The authors have not adequately explained whether the model is actually capable of 1-shot imitation. It is not clear from the domains whether the agent has simply memorized the full data distribution.
+- The model is limited if there is no way to reuse the model without demonstrations.",6,4.0,ICLR2021
+Hyey5xwaYB,2,HygbQaNYwr,HygbQaNYwr,Official Blind Review #2,"This paper proposes perturbation biases as a counter-measure against adversarial perturbations. The perturbation biases are additional bias terms that are trained by a variant of gradient ascent. The method imposes less computational costs compared to most adversarial training algorithms. In their experimental evaluation, the algorithm achieved higher accuracy on both clean and adversarial examples.
+
+This paper should be rejected because the proposed method is not well justified either by theory or practice. Experiments are weak and do not support the effectiveness of the proposed method.
+
+Major comments:
+Since the evaluations of defense algorithms are often misleading [1], it requires throughout experiments or theoretical certifications to confirm the effectiveness of defense methods. However, the experiment configuration in this paper is not satisfactory to demonstrate the robustness of defended networks. The followings are a list of concerns.
+1) Experiments are limited to small datasets and networks. Since some phenomena only appear in larger datasets [2], there is a concern that the proposed method also works on other datasets.
+2) The attack algorithm used for the evaluation is weak. We can confirm this by observing the label leakage [2] in the experimental results. It is hard to judge which defenses are most effective, even within the tested datasets and models.
+3) The ""adversarial training"" baseline used in the experiment is weird. Adversarial training typically generates adversarial examples during the process of the neural networks' optimization instead of using precomputed adversarial examples. Baseline methods should be stronger, for example, adversarial training with PGD [3].
+
+[1] Athalye et al. ""Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples."" ICML 2018
+[2] Kurakin et al. ""ADVERSARIAL MACHINE LEARNING AT SCALE."" ICLR 2017
+[3] Madry et al. ""Towards Deep Learning Models Resistant to Adversarial Attacks."" ICLR 2018",3,,ICLR2020
+rklj1r55Yr,2,Hyez1CVYvr,Hyez1CVYvr,Official Blind Review #3,"This work proposes a new loss function to train the network with Outlier Exposure(OE) [1] which leads to better OOD detection compared to simple loss function that uses KL divergence as the regularizer for OOD detection. The new loss function is the cross entropy plus two more regularizers which are : 1) Average ECE (Expected Calibration Error) function to calibrate the model and 2) absolute difference of the network output to $1/K$ where $K$ is the number of tasks. The second regularizer keeps the softmax output of the network uniform for the OE samples. They show adding these new regularizers to the cross-entropy loss function will improve the Out-distribution detection capability of networks more than OE method proposed in [1] and the baseline proposed in [2].
+
+
+Pros:
+The paper is written clearly and the motivation of designed loss functions are explained well.
+
+Cons:
+1- The level of contributions is limited.  
+
+2- The variety of comparison is not enough. The authors did not show how the approach is working in compared to the other OOD methods like ODIN[3] and the proposed method in [4].
+
+3- The experiments are not supporting the idea. First, the paper claims that the KL is not a good regularizer for OOD detection as it is not a distance metric. But there is no experiment or justification in the paper that supports why this claim is true. Then the second contribution claims that the calibration term that is added to the loss function improves the OOD detection as well as calibration in the network, but the experiments are not designed to show the impact of  each regularizer term separately in improving the OOD detection rate.  Figure 2 also does not depict any significant conclusion. It only shows that the new loss function makes the network more calibrated than the naive network. This phenomenon was reported before in [1]. It would be better if the paper investigated the relation between the calibration and OOD detection by designing more specific experiments for calibration section.   
+
+Overall, I think the paper should be rejected as the contributions are limited and are not aligned with the experiments.  
+
+References
+[1]Hendrycks, Dan, Mantas Mazeika, and Thomas G. Dietterich. ""Deep Anomaly Detection with Outlier Exposure."" arXiv preprint arXiv:1812.04606 (2018).
+
+[2] A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks, ICLR2016.
+
+[3] Liang, Shiyu, Yixuan Li, and Rayadurgam Srikant. ""Enhancing the reliability of out-of-distribution image detection in neural networks."" arXiv preprint arXiv:1706.02690 (2017).
+
+[4] Lee, Kimin, et al. ""A simple unified framework for detecting out-of-distribution samples and adversarial attacks."" Advances in Neural Information Processing Systems. 2018.",1,,ICLR2020
+ByV--6Xbz,3,rJIgf7bAZ,rJIgf7bAZ,"Interesting, but not impactful","This paper proposes what is essentially an off-policy method for learning options in complex continuous problems.  The idea is to use policy gradient style algorithms to update a suite of options using relatively 
+
+On the positive side, I like the core idea of this paper.  The idea of updating multiple options at once is a good one.  I think the authors should definitely continue to investigate this line of work.  I also appreciated that the authors took the time to try and visualize what was learned.  The paper is generally well-written and easy to read.
+
+On the negative side: ultimately, the algorithm doesn't seem to work all that well.  Empirically, the method doesn't seem to perform substantially better than other algorithms, although there seems to be some slight advantage.  A clearly missing comparison would be something like TRPO or DDPG.
+
+Figure 1 was helpful in understanding marginalization and the forward algorithm.  Thanks.
+
+Was there really only 4 options that were learned?  How would this scale to more?
+",4,4.0,ICLR2018
+H1gK_Y3J6Q,2,BJGVX3CqYm,BJGVX3CqYm,Neural Architecture Search Approach to Network Quantization,"In this work the authors introduce a new method for neural architecture search (NAS) and use it in the context of network compression. Specifically, the NAS method is used to select the precision quantization of the weights at each layer of the neural network. Briefly, this is done by first defining a super network, which is a DAG where for each pair of nodes, the output node is the linear combination of the outputs of all possible operations (i.e., layers with different precision quantizations). Following [1], the weights of the linear combination are regarded as the probabilities of having certain operations (i.e., precision quantization), which allows for learning a probability distribution over the considered operations. Differently from [1], however, the authors bridge the soft sampling in [1] (where all operations are considered together but weighted accordingly to the corresponding probabilities) to a hard sampling (where a single operation is considered with the corresponding probability) through an annealing procedure based on the Gumbel Softmax technique. Through the proposed NAS algorithm, one can learn a probability distribution on the operations by minimizing a loss that accounts for both accuracy and model size. The final output of this search phase is a set of sampled architectures (containing a single operation at each connection between nodes), which are then retrained from scratch. In applications to CIFAR-10 and ImageNet, the authors achieve (and sometime surpass) state-of-the-art performance in model compression.
+
+The two contributions of this work are
+1)	A new approach to weight quantization using principles of NAS that is novel and promising;
+2)	New insights/technical improvements in the broader field of NAS. While the utility of the method in the more general context of NAS has not been shown, this work will likely be of interest to the NAS community.
+
+I only have one major concern. The architectures are sampled from the learnt probability distribution every certain number of epochs while training the supernet. Why? If we are learning the distribution, would not it make sense to sample all architectures only after training the supernet at our best?
+This reasoning leads me to a second question. In the CIFAR-10 experiments, the authors sample 5 architecture every 10 epochs, which means 45 architectures (90 epochs were considered). This is a lot of architectures, which makes me wonder: how would a “cost-aware” random sampling perform with the same number of sampled architectures?
+
+Also, I have some more questions/minor concerns:
+
+1)	The authors say that the expectation of the loss function is not directly differentiable with respect to the architecture parameters because of the discrete random variable. For this reason, they introduce a Gumbel Softmax technique, which makes the mask soft, and thus the loss becomes differentiable with respect to the architecture parameters. However, subsequently in the manuscript, they write that Eq 6 provides an unbiased estimate for the gradients. Do they here refer to the gradients with respect to the weights ONLY? Could we say that the advantage of the Gumbel Softmax technique is two-fold? i) make the loss differentiable with respect to the arch parameters; ii) reduce the variance of the estimate of the loss gradients with respect to the network weights.
+
+2)	Can the author discuss why the soft sampling procedure in [1] is not enough? I have an intuitive understanding of this, but I think this should be clearly discussed in the manuscript as this is a central aspect of the paper.
+
+3)	The authors use a certain number of warmup steps to train the network weights without updating the architecture parameters to ensure that “the weights are sufficiently trained”. Can the authors discuss the choice on the number of warmup epochs?
+
+I gave this paper a 5, but I am overall supportive. Happy to change my score if the authors can address my major concern.
+
+[1] Liu H, Simonyan K, Yang Y. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055. 2018 Jun 24.
+
+-----------------------------------------------------------
+Post-Rebuttal
+---------------------------------------------------------
+The authors have fully addressed my concerns. I changed the rating to a 7.
+",7,3.0,ICLR2019
+B1ez1LvJcB,2,HyeX7aVKvr,HyeX7aVKvr,Official Blind Review #3,"This paper presents a method for adapting a model that has been trained to perform one task, so that it can perform a new task (potentially without using any new training data at all—i.e., zero-shot learning). In some ways the presented work is a form of meta-learning or *meta-mapping* as the authors refer to it. The premise of the paper is very interesting and the overall problem is definitely of high interest and high potential impact.
+
+I believe that the presentation of the proposed method can be significantly improved. The method description was a bit confusing and unclear to me. The experimental results presented were all done on small synthetic datasets and it’s hard to evaluate whether the method is practically useful. Furthermore, no comparisons were provided to any baselines/alternative methods. For example, in Sections 4 and 5 I was hoping to see comparisons to methods like MAML. Also, I felt that the proposed approach in Section 5 is very similar to MAML intuitively. This makes a comparison with MAML even more desirable. Without any comparisons it’s hard to tell how difficult the tasks under consideration are and what would amount to good performance on the held-out tasks.
+
+In summary, I feel the paper tackles an interesting problem with an interesting approach, but the content could be organized much better. Also, this work would benefit significantly from a better experimental evaluation. For these reasons I lean towards rejecting this paper for now, but would love to see it refined for a future machine learning conference.
+
+Also, the work by Platanios, et al. on contextual parameter generation is very relevant to this work as it tackles multi-task learning using HyperNetworks. It may be worth adding a short discussion/comparison to that work as it also considers zero-shot learning.
+
+Minor comments:
+- Capitalize: “section” -> “Section”, “appendix” -> “Appendix”, “fig.” -> “Figure”. Sometimes these are capitalized, but the use is inconsistent throughout the paper.
+- “Hold-out” vs “held-out”. Be consistent and use “held-out” throughout.",3,,ICLR2020
+H1gP7OO5h7,3,H1gRM2A5YX,H1gRM2A5YX,Very interesting consolidation paper on the analysis of dynamic neural networks,"I really liked this paper and believe it could be useful to many practitioners of NLP, conversational ML and sequential learning who may find themselves somewhat lost in the ever-expanding field of dynamic neural networks.
+
+Although the format of the paper is seemingly unusual (it may feel like reading a survey at first), the authors propose a concise and pedagogical presentation of Jordan Networks, LSTM, Neural Stacks and Neural RAMs while drawing connections between these different model families.
+
+The cornerstone of the analysis of the paper resides in the taxonomy presented in Figure 5 which, I believe, should be presented on the front page of the paper. The taxonomy is justified by a thorough theoretical analysis which may be found in appendix.
+
+The authors put the taxonomy to use on synthetic and real data sets. Although the data set taxonomy is less novel it is indeed insightful to go back to a classification of grammatical complexity and structure so as to enable a clearer thinking about sequential learning tasks. 
+
+An analysis of sentiment analysis and question answering task is conducted which relates the properties of sequences in those datasets to the neural network taxonomy the authors devised. In each experiment, the choice of NN recommended by the taxonomy gives the best performance among the other elements presented in the taxonomy.
+
+Strength:
+o) The paper is thorough and the appendix presents all experiments in detail. 
+o) The taxonomy is clearly a novel valuable contribution. 
+o) The survey aspect of the paper is also a strength as it consolidates the reader's understanding of the families of dynamic NNs under consideration.
+
+Weaknesses:
+o) The taxonomy presented in the paper relies on an analysis of what the architectures can do, not what they can learn. I believe the authors should acknowledge that the presence of Long Range Dependence in sequences is still hard to capture by dynamic neural networks (in particular RNNs) and that alternate analysis have been proposed to understand the impact of the presence of such Long Range Dependence in the data on sequential learning. I believe that mentioning this issue along with older (http://ai.dinfo.unifi.it/paolo/ps/tnn-94-gradient.pdf) and more recent (e.g. http://proceedings.mlr.press/v84/belletti18a/belletti18a.pdf and https://arxiv.org/pdf/1803.00144.pdf) papers on the topic is necessary for the paper to present a holistic view of the matter at hand.
+o) The arguments given in 5.2 are not most convincing and could benefit from a more thorough exposition, in particular for the sentiment analysis task. It is not clear enough in my view that it is true that ""since the goal is to classify the emotional tone as either 1 or 0, the specific contents of the text are not very important here"". One could argue that a single word in a sentence can change its meaning and sentiment.
+o) The written could be more polished.
+
+As a practitioner using RNNs daily I find this paper exciting as an attempt to conceptualize both data set properties and dynamic neural network families. I believe that the authors should address the shortcomings I think hinder the paper's arguments and exposition of pre-existing work on the analysis of dynamic neural networks.",7,3.0,ICLR2019
+fzRqAN1nl0,2,1OQ90khuUGZ,1OQ90khuUGZ,"This work proposed an algorithm called action guidance to solve the sparse reward problem. However, I don't think this paper is fully prepared for submission as the method is not novel enough and there exist some possible issues that need to be discussed and resolved. ","This work proposed an algorithm called action guidance that trains the agent to eventually optimize over sparse rewards while maintaining most of the sample efficiency that comes with reward shaping. The authors examine three sparse reward tasks with a range of difficulties to prove the effectiveness of action guidance.  However, I don't think this paper is fully prepared for submission as the method is not novel enough and there exist some possible issues that need to be discussed and resolved. 
+
+Below are the detail comments. 
+
+About the method. The key idea behind action guidance is to create a main agent that trains on the sparse rewards, and creating some auxiliary agents that are trained on shaped rewards. And the main agent follows the instructions of auxiliary agents in the initial stage and the probability of it decreases during the following training. A concern is that if there exist several auxiliary agents, how do you arrange the shaped rewards to each auxiliary agent? If there is a conflict between the shaped rewards for the training and guidance of the agent, will the main agent still be trained well? Besides，the method itself is like using imitation learning to obtain initial policy parameters and continues to optimize using sparse reward, the novelty of the method is not sufficient enough.
+
+About the experiments. The baselines use PPO to train agents with sparse rewards or shaped rewards respectively and there are no other SOTA methods designed for sparse rewards compared in the experiments, which is not convinced. Besides, in the environment ProduceCombatUnits, the shaped rewards include the reward for each combat unit the agent produces, which is exactly the sparse reward. Is it means that the agent using shaped rewards has the same optimization direction as the one using sparse rewards? I'm not sure if this is fair enough as the effectiveness of action guidance is not clear in this setting. Lastly, the random opponents in the experiments are not strong, I'm wondering about the agent's performance in a harder setting.
+
+About the writing. The paper is well-written and self-contained. However, the figures to show the typical learned behavior of agents are not clear enough. For example, it's a little bit hard to recognize the enemy units as the blue borders are too thin.
+
+Overall, I vote for a  rejection. 
+
+
+",4,4.0,ICLR2021
+7Sq9efum4J,3,_ptUyYP19mP,_ptUyYP19mP,"Well motivated, strong empirical results","Summary 
+
+The authors propose a novel intrinsic reward based on the difference of inverse visitation counts for consecutive states. This reward encourages the agent to explore beyond the boundary of already explored regions. Using a few simple examples, they show that the proposed intrinsic reward mitigates the problems of detachment and short-sightedness which are common for count-based methods. The method shows superior performance on a number of tasks from two procedurally-generated benchmarks, MiniGrid and NetHack. The paper also contains comparisons with a few strong baselines (including SOTA on these benchmarks), analysis of the learned behavior and intrinsic reward, as well as ablations of the proposed approach. 
+
+
+Strengths 
+
+Overall, I really liked this paper. The proposed method is simple, well motivated, clearly explained, significantly outperforms the SOTA, and solves hard exploration tasks that were previously out of reach. I found the empirical evaluation to be very thorough and comprehensive, including multiple baselines and ablations. I particularly liked the careful motivation of this approach together with the quantitative analysis in Section 5.2 (which compares the visitation counts and intrinsic rewards of different methods). In addition, the authors support their claims regarding the short-sightedness and detachment issues with well designed experiments and metrics. 
+
+
+Weaknesses
+
+One thing that wasn’t clear to me after reading the paper was what are the teacher and student networks used to approximate the visitation counts. Are these equivalent to the predictor and random network used by RND? This is an important detail so please clarify. 
+
+Could the network phi (used to estimate visitation counts) “forget” previously visited states and thus lead to short-sighted behavior in a similar way as count-based methods do (i.e. oscillate between two state regions / corridors)? Have you ever observed this behavior in practice and if not, do you have any intuition why?
+
+Have you tried scaling the IR reward by the inverse of the episodic state counts (like RIDE does) instead of only using the episodic restriction? It might be interesting to add it as an ablation and give some intuition on which (when) one is preferable.  
+
+I think it would be valuable to include all the baselines in the analysis section i.e. Tables 1, 2, and Figures 4, 5 (they can be in the appendix if there isn’t enough space).
+
+Can you specify how many seeds you used for computing the mean and std in the plots? I could not find information and it is important in order to understand the significance of the results.
+
+
+Minor Points
+
+Is there any reason for which in Figure 4 you don’t show the visitation counts for both models for the same number of environment steps? I think that would be a more clear way of presenting those results. 
+
+There is some inconsistency in the notation used. In Figure 1, you use c(s) to denote visitation counts, while in Section 3, you use N(s). 
+
+Typos:
+Table 2: “entropy of the” 
+Figure 2: “MRN6” → MRN6-SX?
+Citations in various places contain parentheses where they should not (i.e. in the middle of a sentence).
+
+
+Recommendation
+
+This paper presents a novel and effective method for an important problem, and it also provides insightful analysis to better understand the limitations of different approaches. Thus, I think it would be a valuable contribution and I recommend it for acceptance. 
+
+
+
+
+
+
+
+
+
+",7,4.0,ICLR2021
+eQVq6oD4lBl,1,dnKsslWzLNY,dnKsslWzLNY,On the Universal Approximability and Complexity Bounds of Deep Learning in Hybrid Quantum-Classical Computing,"The problem studied in this work is of interest in the quantum machine learning community, as the power of small and noisy quantum computers for machine learning problems is far from being understood. Therefore, it is important to study the expressivity of quantum neural networks as function approximators. This work uses the model introduced by Tensorflow Quantum, where different neurons can be implemented on either quantum or classical computers.
+
+However, it is unclear how this result applies to current topologies of QNN/variational circuits used in current literature. From my knowledge of quantum variational circuits, the architecture proposed is different. To my understanding, this work addresses specifically the model proposed for TensorFlow quantum, and this should probably be made more explicit, as it's somehow different from current literature, where quantum neural network architectures are the sole computational node, and no classical computation is performed classically (besides the optimization of the parameters of the variational circuit). The model described in this work is called the ""prologue, acceleration, epilogue"".
+
+If I understood the work properly, the role of quantum computers is to evaluate on a quantum computer the ""binary"" part of a Binary Polynomial Neural Network (this is what the authors call the acceleration phase), after in a prologue part the data is loaded as initial quantum state with a log-depth circuit.Then, in the epilogue phase, a nonlinearity, like a a ReLU function is applied classically,
+ 
+It is really interesting the comparison with the approximation function of neural networks in classical computers. Perhaps more (recent) literature review on quantum expressiveness of other variational circuits can be added.
+ 
+I checked some of the proofs in the manuscript, and they are correct. The paper is nicely written, but perhaps might benefit some more clarity, especially in the proofs in the appendix.
+
+Overall, the submission would have benefited from experiments, showing that a QNN built with the architecture proposed in this work can achieve high accuracy in classification/regression tasks. This can be either done on small quantum computers, or even simulated in GPUs or large classical computers. Also, it would have been beneficial to write more clearly section 3.3, perhaps with an example, on how the classical optimizer is meant to choose the parameters of this circuit in a machine learning problem, i.e. how this architecture is meant to be used in practice.
+
+Other remarks are the following:
+- Please use a consistent notation for multiplication. If my understanding is correct, In proposition 3.2 and the subsequent lines, you use the notation $x \times y$, $xy$, and $x \dot y$ to denote the same operation.
+
+-  In section 4.1 it's not clear to me what this sentence means:
+ At the end of these operations, both zero states $\ket{0}$ in $Q_1$ and $Q_2$ are y, and the $\ket{0..}$ state in the combination of these two systems, $Q_{1,2}$ will be $y^2$. How can the zero register be $y^2$?
+ 
+-  What is the $p$ in the Discussion section, when discussing the result of Yarotsky?
+
+- I think in some parts of the paper the authors use $n$ for the number of qubits, and then $\log n$.
+
+- Figure 1 could be split into two figures (left and right), and the notation in the figure could be better explained as the figure is referenced many times in the paper.
+
+- Is written in the section where the  prologue phase is described ""As pointed by Bravo-Prietoet al. (2020), unitary matrix A can be decomposed to the quantum circuit with gate complexity ofO(logn), where logn is the number of qbits."" This is only true if the matrix $A$ is of the kind specified before, i.e. it's just one single column.
+
+- The fact that the depth of BPNN in hybrid quantum-classical computing can be of $O(1)$ is a strong result that perhaps should be compared more with the literature on the power of quantum shallow circuits or constant depth circuits.  
+
+Also, I think the work should conform to the widespread and standard notation of using qubits and not qbits.
+
+ Some typos:
+ - Proof Proposition 4.2. ""the Of"" should be ""the of"".
+ - After lemma 3.3 ""which is introduced in the following texts"" -> which is introduced next (or in the following section).
+ - ""To take the advantages of high-parallel in quantum computing, we made an observation on the network structure of BPNN as described in the following Property."" Might be improved. It might be changed into ""high-parallelism"" and rephrased the whole sentence.
+ -Section 4.2
+ ""d input variables and f has weak derivative"" should be derivativeS.
+ - . This brings flexibility in implementing functions (e.g., ReLU), while at the same calls for interface for massive data transfer between quantum and classical computers. Perhaps you wanted to say ""at the same time calls for fast interfaces""",4,4.0,ICLR2021
+BJeqs6mSiH,4,Bylthp4Yvr,Bylthp4Yvr,Official Blind Review #3,"The paper is well written, but severely overestimates the core contributions embedded in section 3.
+
+Firstly, the idea of using drop out for matrix sensing seems to be a somewhat trivial extension of the work on dropout for matrix factorization -- http://proceedings.mlr.press/v84/cavazza18a/cavazza18a.pdf.  I am surprised that this work is not cited with its due credit in the writing. I hope to see some concrete statement about the difference between the authors' contribution and the existing literature. It is fine to propose an incremental improvement as long as the original work receives the required credit.
+
+Dropout has been an extensive area of theoretical research for the past few years. The overly simplistic statement about the limitation of our understanding of how dropout works are a little disappointing, particularly so when the authors themselves list this literature in the related work section. That dropout training can be perceived as an adaptive regularization is also very well known and researched. These arguments weaken the second and third contribution of the paper. I do understand though that extending the results of deep linear networks to a single hidden layer RELU network is non-trivial. The derivations also suggest so and the authors deserve credit for attempting these derivations.
+
+I am also quite doubtful about the setting of the experiments in section 4.1. That changing the batch-size or learning rate does not significantly influence the eventual performance is very counter-intuitive. 
+
+Overall, the paper appeared quite promising in the beginning, but the claims in the introduction are not well supported through the rest of the paper. 
+",3,,ICLR2020
+Skgbfz3Z6Q,3,SJxJtiRqt7,SJxJtiRqt7,I think the problem is ill-posed; the image generations from image features are not great; baselines from class labels would have worked beter; lacks motivation.,"PROS:
+* The paper was well-written and explained the method and the experiments well
+
+CONS:
+* The problem seems ill-posed to me. Sound is temporal and the problem should probably be sound-to-video conversion not sound-to-image. 
+* A link to generated images from sounds where one could actually evaluate the generations would be useful. Currently the only way to evaluate the results is via labels.
+* Similarly, a baseline where images are generated given the classification labels of the sounds would probably produce better looking images. Such baseline is not provided, and it is not clear to me what a multi-modal feature extraction is providing on top of this.  For example, in the case of StackGAN, the GAN that was converting text to images, the text was describing something about the image that one could quantify in the resulting generation (eg a blue bird as opposed to a yellow one). Here such an advantage is not clear and if there is one, it should be clearly stated and discussed.
+* The results in Fig. 3 seem particularly poor and on par with current GAN generations. I think this part of the model should be improved before attempting to improve the rest.
+* In Figures 6 and 7, it is not clear what we are expected to see. Also, the labels do not correspond to the real images in many of hte cases (eg pajama, wing, volcano etc).
+
+
+Finally in the discussion, DiscoGAN is mentioned as something to look into for future work. I should note that DiscoGAN is converting samples between domains of the same modality (vision), in the context of domain adaptation, similarly to other works.
+",3,4.0,ICLR2019
+H1lXb9okiX,1,H1g2NhC5KQ,H1g2NhC5KQ,"Impressive experiments, but hard to determine how much is methodologically new here","The paper proposes ""style transfer"" approaches for text rewriting that allow for controllable attributes. For example, given one piece of text (and the conditional attributes associated with the user who generated it, such as their age and gender), these attributes can be changed so as to generate equivalent text in a different style.
+
+This is an interesting application, and somewhat different from ""style transfer"" approaches that I've seen elsewhere. That being said I'm not particularly expert in the use of such techniques for text data.
+
+The architectural details provided in the paper are quite thin. Other than the starting point, which as I understand adapts machine translation techniques based on denoising autoencoders, the modifications used to apply the technique to the specific datasets used here were hard to follow: basically just a few sentences described at a high level. Maybe to somebody more familiar with these techniques will understand these modifications fully, but to me it was hard to follow whether something methodologically significant had been added to the model, or whether the technique was just a few straightforward modifications to an existing method to adapt it to the task. I'll defer to others for comments on this aspect.
+
+Other than that the example results shown are quite compelling (both qualitatively and quantitatively), and the experiments are fairly detailed.
+",6,3.0,ICLR2019
+LdMS5eyWK-g,1,qda7-sVg84,qda7-sVg84,Theoretically well-motivated approach and promising empirical results,"Motivated by the issue of RL policy generalization, this paper explores improving generalization via a learned contrastive representation to embed states in, rather than through data augmentation or regularization alone. They demonstrate that their approach of policy similarity embeddings (PSEs) leads to improved generalization on several benchmarks, building on the method of policy bisimulation.
+
+Strengths:
++ I think this approach is theoretically well-motivated, and the authors also give good intuition, especially in light of the latest work on self-supervised learning.
++ Experimentally, I thought the controls were well-chosen and illustrated the utility of their method on each task they considered (Figures 4 and Tables 1 and 3).
+
+Weaknesses:
+-	I would be curious to see how this approach works on more scaled up tasks with larger action spaces (Section 6.2 is in that direction, for instance), but the fact that it outperforms data augmentation techniques (including bisimulation transfer) in these simpler domains is promising.
+-	In the jumping task, no intuition is given as to why PSE underperforms in the “narrow grid” (Table 1, first row) compared to bisimulation transfer, and it would be helpful to explain the failure cases of their method in more detail generally.
+
+As it stands, I think the ideas of this paper are interesting and novel, but I would like to see tasks with larger action spaces or more naturalistic inputs than the ones considered here, and in the cases where the method does not perform well, documentation and potential intuition for why that might be. Therefore, I recommend a weak accept.",6,3.0,ICLR2021
+BkgFAhVU3m,3,B1GHJ3R9tQ,B1GHJ3R9tQ,Review,"This paper proposes a technique for learning a distribution over parameters of a neural network such that samples from the distribution correspond to performant networks. The approach effectively encourages sampled parameters to have low loss on the training set, and also uses an adversarial loss to encourage the distribution of parameters to be Gaussian distributed. This approach can improve performance slightly by using ensembling and can be useful for uncertainty estimates for out-of-distribution examples. The approach is tested on a few simple problems and is shown to work well.
+
+I am definitely in favor of exploring adversarial divergences (using a critic as a differentiable loss to compare two distributions) in unusual settings, and this paper certainly does this. The idea of transforming samples from a prior such that the transformed sample corresponds to useful network parameters is interesting. The results also seem promising. However, currently the mathematical description of this method is completely unclear and ridden with many errors. I can understand at a reasonable level what the approach is doing from Figure 1, but the definitions and equations given in Equation 3 are at times nearly incomprehensible ""mathiness"". I'm giving the paper a borderline accept because the idea is interesting and the results are OK; I will raise my score if Section 3 is dramatically improved. I give some specific examples of issues with Section 3 in my specific comments below. I'd also note that the paper does a somewhat poor job comparing to existing work - only section 4.2 includes a comparison to existing ""uncertainty"" methods. This should also be improved - the authors should implement the existing methods and use them as a point of comparison in all of their experiments. As a final high-level note, the approach is described at various points as an ""autoencoder"" particularly in reference to the adversarial autoencoder. However, the approach does not ""autoencode"" anything - there is no reconstruction term, or input apart from the noise samples. The only thing it has in common with the adversarial autoencoder is the use of a critic to enforce a distributional constraint. Calling it, or comparing it to, an autoencoder is confusing and misleading.
+
+Specific comments:
+
+- You mention fast weights in related work. I believe Hinton and Plaut were the first to propose fast weights in ""Using Fast Weights to Deblur Old Memories"", and I'd also suggest mentioning ""Using Fast Weights to Attend to the Recent Past"" which is a more recent demonstration that fast weights can be useful on modern problems.
+- The are some issues with your description of Equation 1: First, I don't believe you define G(z) (I assume it is the ""decoder"" network; please define it). Second, in practice I don't believe you actually use JSD or MMD for D_z; you use a critic architecture which in some limit approximates some statistical divergence but in practice they typically don't (see e.g. Arjovsky and Bottou 2017; Fedus et al. 2017; Rosca et al. 2018). Third, writing Q_z \sim Q(z | x) seems strange to me - Q_z is a distribution, and I don't believe that Q(z | x) is a distribution over distributions, so how are you sampling a distribution (Q_z) from Q(z | x) as suggested by the use of the \sim notation? I think you simply mean that Q_z is Q(z | x) approximately marginalized over x.
+- Equation 2 is also not clear. First, the sentence before starts ""Suppose the real parameters \theta^* \sim \Theta..."" The equation itself does not include \theta^* or \Theta so I don't see what this is referring to. Second, the expression for an m-dimensional is written \mathcal{N}(0, \sigma^2, I_m). It's not clear why there is a comma before I_m, and I_m is not defined (though I assume it is the m \times m identity matrix) - did you mean to multiply I_m by \sigma^2? Third, it looks like you actually define P_z twice, once as ""an m-dimensional isotropic Gaussian"" and again as ""a Kd-dimensional isotropic Gaussian""; am I to infer that m = Kd? Why use both? Fourth, you mention the joint P(x, y) but the expectation is taken over P_x and P(y | x). Why call it P_x and not P(x)? And why not compute the expectation over P(x, y)? Fifth, you write ""Here the encoder..."" -- you never define that Q(z) or G is ""the encoder"", I assume Q(z). It is strange to take the expectation over Q(z) (I assume sampling z \sim Q(z)) but then have the term Q(z) appear in (2). How are Q(z) and Q_z related? On that note, I don't see how (2) is an autoencoder, since there is no Q(z | x) term. It appears instead that you are sampling z from Q(z) which doesn't condition on x. So what is being autoencoded? Related, you write ""all the q_k (that will generate different layers) will be correlated, unlike dimensions of z which are drawn to be independent from each other."" But if Q(z) = [q_1, ..., q_K] then doesn't the secont term in (2) suggest that they are being enforced to be similar to the prior P_z, and therefore uncorrelated? Note that you also say later on ""The job of the regularizer D_z(P_z, Q_z) is to force each embedding q_n to approximate P_z."" Frankly at this point I will stop pointing out issues with this equation and discussion since they are so widespread.
+- In your definition of your ensemble scoring rule, you are taking the sum over N + 1 elements (n = 0 to N) but dividing by N.
+- In 4.2, do you use the same model architecture/training/regularization etc. as in previous studies? If not I think comparing the different methods will be conflated by differences in training procedures. Since you do not report results in many experimental settings, I assume you don't.
+- In Figure 3, why not plot the true standard deviation around the true function? It appears you are only plotting +/- 3 stndard deviations for the learned function.
+- Why not include 100 models L2 on Figure 3?
+- It's not clear to me why you define your ""disagreement d"" when it appears the same as the entropy score you used in 4.4.
+- A stronger and more convincing attack would be to attack the ensemble of models, instead of attacking a single model and testing on the ensemble.",6,5.0,ICLR2019
+qiypVXHx8FG,4,YTWGvpFOQD-,YTWGvpFOQD-,"Interesting experimental study of improved, differentially private image classification","The paper considers ways of improving private versions of SGD in the context of image classification. The main finding is that providing ""hand crafted"" features can significantly improve the privacy/accuracy trade-off. In some cases, even a linear model built on top of such features (like those produced by ScatterNet), can improve over differentially private SGD. A plausible explanation for this phenomenon is that extra features can reduce the number of iterations required in SGD, resulting in better privacy and/or less noise. (It is also argued that having much more data similarly improves the trade-off, but this is unsurprising and, it seems, has been observed before by McMahan et al.)
+
+The paper is quite well-written, and I found it easy to follow even though this is not my area of expertise. I also like that it presents a number of possible directions for further improving private SGD, including transfer learning from related, public data sets, and second-order optimization.
+
+A possible criticism is that in principle the ""hand crafted"" features may have been built based on empirical work on MNIST and CIFAR-10, and the same goes for the architecture choices, so in theory there could be some privacy leakage from these choices. It would have been more impressive to demonstrate effectiveness of a newer data set, not known when ScatterNet and the used CNN architectures were proposed.
+
+Two final comments:
+- ""Unlearned"" usually means that you have (deliberately) forgotten something, so it is not the same as ""not learned"".
+- It would be interesting to consider the setting where just the image *label* is private. Has DP SGD been considered in that setting?",7,2.0,ICLR2021
+BJjnIbMVe,2,HJGODLqgx,HJGODLqgx,Review,"Putting the score for now, will post the full review tomorrow.",7,3.0,ICLR2017
+Byjs3NyZz,2,rk9kKMZ0-,rk9kKMZ0-,The authors propose a method that uses an embedding network trained with magnet loss for adaptively sampling and feeding the student network that is being trained for the actual task,"While the idea is novel and I do agree that I have not seen other works along these lines there are a few things that are missing and hinder this paper significantly.
+
+1. There are no quantitative numbers in terms of accuracy improvements, overhead in computation in having two networks.
+2. The experiments are still at the toy level, the authors can tackle more challenging datasets where sampling goes from easy to hard examples like birdsnap. MNIST, FashionMNIST and CIFAR-10 are all small datasets where the true utility of sampling is not realized. Authors should be motivated to run the large scale experiments.
+
+",4,4.0,ICLR2018
+Byln3WvatS,1,Hyg9anEFPS,Hyg9anEFPS,Official Blind Review #1,"The submission proposes a method to perform neural rendering. From a set of images taken of a static object under constant illumination, the proposed method first selects the four viewpoints nearest to the requested novel viewpoint. Then, the images corresponding to those viewpoints are blended together using an encoder-decoder network to produce the novel view image. The key contribution of the submission over previous work is the handling of view-dependent effects such as specular highlights. Those view-dependent effects are first removed from the nearest retrieved images using the proposed EffectsNet. The resulting estimated diffuse images are then re-projected to the target viewpoint and view-dependent effects from this viewpoint are added before blending the images together.
+
+The proposed method improves the robustness of existing neural rendering methods to materials that are not roughly diffuse. 
+
+The process to select the 20 reference images is based on a coverage scheme that is never presented, hindering reproducibility. Would it be possible to describe this scheme? Similarly, the hyperparameters used for the adversarial loss are not explicitly stated, are they exactly the same as the cited Pix2Pix? Would the source code be shared publicly?
+
+I would have appreciated an ablative experiment where the proposed pipeline is kept as-is, but a state-of-the-art intrinsic decomposition technique (as discussed in sec. 2) was applied instead of EffectsNet.
+
+I would recommend using subsections to prevent confusion in references (for example, sec. 6, p. 5 referring to sec. 6).
+
+In fig. 6, the quotient image is hard to interpret as it is not linear. A colormap representing the percentage of error wrt. the ground truth might be easier to interpret.
+
+EffectsNet is used for both removal and addition of view-dependent effects, which makes part of the paper confusing. Maybe adding the mathematical symbols of eq. 1 to fig. 2 might help the reader to understand the training steps?
+
+The impact of the regularizer weight of 0.01 (p. 5) is not discussed. I suspect this value might be important, as too much regularization might give underestimated view-dependent effects and too little might provide strong and incoherent effects. An analysis of this hyperparameter would be welcome.
+
+In my opinion, the proposed abstract is slightly hard to read, maybe it would benefit from being shorter?
+
+Minor details
+- p. 3 “An extensive overview is give[n] in [...]”
+- p. 8 I believe the “z” of the unit “Hz” is wrongly stylized as a mathematical variable.
+",6,,ICLR2020
+H1g9GJoN37,1,HJedho0qFX,HJedho0qFX,"Limited contribution. In addition, it's hard to identify the contribution of this paper.","This paper extends the previous work (Dharmaretnam & Fyshe, 2018), which provided a analytic tool for understanding CNNs through word embeddings of class labels. By analyzing correlations between each CNN layers and class labels, enables to investigates how each layer of CNNs work, how much it performed well, or how to improve the performance.
+
+I felt it is little hard to read this paper. Although the short summary of contributions of this paper in the Introduction, I could not easily distinguish contributions of this paper from the ones of the previous work. It's better to explicitly explain which part is the contributions of this paper in detail. For example, ""additional explorations of the behavior of the hidden layers during training"" is not clear to me because this expression only explain what this paper do briefly, not what this paper is actually different from the previous work, and how this difference is important and crucial.
+
+Similarly, I could not understand why adding concepts, architectures (FractalNet), datasets (CIFAR-100) is so important. Although this paper states these changes are one of the contributions, it is unclear whether these changes lead to significant insights and findings which the previous work could not find, and whether these findings are so important as contributions of this paper. Again, I think it is better to describe what main contributions of this paper are in more detail.",4,2.0,ICLR2019
+ryx--vwTYr,4,H1lNPxHKDH,H1lNPxHKDH,Official Blind Review #3,"In this paper, the author analysis the (approximate) function class generated by an infinite-width network when the Euclidean norm is bounded. They extend the work of Savarese et al. on the univariable function by introducing the Randon Transform and R-norm to this problem.  The authors finally prove that any function in Sobolev space could be (approximately) obtained by a bounded network. The results achieved implies some generalization performance analysis and the induction error. Also, according to the authors, the difference between R-norm and RKHS norm might lead to the distinct from neural networks and kernel methods.
+
+I would recommend accepting this paper since it might give a good insight into understanding the performance of the network beyond the traditional method.",6,,ICLR2020
+_ZC206F-Kb,2,qoTcTS9-IZ-,qoTcTS9-IZ-,Interesting story; but the stability conditions defined in this paper are not justified to be important ,"This paper proposed a general framework to derive different stable limiting behaviors of the dynamics of two-layers neural networks, under different parameterization of the hyper-parameters. For certain choices of hyper-parameters, this recovers the mean-field limit and the NTK limit. This paper also proposed certain properties of the limiting dynamics and showed that using these properties as the classification criteria, there are only a finite number of distinct models in the limit. This paper also proposed a novel initialization-corrected mean-field limit that satisfies all properties.
+This paper tells an interesting story. The question is whether the problem solved by the story is important.
+The main consideration of this paper is to find regimes of hyper-parameters such that the limiting dynamics are stable in some sense. For this purpose, the authors advocate the IC-MF regime, such that the limiting dynamics are stable with respect to all the conditions that the authors proposed. This story seems to be self-contained on its own. However, I believe that two more important criteria for good training algorithms are optimization and generalization efficiency, which are not discussed by the authors.
+The optimization and generalization efficiency are much more important than the stability condition in practice. There could be some regimes of hyper-parameters that do not satisfy the stability condition, but has good optimization and generalization efficiency. The IC-MF regime proposed in this paper, although seems to satisfy additional stability conditions, but intuitively, it seems that its generalization efficiency is not as good as that of MF regime: it added an additional noisy function $f_{ntk, \infty}^{0)}$ to the mean-field prediction function, and this additional noise (intuitively will) hurt generalization.
+I believe some of these stability properties should have some connection to optimization and generalization. For example, if some simple stability property is violated, the algorithm cannot generalize well. I feel that the authors should try to build connections of stability properties to optimization and generalization, to justify the importance of condition 1 and condition 2 defined in the paper.
+Above all, I feel that this paper is interesting in its own criteria. However, it didn't justify that its criteria are important. So I feel that this paper is on the borderline.
+
+Minor issues:
+	1.	Some notations are easy to get readers confused. Eq. (7), $\sigma(d) = \sigma^*(d / d^*)^{q_\sigma}$. Here $\sigma$ is a function of $d$ while $\sigma^*$ is a scaler (not as a function of $d / d^*$). It takes me while to understand this.
+	2.	Typos: page 18: in this case $1 + \tilde q + 2 q_\sigma$.
+	3.	The notations of this paper looks very complicated, especially the superscripts and subscripts.
+",5,3.0,ICLR2021
+SyeFixSscr,3,rklklCVYvB,rklklCVYvB,Official Blind Review #5,"This paper introduces a particular learnable vector representation of time which is applicable across problems without the use of a hand-crafted time representation. Their representation makes use of a feed-forward layer with sine activations which operates on time data. As it is a vector representation, it combines well with other deep neural network methods. They motivate their problem well, explaining why time data is important to a variety of problems and situate their solution as an orthogonal approach to many current solutions in the literature. They make reference to fourier analysis as motivation for their representation. Finally, they provide experimental results to support their claims using fabricated and real-world time series datasets, as well as ablation studies to support their design decisions.
+
+While I think this work has the potential to be a significant contribution, I rate this a weak reject because the theoretical motivation and analysis of the experimental results are lacking the depth of evidence I would expect for an ICLR paper. If you provide a deeper discussion of the provable claims about the power of your model via Fourier analysis and provide a table of test accuracy/recall@K with/without your representation for more than one other state of the art algorithm for these datasets, I would be convinced to strong accept.
+
+Specific comments:
+
+* p.3 third paragraph: you repeat yourself in math notation a few times here. Repeated equations usually indicate that there is something new happening, but all of these are just restatements of your theta sin(omega tau + phi) term. I would introduce the notation for t2v(tau) upfront and use that to define a(tau, k)[j] and f_j
+* p.3 A clearer explanation of the theory here would help, as I think Fourier's theorem nicely supports your claims.
+* p.4 first paragraph you claim that this method responds well to data which exhibits seasonality, but none of your datasets deal with data that would exhibit seasonality. There are plenty of simple real-world datasets available which show multi-scale periodic phenomena (activity or location data, weather data, travel data, etc.). In fact, segmentation and recognition of wearable device activity would be a great application for this method.
+* p.4 third paragraph: Your claim of invariance to time rescaling is technically correct, but I am not convinced that a model can learn the correct omega values for an arbitrary rescaling (e.g. if the period is smaller than the time unit). You show that this works for a rescaling from 2pi/7 to 2pi/14, but it would be nice if there was experimental confirmation of this property with frequency > 1.
+* p.6 Showing accuracy/recall across training epochs is not sufficient evidence to show that this is a useful representation. There should be some kind of comparison with test set results from other state-of-the-art work on these datasets. If adding your representation to the SOTA model improved test set performance (or at least sped up training without hurting test set performance), then that would be better evidence. If LSTM+T is the SOTA, say so and restate the author's test performance compared to yours. If this is what these graphs show, consider using a different visualization to make it clearer that you're improving the final performance, not just the training process.
+* p.8 I think sine functions make optimization harder because they make the gradient function periodic with respect to the weights, creating infinitely many local extrema. Historically this may have been an issue, but deep neural networks have so many local minima it might not matter. Still, it would be good to show that trained performance doesn't depend on the initialization values more than a standard LSTM+T model.
+* You have an interesting corner case where your neural network parameters are interpretable: you can interpret the omega values from your model as frequencies and investigate their values to see which kinds of periodicity your model uses. You do something like this on p.7, but it would be neat to see a histogram like the one you have for EventMNIST for one of the real-world datasets to see if it learns the domain-relevant time knowledge you claim that it should learn.",3,,ICLR2020
+S1gbNt0nYr,2,S1eYKlrYvr,S1eYKlrYvr,Official Blind Review #3,"This paper has two main contributions. First, the authors perform an extensive study to understand the source of what they refer to as 'environment bias', which manifests itself as a gap in performance between environments used for training and unseen environments used for validation. The authors conclude that of the three sources of information provided to the agent (the natural language instruction, the graph structure of the environment, and the RGB image), the RGB image is the primary source of the overfitting. The second contribution is to use semantic information, compact statistics derived from (1) detected objects and (2) semantic segmentation, to replace the RGB image and provide input to the system in a way that maintains state-of-the-art performance but shrinks the performance gap between the seen and unseen data.
+
+This paper has some pretty exhaustive treatment diagnosing the source of the agent's 'environment bias' (which, as I discuss below, I believe is more accurately referred to as 'overfitting') in Sec. 4. To me, this is this highlight of the paper, and some interesting work; the investigation of the behavior of the system is interesting and informative. It provides a framework for thinking about how to diagnose this behavior and identify its source. The authors use this rather extensive study to motivate the need for new features (semantic features) to replace the RGB image that their investigation finds is where much of this 'environment bias' is located. Unfortunately, it is here that the paper falls flat. The authors proposal methods perform nominally better on the tasks being investigated, but much of the latter portion of the paper continues to focus on the 'improvement' in the metric they use to diagnose the 'bias'. As I mention below, the metric for success on these tasks is performance on the unseen data, and, though an improvement on their 'bias' metric is good anecdotal evidence their proposed methods are doing what they think, the improvements in this metric are largely due to a nontrivial decrease in performance on the training data. Ultimately, this is not a compelling reason to prefer their method. I go into more details below about where I think some of the other portions of the paper could be improved and include suggestions for improvement.
+
+High-level comments:
+- I am uncertain that 'bias' is the right word to describe the effect under study. In my experience, environment bias (or, more generally, dataset bias) usually implies that the training and test sets (or some subset of the data) are distinct in some way, that they are drawn from different distributions. The learning system cannot identify these differences without access to the test set, resulting in poor performance on the 'unseen' data. In the scenario presented here, the environments are selected to be in the train/test/validation sets at random. As such, the behavior described here is probably more appropriately described as 'overfitting'. The shift in terminology is not an insignificant change, because using 'bias' to describe the problem incorrectly suggests that the data collection procedure is to blame, rather than a lack of data or an overparamatrized learning strategy; I imagine that more data in the training set (if it existed) could help to reduce the gap in performance the paper is concerned with. That being said, I imagine some language changes could be done to remedy this.
+- Perhaps the biggest problem with the paper as written is that I am not convinced that the 'performance gap' between the seen and unseen data is a metric I should want to optimize. This metric is instructive for diagnosing which component of the model the overfitting is coming from, and Sec. 4 (devoted to a study of this effect) is an interesting study as a result. However, beyond this investigation, reducing the gap between these two is not a compelling objective; ultimately, it is the raw performance on the unseen data that matters most. The paper is written in a way that very heavily emphasizes the 'performance gap' metric, which gets in the way of its otherwise interesting discussion diagnosing the source of overfitting and some 'strong' results on the tasks of interest. The criteria should be used to motivate newer approaches, rather than the metric we should value for its adoption. This narrative challenge is the most important reason I cannot recommend this paper in its current state.
+- Using semantic segmentation, rather than the RBG image, as input seems like a good idea, and the authors do a good job of motivating the use of semantics (which should show better generalization performance) than a raw image. However, the implementation in Sec. 6.3 raises a few questions. First (and perhaps least important) is that 6.3 is missing some implementation details. In this section, the authors mention that 'a multilayer perceptron is used' but do not provide any training or structure details; these details should be included in an appendix. More important is the rather significant decrease in performance on the seen data (11% absolute) when switching to the learned method. Though the performance on the unseen data does not change much, it raises some concerns about the generalizability of the learning approach they have used: in an ideal world with infinite training data, the network would perfectly accurately reproduce the ground truth results, and there should be no difference between the two. Consequently, the authors should comment on the discrepancy between the two and the limits of the learned approach, which I worry may limit its efficacy if more training data were added.
+
+Smaller comments:
+- I do not fully understand why the 'Touchdown' environment was included in Table 1, since the learned-semantic agent proposed in the paper was not evaluated. The remainder of the experiments are sufficient to convince the reader that this gap exists, and I would recommend either evaluating against the proposed technique or removing this task from the paper.
+- Figure captions should be more 'self-contained'. Right now, they describe only what is shown in the figure. They should also describe what I, as a reader, should take away or learn from the figure. This is not always necessary, but in my experience improves readability, so that the reader does not need to return to the body of the text to understand.
+- The use of a multilayer perceptron for the Semantic Segmentation learned features, trained from scratch, stands out as a strange choice, when there are many open source implementations for semantic segmentation exist and could be fine-tuned for this task; a complete investigation (which may be out of scope for the rebuttal period) may require evaluating performance of one of these systems.",3,,ICLR2020
+S1gglldXqB,3,SJlVY04FwH,SJlVY04FwH,Official Blind Review #3,"1) Summary
+The manuscript presents a theoretical convergence analysis of gradient-based saddle point algorithms to solve min-max problems with a bilinear objective function. In particular, the analysis covers block updates and joint updates.
+
+2) Quality
+The paper -- being a theoretical analysis rather than a new algorithm -- seems mathematically rigorous but lacks motivation and also explanation.
+
+3) Clarity
+The notation is pretty clear and the results seem convincing but the mathematical formulation in eq. 2.1 and related assumptions (E being invertible, b and c being) need to be better justified in order to make the paper accessible to a broader audience. 
+
+4) Reproducibility
+The data is mainly synthetic and the codes for the update schemes are available. This should render the results reproducible.
+
+5) Evaluation
+The evaluation is on synthetic settings and the results on the convergence seem convincing but a subtantial optimization problem remains for future work.
+
+6) Questions/Issues
+  A) Not sure about the implications, isn't the set of saddle points in eq (2.2) equivalent to x=y=0? If this is intended, there needs to be some explanation.
+  B) It is not clear how well the results of the bilinear setting are applicable to the general case of bivariate functions. E.g. below eq (2.5) and below eq (5.1). There needs to be more justification.
+  C) The start of Section 2 kicks off a little rough i.e. the formal setting could be better motivated.
+
+7) Details
+  a) Section 2, below eq 2.2: ""biliner games"" -> ""bilinear games""
+  b) Capitalization in references: Nash, GAN, Potenzreihen, Einheitskres
+  c) References: ""méthode iterative de résolution d'une équation variationelle""",3,,ICLR2020
+BytMf6WVe,2,rJ8uNptgl,rJ8uNptgl,interesting experimental evaluation of variable bit-rate CNN weight compression scheme,"This paper proposes a novel neural network compression technique.
+The goal is to compress maximally the network specification via parameter quantisation with a minimum impact on the expected loss.
+It assumes pruning of the network parameters has already been performed, and only considers the quantisation of the individual scalar parameters of the network.
+In contrast to previous work (Han et al. 2015a, Gong et al. 2014) the proposed approach takes into account the effect of the weight quantisation on the loss function that is used to train the network, and also takes into account the effect on a variable-length binary encoding of the cluster centers used for the quantisation. 
+
+Unfortunately, the submitted paper is 20 pages, rather than the 8 recommended. The length of the paper seems unjustified to me, since the first three sections (first five pages) are very generic and redundant can be largely compressed or skipped (including figures 1 and 2). Although not a strict requirement by the submission guidelines, I would suggest the authors to compress their paper to 8 pages, this will improve the readability of the paper.
+
+To take into account the impact on the network’s loss the authors propose to use a second order approximation of the cost function of the loss. In the case of weights that originally constitute a local minimum of the loss, this leads to a formulation of the impact of the weight quantization on the loss in terms of a weighted k-means clustering objective, where the weights are derived from the hessian of the loss function at the original weights.
+The hessian can be computed efficiently using a back-propagation algorithm similar to that used to compute the gradient, as shown in cited work from the literature. 
+The authors also propose to alternatively use a second-order moment term used by the Adam optimisation algorithm, since it can be loosely interpreted as an approximate Hessian. 
+
+In section 4.5 the authors argue that with their approach it is more natural to quantise weights across all layers together, due to the hessian weighting which takes into account the variable impact across layers of quantisation errors on the network performance. 
+The last statement in this section, however, was not clear to me: 
+“In such deep neural networks, quantising network parameters of all layers together is more efficient since optimizing layer-by-layer clustering jointly across all layers requires exponential time complexity with respect to the number of layers.”
+Perhaps the authors could elaborate a bit more on this point?
+
+In section 5 the authors develop methods to take into account the code length of the weight quantisation in the clustering process. 
+The first method described by the authors (based on previous work), is uniform quantisation of the weight space, which is then further optimised by their hessian-weighted clustering procedure from section 4. 
+For the case of nonuniform codeword lengths to encode the cluster indices, the authors develop a modification of the Hessian weighted k-means algorithm in which the code length of each cluster is also taken into account, weighted by a factor lambda. Different values of lambda give rise to different compression-accuracy trade-offs, and the authors propose to cluster weights for a variety of lambda values and then pick the most accurate solution obtained, given a certain compression budget.  
+
+In section 6 the authors report a number of experimental results that were obtained with the proposed methods, and compare these results to those obtained by the layer-wise compression technique of Han et al 2015, and to the uncompressed models. 
+For these experiments the authors used three datasets, MNIST, CIFAR10 and ImageNet, with data-set specific architectures taken from the literature. 
+These results suggest a consistent and significant advantage of the proposed method over the work of Han et al. Comparison to the work of Gong et al 2014 is not made.
+The results illustrate the advantage of the hessian weighted k-means clustering criterion, and the advantages of the variable bitrate cluster encoding.  
+
+In conclusion I would say that this is quite interesting work, although the technical novelty seems limited (but I’m not a quantisation expert).
+Interestingly, the proposed techniques do not seem specific to deep conv nets, but rather generically applicable to quantisation of parameters of any model with an associated cost function for which a locally quadratic approximation can be formulated. It would be useful if the authors would discuss this point in their paper.
+",7,3.0,ICLR2017
+vlVBbgRebL_,4,d_Ue2glvcY8,d_Ue2glvcY8,"The paper proposed a text generation method which can utilize the language structures from character-level, word-level to sentence-level structure.","The paper proposed a text generation method which can utilize the language structures from character-level, word-level to sentence-level structure.  The proposed model, structure-aware transformer (SAT), explicitly incorporates multiple types of multi-granularity structure information to guide the text generation with corresponding structure. 
+
+Pros:
+1. The paper is clearly written. The experiments show the effectiveness of the proposed method.
+2. The  proposed method can explicitly incorporate multiple types of multi-granularity structure information to guide the text generation.
+3. The proposed method shows its advantages in structure control.
+
+Cons:
+
+1. The method just incorporates structure information in the encoder part rather than the decoder. It's hard to guarantee the quality of structure control. Moreover, this way to incorporate structure information is just adding extra features and not a new framework.
+2. The structure information used in the paper is only segmentation and POS information. Other information, such as syntax, also should be considered.  Segmentation and POS information are just shallow structure information.
+3. BLEU score for POS or BMES is not suitable. Since the structure information is given, the generated text should be evaluated with other measures, such as F1 score.
+4. The fluency could be due to the pre-trained language model GPT2. An experiment should be performed without PLMs.
+5. Some related references are missing.
+
+Questions:
+1. The title of paper is ""Structure Controllable Text Generation"", but the proposed method is just to infuse structure information as features. Therefore, the proposed method is more like ""structure-infused"" rather than ""structure controllable"".
+
+Missing References:
+1. Zhang X, Yang Y, Yuan S, et al. Syntax-infused variational autoencoder for text generation[J]. arXiv preprint arXiv:1906.02181, 2019.
+2. Casas N, Fonollosa J A R, Costa-jussà M R. Syntax-driven Iterative Expansion Language Models for Controllable Text Generation[J]. arXiv preprint arXiv:2004.02211, 2020.
+3. Wu S, Zhou M, Zhang D. Improved Neural Machine Translation with Source Syntax[C]//IJCAI. 2017: 4179-4185.
+
+",5,4.0,ICLR2021
+PaQjUjJCTJ0,1,Ua6zuk0WRH,Ua6zuk0WRH,Great theoretical and experimental results but missing time equalized comparisons with simple baselines ,"### Summary
+
+The authors propose to use the kernel feature map self-attention formulation introduced in [1] to efficiently approximate the softmax attention. The main contribution of the paper lies in the proposed _positive random features_ that can approximate softmax with a strictly positive feature map without which the training is unstable. The authors also show that an approximation of softmax is not necessary for good performance and actually use ReLU random features to achieve their best results when training from scratch.
+
+### Strengths
+
+- The paper deals with a very pressing and important issue, that of the scalability of self-attention.
+- The positive random features are also useful outside of the context of self-attention for efficiently approximating softmax.
+- The experimental results provide strong evidence about the performance of training transformers with kernel feature-maps.
+
+### Weaknesses
+
+The biggest weakness of the paper in my opinion is the lack of comparison with a simple feature-map as proposed in [1]. Since the authors also use the ReLU random features, we establish that approximating softmax is not necessary for good performance.
+
+1. What would the performance be if a simple deterministic feature map was used?
+2. What would it be if the computational cost was equalized either by adding more layers or by increasing the dimensionality of the queries and keys?
+
+The second weakness of the paper concerns the evaluation of the practical softmax approximation capabilities. I find the theoretical results interesting and important but I would like more experimental evidence. Without fine-tuning, the authors provide evidence that the approximation does not work (Fig 5).
+
+1. What would happen, for instance, in a toy task where the Lipschitz constant of the transformer layers was kept low? How big would the feature map need to be in order for the approximation to work in such a simple case?
+2. What is being approximated in Fig 4? Is it a randomly initialized attention? What is the rank of the attention matrix?
+3. How good would the approximation be for an attention matrix that is almost full rank and how many features would we need then?
+
+### Reasons for my recommendation
+
+I am recommending acceptance because I believe that the positive random features is an important contribution both for transformers and for kernel approximation. In addition, the experimental results are impressive and show that fast kernelized attention indeed works in practice. My only reservation for a higher score is, as mentioned in the weaknesses section, the lack of comparison with simpler feature maps under equalized computation time.
+
+[1]: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention",7,5.0,ICLR2021
+4lVSNUxzPq,2,Px7xIKHjmMS,Px7xIKHjmMS,Several parts unclear,"Overview: 
+========
+The paper suggests a modification to graph neural networks, which is claimed to overcome GNN expressiveness issues recently shown by Loukas (ICLR 2020).
+
+Comments:
+=========
+I had multiple difficulties in following the content of the paper, some are detailed next.
+
+-- ""Unfortunately, [GNNs] require large depth as proved in Theorem 6 above""
+The lower bound in Theorem 6 is for approximation alpha < 1.5, but here you discuss approximation O(log n), doesn't this void the lower bound? Why cannot usual GNNs produce an O(log n) approximation (and specifically Bourgain's embedding)?
+
+-- ""While there do exist semidefinite programming based algorithms (Linial et al., 1995) for computing the embeddings required for Bourgain’s theorem, they are not suitable for implementation via efficient neural architectures. Instead [...] we adapt the sketch based approximate shortest path algorithms of Das Sarma et al. (2010)""
+
+In fact, the algorithm you implement is the one from Bourgain and Linial et al., not Dar Sarma et al. All those works use roughly the same sketching algorithm, which measures the distance of each point to randomly chosen clusters. However, the distance estimation procedure you implement (specifically ~d_G(s,t) = max_i|v_s^(i) - v_t^(i)|) is Bourgain's (this is just an ell-infinity embedding). Dar Sarma et al.'s estimation procedure is different and (building on the well-known work of Thorup-Zwick) relies on computing the common nearest neighbors of the given pair s,t in the random clusters. Note that this bears on the correctness of the proof of Theorem 8 (I think the statement still holds due to Matousek's analysis of Bourgain's embedding, but not for the reason you cite). Also note that in Theorem 8 (as well as all aforementioned results) c needs to be an integer (in particular, it is known that any approximation less than 3 is impossible with less than ~Omega(n^2) parameters).
+
+-- For min-cut, you write that traditional GNNs require Omega(sqrt(n)) depth/rounds, citing Loukas (2020). But doesn't that lower bound entail both the depth and the width (d*sqrt(w) = ~Omega(sqrt(n)))?
+
+-- I am unable to follow the proof of Theorem 9. Could you please explain the correctness of your construction.
+
+(On this note, the sentence ""Karger & Stein (1996) implies that with probability at least 1/n^2 there exists a prefix L' of L such that..."" seems like an unfortunate inaccuracy; the ""prefix"" exists deterministically, and their guarantee is that the iterative random contraction algorithm finds it with probability at least ~1/n^2.)
+
+Conclusion:
+=========
+I am currently unable to recommend accepting this paper, due to what seems like multiple inaccuracies, misinterpretations of prior work, unclear statements, and possibly technical correctness issues. I will await clarifications from the authors on the points detailed above.
+
+Post discussion update
+=========
+After discussion with the authors, I have calibrated my score upward to 4, since the authors seemed willing to engage in discussion and correct/improve the paper, which I appreciate, but I still recommend not accepting the paper.
+
+The authors generally acknowledged (though have not yet fixed) the issue of wrong attribution of the APSP algorithm. This isn't just a matter of citing B instead of A; the paper (still) contains a lengthy discussion of why A is not suitable, so instead they must resort to B, even though in reality they just use A (and B remains unused). This is glaring since these papers are famous classics, widely taught in graduate courses, their content is well known, and it is puzzling how a diametrically incorrect representation of them made its way into the paper.
+
+The reason I dwell on this is that it signifies a larger issue with the paper. The original version was peppered with formal statements which were at best inaccurate, and even though the authors fixed (or said they would fix) the ones I pointed out, I remain unable to trust the overall technical soundness of the paper. The review time frame doesn't allow a reviewer to carefully verify every statement (nor would I want to), there must be some commitment of due diligence on part of the authors, that up to a small inevitable fraction of inaccuracies, the formal content is rigorously correct. I'm afraid the current version of the paper is quite off this mark.
+
+Putting formal soundness aside, my present understanding of the idea of the paper is the following: The authors observe that many basic computations on graphs can be parallelized into a few computations of small width and depth. Usual GNNs can implicitly implement this if their width is large enough, but this poses a computational burden, and there are obvious advantages to explicitly building this parallelism into the architecture. This seems like a sensible and potentially empirically useful observation, but the experimental section still seems too thin to make the case properly. That said, perhaps I have not fully understood the paper, since its frequent inaccuracies and fuzzy statements made it a bit hard for me to follow. 
+
+In conclusion. I think the paper should undergo a substantial revision:
+1. Clean up the theory part and ensure its formal soundness,
+2. Crystalize the point of the paper (in particular, rather than just presenting GNN+, I hope a revised version would include a more thorough comparison with usual GNNs - not just dismiss them with some citations of prior works which allegedly prove limitations - this leaves doubts about the exact model and assumptions, particularly since as discussed above, the prior work is not always cited accurately),
+3. Possibly expand the experimental section.
+",4,3.0,ICLR2021
+ryLYFULlM,1,S1JHhv6TW,S1JHhv6TW,see detailed review below.,"This paper theoretically validates that interconnecting networks with different dilations can lead to expressive efficiency, which indicates an interesting phenomenon that connectivity is able to enhance the expressiveness of deep networks. A key technical tool is a mixed tensor decomposition, which is shown to have representational advantage over the individual hierarchical decompositions it comprises.
+
+Pros:
+
+Existing work have focused on understanding the depth of the networks and established that deep networks are expressively efficient with respect to shallow ones. On the other hand, this paper focused on the architectural feature of connectivity. The problem is fundamentally important and its theoretical development is solid. The conclusion is useful for developing new tools for deep network design. 
+
+Cons:
+
+In order to show that the mixed dilated convolutional network is expressively efficient w.r.t. the corresponding individual dilated convolutional network, the authors prove it in two steps: Proposition 1 and Proposition 2. However, in the proof of Proposition 2 (see Theorem 1), the authors only focus on a particular case of convolutional arithmetic circuits, i.e., $g(a,b)= a*b$. In the experiments, see line 4 of page 9, the authors instead used ReLU activation $g(a, b)= max{a+b, 0}$. Can authors provide some justifications of such different choices of activation functions? It would be great if authors can discuss how to generate the activation function in Theorem 1 to more general cases.
+
+
+",7,4.0,ICLR2018
+rkxiYmDq2m,1,rJliMh09F7,rJliMh09F7,Interesting idea with good experimental validation,"The paper proposes a method for generating diverse outputs for various conditional GAN frameworks including image-to-image translation, image-inpainting, and video prediction. The idea is quite simple, simply adding a regularization term so that the output images are sensitive to the input variable that controls the variation of the images. (Note that the variable is not the conditional input to the network.) The paper also shows how the regularization term is related to the gradient penalty term. The most exciting feature about the work is that it can be applied to various conditional synthesis frameworks for various tasks. The paper includes several experiments with comparison to the state-of-the-art. The achieved performance is satisfactory. 
+
+To the authors, wondering if the framework is applicable to unconditional GANs.",7,5.0,ICLR2019
+Sye2efgAtH,1,S1x2PCNKDB,S1x2PCNKDB,Official Blind Review #3,"The paper proposes an extension of the adversarial imitation learning framework where the discriminator is additionally incentivized not to distinguish frames that are different between the expert and the agent in irrelevant ways. The method relies on manually identifying a set of irrelevant frames.
+
+The paper correctly identifies an important shortcoming of GAIL and proposes a sensible and generic way to overcome it. The experiments are well designed and corroborate the claims made in the paper.
+
+EDIT: I acknowledge reading the other reviews and the author response and stand by my initial assessment.",8,,ICLR2020
+SyKUVctlM,2,S1Euwz-Rb,S1Euwz-Rb,Looks good but needs clarification,"This paper proposes a recurrent neural network for visual question answering. The recurrent neural network is equipped with a carefully designed recurrent unit called MAC (Memory, Attention and Control) cell, which encourages sequential reasoning by restraining interaction between inputs and its hidden states. The proposed model shows the state-of-the-art performance on CLEVR and CLEVR-Humans dataset, which are standard benchmarks for visual reasoning problem. Additional experiments with limited training data shows the data efficiency of the model, which supports its strong generalization ability.
+
+The proposed model in this paper is designed with reasonable motivations and shows strong experimental results in terms of overall accuracy and the data efficiency. However, an issue in the writing, usage of external component and lack of experimental justification of the design choices hinder the clear understanding of the proposed model.
+
+An issue in the writing
+Overall, the paper is well written and easy to understand, but Section 3.2.3 (The Write Unit) has contradictory statements about their implementation. Specifically, they proposed three different ways to update the memory (simple update, self attention and memory gate), but it is not clear which method is used in the end.
+
+Usage of external component
+The proposed model uses pretrained word vectors called GloVE, which has boosted the performance on visual question answering. This experimental setting makes fair comparison with the previous works difficult as the pre-trained word vectors are not used for the previous works. To isolate the strength of the proposed reasoning module, I ask to provide experiments without pretrained word vectors.
+
+Lack of experimental justification of the design choices
+The proposed recurrent unit contains various design choices such as separation of three different units (control unit, read unit and memory unit), attention based input processing and different memory updates stem from different motivations. However, these design choices are not justified well because there is neither ablation study nor visualization of internal states. Any analysis or empirical study on these design choices is necessary to understand the characteristics of the model. Here, I suggest to provide few visualizations of attention weights and ablation study that could support indispensability of the design choices.
+",6,4.0,ICLR2018
+HtEQzXdUw4,3,jHefDGsorp5,jHefDGsorp5,novel approach for molecular design using explainability,"### Initial Review
+
+The paper proposes a new algorithm for de-novo molecular design, which uses a model to extract explanatory subgraphs from a set of support molecules which ""explain"" high scores wrt a scoring function, and a generative model which is conditioned on these subgraphs to produce full molecules. 
+ 
+All in all, I think this is an interesting combination of several existing approaches in generative models for molecules (all referenced in the paper). The aspect of explainability is novel. The presentation of the paper is mostly clear. The approach is quite geared to molecule generation, but can potentially also inspire applications in other domains, which makes it interesting from a general ML perspective as well. 
+ 
+I like the paper from the theoretical side, which alone warrants acceptance of the paper at ICLR in my opinion. 
+ 
+However, I have a few minor concerns / comments: 
+ 
+I am a bit on the fence with the validation method. Since almost every new paper in the field proposes a new validation approach, it has become pretty much impossible to assess what the state of the art of the field is (or if the concept of SOTA is even something meaningful), and this paper is no exception in this regard. But I assume the authors will disagree here. 
+ 
+In practice, generating 20k molecules is a lot. Looking at the statistics of the top100 molecules would probably be sufficient. 
+ 
+Also, I find it somewhat surprising that some of the baseline algorithms (in particular the Winter et al MSO model), which are less constrained than algorithm presented here, are not achieving higher scores, in particular when the algorithms can query the scoring function 5 M times. Maybe this is something the authors could comment on in the rebuttal. 
+ 
+As an additional baseline, I would suggest to report the ""best in dataset"", I.e. run the scoring function on the seeds and all molecules used to train the generator, and pick the top molecules. 
+ 
+ 
+Related work: 
+ 
+I would suggest to additionally cite https://arxiv.org/abs/1701.01329 which was the first paper to apply neural models to molecule generation in drug discovery, and the first of such papers which has been prospectively validated in laboratory experiments by scientists not affiliated with the authors.
+
+### Update 1 after discussion:
+Score raised.",7,5.0,ICLR2021
+HylX_BRscS,4,HkxU2pNYPH,HkxU2pNYPH,Official Blind Review #4,"The authors propose several approaches to making a data-to-text generation system more precise, that is, less prone to hallucination.  In particular, they propose an attention score, which attempts to measure to what degree the model is relying on its attention mechanism in making a prediction. This attention score is used to weight a mixture distribution (a ""confidence score"") over the generation model's next-word distribution and the next-word distribution of an unconditional language model. The learned conditional distribution can then be calibrated to the confidence score. The authors also propose a variational-inference inspired objective, which attempts to allow the model to ignore certain tokens it isn't confident about. The authors evaluate their approach on the WikiBio dataset, and find that their approaches make their system more precise, at the cost of some coverage.
+
+This paper is well motivated, timely, and it presents several interesting ideas. However, I think parts of the proposed approach need to be better justified. In particular:
+
+-  What justifies defining the attention score A_t in this way? First, is there an argument (empirical or otherwise) for using the magnitude of the attention vector (rather than some other statistic)? Is it obvious that if the attention vector has a high magnitude then it ought to be trusted? Note that this might be a reasonable assumption in the case of a pointer-generator style model, where a single attention vector is used both for attending and for copying, but in a model where attention isn't constrained in this way the magnitude of the attention vector may be misleading.
+
+- The variational objective seems difficult to justify. First, I don't understand what it means for p(y | z, x) to be assumed to 1. Is this for any z (in which case y is independent of z)? Otherwise, how can it be removed from the objective? (Put another way: Equation (17) is not in general true; it neglects an expected log likelihood term). I'm also not entirely clear on how Equation (12) is modeled: do the z's really only rely on the other sampled z's? Doesn't that require a different model than the one that calculates P^{\kappa}?
+
+- Somewhat minor: the claim that optimizing the joint objective needn't hurt perplexity relies on kappa being 0; have you confirmed empirically that when it isn't zero the perplexity improves over the baseline model?
+
+- Finally, I'm not sure I understand why there needs to be a stop-gradient in equation (4). It would be nice to also verify empirically that this is important.
+
+",3,,ICLR2020
+HJxiqVt3tr,2,HkgqmyrYDH,HkgqmyrYDH,Official Blind Review #1,"Summary:
+
+This paper proposes to predict word sequences for Amharic language-- a language spoken in Eastern Africa. It proposes to use HMMs with POS tags and morphological features to perform this prediction task.
+
+
+The paper is just 3 pages, contains 1 paragraph of methodology, and no experiments section. It is clearly a very early stage work and not in the scope of ICLR. This paper should have been desk-rejected as it needs more work before it is fit for publication. There is lot of work on word sequence prediction and HMMs are no longer the state-of-the-art. The authors should consider looking at RNN-based methods such as LSTMs for this task.",1,,ICLR2020
+ryhZnCEEx,3,SJqaCVLxx,SJqaCVLxx,hard to understand what is going on,"The authors seems to have proposed a genetic algorithm for learning the features of a convolutional network (LeNet-5 to be precise). The algorithm is validated on some version of the MNIST dataset. 
+
+Unfortunately the paper is extremely hard to understand and it is not at all clear what the exact training algorithm is. Neither do the authors ever motivate why do such a training as opposed to the standard back-prop. What are its advantages/dis-advantages? Furthermore the experimental section is equally unclear. The authors seem to have merged the training and validation set of the MNIST dataset and use only a subset of it. It is not clear why is that the case and what subset they use. In addition, to the best of my understanding, the results reported are RMSE as opposed to classification error. Why is that the case? 
+
+In short, the paper is extremely hard to follow and it is not at all clear what the training algorithm is and how is it better than standard way of training. The experimental section is equally confusing and unconvincing. 
+
+Other comments: 
+-- The figures still say LeCun-5
+-- The legends of the plots are not in english. Hence I'm not sure what is going on there. 
+-- The paper is riddled with typos and hard to understand phrasing. ",3,5.0,ICLR2017
+BkgYgVIIh7,1,BJgGhiR5KX,BJgGhiR5KX,"Limited novelty, strong evaluation, other languages and tasks?","The paper presents an intuitive architecture for learning cross-lingual sentence representations. I see weaknesses and strengths: 
+
+(i) The approach is not very novel. Using parallel data and similarity training (siamese, adversarial, etc.) to facilitate transfer has been done before; see [0] and references therein. Sharing encoder parameters across very different tasks is also pretty standard by now, going back to [1] or so. 
+(ii) The evaluation is strong, with a nice combination of standard benchmark evaluation, downstream evaluation, and analysis. 
+(iii) While the paper is on cross-lingual transfer, the authors only experiment with a small set of high-resource languages, where transfer is relatively easy. 
+(iv) I think the datasets used for evaluation are somewhat suboptimal, e.g.: 
+a) Cross-lingual retrieval and multi-lingual STS are very similar tasks. Other tasks using sentence representations and for which multilingual corpora are available, include discourse parsing, support identification for QA, extractive summarization, stance detection, etc. 
+b) Instead of relying on Agic and Schluter (2017), why don’t the authors use the XNLI corpus [2]?
+c) Translating the English STS data using Google NMT to evaluate an architecture that looks a lot like Google NMT sounds a suspicious. 
+(v) While I found the experiment with eigen-similarity a nice contribution, there is a lot of alternatives: seeing whether there is a linear transformation from one language to another (using Procrustes, for example), seeing whether the sentence graphs can be aligned using GANs based only on JSD divergence, looking at the geometry of these representations, etc. Did you think about doing the same analysis on the representations learned without the translation task, but using target language training data for the tasks instead? The question would be whether there exists a linear transformation from the sentence graph learned for English while doing NLI, to the sentence graph learned for German while doing NLI. 
+
+Minor comments: 
+- “Table 3” on page 5 should be Table 2. 
+- Table 2 seems unnecessary. Since the results are not interesting on their own, but simply a premise in the motivating argument, I would present these results in-text. 
+
+[0] http://aclweb.org/anthology/W18-3023",6,5.0,ICLR2019
+SJloae9gcS,3,Skeh1krtvH,Skeh1krtvH,Official Blind Review #2,"This paper re-organized the high dimensional 1-D raw waveform as 2-D matrix. This method simulated the autoregressive flow. Log-likelihood could be calculated in parallel. Autoregressive flow was only run on row dimension. The number of required parameters was desirable to synthesize high-fidelity speech with the speed faster than real time. Although this method could not achieve top one in ranking in every measurements, the resulting performance was still obtained with the best average results. 
+
+In general, this paper is clearly written, well organized and easy to follow. The authors carried out sufficient experiments and analyses, and proposed some rules of thumb to build a good model. On one hand, we may catch the contributions. But, on the other hand, the contributions were not clearly explained. The results were averaged but were not clearly explained.
+
+The authors suggested to specify a bigger receptive field than the squeezed height. The property of getting better performance using deeper wavenet was ""not"" clearly explained and investigated. In the experiments, a small number of generative steps was considered. This is because short sequence based on autoregressive model was used. 
+
+This paper mentioned that using convolution queue could improve the synthesis speed. But, the synthesis speed has been fast enough because it is almost 15 times faster than real time. In practical applications, 100x faster is almost the same as 15x faster for humans. But, the task isn’t interacted with human. It is suggested to focuse on reducing the number of parameters or enhancing the log likelihood.",3,,ICLR2020
+ryloTLBC2m,3,rylKB3A9Fm,rylKB3A9Fm,Review,"This paper presents a new benchmark for studying generalization in deep RL along with a set of benchmark results. The benchmark consists of several standard RL tasks like Mountain Car along with several Mujoco continuous control tasks. Generalization is measured with respect to changes in environment parameters like force magnitude and pole length. Both interpolation and extrapolation are considered.
+
+The problem considered in this paper is important and I agree with the authors that a good set of benchmarks for studying generalization is needed. However, a paper proposing a new benchmark should have a good argument for why the set of problems considered is interesting. Similarly, the types of generalization considered should be well motivated. This paper doesn’t do a good job of motivating these choices.
+
+For example, why is Mountain Car a good task for studying generalization in deep RL? Mountain Car is a classic problem with a two-dimensional state space. This is hardly the kind of problem where deep RL shines or is even needed at all. Similarly, why should we care whether an agent trained on the Cart Pole task can generalize to a pole length between 2x and 10x shorter than the one it was trained on without being allowed to update its policy? Both the set of tasks and the distributions of parameters over which generalization is measured seem somewhat arbitrary.
+
+Similarly, the restriction to methods that do not update its policy at test time also seems arbitrary since this is somewhat of a gray area. RL^2, which is one of the baselines in the paper, uses memory to adapt its policy to the current environment at test time. How different is this from an agent that updates its weights at test time? Why allow one but not the other?
+
+In addition to these issues with the proposed benchmark, the baseline results don’t provide any new insights. The main conclusion is that extrapolation is more difficult than interpolation, which is in turn more difficult than training and testing on the same task. Beyond that, the results are very confusing. Two methods for improving generalization (EPOpt and RL^2) are evaluated and both of them seem to mostly decrease generalization performance. I find the poor performance of RL^2-A2C especially worrisome. Isn’t it essentially recurrent A2C where the reward and action are fed in as inputs? Why should the performance drop by 20-40%?
+
+Overall, I don’t see the proposed tasks becoming a widely used benchmark for evaluating generalization in deep RL. There are just too many seemingly arbitrary choices in the design of this benchmark and the lack of interesting findings in the baseline experiments highlights these issues.
+
+Other comments:
+- “Massively Parallel Methods for Deep Reinforcement Learning” by Nair et al. introduced the human starts evaluation condition for Atari games in order to measure generalization to potentially unseen states. This should probably be discussed in related work.
+- It would be good to include the exact architecture details since it’s not clear how rewards and actions are given to the RL^2 agents.
+",3,5.0,ICLR2019
+Bkein8QWaQ,4,ryfcCo0ctQ,ryfcCo0ctQ,"Interesting theoretical work, but missing key previous literature","The authors frame value function estimation and policy learning as bilevel optimization problems, then present a two-timescale stochastic optimization algorithm and convergence results with non-linear function approximators. Finally, they relate the use of target networks in DQN to their two-timescale procedure.
+
+The authors claim that their first contribution is to ""unify the problems of value function estimation and policy learning using the framework of bilevel optimization."" The bilevel viewpoint has a long history in the RL literature. Are the authors claiming novelty here? If so, can they clarify which parts are novel?
+
+The paper is missing important previous work, SBEED (Dai et al. 2018) which shows (seemingly much stronger) convergence results for a smoothed RL problem. The authors need to compare their approach against SBEED and clearly explain what more they are bringing. Furthermore, the Fenchel trick used in SBEED could also be used to attack the ""double sampling"" issue here, resulting in a saddle-point problem (which is more specific than the bilevel problem). Does going to the bilevel perspective buy us anything?
+
+=====
+
+In response to the author's comments, I have increased my score.
+The practical implications of this theoretical work are unclear. It's nice that it relates to DQN, but it does not provide additional insight into how to improve existing approaches. The authors could significantly strengthen the paper by expanding in this area.",5,3.0,ICLR2019
+rJeiwZE4pm,3,H1lPUiRcYQ,H1lPUiRcYQ,Experiments need to be improved,"In response to the authors' rebuttal, I have increased my ratings accordingly. I strongly encourage the authors to include those ablative study results in the work. I also strongly recommend an ablative study on importance sampling so as to provide more quantitative results, in addition to Fig. 4. Finally, I hope the authors can consider more advanced importance sampling techniques and explore whether it helps you get better results in even higher dimensions.
+
+=================================
+This paper proposes several enhancements to a neural network method for computing committor functions so that it can perform better on rare events in high-dimensional space. The basic idea is using a variational formulation with Dirichlet-like boundary conditions to learn a neural committor function. The authors claim to improve a previous neural network based method by i) using a clever parameterization of the neural committor function so that it approximately satisfy the boundary condition; ii) bypassing the difficulty of rare events using importance sampling; and iii) using collective variables as feature engineering.
+
+Generally I feel this paper is well written and easy to understand, without requiring too much background in physics and chemistry. The application is new to most people in the machine learning community. However, 
+the main contributions of this paper are empirical, and I found the experiments not very convincing. Here are my main concerns:
+
+1. There is almost no ablation study. The parameterization of committor function satisfies the Dirichlet boundary condition, which is aesthetically pleasing. However, it's unclear how much this improves the regularization used in the previous method. Similarly, without importance sampling, will the results actually become worse? What changes if the collective variables are removed? There is even no comparison with the previous neural network based method on computing committor functions, though the authors cited it.
+
+2. In the experiment on extended Mueller potentials, authors use the FEM results as the ground truth. However, it is not clear how accurate those FEM solutions are. Without this being clarified, it is unclear to me that the RMSE and MAE results in Table 1 are meaningful. Maybe try some simpler problem where the committor functions can be computed exactly?
+
+3. In experiments the authors often argue that results will improve when networks become deeper. However, all network architectures used in the paper are narrow and shallow when viewed from the perspective of modern deep learning. If the authors want to stress this point, I would expect to see more experimental results on neural network architectures, where you vary the depth of the network and report the change of results.
+
+4. ""Then we use the result as the initial network to compute the committor function at T = 300K"" => Did you first train a neural committor on samples of T = 800K and use its weights as initialization to the neural committor for T = 300K? Please clarify this more.
+
+5. Finally, I think the importance sampling technique proposed in this paper can be improved by other methods, such as annealed importance sampling. The largest dimension tested in this paper is only 66, which is still fairly small in machine learning, and I don't expect the vanilla importance sampling can work in higher dimensions.",6,4.0,ICLR2019
+fq72SEFNxaB,2,GH7QRzUDdXG,GH7QRzUDdXG,Review,"The paper performs the analysis of the GAN latent spaces from the geometric perspective, inducing a metric tensor in the latent space from the LPIPS distance in the image space. The main authors' finding is that under such metric, the latent spaces of typical GANs are highly anisotropic, which can be exploited for more effective GAN inversion. Furthermore, the authors show that eigen vectors of the metric tensor often correspond to interpretable latent transformations.
+
+Pros:
+
+1) The paper is exceptionally well-written and provides a very interesting read. While the performed analysis is simple and natural, it does reveal several interesting findings about typical latent spaces: LPIPS-anisotropy, global consistency of the metric tensor.
+
+2) The authors confirm the usefulness of their analysis by providing immediate practical benefits: more effective GAN inversion, which accounts for the latent space anisotropy.
+
+Cons:
+
+1) Missing work on interpretable GAN directions:
+
+[A] The Hessian Penalty: A Weak Prior for Unsupervised Disentanglement, ECCV 2020
+
+2) In my opinion, the authors do not provide enough support for their claim ""This finding unifies previous unsupervised methods that discover interpretable axes in the GAN space"".
+
+- While the proposed method does seem to generalize both Ha ̈rko ̈nen et al., 2020 and Shen & Zhou, 2020, I do not see, how it captures Voynov & Babenko, 2020 and Pebbles et al (see above [A]). Furthermore, I believe that such claims should be supported by the experiments.  Could the authors experimentally confirm that their method results in the same set of directions as the existing methods?
+
+Overall, I am positive about this submission, since the main analysis is both interesting and practically useful. My main criticism is that in terms of discovery of interpretable directions, the methods should be experimentally compared to existing alternatives. If it does provide a super-set of directions, obtained by existing methods, this would make the submission much stronger. Otherwise, the claim about unification should be toned down in my opinion.
+
+======== AFTER REBUTTAL ========
+
+I appreciate the authors' efforts on additional thorough comparison to existing works on interpretable axes discovery. From the updated manuscript, however, it is not clear what method is superior and the authors' approach appears to be a yet another method for this task rather than generalization of previous ones. Overall, I am still on the positive side since the observed findings deliver a clear profit for GAN inversion. But I am not increasing my score given that the ""interpretable axes"" part has become less impressive (in terms of weaker claims and conclusions) and the competing SIGGRAPH work.   ",6,4.0,ICLR2021
+S1esrDQ0YS,2,B1xwv1StvS,B1xwv1StvS,Official Blind Review #1,"The authors propose a new neural network model, called as Dissimilarity Network, to improve the few-shot learning accuracy.
+Overall the idea is well motivated that by emphasizing the difference among classes, the model can achieve more accurate predictions for classes where only limited data points are available for training.
+However, the paper is not quite well written.
+Firstly, much of the work is built upon previous work including attention mechanisms, episodic training for few-shot learning. Such components are the core of this work because the attention mechanisms implement the class-awareness, and the episodic training facilitates the LSTM structure. Yet these are not well explained and not much context is provided, thus making the paper hard to follow.
+Secondly, some terms are fairly overloaded, or not clearly defined. For example, the “prior” as mentioned in both the abstract and the introduction doesn’t refer to the commonly interpreted term as in the Bayesian settings, but rather as a hand-waiving term to indicate the model design. Also, the terms, “score”, “metric”, “dissimilarity” are mentioned in the paper but the paper is not really learning the metric, to my understanding. Thus the details of the paper is quite hard to grasp. 
+Lastly, the idea of designing the global embedding and the task aware embedding is interesting but shouldn’t really be restricted to few-shot learning. It would be interesting to test the idea on general classification tasks, for example in a simple cross validation settings.
+Thus I think the paper would be stronger if the above are addressed and it’s not ready for publishing yet in its current form.
+
+Below are some more detailed comments:
+1)	In the abstract, the “newly introduced dataset H-CIFAR” is not precise to me; my understanding is that the paper proposes such an experiment design for testing how well a classifier can predict the labels with hierarchy. The current writing refers to that the authors comprises a completely new dataset with new labels.
+2)	In the last sentence of the second paragraph in Introduction, the question is asked “what prior” should be reasonable. Since the authors didn’t really add any priors in a Bayesian settings but rather designed an architecture, I suggest to reword something like “how to explicitly encode hierarchies into the model structure”.
+3)	In Section 2.1, some more description for “episodic training” would be nice: why should it be used? How is it used and why it makes sense in the few-shot learning context?
+4)	In Section 2.2, it would be nice to add the mathematical definition of “prototype”.
+5)	In Section 2.2.1, it would be nice to define “H”.
+6)	In Section 2.2.2, is M required to be fixed given it’s episodic training? Also it would be nice to add more details about the attention mechanism.
+7)	In the result section, it would be nice to discuss when the proposed method is doing better than other methods, for example RelationNet, as well as when it’s worse since different datasets show different results.
+",3,,ICLR2020
+k4LmsCHVohi,3,VMAesov3dfU,VMAesov3dfU,Initial Review,"review:
+This paper addresses the effects of gradient descent methods onto compositionality and compositional generalization of models. The authors claim that the optimization process imposes the models to deviate compositionality, which is defined with conditional independence among random variables of input, predicted output and the ground-truth. Since compositionality is one of important features of human intelligence, it has been interested widely in the field of AI/ML such as vision, language, neuro-symbolic approaches, common sense reasoning, disentangled representation, and the emergence conditions of compositionality. As it has been not much focused on the relationship with optimizers, it is fresh and interesting. However, it is not easy to figure out the position of this paper from two reasons: (1) the definitions on compositionality in this paper are not so compatible with recent related works, which mostly consider certain structures in models [ICLR19, JAIR20] or representative problems such as visual reasoning [CVPR17] and Raven progressive matrices [PNAS17]. (2) The authors do not consider quantitative approaches such as compositionality [ICLR19] or compositional generalization [ICLR20]. 
+
+In this paper, the main claim is very broad argument. To verify this claim, the authors provide supports of both theoretical and experimental aspects. Theoretically, they try to show that reducing loss values in the optimization process induces utilizing other input variables including useful information based on mutual information. Experimentally, they show the gaps between several settings of accuracy curves with the MNIST dataset (vision) and the SCAN dataset (language). With both aspects, theoretical steps are vague and weak, and the experimental results are little persuasive and convincing.
+Some steps in theoretical derivation seem to be wrong.
+I recommend ‘trivial and wrong’ for this paper.
+
+Pros:
+They deal with the relationship among compositionality, compositional generalization and gradient descent. It is interesting and novel question as far as I know.
+
+Concerns:
+-	It is not clear the assumptions on models is covered in the main claim. Some arguments have readers guess the claim only on neural networks. Currently, it is not explicit. What if a model is naïve Bayes classifier which assumes conditional independence? Does it have compositional generalization? If the classifier is trained with gradient descent, the key argument of the paper has counterexamples, which becomes wrong.
+-	Theorem 1 should show more clearly Markov chain structure among X, Y and Z. X -> Y -> Z (as written in Cover 1999 p.34)
+-	What is the relationship between Y and X in Proposition 1? 
+-	The proof in Proposition 2 seems not valid. Is the Markov chain among Y hat, X, and Y still valid? Without any constraints of X and Y, the equation in the middle of Proposition 2 seems not an identity (consider joint probability models with discrete values), and the derivation process is not trivial. The validity of this result is a factor that also affects subsequent verification.
+-	There is no quantitative analysis with measurable cases as mentioned above.
+
+
+[CVPR17] Johnson et al., CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning, CVPR 2017.
+
+[PNAS17] Duncan et al., Complexity and compositionality in fluid intelligence, PNAS 2018.
+
+[ICLR19] Jacob Andreas, Measuring compositionality in representation learning, ICLR 2019.
+
+[ICLR20] Keysers et al., Measuring compositional generalization: a comprehensive method on realistic data, ICLR 2020.
+
+[JAIR20] Hupkes et al., Compositionality decomposed: how do neural networks generalise?], JAIR 2020.",1,4.0,ICLR2021
+rkebus9xqB,3,ByeMPlHKPH,ByeMPlHKPH,Official Blind Review #1,"
+
+This paper claims to propose an extension of the Transformer architecture specialized for the mobile environment (under 500M Mult-Adds).
+The authors propose their method called ""Long-Short Range Attention (LSRA),"" which separates the self-attention layers into two different purposes, where some heads focus on the local context modeling while the others capture the long-distance relationship. 
+They also demonstrate consistent improvement over the transformer on multiple datasets under the mobile setting. 
+It also surpasses the recently developed comparative method called ""Evolved Transformer"" that requires a far costly architecture search under the mobile setting.
+ 
+This paper is basically well written and easy to follow what they have done.
+The experimental results look good.
+ 
+However, I have several concerns that I listed as follows.
+ 
+1,
+I am not sure whether my understanding is correct or not, but it seems that the proposed method, LSRA, is not a method specialized for mobile computation.
+In fact, in the paper, they say, ""To tackle the problem, instead of having one module for ""general"" information, we propose a more specialized architecture, Long-Short Range Attention (LSRA), that captures the global and local information separately."" 
+ 
+There is no explicit discussion that LSRA is somehow tackling the mobile setting.
+There is a large mismatch (gap) between the main claim and what they have done.
+In other words, LSRA can be simply applied to standard-setting (non-mobile setting). Is there any reason that the proposed method cannot be applied to the standard-setting?
+If my understanding is correct, the paper must be revised and appropriately reorganized to clear this gap.
+ 
+2,
+I am not convinced of the condition of the so-called ""mobile setting (and also extremely efficient constraint)."" 
+Please provide a clear justification for it.
+ 
+",6,,ICLR2020
+SBHkrIl_LY0,4,NGBY716p1VR,NGBY716p1VR,A useful paper needing more theoretical interpretations,"The authors claimed in this paper that as the most empirically successful approach to defending adversarial examples, PGD-based adversarial training, is computationally inefficient. Fast adversarial training could mitigate this issue by training a model using FGSM attacks initialized with large randomized perturbations, but the underlying reason for its success remains unclear and it may still suffer from catastrophic overfitting. The authors conducted a series of experiments to figure out the key to the success and properties of fast adversarial training. The experimental results showed that fast adversarial training cannot avoid catastrophic overfitting, but could be able to recover from catastrophic overfitting quickly. Based on all of the observations, the authors proposed a simple method to improve fast adversarial training by using PGD attack as training instead of R+FGSM attack (proposed in fast adversarial training) when overfitting happens, or using fast adversarial training as a warmup. The proposed methods could achieve slightly better performance than the current state-of-art approach while reducing the training time significantly.
+
+#####################################################################
+
+Overall, I vote for weak reject (marginally). I like the idea of exploring the properties of adversarial training, the experiments may also be inspiring. But my major concern is that the interpretation about the ‘catastrophic overfitting’ is not clear, and the interpretation about the effectiveness of R-FGSM and PGD against overfitting is also not clear. Hopefully, the authors can address my concern in the rebuttal period. 
+
+Pros:
+#####################################################################Pros: 
+ 
+1.Attempting to interpret the successful reason for a previous work is interesting. And the exploratory experiments may be inspiring for other researchers.
+
+2. Overall, the paper was well written. All the motivations and conjectures are easy to follow and understand. 
+
+3. This paper provides a lot of experiments to show the effectiveness of the proposed methods which appeared slightly better than the SOTA PGD-training while reducing training time significantly.
+#####################################################################
+
+
+Cons: 
+ 
+1. Although the authors attempted to explain the key to the success of fast adversarial training, it might be still not clear theoretically: 
+
+(1)	Why R-FGSM and PGD could guide the model to recovery from ‘catastrophic overfitting’, but FGSM could not? Does it mean that stronger attacks could guide the model to recovery?
+
+(2) Why the ‘catastrophic overfitting’ happened a lot of times when using FGSM training, but R-FGSM and PGD could mitigate it? Does it mean that stronger attacks could mitigate it?
+
+2. As I understand, Figure 3(c) should be the result of proposed FastAdv+. From Figure 3(c), it can be observed that there are ‘catastrophic overfitting’ in FastAdv+, but this phenomenon could not be seen from Figure 4. Do you have any idea to explain it?
+
+3. As concerned in my 1.(1) and (2), if weaker attacks lead to more ‘catastrophic overfitting’ and could not guide the model to recovery, why the FastAdvW 4-8 using a weaker attack as a warmup could outperform FastAdv+ and FastAdvW 8-8.
+
+4. Though the proposed methods appear useful, they may be a bit straightforward and have a limited novelty (using PGD attacked samples when ‘catastrophic overfitting’ happens)
+",5,4.0,ICLR2021
+BJxqaF2WqB,3,r1evOhEKvH,r1evOhEKvH,Official Blind Review #2,"This paper presents a semi-supervised classification method for classifying unlabeled nodes in graph data. The authors propose a Graph Inference Learning (GIL) framework to learn node labels on graph topology. The node labeling is based of three aspects: 1) node representation to measure the similarity between the centralized subgraph around the unlabeled node and reference node; 2) structure relation that measures the similarity between node attributes; and 3) the reachability between unlabeled query node and reference node.
+
+The authors propose to use graph convolution to learn the node representation and random work on graph to evaluate the reachability from query node to reference node.
+
+The presentation of the paper is clear and easy to follow. The idea of the paper seems straightforward and the experimental results seems promising in semi-supervised classification for nodes in graph data.
+
+Just a few concerns on model performance:
+
+1) The node labeling is based on the comparison with reference nodes. Would such method get biased toward major classes in the data, if the data is imbalanced among different classes?
+
+2) This method adopts three different modules for node labeling. It would be helpful if the authors can add some results to show the contribution of the different modules, i.e., what would be the performance if reachability or consistency of local graph topology structure is not considered in the classification?
+
+One typo: in Table 1, it should be ""Label rate"".",6,,ICLR2020
+2QSZC-kXVHt,1,9MdLwggYa02,9MdLwggYa02,"Interesting problem, unconvincing solution.","** Summary **
+
+This paper focuses on issues in the popular PBT algorithm for hyperparameter optimization. It investigates the 1) step size (which is typically a constant multiplier) 2) the variance induced by better weights and 3) the greediness of the algorithm, which they refer to as short-term vs. long term effects. These issues are well motivated, and it is intuitive that they are flaws in the original algorithm. The proposed approach is to use Differential Evolution which the authors claim makes the hyperparameter selection more robust. The paper also introduces a new library for online hyperparameter tuning.
+
+** Primary Reason for Score **
+
+The strengths of this work are that it identifies and discusses some interesting issues with PBT, a commonly used algorithm. However, as someone who frequently uses variants of the PBT algorithm, the evidence provided in this work is not sufficient for me to adopt their recommendations. The method is based on heuristics and the experiments are unfortunately not rigorous: the gains are small and it is a single seed. To increase my score, I would need to see more robust results that make these heuristics convincing, for example multiple seeds with clear outperformance (ideally statistically significant). It would also be important to see ablation studies for the newly introduced parameters (e.g. m). In addition, some demonstration of the phenomena described having an influence on the performance would be helpful.
+
+** Strengths **
+
+1) The issues the paper addresses are well motivated, and well described. 
+2) The topic of the paper (PBT) is one that I think has not been sufficiently addressed by the community. In particular, the present PBT algorithm is commonly used but none of the improvements since 2017 have been widely adopted. It seems like a fruitful direction for research.
+3) I appreciate the discussion of the results, which do not claim SoTA but instead go into detail on possible drivers of performance. 
+
+** Weaknesses **
+
+1) The main contribution of the work is not convincing. It simply replaces one heuristic for another. While the results show an improvement, it is not clear. 
+2) Experiments are only run a single time, and this is surely a noisy process. Given that, the gains vs. PBT seem small. It is entirely possible that this small gain is reversed in a second run. If the TransformerXL is too expensive, then a smaller experiment which can be repeated multiple times would be a stronger piece of evidence for the method’s efficacy. 
+3) The authors claim to reduce the meta-parameters, yet introduce new parameters (F_1, F_2 and m). Also how was the size of the PBT step chosen? For the transformer experiment it goes from 1 epoch -> 10 epochs. Some ablation studies for these parameters would be needed for a reader to fully understand how to use this method on a new task. 
+4) The library is presented as a second major contribution, but it is not clear why the reader would choose to use it over existing libraries such as ray tune, which are popular and widely used. There is no comparison or discussion here, other than just saying that the new library is better. I also couldn’t find the library anywhere, the supplementary material is just a two page pdf, and there is no anonymized link. Please correct me if I missed this. 
+
+** Minor issues **
+
+i) The ICLR 2020 template was used (rather than 2021). 
+
+ii) Bottom of page 7, “raw” -> “row”",4,4.0,ICLR2021
+SJgxyIYqnX,1,HJG1Uo09Fm,HJG1Uo09Fm,Lack of clarity in core sections,"This paper describes a meta-RL algorithm through imitation on RL policies. While the paper builds nicely up to the core part, I find essential details missing about the imitation setup. By glancing at previous BC papers (some of which are cited), the quantity for supervised imitations, etc., were clearly defined. 
+
+It will be useful for this reviewer if the authors can provide more clarity in explaining the BC task involved in their algorithm.",5,2.0,ICLR2019
+S1xjmZzVg,2,Bk8aOm9xl,Bk8aOm9xl,Review,"This paper provides a surprise-based intrinsic reward method for reinforcement learning, along with two practical algorithms for estimating those rewards. The ideas are similar to previous work in intrinsic motivation (including VIME and other work in intrinsic motivation). 
+As a positive, the methods are simple to implement, and provide benefits on a number of tasks.
+However, they are almost always outmatched by VIME, and not one of their proposed method is consistently the best of those proposed (perhaps the most consistent is the surprisal, which is unfortunately not asymptotically equal to the true reward). The authors claim massive speed up, but the numerical measurements show that VIME is slower to initialize but not significantly slower per iteration otherwise (perhaps a big O analysis would clarify the claims).
+Overall it's a decent, simple technique, perhaps slightly incremental on previous state of the art.",6,3.0,ICLR2017
+SJlYT1Uw27,1,rkxtl3C5YX,rkxtl3C5YX,"heavy on notations, limited impact applicability / experimental results","The paper proposes a formal framework to claim that Alpha Zero might converges to a Nash equilibrium. The main theoretical result is that the reward difference between a pair of policy and the Nash policy is bounded by the expected KL of these policy on a state distribution sampled from the Nash policies. 
+
+The paper is quite heavy on notations and relatively light on experimental results. The main theoretical results is a bit remote from the case Alpha Zero is applied to. Indeed the bound is in 1/(1-/gamma) while Alpha Zero works with gamma = 1. Also 
+
+Casting a one player environment as a two player game in which nature plays the role of the second player makes the paper very heavy on notations.
+
+In the experimental sections, the only comparison with RL types algorithm is with SARSA, it would be interesting to know how other RL algorithms, perhaps model free, would compare to this, i.e. is Alpha Zero actually necessary to solve this tasks?
+
+
+--- 
+p 1
+
+' it uses the current policy network g_theta' : policy and value network.
+
+p 2 / appendix
+No need to provide pseudo code for alpha zero the original paper already describes that?
+
+p2 (2). It seems a bit surprising to me that the state density rho does not depend upon pi but only on pi star? 
+
+p4:
+Not sure why you need to introduce R(pi), isnt it just V_pi (s_0) ? Also usually the letter R is used for the return i.e. the sum of discounted reward without the expectation, so this notation is a bit confusing?
+
+p5:
+paragraph2: I don't quite see the point of this.
+
+p8:
+""~es, because at most on packet can get serviced from any input or output port.~"" typo ?
+
+
+
+",5,3.0,ICLR2019
+EpWm4yBroiW,2,3F0Qm7TzNDM,3F0Qm7TzNDM,"Nice idea, but no experimental comparisons to other algorithms","**Summary of paper**
+
+The authors introduce an algorithm called VBSW to re-weight a training data set in order to improve generalization. In summary, VBSW sets the weight of each example to be the sample variance of the labels of its k nearest neighbors. The nearest neighbors are chosen in the embedding space from the second-to-last layer of a pre-trained neural network. The last layer of the pre-trained model is then trained with these new weights.
+
+This approach is quite simple in practice and seems to be theoretically justified.
+
+The authors demonstrate that VBSW achieves better test accuracy than not using VBSW on 3 toy datasets and 5 real-world datasets.
+
+**Conclusions**
+
+Quality: The authors did not experimentally compare VBSW to any alternative algorithms. I find this omission inexcusable. The problem of re-weighting examples to achieve better accuracy has been studied for decades; there are many other algorithms to compare against.
+
+Clarity: The paper is generally well-structured and well-written, although with a few typos and grammatical errors.
+
+Originality: I am not familiar enough with the related work to say whether this idea is novel. However, seems quite simple and potentially very similar to existing published techniques.
+
+**Comments**
+
+Section 3.1 seems to assume that the labels have no noise. For example, if two examples have the same input features, their labels seem to be generated by the same function f, which would always produce the same label for both examples. This seems unrealistic.
+
+Section 2 describes the author's VBSW algorithm as being applied ""prior to the training"", but the algorithm actually requires a neural network to be pre-trained before the reweighting procedure. I felt the beginning of the paper was misleading in this regard.
+
+**Minor comments**
+
+Table 1: The difference between the mean accuracies of VBSW vs. basline on the BC dataset seems statistically insignificant; I believe VBSW should not be bolded.",3,4.0,ICLR2021
+BygJyY52FS,2,SJlJSaEFwS,SJlJSaEFwS,Official Blind Review #3,"The paper does not bring anything novel to the field of cross-lingual representation learning: it just revisits some older ideas (from the period of 2013-2015), now revamped, given the fact that more sophisticated and more effective methods are used to model exactly the same intuitions. I see this work as largely incremental, and it just further supports what has been known before, and it further supports recent findings (which are all quite straightforward) from the work of Ormazabal et al. (ACL 2019). The actual model implementation is a straightforward extension of the Sent2Vec model to cross-lingual scenarios, inspired by previous work (e.g., the work on TransGram and BiVec), so the paper is also very incremental from the methodological perspective.
+
+I am puzzled why MUSE is selected as the unsupervised baseline given that fact that: 1) previous work showed its non-robustness for many language pairs, 2) the VecMap model of Artetxe et al. has been proven as the most robust unsupervised cross-lingual word embedding model in several recent empirical analyses - see e.g., Glavas et al. (ACL 2019), Heyman et al. (NAACL 2019), or the original VecMap work. Also, I am puzzled why the paper overstates the rekindled interest towards TransGram, given that TransGram and especially BiVec are well-known models that learn from parallel data.
+
+Another note related to evaluation: to really establish how different cross-lingual embeddings compare to each other, a wider set of experiments and downstream evaluation is definitely required, see the work of e.g., Glavas et al. (ACL 2019). 
+
+Most importantly, the paper evaluates only on very similar language pairs. The main reason why much recent work has focused on alignment-based/projection-based methods was quite pragmatic: we need such weakly supervised methods where we cannot assume the abundance of parallel data to enable cross-lingual transfer in resource-poor settings. If parallel data exists, it is quite intuitive and obvious (and also empirically validated before) that joint modeling is a better choice than a weakly supervised method that just uses 1K or 5K translation pairs. In fact, I am not even sure that it is fair to compare models that rely only on 1K translation pairs with models that draw their strength from 1M or 2M parallel sentences. This paper just shows that, if we have parallel data (which we do for many resource-richer language pairs), it is better to do joint modeling instead of learning simple alignments, but that is a pretty trivial finding imho.
+
+Are the results on MLDoc really state-of-the-art? The results are actually quite mixed, and the advantage of Bi-Sent2Vec is its quicker training. However, what about more recent methods such as XLM which rely on exactly the same resources as Bi-Sent2Vec to do the zero-shot classification task?
+
+Table 2: it is a well-known fact that multilingual training can improve performance in monolingual supervision: see e.g., the work of Faruqui and Dyer (EACL 2014, not cited). Alignment-based approaches that apply the Orthogonal Procrustes mapping cannot improve on monolingual word similarity simply because the orthogonality constraint preserves the topology of the original space. Therefore, evaluating different embedding methods on the intrinsic word similarity task is not a sound evaluation protocol imho - it would be much more informative to plug the embeddings as features in a classification or a parsing task (or something else).
+
+Figure 2: corpus size. Based on the results presented, it seems that the performance saturates by adding more parallel data, but the authors fail to fully understand their evaluation data in the first place. For instance, there are multiple problems with the MUSE datasets, as discussed in the recent work of Kementchedjhieva et al. (EMNLP 2019) - it evaluates mostly high-frequent word (actually - noun) translation, and of course that this saturates more quickly. It doesn't by any means imply that joint training therefore requires less data to reach peak performance: this is true only with the MUSE dataset, and is not a general truth.
+
+Minor:
+- The work of Artetxe et al. (ACL 2017) should be cited when talking about bootstrapping alignment-based methods from limited bilingual supervision (instead of the work of Artetxe and Schwenk which concerns learning multilingual sentence embeddings).
+- Many very relevant and historically important papers are omitted from the related work section: e.g., Hermann and Blunsom's work, Chandar et al., Soyer et al., Vulic and Moens, Gouws et al., Levy et al., to name only a few.
+- I am not sure that the statement that BiVec is not needed in the presence of TransGram is true in general: it mostly suggests that there are some deficiencies with the evaluation protocol.",3,,ICLR2020
+ByqjAeGEg,2,SyxeqhP9ll,SyxeqhP9ll,Interesting well written paper on improving the stability of discriminators in GANs.,"The authors present a method for changing the objective of generative adversarial networks such that the discriminator accurately recovers density information about the underlying data distribution. In the course of deriving the changed objective they prove that stability of the discriminator is not guaranteed in the standard GAN setup but can be recovered via an additional entropy regularization term.
+
+The paper is clearly written, including the theoretical derivation. The derivation of the additional regularization term seems valid and is well explained. The experiments also empirically seem to support the claim that the proposed changed objective results in a ""better"" discriminator. There are only a few issues with the paper in its current form:
+- The presentation albeit fairly clear in the details following the initial exposition in 3.1 and the beginning of 3.2 fails to accurately convey the difference between the energy based view of training GANs and the standard GAN. As a result it took me several passes through the paper to understand why the results don't hold for a standard GAN. I think it would be clearer if you state the connections up-front in 3.1 (perhaps without the additional f-gan perspective) and perhaps add some additional explanation as to how c() is implemented right there or in the experiments (you may want to just add these details in the Appendix, see also comment below).
+- The proposed procedure will by construction only result in an improved generator and unless I misunderstand something does not result in improved stability of GAN training. You also don't make such a claim but an uninformed reader might get this wrong impression, especially since you mention improved performance compared to Salimans et al. in the Inception score experiment. It might be worth-while mentioning this early in the paper.
+- The experiments, although well designed, mainly convey qualitative results with the exception of the table in the appendix for the toy datasets. I know that evaluating GANs is in itself not an easy task but I wonder whether additional more quantitative experiments could be performed to evaluate the discriminator performance. For example: one could evaluate how well the final discriminator does separate real from fake examples, how robust its classification is to injected noise (e.g. how classification accuracy changes for noised training data). Further one might wonder whether the last layer features learned by a discriminator using the changed objective are better suited for use in auxiliary tasks (e.g. classifying objects into categories).
+- Main complaint: It is completely unclear what the generator and discriminators look like for the experiments. You mention that code will be available soon but I feel like a short description at least of the form of the energy used should also appear in the paper somewhere (perhaps in the appendix).
+",7,5.0,ICLR2017
+nqpB8mapTJD,1,5i4vRgoZauw,5i4vRgoZauw,"Review for ""Wiring Up Vision: Minimizing Supervised Synaptic Updates Needed to Produce a Primate Ventral Stream""","This paper presents an empirical study that elucidates potential mechanisms through which models of adult-like visual streams can ""develop"" from less specific/coarser model instantiations. In particular, the authors consider existing ventral stream models whose internal representations and behavior are most brain-like (amongst several other models) and probe how these fair in impoverished regimes of available labeled data and model plasticity (number of ""trainable"" synapses). They introduce a novel weight initialization mechanism, Weight Compression (WC), that allows their models to retain good performance even at the beginning of training, before any synaptic update. They also explore a particular methodology for fine-tuning, Critical Training (CT), that selectively updates parameters that seem to yield the most benefit. Finally, they explore these methods/algorithms' transfer performance from one ventral stream model (CORnet-S) to two additional models (ResNet-50 and MobileNet).
+
+Pros:
+The problem that the authors present is an interesting one and undoubtedly useful for many applications. Deep neural networks such as the CORnet-S, ResNet-50, and MobileNet are data-hungry, and obtaining labeled data is an expensive process (and perhaps even implausible in many cases). Techniques to condense these models in terms of parameters and alleviate the need for vast amounts of labeled data while maintaining desirable traits (such as brain-like representations) are important for the machine learning community. Though a bit far-fetched at this point, tracking the developmental trajectories of these neural networks can also have other scientific implications in the form of data-driven hypothesis testing.
+
+The most exciting part of the study is the transfer experiment (from CORnet-S to ResNet and MobileNet). This seems like an interesting and novel way to construct model taxonomies. For instance, sampling from the CORnet-S weight clusters works well for ResNets potentially because these two models can be construed as ""recurrent"" in a way. MobileNets, on the other hand, are purely feedforward and thus are not significantly influenced by knowledge from the CORnet-S weights.
+
+Moreover, the authors conduct a series of numerical experiments to identify ""when"" their proposed methods are most useful. The finding that WC+CT is more advantageous in regimes where data is scarce (as opposed to regimes where data is plenty) is not surprising but a good one to report. I say ""not surprising"" because WT distills knowledge from a fully trained model, and CT only updates a fraction of the parameters (updating more parameters would require more data to prevent overfitting).
+
+Cons:
+The authors take the analogy between ""a developing visual system"" and ""training a model"" a bit too far. They operate under the premise that visual circuitry develops purely via ""supervised"" learning. Is there conclusive evidence for this? It is also surprising that discussions of reinforcement learning mechanisms never feature, given that these are more biologically plausible.
+
+The novelty (and utility; for ex: Fig 2b) of the proposed initialization technique is marginal. It is not articulated how their method (WC) overcomes the critiques they raise against Frankle et al. 2019. Moreover, claiming that WC achieves decent performance with ""zero"" synaptic updates is not fair. This seems to be closer to restoring pre-trained weights than to random initialization (like KN). 
+
+For CT, the authors choose ""critical"" layers to update. Is there a rationale (or a statistical metric) that justifies choosing these specific layers? 
+
+The WC kernel cluster center visualization analysis (Fig. 5c) seems out of place and poorly discussed. What can be gleaned from the 3x3 kernels shown here?
+
+Minor:
+By ""supervised updates,"" the authors refer to the number of available labels and not the number of parameter updates that happen. This terminology is non-canonical. 
+
+Employing Gabor priors for the first convolutional layer: Doesn't orientation selectivity emerge in the primary visual areas from experience, rather than structurally hard-coded? 
+
+The authors allude to the possibility of using ""local"" learning rules on a subset of layers identified by CT. However, this is speculation from the point of view of the current manuscript. All the conclusions drawn are from ""global"" gradients.
+
+Ambiguous sentence (Pg. 6, Sec 6): ""Reducing the number of supervised updates minimizes required updates by a smaller number of epochs and images.""
+
+(Pg. 8) ""synaptic updates primarily take place in higher cortical regions"": Is there evidence for this?
+
+Numerical imprecisions:
+(i) The authors claim that the performance of CORnet-S_wc is 54% (relative to the fully trained model). However, in Fig 2b (mean) and Fig 3c (top) the markings seem to be closer to 50%?
+(ii) (Fig. 4a) The performance of MobileNet seems to be slightly better than CORnet-S, which contradicts the initial claim that CORnet-S is currently the best available model of adult primate visual processing.",6,4.0,ICLR2021
+SJlU2stRYH,1,r1gBOxSFwr,r1gBOxSFwr,Official Blind Review #3,"This paper proposes a way to compress Bert by weight pruning with  L1 minimization and proximal method. This paper is one of the first works aiming at  Bert model compression.
+The authors think the traditional pruning ways can not work well for Bert model, so they propose Reweighted Proximal Pruning and conduct experiments on two different datasets. According to their results, they successfully compress 88.4% of the original Bert large model and get a reasonable accuracy.
+
+Strong points:
+1. The authors propose a new method RPP for Bert model compression.
+2. The authors design experiments to show their RPP can get a very good prune ratio with reasonable accuracy.
+
+Weak points:
+1. The authors should provide a detailed and rigorous explanation for the drawback of existing pruning methods.
+2. In the experiments, the authors only compare RPP with self-designed method NIP instead of any existing pruning method.  The reason they said is “these methods do not converge to a viable solution''. It would be better if they are also compared and analyzed in detail.
+3. In the CoLA and QNLI datasets of Bert_large experiments, RPP can get a higher accuracy even than the original Bert_large model without pruning? This is counter-intuitive.  
+4. About the metrics, the authors use F1 score and accuracy, the standard metrics in the GLUE benchmark for different tasks, except for CoLA. It might make sense to also keep the metrics for CoLA consistent with GLUE benchmark for better comparison.
+5. It is not clear what the authors want to express in Figure 2. The generation of the figure needs more explanation, and the results need to be better interpreted. 
+",6,,ICLR2020
+HkWEVNk4g,1,r1rz6U5lg,r1rz6U5lg,"Interesting ideas, not sure it belongs @ ICLR","Two things I really liked about this paper:
+1. The whole idea of having a data-dependent proposal distribution for MCMC. I wasn't familiar with this, although it apparently was previously published. I went back: the (Zhu, 2000) paper was unreadable. The (Jampani, 2014) paper on informed sampling was good. So, perhaps this isn't a good reason for accepting to ICLR.
+
+2. The results are quite impressive. The rough rule-of-thumb is that optimization can help you speed up code by 10%. The standard MCMC results presented on the paper on randomly-generated programs roughly matches this (15%). The fact that the proposed algorithm get ~33% speedup is quite surprising, and worth publishing.
+
+The argument against accepting this paper is that it doesn't match the goals of ICLR. I don't go to ICLR to hear about generic machine learning papers (we have NIPS and ICML for that). Instead, I go to learn about how to automatically represent data and models. Now, maybe this paper talks about how to represent (generated) programs, so it tangentially lives under the umbrella of ICLR. But it will compete against more relevant papers in the conference -- it may just be a poster. Sending this to a programming language conference may have more eventual impact.
+
+Nonetheless, I give this paper an ""accept"", because I learned something valuable and the results are very good. ",7,4.0,ICLR2017
+Syx3uzKGnQ,2,HkxOoiAcYX,HkxOoiAcYX,An interesting paper but the observations from the experiments could be stated more clear.,"Response to author comments:
+
+I would like to thank the authors for answering my questions and addressing the issues in their paper. I believe the edits and newly added comments improve the paper. 
+
+I found the response regarding the use of your convergence bound very clear. It is a very reasonable use of the bound and now I see how you take advantage of it in your experimental work. However, I believe the description in the paper, in particular, the last two sentences of Remark 1, could still be improved and better explain how a reasonable and computationally feasible n was chosen.
+
+To clarify one of my questions, you correctly assumed that I meant to write the true label, and not the output of the network.
+
+
+***********
+
+The paper revises the techniques used in Tishby’s and Saxe et al. work to measure mutual information between the data and a hidden layer of a neural network. The authors point out that these previous papers’ measures of mutual information are not meaningful due to lack of clear theoretical assumptions on the randomness that arises in DNNs.
+
+The authors propose to study a perturbed version of a neural network to turn it into a noisy channel making the mutual information estimation meaningful. The perturbed network has isotropic Gaussian noise added to each layer nodes. The authors then propose a method to estimate the mutual information of interest. They suggest that the mutual information describes how distinguishable the hidden representation values are after a Gaussian perturbation (which is equivalent to estimating the means of a mixture of Gaussians). Data clustering per class is identified as the source of compression.
+
+In addition to proposing a way to estimate a mutual information of a stochastic network, the authors analyze the compression that occurs in stochastic neural networks. 
+
+It seems that the contribution is empirical, rather than theoretical, as the theoretical result cited is going to appear in a different article. After reading that the authors “develop sample propagation (SP) estimator”, I expected to see a novel approach/algorithm. However, unless I missed something, the proposed method for estimating MI for this Gaussian channel is just doing MC estimation (and no guarantees are established in this paper). The convergence bounds for the SP estimator are presented(Theorem 1), however, the result is cited from another article of the authors, so it is not a contribution of this submission. 
+
+Since the authors have this convergence  bound stated in Theorem 1, it would be great to see it being used - how many samples are needed/being used in the experiments? What should the error bars be around mutual information estimates in the experiments? If the bound is too loose for a reasonable number of samples, then what’s the use of it?
+
+The authors perform two types of experiments on MNIST. The first experiment demonstrates that no compression is observed per layer and the mutual information only increases during training (as measured by the binning approach, which is supposed to track the mutual information of the stochastic version of the network). The second experiments demonstrates that deeper layers perform more clustering. 
+
+Regarding the first experiment, could the authors clarify how per unit and per entire layer compression estimation differs?
+
+Also, in my opinion, more clustered representations seem to indicate that the mutual information with the output increases. Could the authors comment on how the noise levels in this particular version of a stochastic network affects the mutual information with the output and the clustering? Do more clustered representations lead to increased mutual information of the layer with the output?
+
+I found it fairly difficult to summarize the experimental contribution after the first read. I think the presentation and summary after each experiment could be improved and made more reader friendly. For example, the authors could include a short section before the experiments stating their hypothesis and pointing to the experiment/figure number supporting their hypothesis.",7,4.0,ICLR2019
+BJeuOnp19S,3,r1lZgyBYwS,r1lZgyBYwS,Official Blind Review #2,"This paper proposes a method for lossless image compression consisting of a VAE and using a bits-back version of ANS. The results are very impressive on a ImageNet (but maybe not so impressive on the other benchmarks). The authors also discuss how to speed up inference and present some frightening runtime numbers for the serial method, and some better numbers for the vectorized version, though they're nowhere close to being practical.
+
+I think this paper should be accepted. It has a better description of the BB ANS algorithm than I have read before, and it's a truly interesting direction for the field, despite the lack of immediate applicability.
+
+If we are to accept this paper, I suggest the authors put a full description of the neural network used (it's barely mentioned). I think the authors also need to disclose how long it took to compress an average imagenet image (looking at the runtime numbers for 128x128 pixels is scary, but at least we'd get a better picture on the feasability).
+
+Overall, due to the fact that the authors pledge to open source the framework, I think some of the details will be found in the code, once released. I think this is an important step because there are so many details in this paper that one cannot reasonably reproduce the work by simply reading the text of this paper.",6,,ICLR2020
+BkgiaQoTFH,1,Bygadh4tDB,Bygadh4tDB,Official Blind Review #2,"***Score updated to weak accept after the rebuttal.***
+
+Straight-Through is a popular, yet not theoretically well-understood, biased gradient estimator for Bernoulli random variables. The low variance of this estimator makes it a highly useful tool for training large-scale models with binary latents. However, the bias of this estimator may cause divergence in training, which is a significant practical issue. The paper develops a Fourier analysis of the Straight-Through estimator and provides an expression for the bias of the estimator in terms of the Fourier coefficients of the considered function. Motivated by this expression, the paper proposes two modifications of Straight-Through which may reduce the bias of the estimator, at the cost of the variance. The experimental results show advantage of this improved estimator over Gumbel-Softmax and DARN estimator.
+
+While I really like the premise of the paper, I feel that it needs a significant amount of additional work. The text is currently fairly hard to read. The theoretical part of the paper does not quantify the variance of the estimator. The experiments are a bit unfinished and do not include ablations of the proposed modifications of Straight-Through. Most importantly, I think that in the current form the theoretical and the empirical parts of the papers are not well-connected. Because of this, I believe that the paper should currently be rejected, but I encourage the authors to continue this line of work.
+
+Pros:
+1. Theoretical analysis and empirical improvement of the Straight-Through estimator is an important avenue of work.
+2. The paper makes a solid contribution of deriving the Fourier expansion of the Straight-Through estimator bias.
+3. Based on this expansion, the paper proposes an algorithm with reduced bias. The algorithm is simple to implement, practical and appears to work slightly better than DARN.
+
+Cons:
+1. The key weakness of the theoretical part of the paper is that it focuses on the bias of the estimator, but does not quantify the variance, especially after the modifications. If reducing the bias was the only goal, one could use unbiased (but high-variance) estimators such as REINFORCE or VIMCO.
+2. The final algorithm appears to be the DARN estimator combined with relaxation by uniform noise (“Bernoulli splitting uniform”) and scaling. The paper does not have an ablation showing how the uniform noise and scaling perform on their own.
+3. There are a few incorrect statements that I’ve noticed.
+* “As a side contribution, we show that the gradient estimator employed with DARN (Gregor et al., 2013), originally proposed for autoregressive models, is a strong baseline for gradient estimation.” - MuProp paper compared to this estimator under the name 1/2-estimator
+* In Lemma 1 the “REINFORCE gradient” is just the exact gradient of the expectation, not a stochastic REINFORCE gradient.
+* “ To the best of our knowledge, FouST is the first gradient estimate algorithm that can train very deep stochastic neural networks with Boolean latent variables.” This paper uses up to 11 latent variable layers, while [1] has trained models with >20 latent variable layers (although their “layers” have just one unit).
+4. The derivation of “Bernoulli splitting uniform” trick is confusing and contains a lot of typos. For instance, the text before eqn. (14) implies that the distribution of u_i is U[-1, 1], which cannot be right and does not correspond to Algorithm 1. The statement that this trick does not lead to a relaxation is odd, since the function is being evaluated at non-discrete points.
+5. There are generally many typos and some poor formatting in the math. For example, in eqn. (6) the coefficients are off by one: it should be c0 + c1 z1 + c2 z2^2 + … . The equations (10) and (11) are poorly formatted. The notation \partial_z1 f(u_1, u_2) in eqn. (14) is strange. In many places p^{i->½} is denoted as p^{1->½}.
+5. I don’t think I understood the idea of representation scaling (Section 4.4). The eqn. (16) would suggest that the scaling should optimally be set to zero, which is just saying that the gradient is unbiased when the model does not use the latents. There is no other practical guidance on choosing this coefficient. Furthermore, one can always absorb the global scaling factor into the succeeding weights layer of the model, so this trick can probably be replaced by a modification of the weights initialization.
+6. The experiments are missing a comparison to the Straight-Through Gumbel-Softmax estimator, introduced in the original Gumbel-Softmax paper. This is a popular biased estimator for Bernoulli latents, e.g. used in [1] [2]. Another interesting comparison would be [3] which proposes a lower-bias version of Gumbel-Softmax.
+7. Figure 2 is missing the line for REBAR, even though this line is referred to on Page 8. Figure 2 and Figure 4 are both labeled as training ELBOs, despite the plots being different.
+
+[1] Andreas Veit, Serge Belongie “Convolutional Networks with Adaptive Inference Graphs” ECCV 2018
+[2] Patrick Chen, Si Si, Sanjiv Kumar, Yang Li, Cho-Jui Hsieh “Learning to Screen for Fast Softmax Inference on Large Vocabulary Neural Networks” ICLR 2019 https://openreview.net/forum?id=ByeMB3Act7
+[3] Evgeny Andriyash, Arash Vahdat, Bill Macready “Improved Gradient-Based Optimization Over Discrete Distributions” https://arxiv.org/abs/1810.00116",6,,ICLR2020
+BJxDojhXh7,1,rkxacs0qY7,rkxacs0qY7,Interesting and timely contributions hampered by rushed submission,"-- Paper Summary --
+
+The primary contribution of this paper is the presentation of a novel ELBO objective for training BNNs which allows for more meaningful priors to be encoded in the model rather than the less informative weight priors featured in the literature. This is achieved by way of introducing a KL measure over stochastic processes which allows for priors to take the form of GP priors and other custom variations. Two approaches are given for training the model, one inspired by GANs, and a more practical sampling-based scheme. The performance of this training scheme is validated on a variety of synthetic and real examples, choosing Bayes by Backprop as the primary competitor. An experiment on contextual bandit exploration, and an illustrative Bayesian optimisation example  provided in the supplementary material showcase the effectiveness of this method in applications where well-calibrated uncertainty is particularly pertinent.
+
+-- Critique --
+
+This paper makes important strides towards giving more meaningful interpretations to priors in BNNs. To the best of my knowledge, the KL divergence between stochastic processes that gives rise to an alternate ELBO has not been featured elsewhere, making this a rather interesting contribution that is supplemented by suitable theorems both in the main text and supplementary material. The introductory commentary regarding issues faced with increasing the model capacity of BNNs is particularly interesting, and the associated motivating example showing how degeneracy is countered by fBNN is clear and effective.
+
+The GAN-inspired optimisation scheme is also well-motivated. Although the authors understandably do not pursue that scheme due to the longer computation time incurred (rendering its use impractical), it would have been interesting to see whether the optimum found using this technique is superior to the sampling based scheme used throughout the remainder of the paper. The experimental evaluation is also very solid, striking an adequate balance between synthetic and real-world examples, while also showcasing fBNNs’ effectiveness in scenarios relying on good uncertainty quantification.
+
+In spite of the paper’s indisputable selling points, I have several issues with some aspects of this submission. For clarity, I shall distinguish my concerns between points that I believe to be particularly important, and others which are less significant:
+
+- Monte Carlo dropout (Gal & Ghahramani, 2016), and its extensions (such as concrete dropout), are widely-regarded as being one of the most effective approaches for interpreting BNNs. Consequently, I would have expected this method to feature as a competitor in your evaluation, yet this method does not even get a cursory mention in the text.
+
+ - The commentary on GPs in the related work paints a dour picture of their scalability by mostly listing older papers. However, flexible models such as AutoGP (Krauth et al, 2017) have been shown to obtain very good results on large datasets without imposing restrictions on the choice of kernels.
+
+ - The regression experiments all deal with a one-layer architecture, for which the proposed method is shown to consistently obtain better results. In order to properly assess the effectiveness of the method, I would also be interested in seeing how it compares against BBB for deeper architectures on this problem. Although the authors cite the results in Figure 1 as an indicator that BBB with more layers isn’t particularly effective, it would be nice to also see this illustrated in the cross-dataset comparison presented in Section 5.2.
+
+ - Furthermore, given that all methods are run for a fixed number of iterations, it might be sensible  to additionally report training time along with the results in the table. This should reflect the pre-processing time required to optimise GP hyperparameters when a GP prior is used. Carrying out Cholesky decompositions for 1000x1000 matrices 10k times (as described in Section 5.2.2) does not sound insignificant.
+
+- The observation regarding the potential instability of GP priors without introducing function noise should be moved to the main text; while those who have previously worked with GPs will be familiar with such issues, this paper is directed towards a wider audience and such clarifications would be helpful for those seeking to replicate the paper’s results. On a related note, I would be keen on learning more about other potential issues with the stability of the optimisation procedure, which does not seem to be discussed upfront in the paper but is key for encouraging the widespread use of such methods.
+
+- The paper contains more than just a handful of silly typos and grammatical errors - too many to list here. This single-handedly detracts from the overall quality of the work, and I highly advise the authors to diligently read through the paper in order to identify all such issues.
+
+ - The references are in an absolute shambles, having inconsistent citation styles, arXiv papers cited instead of conference proceedings, etc. While this is obviously straightforward to set right, I’m nonetheless disappointed that this exercise was not carried out prior to the paper’s submission.
+
+ - The theory presented in Appendix A of the supplementary material appears to be somewhat ‘dumped’ there. Given that this content is crucial for establishing the correctness of the proposed method, linking them more clearly to the main text would improve its readability and give it a greater sense of purpose. I found it hard to follow in its current state.
+
+** Minor **
+
+ - In the introduction there should some mention of deep Gaussian processes which are implicitly a direct competitor to BNNs, and can now also be scaled to millions and billions of observations (Cutajar et al. 2017; Salimbeni et al. 2017). The former is particularly relevant to this work since the architecture can be assimilated to a BNN with special structure for emulating certain kernels.
+
+ - Experiment 5.1.1 is interesting, and the results in Figure 2 are convincing. I would also be interested in seeing how fBNN performs when the prior is misspecified however, which may be induced by using a less appropriate GP kernel. This would complement the already provided insight on using tanh vs ReLU activations.
+
+ - The performance improvement for the experiment on large regression datasets is quite subdued, so it might be interesting to see how both methods compare against each other when deeper BNN architectures are considered. 
+
+- With regards to Appendix C.2, which order arccosine kernel is being used here? One can easily draw similarities between the first order arccosine kernel and NN layers with ReLUs, so perhaps it would be useful to specify which order is being used in the experiment.  
+
+- Given that the data used for experiments in Appendix C.3 effectively has grid structure, I would be interested in seeing how KISS-GP performs on this task. There should be easily accessible implementations in GPyTorch for testing this out. Given how GPs tend to not work very well on image completion tasks due to smoothness in the kernel, this comparison may also be in fBNNs favour.
+
+- Restating the basic architecture of the BNN being used for the contextual bandits experiment in the paper itself would be helpful in order to avoid having to separately check out Riquieme et al (2018) to find such details.
+
+- I wonder if the authors have already thought about the extendability of their proposal to more complex BNN architectures such as Bayesian ConvNets?
+
+
+-- Recommendation --
+
+Whereas several ICLR submissions tend heavily towards validation by way of empirical evaluation, I find that the theoretic contributions presented in this paper are by themselves interesting and well-developed, which is very commendable. However, there are multiple telling signs of this being a rushed submission, and I am less inclined to argue ardently for such a paper’s acceptance. Although the paper indeed has its strong points, both in terms of novelty and varied experimental evaluation, in view of this overall lack of finesse and other concerns listed above, I think that the paper is in dire need of a thorough clean-up before being published.
+
+Pros/Cons summary:
+
++   Interesting concepts that extend beyond empirical fixes.
++   Defining more interpretable priors is a very pertinent topic in the study of BNNs.
++   The presented ideas could potentially have notable impact.
++   Illustrative experiments and benchmark tests are convincing.
+-   Not enough connection to MC dropout.
+-   Choice of experiments and description of stochastic processes overly similar to other recent widely-publicised papers. It feels on trend, but consequently also somewhat reductive.
+-   More than a few typos and grammatical errors.
+-   Presentation is quite rough around the edges. The references are in a particularly dire state.",6,4.0,ICLR2019
+f4VyLoRijaG,4,yUxUNaj2Sl,yUxUNaj2Sl,"interesting results and experiments, might need more comprehensive experiments","Summary of the paper:
+This paper tries to study whether increasing shape-bias of a neural network trained with imagined will make it more robust to common corruptions such as gaussian noises.
+The paper falsified this point by producing a data augmentation method which leads to more shape biased network yet less susceptive to common corruptios.
+The paper further hypothesize that it’s the stylization augmentation that leads to increase robustness of the network by ablation studies.
+
+Strength:
+The paper does provide a counter example to the common hypothesis that increase shape bias can lead to more robust network. This provides insight to future researches on understanding how shape-bias and texture-bias affects on neural network robustness. If the results of the counter example is reproducible and significance, then the claim is convincing and the paper did a good job verifying such hypothesis.
+
+Weakness:
+1. The experiment results seemed to be limited in this dataset. In order to make such general claim, I would expect results for different datasets (e.g. CIFAR), large scale datasets (e.g. the whole imagined), and different network and training procedure (e.g. not just resent).
+2. This might be some minor things, but it would be nice if there are statistic significance test for the results (or at least show the variance of couple runs). When I looked at the difference of the number, it would be nice that one can make sure such results is statistically significant with respect with difference runs.
+3. Most of the paper is very empirical, and there is little insight or theory or principal ways that organize the results. 
+
+
+Justifications:
+While I do like the paper that provide insight and show negative results falsifying some prevalent claim, but the paper provides rather limited evaluation without theoretical insights. Since I’m not an expert in this field, I will recommend borderline scores to hear about the authors’ response.
+
+-------------------
+UPdate: the authors' reply address my concerns well, so I raise my rating to the acceptance side.",6,2.0,ICLR2021
+US1E9bhU1QJ,2,zv-typ1gPxA,zv-typ1gPxA,"This paper leverages similar codes to help generate code summarization, and an attention-based dynamic graph model is introduced to further capture the global graph information.","Summary:
+
+This paper leverages similar code-summary pairs from existing data to assist code summary generation. The model first retrieves a similar code snippet from the existing database. Then, the author applied GNN over the code property graphs (CPGs).  A challenge is that CPGs are typically deep therefore it is difficult to capture long dependencies. The author proposed an attention mechanism to capture global information between nodes, and then a hybrid GNN layer encodes the retrieve-augmented graph.  Finally, a generator takes both GNN's output and the retrieved text summary and predict outputs. Experimental results over a new C code indicates that the proposed method outperforms both IR and neural generation methods.
+
+########################################
+
+Reason for score: 
+
+Overall, I vote for accepting. Both the idea of leveraging existing code and also the adaptive layer to capture long dependencies are interesting and the experiments look solid. Although I would still like to see the results from previous existing datasets.
+
+########################################
+Some comments about the experiments:
+
+a. As an application study, it is still necessary to compare the model over previous benchmarks, even though there are some issues with those datasets. 
+
+b. A pair of missing ablation studies are: a generator still takes the text summary of retrieved code, but not use the augmented graph; and vice versa, the generator only takes the graph information but not the retrieved text summary. This can further indicate which part of the retrieved information is more useful. ",7,3.0,ICLR2021
+S1e467GThm,2,BygmRoA9YQ,BygmRoA9YQ,Synthetic naive approach to handling distorted images by deep neural networks,"The paper presents a synthetic naive approach to analyzing distorted, especially noisy, images through deep neural networks. It uses an existing gating network to discriminate between clean and noisy images, averaging and denoising the latter, so as to somewhat improve the results obtained if no such separation was used. It deals with a well known problem using the deep neural network formulation. Results should be compared to other image analysis methodologies, avoiding smoothing when not required, that can be used for the same purpose.  This should also be reflected in related work in section 2; the reason of including Table 1 in it seems unclear. 
+",4,5.0,ICLR2019
+SJRT6dgEl,2,H1wgawqxl,H1wgawqxl,"Good paper, room for more  empirical study","Summary:
+
+The paper introduces a parametric class for non linearities used in neural networks. The paper suggests two stage optimization to learn the weights of the network, and the non linearity weights.
+
+significance:
+
+The paper introduces a nice idea, and present nice experimental results. however  I find the theoretical analysis not very informative, and  distractive from the main central idea of the paper. 
+ 
+A more thorough experimentation with the idea using different basis and comparing it to wider networks (equivalent to the number of cosine basis used in the leaned one ) would help more supporting results in the paper. 
+
+
+Comments: 
+
+- Are the weights of the non -linearity learned shared across all units in all layers ? or each unit has it is own non linearity?
+
+- If all weights are tied across units and layers. One question that would be interesting to study , if there is an optimal non linearity. 
+
+- How different is the non linearity learned if the hidden units are normalized or un-normalized.  In other words how does the non linearity change if you use or don't use batch normalization? 
+
+- Does normalization affect  the conclusion that polynomial basis fail? 
+
+
+
+
+
+",6,4.0,ICLR2017
+KjKVwahcz-,2,F438zjb-XaM,F438zjb-XaM,"A ""Fon"" Project: How to Translate an African Low-resource Language","The authors investigate different tokenization methods for the translation between French and Fon (an African low-resource language). This means that they compare different ways to construct the input and output vocabularies of a neural machine translation (NMT) system. They further propose their own way to create those units, based on phrases, which is called WEB.
+
+The NMT system the authors use follows Bahdanau et al. (2015): it is a GRU sequence-to-sequence model with attention. The dataset they use has been created and cleaned by bilingual speakers and consists of roughly 25k examples (this is a really small dataset for NMT, so the authors are taking on a really hard task!). 
+
+WEB works in the following way: after phrases have been found automatically, bilingual speakers analyze what the longest phrases which correspond to translated phrases in the other language are. Only the longest phrases for each example are kept for the final vocabulary. The authors show that WEB improves the performance in both translation directions by a lot on all metrics, clearly showing that the work they invest into creating the vocabulary pays out. Thus, I think this work is important to be able to provide speakers of Fon with a functioning translation system.
+
+However, I am unsure if this work is suitable for a machine learning conference. While the overall goal of this work is to create an NMT system, the main contribution is the manual cleaning of the dataset and semi-manual creation of the vocabularies. I would recommend to the authors to submit this paper to a conference with a stronger focus on NLP and NLP resources (maybe LREC)? I further want to emphasize that I think work like this paper is incredibly important and the authors shouldn't feel discouraged. Importantly, the manual labor needed for WEB has been a lot and it's obvious that it helps for NMT. I just don't think that this paper is a good fit for ICLR.
+
+Minor point: has the creation of WEB access to the test data? If so, the authors should change that (or collect new test data?) to ensure a fair evaluation. ",3,3.0,ICLR2021
+Byev-9Zwcr,4,HyezBa4tPB,HyezBa4tPB,Official Blind Review #4,"Motivated by real-world challenges in applying pre-trained models, the authors propose a model for selective prediction (prediction with an option for abstention) that wraps an existing black-box classification model. The resulting model output is a Dirichlet distribution with mean equal to the categorical distribution produced by the black-box and concentration parameter specified by a separate auxiliary model. This additional model is trained to minimize negative log-likelihood of observations under categorical distributions sampled from the aforementioned Dirichlet along with an L1 regularization term on the concentration parameter. To infer the model’s level of uncertainty, the authors propose computing the entropy of the average of sampled categorical distributions.
+
+The authors evaluated this model on several pairs of sentiment-analysis NLP tasks and one pair of image datasets where a base model is trained on the source dataset, and the auxiliary model is trained on the second, target dataset. Using metrics proposed in (Condessa et al. 2015), the results show positive results at nearly all thresholds compared to a simple entropy baseline.
+
+The paper addresses an important practical challenge in machine learning, but a confusing problem-framing and lack of robust baselines make me skeptical that it is suitable for publication at ICLR 2020.
+
+A primary concern with this work is its framing of the problem as one of measuring aleatoric (irreducible) uncertainty, but the motivation in transfer learning and interdependence in production ML systems requires models that can characterize epistemic (reducible) uncertainty. A black box model that yields distributions over classes expresses aleatoric uncertainty via that distribution, and uncertainty due to a shifted data distribution is epistemic as additional data from the new domain would reduce it.
+
+More problematic is the study’s lack of robust baselines. The authors only present the predictive entropy baseline, but numerous methods exist for out-of-distribution detection and selective classification. Though the specific case of selective classification from a blackbox base model is perhaps more niche, other methods from related problems can either be adapted accordingly or used as upper / lower bounds on what we can expect for this problem. 
+
+Some simple baselines I would expect to see include:
+* Training a new classification model entirely on the new domain.
+* Training an auxiliary classifier to predict if the base model will be correct (similar to SelectiveNet but without a shared network body). Ideally this model should also have access to the base-model’s prediction as input.
+* “Confidence score” (i.e. the probability assigned to the base-model’s predicted class) -- this is a common baseline for OOD detection.
+
+Additional questions / concerns:
+* Do I understand correctly that when drawing samples for the entropy calculation, $E[\hat{y}]$ will equal the black-box model’s prediction in the limit of sample size? If so, this looks like an inefficient and complicated proxy for measuring the concentration of categorical entropies produced by the Dirichlet.
+* Since the paper is focused on scenarios where one is taking advantage of a pre-trained model, one might wonder if the wrapping scheme is indeed less expensive than a new model trained from scratch on the new domain (e.g. with comparable capacity to the proposed wrapping model). Authors should include experiments / baselines to assess this.
+* In section 2, the paper asserts that assuming access to logits breaks the blackbox assumption, but these are computable from softmax values (up to constant factors).
+* Is the beta regularization term is theoretically required to prevent unbounded growth or is it simply an empirically practical necessity?
+* In the image transfer experiments, STL-10 has relatively very few labeled examples (hundreds per class), so why was this only used as the source domain? Realistic transfer learning generally entails an expensive model trained on a source domain with plentiful data transferred to a target domain with scarce data.
+* Authors should mention that STL-10 images represent a distribution shift from CIFAR-10 as the two were not generated identically. The paper currently only highlights differences in dataset size.
+",1,,ICLR2020
+1HtSNxGra9I,2,j6rILItz4yr,j6rILItz4yr,"Although the motivation of this study is clear, the proposed method is not appropriately designed along with the motivation.","--Paper summary--
+
+The authors propose Adversarial Feature Augmentation (ALFA), which augment features at hidden layers by adding adversarial perturbations. Where and how strongly the augmentation is conducted is automatically optimized via training. Experimental results show that the proposed method consistently improves the performance of baselines over several datasets and network architectures.
+
+--Review summary--
+
+Although the motivation of this study is clear, the proposed method is not appropriately designed along with the motivation. Moreover, its novely is merginal. I vote for rejection.
+
+--Details--
+
+Strength
+
+- The motivation is clear and seems reasonable. Training with adversarial perturbations is known to be effective but computationally expensive. It can be problematic when the model or training data is large-scale.
+- The proposed method consistently improves the performance of baselines over several datasets and network architectures.
+
+Weakness and concerns
+
+- Is the computational complexity of the proposed method really small? Since the adversarial perturbation is computed for every layer, its computational complexity should be almost same with that of standard adversarial training.
+- The training objective shown in Eq. (3) is not reasonable. Since the norm of \delta is upper-bounded by a certain constant \epsilon, the effect of the adversarial perturbation can be reduced just by increasing the scale of features. Are features normalized ones?
+- The optimization of \eta in L-ALFA is not reasonable. Since min_\eta comes after max_\delta, L-ALFA should choose the layer that corresponds to the smallest increase of loss by adding adversarial perturbation. Therefore, this design minimizes the effect of the augmentation, which is contradictive to the motivation of introducing \eta. Moreover, since \epsilon is common for all layers, the optimal \eta should be sensitive to the scale of features, which indicates that the performance of the proposed method would heavily depend on both how to initialize the model and whether any normalization is conducted in the model or not. 
+- Marginal novelty. An idea of adversarially augmented features has been already presented in [R1].   
+[R1] ""Training Deep Neural Networks with Adversarially Augmented Features for Small-scale Training Datasets,"" IJCNN 2019.
+",4,4.0,ICLR2021
+BJeblVM6tr,2,r1l7E1HFPH,r1l7E1HFPH,Official Blind Review #2,"The main contributions of this paper are k-PI-DQN and k-VI-DQN, which are model-free versions of dynamic programming (DP) methods k-PI and k-VI from another paper (Efroni et al., 2018).  The deep architecture of the two algorithms follows that of DQN.  Efroni et al. (2018b) already gave a stochastic online (model-free) version of k-PI in the tabular setting.  Although this paper is going one step further extending from tabular to function approximation, I feel that the paper just combined known results, the shaped reward from Efroni et al (2018a) and DQN.  The extension seems straightforward.  Mentioning previous results from Efroni et al (2018a) and (2018b) does not justify the extension would possess the same property or behaviour.   The experiments were only comparing their methods with different hyperparameters, with only a brief comparison to DQN.  ",3,,ICLR2020
+rJxYbry4hm,1,rkGcYi09Km,rkGcYi09Km,Novelty and advantage of the proposed methods are limited,"The paper explores unsupervised deep learning model for extractive telegraphic summaries, which extracts text fragments (e.g., fragments of a sentence) as summaries. The paper is in general well structured and is easy to follow. However, I think the submission does not have enough content to be accepted to the conference.
+
+First, in term of methodology (as described in Section 3), the paper has little novelty. There has been intensive study using various deep learning models on summarization. The models described in the paper contain little novelty compared with previous work using autoencoder and LSTM for both extractive and abstractive summarization. 
+
+Second, the paper claims contributions on using deep learning models on telegraphic summarization, but the advantage is not well demonstrated. For example, the advantage of the resulting summary is not compared with state-of-the-art sentence compression models with intrinsic evaluation or (probably better) with extrinsic evaluation. (By the way, it is interesting that the paper argues the advantage of using telegraphic summaries for fictional stories but actually gives an example which looks also very typical in news articles (the “earthquake Tokyo 12 dead” example).)
+
+Third, there has been much work on speech summarization that summarizes with the “telegraphic” style (this is natural, considering speech transcripts are often non-grammatical, and “telegraphic” style summaries focusing on choosing informative fragments actually result in usable summaries.) The author(s) may consider discussing such work and compare the proposed methods to it.
+",4,4.0,ICLR2019
+wCjfEX7E1Kv,1,hr-3PMvDpil,hr-3PMvDpil,This paper presents a provable defense method called BAGCERT against patch attacks which uses an invariant of BagNet for certification. ,"This paper presents a provable defense method called BAGCERT against patch attacks which uses an invariant of BagNet for certification. By using the network with small receptive fields, this paper first analyzes the worst-case classification. The basic certification process is created by using a novel aggregation function. Finally, after using the same certification conditions as the Derandomized smoothing (Levine & Feizi, 2020), the certification could be evaluated within constant time. To further reduce the impact of the adversarial patch, the proposed method uses the certification condition as the objective loss to train the network. Empirical studies show the superiority of BAGCERT over other approaches.
+
+Advantages: 
+1.The paper gives good formal descriptions and rigorous proofs. 
+2.The certification section is well structured. The narrative is logically layered and well-organized. 
+3. The paper gives a SOTA certified accuracy with high clean accuracy on ImageNet. 
+
+Concerns: 
+1. Overall, the paper uses essentially the same certification conditions as the Derandomized smoothing except that it replaces the network architecture with BagNet. In addition, the table that describes certification time shows that number of parameters of the proposed method is much larger than Derandomized smoothing. Is it fair to compare accuracy under networks with varying numbers of parameters that differ too much (38M:11M)?
+2. Could you provide the training cost compared with Derandomized smoothing? In theory, the method is faster. 
+3. The description of R(l) in 3.1 is so unclear that it is difficult to understand the meaning of R(l).
+4. The supplementary material should give some experimental results of the application of the proposed method to practical examples, for example, can you provide some results that uses some existing patch attack methods for evaluation?
+",6,4.0,ICLR2021
+B1ew6SL9nX,2,rkeqCoA5tX,rkeqCoA5tX,Extension of AmbientGAN on denoising and demixing problems， but experiments are not sufficient,"This paper proposed two new GAN structures for learning a generative modeling using the superposition of two structured components. These two structures can be viewed as an extension of AmbientGAN. Experiments results on MNIST dataset are presented. Overall, the demixing-GAN structure is relatively novel. However, the potential application seems limited and the experiment result is not sufficient enough to support the idea. Detail comments are as following,
+
+
+1.	It seems there are no independent assumption imposed on the addition of two generators. It is possible that the possible model only will works on simple toy example, where the distributions of two structured components are drastic different. Or the performance will be affected by the initialization.  It would be nice if the author test this on more realistic examples, such as the source separation problem in acoustic or the unmixing problem in hyper-spectral images. More detail information about the experiments setting, such as the methods used to initialize the two generators are need. 
+2.	In the experiment part, it would be nice to have Quantitive results presented, for example PSNR for denoising. Simple comparison with several traditional methods could also help understanding the advantage of the model.
+",5,4.0,ICLR2019
+pHTBGZTbs1t,1,zspml_qcldq,zspml_qcldq,The advantages and significance of the proposed method are unclear and confusing.,"In this paper, the authors present a method to use unstructured external knowledge sources to improve visual question answering and image-caption retrieval. The proposed method can achieve somewhat improvement for visual question answering, but drop the performance for image-caption retrieval with a more complex model. Some concerns are as follows:
+
+1. The authors claim that the proposed method achieved state-of-the-art performance on both COCO and Flickr30k  image-caption retrieval. However, their retrieval scores are lower about 10 than the state-of-the-art counterparts, such as TERAN. The statement is not correct.
+2. Although the authors stated the proposed method uses raw images as input, the adopted backbones (i.e., image/text encoders) should be frozen to extract the features for the following components in their pipeline, which is similar to the other feature-based methods (e.g., TERAN) that also can be seen as freezing their backbones (e.g., Faster R-CNN) during their training and inference stages. Thus, the inputs between the proposed method and other methods have no essential difference. What is the significance to design such a much more complex model for image-caption retrieval? What are the advantages of the proposed method comparing prior superior methods? I am confused that if it is worthy to adopt such a complex model with worse performance.
+3. It is interesting to see that the proposed method could improve the performance of VQA. However, Table 3 does not give us a throughout comparison. There are many results missed in the table, such as different training types for Flickr30K, some results for Movie+MCAN, etc. From the results, we also could draw that the improvement of the proposed method is very limited for a good VQA method, i.e., Movie+MCAN with Vanilla. The experiments could not significantly demonstrate the significance and advantages of the proposed method.",5,5.0,ICLR2021
+E2ldJ2x3n_,4,Wis-_MNpr4,Wis-_MNpr4,Review of Reviewer 1,"
+
+Results - The two main proposed benefits of the approach are increased speedups in inference as well as training over  Slalom and just SGX. Looking at Figure 2 and Figure 4, the spped-ups do not appear to be consistently significant. I do appreciate that the authors show the results for MobileNetV2, for which the proposed algorithm is supposed to have less improvement in training time. Overall, I think, given that the main selling point of the time is faster training and inference, it fails to provide a reliable boost in either of them.
+
+Experimental Comparisons - I would have appreciated a comparison with other methods especially those that allow training while maintaining privacy.  While the authors make a note of them in Table 1, their differences with the proposed approach and where one is supposed to be better than the other is discussed neither conceptually nor experimentally (except Slalom)
+
+
+Novelty - The main algorithmic novelty of the paper is exporting the compute-heavy linear computes to a GPU outside the secure enclave by using blinding-unblinding techniques to protect the privacy.  In my opinion,  while it is a very nice and concise thing to do (along with the results on its privacy guarantee), it is not significantly impactful or interesting to provide enough novelty to this paper. If this would have led to a huge increase in performance margins, I would have said a simple solution that leads to dramatic increase in performance is amazing! However, that is not the case either. 
+
+So, given these reasons I am inclined to reject at this point. However, I would encourage the authors to discuss the other techniques in more detail, compare with them and point out situations where this technique can have a larger impact.
+
+=====
+
+I have read the authors' response and my comments remain the same as above especially the paragraph regarding novelty. I am keen to see a revised version of this paper.",4,4.0,ICLR2021
+z5AUHpoTcgL,3,H5B3lmpO1g,H5B3lmpO1g,Weak technical contribution with no major insights from experiments,"This paper uses several different techniques in IL and RL to improve performance on 6D robot grasping. It uses an expert planner OMG to collect initial data for BC as well as for online IL via Dagger. The uses DDPG to further train as well as fine tune on new unlabeled objects.
+
+The topic is very relevant and of current interest as more real world applications will need more than just 2D grasping that bin-picking has addressed and related work is sufficiently discussed.
+
+The technical contribution seems weak as the paper mostly explores known methods and well-known 'trade-tricks' (goal conditioning or loss on goal) towards a grasping centric problem which is also heavily explored as part of various RL tasks in literature. The main weakness of the work however is the lack of clear motivation for why such a complicated procedure is necessary compared to the expert planner already being used - the experiments aren't designed to address this question.
+
+- Using a planner as an expert for IL is common practice and I don't think counts as a major contribution as presented at the end of the introduction.
+
+- The bulk of IL experiments focus on what input representation is helpful. While it is not surprising that 3D inputs like point clouds would be better for 6D grasping these are more suitable as ablations than main experiments investigating the proposed method itself, compared to other sota approaches learning based or otherwise.
+
+- In several places 'contact-rich' and 'different dynamics' is motivated without clear explanation early on until the experiments identified what the setup was. The former does not seem to be well explored in the experiments, '...especially in those contact-rich scenarios...'. Aren't all grasping problems contact rich (unless only reaching to a pre-grasp is being considered) or were there some new scenarios constructed to specifically study the relative effects of contact?
+
+- Results in table 1 and figure 4 present mean statistic from 3-5 runs. This seems small, variance bands should be shown in figure 4 to see if the small number of sample are sufficient to capture the full picture.
+
+- The problems studied could be addressed by planar grasping as well. Complex setups with clutter, etc would better motivate if the presented approach is able to scale to scenarios where 6D grasping is necessary. 
+
+Other comments:
+
+- Does not including the BC loss for some samples in the batch (or between training iteration) cause any discontinuities or make learning unstable, as the loss landscape discontinuously changes?
+
+[Update]
+Thank you for the responses and clarifications. I appreciate the additional experiments in the real world and comparisons with the open-loop policy. Novelty still remains a concern however; in using a planner as an IL expert, it isn't clear what was challenging to adopt this strategy for the grasping problem and qualifies as a significant contribution. Additionally, the experiments to study 'contact-rich' and 'different dynamics' problems is unclear; the experiments don't indicate what aspects of the proposed method address these challenges and are able to do so with vision/depth-only feedback (no tactile); also the evaluation in simulation alone is insufficient to study such scenarios. I have updated my score accordingly.
+",5,4.0,ICLR2021
+SkemmQtecS,3,BkxSmlBFvr,BkxSmlBFvr,Official Blind Review #1,"Authors did an extensive experimental study over neural link prediction architectures that was never done before, in such a systematic way, by other works in this space. Their findings suggest that some hyperparameters, such as the loss being used, can provide substantial improvements to some models, and can be the reason of the significant improvements in neural link prediction accuracy the community observed in recent months.
+
+This is a really interesting paper, and can really shine some light on what was going on in neural link prediction over recent years. It also provides a great overview of the field -- in terms of architectures, loss functions, regularizers, sampling strategies, data augmentation strategies etc. -- that is really needed right now in the field.
+
+One concern I have is that the hyperparameter tuning strategy is not really described -- authors just say something along the lines of ""we use av.dev"", but for those unfamiliar with this specific hyperparameter optimiser this does not provide much information (e.g. what is a Sobol sequence? I had to look it up).",8,,ICLR2020
+SkgGSxr93m,3,r1l73iRqKm,r1l73iRqKm,Good work,"This work proposes a brand new dataset to fill in the vacancy of current conversational AI community, specifically the introduced dataset aims at providing a platform to perform large-scaled knowledge-grounded chit-chat. Overall, the dataset is well-motivated and well-designed, its existence will potentially benefit the community and inspire more effective methods to leverage external knowledge into dialog system. Besides, the paper also utilizes many trending models like Transformers, Memory Networks, etc to ensure the state-of-the-art performance. The clear structure and paragraphs also makes the paper easy to read and follow.
+
+Here are some questions I want to raise about the paper:
+
+1. First of all, the design of the conversation flow though looks reasonable, but it is pretty uncommon for a human to ground his/her every sentence on external knowledges. Therefore, it would probably be better to introduce some random ungrounded turns into the conversation to make it more humanlike.
+
+2. Secondly, the whole framework is based on many modules and every one of them are prone to error. I’m afraid that such cascaded errors will accumulate and lead to compromised performance in the end. Have you thought about using REINFORCE
+algorithm to alleviate this issue?
+
+3. Finally, it would be better to introduce some noisy or adversarial apprentice to raise unrelated turns and see how the system react. Have you thought about how to deal with such cases?",7,4.0,ICLR2019
+rkxAK85sKB,1,S1l-C0NtwS,S1l-C0NtwS,Official Blind Review #3,"This paper compares to approaches to bilingual lexicon induction, one which relies on joint training and the other which relies on projecting two languages' representations into a shared space. It also shows which method performs better on which of three tasks (lexicon induction, NER and MT). The paper includes an impressive number of comparisons and tasks which illuminate these two popular methods, making this a useful contribution. It also compares contextualized to non-contextualized embeddings.
+
+I have some questions:
+-can you goldilocks the amount of sharing in an alignment method? Put a different way, how much is performance affected if you perform alignment w/variously sized sub-partitions of the seed dictionary? Can you find a core lexicon (perhaps most common words shared cross-lingually) that will provide a good-enough alignment to joint-align with? If you were very ambitious, you could try artificially vary the amount of lexicon disjointness across languages (for camera-ready) and report how much your results are affected by incomplete overlap in translation variants.
+-for Table 1, do you have any ideas about why certain languages do better on eng--> them and others do better on them-->eng?
+-do you have any analysis exploring what is shared? wrt this sentence ""It also suggests detecting what to share is crucial to achieve better cross-lingual transfer.""
+
+Please address:
+- I would like more intuitive motivation for (6).
+
+Small notes: 
+-Fig. 1 font is too small, you could also remove excess axes. There's also overlap between the labeled arrow and the y-axis label btw (a) and (b).
+-""oversharing"" should be typographically delimited as a definition in ""We refer to this problem as oversharing""
+-typos: bottom of p4: ""Secition 2.3""
+-""e.g."" or ""i.e."" should be followed by a "",""
+-list which languages you use on bottom of p5
+-Table 3 caption looks crazy, what happened to your spacing there?",8,,ICLR2020
+Sylw5GMBnm,1,r1xYr3C5t7,r1xYr3C5t7,Graph2Graph without any graph structured inputs or outputs.,"This paper proposes an encoder-decoder model based on the graph representation of inputs and outputs to solve the multi-label classification problem. The proposed model considers the output labels as a fully connected graph where the pair-wise interaction between labels can be modelled.
+
+Overall, although the proposed approach seems interesting, the representation of the paper needs to be improved. Below I listed some comments and suggestions about the paper.
+
+- The proposed model did not actually use any graph structure of input and output, which can potentially mislead the readers of the paper. For instance, the encoder is just a fully connected feed-forward network with an additional attention mechanism. In the same sense, the decoder is also just a fully connected feed-forward network. Furthermore, the inputs and outputs used throughout the paper do not have any graph structure or did not use any inferred graph structure from data. I recommend using any graph-structured data to show that the proposed model can actually work with the graph-structured data (with proper graph notations) or revise the manuscript without graph2graph representation.
+
+- I personally do not agree with the statement that the proposed model is interpretable because it can visualise the relation between labels through the attention. NN is hard to interpret because the weight structure cannot be intuitively interpretable. In the same sense, the proposed model cannot avoid the problem with the nature of black-box mechanism. Especially, multiple weight matrices are shared across the different layers, which makes it more difficult to interpret. Although the attention weights can be visualised, how can we visualise the decision process of the model from end-to-end? The question should be answered to claim that the model is interpretable.
+
+- 2.2.1, 2.2.2, 2.3 shares the similar network layer construction, which can be represented as a new layer of NN with different inputs (or at least 2.2.2 and 2.3 have the same layer structure). It would be better to encapsulate these explanations into a new NN module which can be reused multiple parts of the manuscript for a concise explanation.
+
+- Although the network claims to model the interactions between labels, the final prediction of labels are conditionally independent to each other, whereas the energy based models such as SPEN models the structure of output directly. In that sense, the model does not take into account the structure of output when the prediction is made although the underlying structure seems to model the 'pair-wise' interaction between labels.
+
+- In Table1, if the bold-face is used to emphasise the best outcome, I found it is inconsistent with the result (see the output of delicious and tfbs datasets).
+
+- Is it more natural to explain the encoder first followed by the decoder?",5,2.0,ICLR2019
+S1gt14y3YS,1,B1xbTlBKwB,B1xbTlBKwB,Official Blind Review #1,"I have read the author response.  Thank you for responding to my questions.
+
+This paper aims to predict typical “common sense” values of quantities using word embeddings.  It includes the construction of a data set and some experiments with regression models.  The general direction of this work is worthy of study, but the paper needs additional justification for its task, better discussion of recent related work, and more development of its regression models.
+
+The work starts by describing the construction of interesting crowdsourced data sets that include people’s estimates of typical quantities, what they would consider to be low or high values for a given object in given units e.g. the temperature of a hot spring or the height of a giraffe.  Overall, the data sets are interesting but are not especially large (2300 total [low, high] pairs of 230 different quantities).  Further, the particular task formulation here needs more justification.  I think most of us would agree that common sense is a critical AI challenge, and that the question of whether embeddings reflect typical quantities is important.  But, in a paper where the data set is considered a primary contribution, I would expect more justification for exactly how this task is formulated, and which objects and units were selected.  As one example, why ask about “large” and “small” values rather than something with more precise semantics (like the 10th and 90th percentile, for example)?  I also felt that the introduction could be improved to provide more convincing motivation.  E.g., the first paragraph only says that humans apply different adjectives (like “hefty” and “cheap”) to different things depending on their numerical attributes (weight, cost), but does not argue why teaching AI systems to use those adjectives is a priority.
+
+Regarding related work, the paper is missing a discussion of several relevant papers that use embeddings to obtain relative comparisons or estimates of commonsense properties of objects, including:
+
+Forbes, Maxwell, and Yejin Choi. ""Verb physics: Relative physical knowledge of actions and objects."" ACL 2017
+
+Yang, Yiben, et al. ""Extracting commonsense properties from embeddings with limited human guidance."" ACL 2018
+
+Elazar, Yanai, et al. ""How Large Are Lions? Inducing Distributions over Quantitative Attributes."" ACL 2019
+
+The paper then presents the performance of some regression models.  These models are standard existing techniques, and given the relatively low performance I would have liked more development of the models and more analysis of the performance.  For a conference like ICLR I would expect to see a more thorough exploration and analysis of possible models for the task.  Looking at more powerful neural regressors (perhaps using contextual embeddings rather than just fixed word embeddings) might be one option.  Offering an explanation for why ARD seems to work better than the other approaches would be helpful.
+
+Minor: In Table 3, the way that small and large are interleaved makes it hard to compare systems, I think presenting all the small results together, and large results together may help.  In Figure 2, it would be helpful to see the histogram for size-large within the same plots here, so we could see how far apart they are.
+
+“Because Skip-gram has to handle more words to predict words, we assume Skip-gram will obtain more information about numerical values.”
+-- I didn’t understand what you meant about skip-gram having to “handle more words to predict words.”  Also, I did not understand how this entails that skip-gram would obtain more info about numerical values.",1,,ICLR2020
+CqyKfUdPHlO,1,rI3RMgDkZqJ,rI3RMgDkZqJ,good theory paper,"This paper considers a primal approach to the constrained RL problem where the constraints have a similar form as the total reward. The paper establishes global convergence in the tabular and NTK approximation cases. The problem in the aforementioned two cases is non-convex and therefore the global convergence results are very interesting and can contribute to the RL theory community. I did not check the entire proof but believe it is correct after checking some key points and go through the technical lemmas. The assumption on the one hidden layer neural network is standard, as it is used in a series of recent literature, although it is strong compared with practical algorithms.
+
+I have a question about Lemma 1, which is borrowed from another paper and I don't have enough time to check the specific lemma in that paper and its corresponding contextures and proof. The result says policy evaluation converges to the true Q function under a uniform measure (i.e. 2-norm), while the evaluation sample comes from the policies during iteration, which probably chooses sparse actions and visits sparse states. Do I miss any assumption on the exploration ability of the policies during iteration? Similar questions arise in applying Lemma 2 to Theorem 2. In the papers on unconstrained MDP (Agarwal 19 and Wang 19), they assume bounds on the ratio of visitation measures between policies to handle this technique problem. While I seem to can not find the corresponding parts and would like to ask about the intuition behind that.",7,4.0,ICLR2021
+JCu-sW8HPym,1,M_eaMB2DOxw,M_eaMB2DOxw,Interesting topic but missed significant previous works,"Post discussion:
+I read the author's response and other reviews. I will stick to my rating and encourage the author to resubmit a revised version focusing on the antisymmetric case. 
+
+
+
+
+
+Summary:
+
+The paper studies the approximation power and shows universality results for two recent neural network models that represent symmetric and antisymmetric functions. The main contributions of the paper are claimed to be (1) Universality of Fermi-Net (antisymmetric model) with a single Generalized Slater determinant (2) Universality of symmetric MLPs. In both cases, the authors emphasize the fact that the theorems deal with (i) vector inputs rather than scalar inputs, and that (ii) the approximation results are based on smooth polynomials rather than discontinuous functions as done in previous works. 
+
+While the problem the paper targets is interesting and important, the paper missed several important previous works and does not do a good job explaining their novelty with respect to the previous work that was cited. See below.
+
+Contribution 1: I am not an expert on antisymmetric function approximation and didn’t know Ferminet, but contribution 1 seems novel to me, although the authors should make a better job explaining the difference between their results and the original FermiNet paper. It is not entirely clear what was done before and what is new
+
+Contribution 2 and points (i),(ii) were discussed before in several papers. First “Provably Powerful Graph Networks” (NeurIPS 2019) used the power sum multi symmetric polynomials for representing set functions in the same way they are used in this paper.  Second, “On universal equivariant set networks” (ICLR 2020) proves a universal approximation theorem for equivariant set functions based on these polynomials. Given these two works, I am not sure the current paper has any additional contribution.
+
+Strong points:
+
+Understanding the approximation power of invariant/equivariant neural networks is an important goal.
+The results on antisymmetric functions are nice
+
+Weak points:
+
+The work is not properly positioned with respect to previous work and pretty much misses all the work done on the approximation of invariant functions since DeepSets. Except for the discussion above, the following should be cited/discussed:
+
+-- “Janossy Pooling: Learning Deep Permutation-Invariant Functions for Variable-Size Inputs” ICLR 2019, which discusses permutation sampling strategies. 
+
+-- “Universal approximations of invariant maps by neural networks” 2018 that discusses symmetrization and approximation of symmetric functions (and function invariant to many other compact groups). 
+
+-- “PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation” CVPR 2017, which also show universal approximation for symmetric functions.
+
+-- “On the Universality of Invariant Networks”, ""Universal Equivariant Multilayer Perceptrons"" ICML 2019/2020 might also be relevant.
+
+Recommendation:
+
+The paper studies an important problem but missed significant previous work. I believe that the results on antisymmetric function approximation are novel and suggest rewriting the paper focused on these results, with a clear discussion on the contribution with respect to previous works. It might also be good to spend more time on Ferminet while doing so since this model is less known to the machine learning community. In its current form, The paper is not ready for publication.
+
+Minor comments:
+
+The authors added a long version (22 pages) as an appendix. Not sure if this is OK. 
+
+",4,4.0,ICLR2021
+mvNw91nauQS,1,bB2drc7DPuB,bB2drc7DPuB,Solid contribution to RL theory,"This paper studies the mean-field limit of the policy gradient method (with entropy regularized) and proves that any stationary point under this setting is a global minimizer. I am not able to verify the entire proof as it involves a lot of standard steps to bridging the finite parameter case and the mean-field limit. The result seems promising and well complements several theory results in RL in the past year, e.g. the optimality of policy gradient under NTK regime and the TD algorithm in the mean-field regime. 
+
+Although the paper does not provide the convergence guarantee of the mean-field density flow to a stationary point (please correct me if this is wrong), the characterization of the optimality is still a good contribution. It well explains why a neural network policy is globally optimal given (1)it is stationary under the training via the first-order method (2)its parameterization has strong expressive power, e.g. it has infinite parameters or it is essentially a nonparametric model. 
+
+The convergence (section 4.2) to the many-particle limit. i.e. mean-field limit, seems standard, as the authors claim it is very similar to the case of supervised learning. I still would like to ask whether the authors found any key differences between the supervised learning case and RL objective, i.e., maximizing the total reward. In particular, does the absence of a strongly convex loss function cause any difficulty in the proof?",7,3.0,ICLR2021
+Hrds5skluOz,1,tu29GQT0JFy,tu29GQT0JFy,The way that this paper adapts deep latent variable models for missing not at random data is quite straightforword and lacks technical depth.,"This paper proposes an approach to training deep latent variable models on data that is missing not at random. To learn the parameters of deep latent variable models, the paper adopts importance-weighted variational inference techniques. Experiments on a variety of datasets show that the proposed approach is effective by explicitly modeling missing not at random data.
+
+P1: The related work is well done. This paper reviews the most related studies in several lines of research, including missing data concepts and theories in statistics, missing not at random data in various applications, deep latent variable models for missing data, etc.
+
+P2: The experimental results are quite extensive. This paper conducts experiments on a wide range of datasets from different domains: censoring on multi-variate datasets, clipping on image datasets, and bias on recommendation datasets. The paper also compares the proposed approach against a representative selection of state-of-the-art approaches.
+
+C1: The main concern for this paper is the lack of technical depth. Using variational distribution to derive a tractable lower bound of of a joint likelihood function, using Monte Carlo estimates to unbiasedly approximate the lower bound, and using a reparameterization trick to obtain unbiased estimates of gradients of the lower bound are well-established techniques. It is important for this paper to highlight what is the novelty of the proposed approach in terms of technical innovation.
+
+C2: This paper argues that the proposed approach allows for incorporating prior information about types of missingness. However, it is not clear to me what is the prior information and how does the proposed approach leverage the prior information. It is highly recommended for this paper to provide a formal formulation of the prior information in missing not at random data. It is also recommended for the paper to elaborate how the proposed approach uses the prior information and underlying motivation.
+
+C3: The last sentence in the 4th page of this paper states that it is possible to show that a sequence of objective functions converges to the joint likelihood function. To make the statement more convincing, it would be better if the paper could include a proof that the sequence of objective functions theoretically converges.",4,3.0,ICLR2021
+z5TPERrNIul,4,fpJX0O5bWKJ,fpJX0O5bWKJ,"Powerful tool, experimental results need some more work","Summary:
+
+The authors propose Variance of Gradients (VoG) as a quantifiable metric to identify examples that are difficult to classify. This is motivated by the intuition that examples that are easy to classify do not contribute much to the loss beyond early stages of training, hence don't contribute much to the gradient. The gradient $S_{it}$ w.r.t. every pixel $i$ is tracked across $t=1..K$ snapshots through the training process. The mean $\mu_i$ of $S_{it}$ is computed across snapshots. and finally the VOG of each example $j$ is computed as the average over all $N$ pixels and $K$ snapshots of the squared difference between the gradient $S_{it}$ and $\mu_i$.
+Qualitative and quantitative proof of correlation of VoG with diffuculty of examples is provided:
+Qualitative: Authors illustrate that examples with high VoG tend to have cluttered backgrounds or odd angles. 
+Quantitative: High correlation betwee number of erroneous examples in test set and associated VoG, especially on the harder datasets like ImageNet.
+VoG is used to study phases of training a neural network. Interestingly, if we focus only on early epochs of training, the difficult examples have low VoG while easy examples have high VoG. As in early phases of training, models concentrate on learning the easy examples, and the error rate on these easy examples is still high.
+VoG is also used to identify out of distribution examples that high capacity models tend to get right only by memorization.
+
+Strengths:
+- A new metric, VoG, is proposed that is easy to calculate and shown to be associated with examples that are difficult to classify. Having such a powerful tool will be useful for a variety of purposes, from identifying atypical examples for human auditing, to aided interpretability of models.
+- The metric is easy to compute as compared to competing methods. It can easily be adopted by practitioners as they using checkpoints that are often computed and saved anyway.
+- Empirical results are convincing for the most part. There is a clear increase in error-rate as VoG increases. And this relationship is shown to hold across various network initializations. VoG is shown to vary throughout training. As the network begins to converge, the difficult examples show consistently high VoG. 
+
+Weaknesses:
+- Fig. 4 is not clear. For each value of decile on the x-axis, the error rate is computed on that 10% of data. And the error rate is shown to be higher for larger values of VoG. However, the maximum error rate even for the maximum value of VoG is 20-40%. Therefore, there are clearly upto 80% of examples with high VoG that are still correctly classified. Authors don't explain why this is the case.
+- Fig. 7: It is not clear why the error rate associated with the difficult examples that have low VoG early in training is low. Shouldn't the errors associated with difficult examples remain the same throughout training i.e., the network never learns to classify these examples correctly. Why would the error rate degrade? Or are the authors reporting percentage of total errors on the y-axis. In which case while the asbolute number of errors associated with difficult examples remains the same, their relative ratio as compared to the overall number of errors increases as training proceeds. Please clarify.
+- Fig. 8: Out of distribution examples e.g., deliberately shuffled labels are shown to be associated with slightly higher VoG score values. Can the authors include a significance test to show this is a material difference. There is a high variance in VoG values for shuffled examples. Why would these examples exhibit lower VoG? Can the authors provide some intuition behind this.  
+
+Conclusion:
+Overall, my decision is to accept the paper because this is a powerful proposal that deserves to be investigated further. However, I have some reservations about the empirical results as described above. If authors can explain/clarify these aspects it would be a much stronger submission.
+
+In addition to those listed above, the authors should address the questions below in future work:
+- How come not all or even a majority of examples that are misclassified by a network have high VoG?
+- Will results hold across various types of models? What is the relationship of VoG with capacity of models? 
+- Will results hold across domains?",6,4.0,ICLR2021
+7_YceSvrexx,1,kVZ6WBYazFq,kVZ6WBYazFq,"proposes simple metrics and solutions for defining locality in LIME, but lacking in clarity in examples and in relation to previous works","Summary: This paper proposes a new sampling method for LIME based on user-defined Boolean subspaces. They show that using these subspaces rather than the default sampling settings of LIME can lead to robustness against adversarial attacks and allow users to better undercover bugs and biases in subspaces relevant to the user's task. They additionally propose a new metric for measuring the quality of an explanation.
+
+Pros: The authors create a novel link between the explainability literature and Boolean Satisfiability. The proposed methodology is shown to be relevant in several different applications. They propose simple metrics for evaluating the quality of samples and the quality of an explanation.
+
+Cons: I don't have a lot of experience with ICLR, but it is not obvious to me that this paper is appropriate for this venue.
+
+None of the metrics used in the paper seem to be weighted by density/distance to the point being explained, as is done in LIME. Given that the point of LIME is that the function is unlikely to be globally approximable by a linear function, the lack of incorporation of a weighting function seems to make this framework inferior to that of LIME (and in fact, theoretically, the binary subspaces used in this paper are merely a specific instantiation of the flexible weighting function used in LIME, the paper, as opposed to LIME, the software package). I would expect that without such a weighting function, in many cases explanations with similar values of the rho metric may vary widely in their usefulness as local explanations.
+
+The paper suffers somewhat from relegating all of the examples to the supplementary material. Examples which are important to the points being made should be brought back into the main body of the paper.
+
+Certain relevant works seems to have been missed: Sokol et al. (https://arxiv.org/pdf/1910.13016.pdf) proposed user-specified local surrogate explainers, including allowing users to define their own sampling subspace (but do not propose algorithms for sampling). Zhang et al. (https://arxiv.org/pdf/1904.12991.pdf) show that LIME does not accurately reflect local invariance to globally important variables. 
+
+In the first experiment, without some ground truth knowledge as to what the classifier is doing, it is not obviously useful to point out that the CLIME identifies race as a top feature for females in the recidivism dataset. The setup of Zhang et al. may be preferred, where the ground truth behavior of the classifier is known.
+
+In the adversarial attack experiment, it is not completely clear what is done: did you use CLIME to generate explanations of the adversarial classifier from Slack et al.? Was the adversarial classifier trained with access to CLIME perturbation functions, or was it trained assuming LIME perturbation functions? This doesn't seem to immediately show superiority over the LIME framework, as at larger Hamming distances the bias is still hidden. I think a more appropriate comparison would allow the adversarial classifier access to the relevant perturbation function and consider accuracy at a variety of neighborhood widths for LIME, as with the Hamming distance.
+
+###post rebuttal###
+I have read the updated version of the paper and still feel that this paper may have errors regarding the flexibility and purpose of LIME. The idea is nice, but the paper and evaluations would benefit from more polishing before publication. I maintain my original score.
+
+Misrepresentation of LIME:
+
+Section 3: LIME does not assume that the sampling neighborhood is the same as the true distribution. It may assume something weaker, such as that the function being explained is fairly smooth in the sampling neighborhood. Note that this can be a feature of LIME and not necessarily a bug: if for example x1 and x2 are fully correlated in the data distribution but the classifier only uses x1, it would be impossible to tell this if sampling only within the data distribution. By sampling outside the data distribution it becomes apparent that the classifier is using x1 only. Also, LIME assumes black box access to the function, so I don't fully understand your statement that ""we generally do not have access to the true labels of instances generated through sampling"". It seems like you may be defining the ""correct"" explanation with respect to the true data distribution, rather than to the classifier. LIME is meant to explain a black-box classifier. If the classifier is wrong, LIME should reveal what the classifier does (that is, the explanation should also be ""wrong"" with respect to the true data). The ""framework capabilities"" is also simply not true: users can define the data point to be explained, as well as their own similarity kernel and/or kernel width.
+
+Evaluation:
+
+It's not entirely obvious to me how we can be sure that CLIME is producing the ""right"" explanation in C.1, C.2 without knowing the function f. If changing the training set changes the classifier f, then it is correct that the explanation should change. As mentioned above, evaluating whether or not an explanation is ""correct"" should be done with respect to the classifier, not the underlying data distribution. In ""Detecting Adversarial Attacks"", it's not clear from the text whether or not you retrain the adversarial attack with your perturbation function. Further, I suspect that LIME may also be able to identify the sensitive feature for sufficiently small neighborhood sizes when sampling in binary space Z'. It seems like a straw man argument to compare an optimized version of your sampling procedure to the default version of lime.
+
+Minor: Equations 1) and 2), if they are describing the usage in Ribeiro et al., should include a weighting function.
+
+Figure 2 seems not to be explained in the text and would benefit from more description.
+",5,4.0,ICLR2021
+BJAg3e7ZM,3,HJrJpzZRZ,HJrJpzZRZ,Review,"
+1) Summary
+This paper proposes a flow-based neural network architecture and adversarial training for multi-step video prediction. The neural network in charge of predicting the next frame in a video implicitly generates flow that is used to transform the previously observed frame into the next. Additionally, this paper proposes a new quantitative evaluation criteria based on the observed flow in the prediction in comparison to the groundtruth. Experiments are performed on a new robot arm dataset proposed in the paper where they outperform the used baselines.
+
+
+2) Pros:
++ New quantitative evaluation criteria based on motion accuracy.
++ New dataset for robot arm pushing objects.
+
+3) Cons:
+Overall architectural prediction network differences with baseline are unclear:
+The differences between the proposed prediction network and [1] seem very minimal. In Figure 3, it is mentioned that the network uses a U-Net with recurrent connections. This seems like a very minimal change in the overall architecture proposed. Additionally, there is a paragraph of “architecture improvements” which also are minimal changes. Based on the title of section 3, it seems that there is a novelty on the “prediction with flow” part of this method. If this is a fact, there is no equation describing how this flow is computed. However, if this “flow” is computed the same way [1]  does it, then the title is misleading.
+
+
+Adversarial training objective alone is not new as claimed by the authors:
+The adversarial objective used in this paper is not new. Works such as [2,3] have used this objective function for single step and multi-step frame prediction training, respectively. If the authors refer to the objective being new in the sense of using it with an action conditioned video prediction network, then this is again an extremely minimal contribution. Essentially, the authors just took the previously used objective function and used it with a different network. If the authors feel otherwise, please comment on why this is the case.
+
+
+Incomplete experiments:
+The authors only show experiments on videos containing objects that have already been seen, but no experiments with objects never seen before. The missing experiment concerns me in the sense that the network could just be memorizing previously seen objects. Additionally, the authors present evaluation based on PSNR and SSIM on the overall predicted video, but not in a per-step paradigm. However, the authors show this per-step evaluation in the Amazon Mechanical Turk, and predicted object position evaluations.
+
+
+Unclear evaluation:
+The way the Amazon Mechanical Turk experiments are performed are unclear and/or not suited for the task at hand.
+Based on the explanation of how these experiments are performed, the authors show individual images to mechanical turkers. If we are evaluating the video prediction task for having real or fake looking videos, the turkers need to observe the full video and judge based on that. If we are just showing images, then they are evaluating image synthesis, which do not necessarily contain the desired properties in videos such as temporal coherence.
+
+
+Additional comments:
+The paper needs a considerable amount of polishing.
+
+
+4) Conclusion:
+This paper seems to contain very minimal changes in comparison to the baseline by [1]. The adversarial objective is not novel as mentioned by the authors and has been used in [2,3]. Evaluation is unclear and incomplete.
+
+
+References:
+[1] Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical interaction through video prediction. In NIPS, 2016.
+[2] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video prediction beyond mean square error. In ICLR, 2016.
+[3] Ruben Villegas, Jimei Yang, Seunghoon Hong, Xunyu Lin, Honglak Lee. Decomposing Motion and Content for Natural Video Sequence Prediction. In ICLR, 2017
+",3,5.0,ICLR2018
+r1eBR3LqhX,2,BJG__i0qF7,BJG__i0qF7,Learning to encode spatial relations from natural language,"The main contributions of the work are the new datasets and the overall integration of previous modeling tools in such a way that the final architecture is able to encode semantic spatial relations from textual descriptions. This is demonstrated by an implementation that, given textual descriptions, is able to render images from novel viewpoints. In terms of these two contributions, as I explain below, I believe there is space to improve the datasets and the paper needs further analysis/comments about the merits of the proposed approach. So my current overall rating is below acceptance level.
+
+In terms of data, authors provide 2 new datasets: i) a large datasets (10M) with synthetic examples (images and descriptions) and ii) a small dataset (6k) with human textual descriptions corresponding to synthetic images. As the main evaluation method of the paper, the author include direct human evaluation of the resulting renderings (3 level qualitative evaluation: perfect-match/partial-match/no-match). I agree that, for this application, human evaluation is more adequate than comparing a pixel-level output with respect to a gold image. In this sense, it is surprising that for the synthetic dataset the perfect match score of human evaluation for ground truth data is only 66%. It will be good to increase this number providing a cleaning dataset. 
+
+Related to the previous comment, it will be good to provide a deeper analysis about the loss function used to train the model.   
+
+In terms of the input data, it is not clear how the authors decide about the 10 views for each scene.
+
+In terms of the final model, if I understood correctly, the paper does not claim any contribution, they use a model presented in a previous work (actually information about the model is mostly included as a supplemental material). If there are relevant contributions in terms of model integration and/or training scheme, it will be good to stress this in the text.
+
+Writing is correct, however, authors incorporate important details about the dataset generation process as well as the underlying model in the supplemental material. Given that there is a page limit, I believe the relevant parts of the paper should be self-contain. ",5,4.0,ICLR2019
+SJeLsABqFH,1,ryefE1SYDr,ryefE1SYDr,Official Blind Review #1,"LIA: Latently Invertible Autoencoder Review
+
+This paper proposes a novel generative autoencoder, and a two-stage scheme for training it. A typical VAE is trained with a variational approximation: during training latents are sampled from mu(x) + sigma(x) * N(0,1), mu and sigma are regularized with KL div to match an isotropic normal, and the model minimizes a reconstruction loss. A LIA is instead first trained as a standard GAN, where an invertible model, Phi, (e.g. a normalizing flow) is hooked up to the Generator/Decoder, such that the output is G(Phi^{-1}(z)) with z~p(z), a simple distribution. In the second stage, the Generator/Decoder and Discriminator are frozen, and an Encoder is trained to minimize a reconstruction loss (in this paper, a pre-trained VGG network is used as a feature extractor to produce a perceptual loss) and maximize the “real” output of the frozen Discriminator on reconstructed samples.
+
+The key advantage of this method is that during the second stage no stochasticity is injected into the network and Phi is not involved.This means that the encoder does not need to be regularized to produce a specific parametric form of its output (no KL(p || q))), instead implicitly learning to match the latent distribution expected by the generator through the reconstruction losses. Additionally, because the latent space is not e.g. an isotropic Gaussian, it can be more expressive and flexible, only being constrained to an invertible transformation of the distribution p(z) chosen during the first training stage.
+
+The training procedure is evaluated using a StyleGAN architecture on several high-res, single-class datasets (faces and LSUN Cats/Cars/Bedrooms). The quality of the resulting reconstructions is compared against several methods which are also capable of inference (like ALI and post-training an encoder on a standard StyleGAN), and samples and interpolations are presented. There is also an experiment that compares gradients in the encoder when training using LIA against training using a more typical VAE setup. 
+
+My take: The key idea behind this paper is quite promising, and I believe this paper has tremendous potential. I agree with the authors that the usefulness of implicit generative models is limited by their typical lack of an encoder network, and that existing approaches have several design drawbacks, and incorporating invertible networks and advances in flow-based modeling looks like a fruitful avenue of research.
+
+However, I have a litany of concerns with the paper itself, concerning its high similarity with a paper published in May, its motivation, its presentation, its empirical evaluation, and the analysis presented within. While my concerns and suggestions are extensive, this paper is perhaps unusual in that all of the issues I have are quite fixable; the core idea is good, but its realization and presentation in the paper need a substantial amount of revision. I am currently giving this paper a weak reject as I do not believe it is ready for publication, but I believe that with an overhaul this paper could be a more clear accept.
+
+Update, post rebuttal:
+
+Thanks to the authors for their response. While I appreciate their insistence that my issues with the paper likely stem from my simply not understanding it (or the underlying topics), I hope they can appreciate that such an appeal is unlikely to allay said concerns. Pointwise:
+
+1. While there is of course a difference between the LIA setup and the GLF setup, regardless of the two-stage training or the inclusion of an adversarial or perceptual loss or any other bells and whistles that get attached, the fact remains that the resulting architecture for both LIA and GLF is an autoencoder with an invertible model in the middle. Arguing that the LIA setup is somehow fundamentally different is akin to arguing that a VAE-GAN is an utterly different model from a VAE with a VGG perceptual loss. Yes, they're optimizing slightly different things (distribution-wise differences vs sample-wise differences) , they have different characteristics, etc., but at the end of the day they're still autoencoders with extra bits on the end. The same general principle holds here. As I stated in my original review, I consider the differences relatively minor and maintain my stance that comparison is warranted. 
+
+1(a). While the authors may have completed the work over a year ago, the fact that the other work was made public multiple months before this work means that it does, in fact, count as prior work. There is plenty of precedent at ICLR for work which appeared on arXiv well before submission date to be considered as prior work. I understand that this can be frustrating if the authors have previously submitted to other conferences and wished to wait until acceptance before making the work public, but that is a personal choice that does not change the nature of the situation.
+
+2. If interfacing with image manipulation techniques is the motivation for improving reconstructions, this motivation should be clearly stated in the paper. After rebuttal there is still no mention of this motivation, which suggests to me that the authors expect all readers to consider ""reconstruction"" (which I again posit is not really a task) to matter intrinsically.
+
+3. I once again appreciate that the authors hold that this reviewer is incorrect about basic facts. The forward KL placed on the latent space of a VAE only encourages it to resemble a particular distribution (typically isotropic gaussian) but the information content passed through the bottleneck can indeed grow with the size of the latent space, as one can see experimentally by ablating the latent dimensionality. This general principle should also be reinforced when one considers that flow models with exact inference (i.e. perfect reconstructions) require dz==dx.
+
+4. This reviewer maintains that sampling a random input from a distribution during training involves stochasticity.
+
+5. This still does not address my concern that reconstruction is not a particularly relevant task.
+
+7-9: Thank you for modifying the caption in this figure, though I still hold that the y-axis should be correctly labeled. This still does not address my concern that this experiment does not actually show what the authors claim it does--the magnitude of the gradient noise is not by any measure a viable indication that the inclusion of phi is doing anything meaningful in place of a standard MLP  as the comparison is instead made to an entirely different training setup.
+
+I maintain my stance of rejection.
+
+Original review:
+
+First off, a paper published in May on arXiv titled “Generative Latent Flow” (GLF) proposes an idea which is in essence identical to the one proposed in this paper. In GLF, a VAE is trained, but rather than using the reparameterization trick, a true normalizing flow is trained to model the distribution of the output of the encoder (i.e. with NLL using the change of variables formula common to flow-based models), such that the training of the actual autoencoder is truly deterministic (in the sense that at no point is an epsilon~p(z) sampled like in a normal VAE. The core difference between LIA and GLF is that GLF learns to model the distribution of the encoder outputs to enable sampling, while LIA incorporates an invertible model into a generator which explicitly through sampling, and then fits an encoder post-hoc. There are other differences in implementation and the choice of datasets, but those are (IMO) minor details relative to the core similarity. Given that GLF was published 4 months before the ICLR2020 deadline, this paper absolutely must be cited, compared against, and discussed. I am somewhat inclined to argue that given the similarity, LIA is merely incremental relative to GLF, but for now I think it is sufficient to point out the existence and similarity.
+
+Second, the stated motivation in this paper is, I think, misguided. The authors argue for the need of an inference network, but they explicitly make clear that their goal is to train this network to enable reconstruction of an x given a z, rather than e.g. to learn a “good representation” (bearing in mind that what constitutes a good representation is strongly subject to debate). The authors do not provide any motivation for why reconstruction matters. At no point is an application or downstream task presented or mentioned in which good reconstructions are relevant. One might argue that choosing reconstruction quality is as arbitrary as pursuing improved sample quality (as is in vogue in GAN papers)  but there is substantial evidence that improved sample quality correlates with improved representation learning (mode dropping in GANs notwithstanding); the case is more complex for high-quality reconstructions. 
+
+Reconstruction could perhaps be motivated from the point of view of compression, but this paper makes no attempt to examine compression: rate-distortion tradeoffs are not considered, nor are any empirical metrics of compression ratio or likelihood such as bits/dim presented. Given that one can produce a model which achieves arbitrarily high-quality reconstructions by simply increasing the dimensionality of the bottleneck, I do not find reconstruction to be a compelling problem.
+
+One might also argue that improved reconstruction capacity is indicative of better ability to fit the distribution (i.e. less mode dropping), but in the LIA setup the generator is trained as a standard StyleGAN with the only modification being the replacement of the MLP with Phi, so there’s no reason to believe that the implicit model defined by G has been meaningfully affected by the inclusion of the post-hoc trained encoder.
+
+If the authors wish to pursue “reconstruction” as the primary motivation for learning an encoder, I would suggest they spend more time discussing compression and the latent bottleneck, as well as performing more detailed empirical evaluations (explained below). Basically, *why* does reconstruction matter? Alternatively, the authors could demonstrate the usefulness of their learned encoders for downstream tasks to indicate that the representations they learn are of high quality and useful.
+
+Third, the presentation of this paper needs a lot of work. There are typos and confusing statements throughout, as well as several instances of overstatement. 
+
+The key insight of this paper appears to be that “having an invertible network at the input to the generator makes it more amenable to post-hoc learning an encoder.” If I understand correctly, the only difference between this method and Encoded StyleGAN is that this paper uses an invertible model in place of the StyleGAN MLP. If this is the case, then the paper needs to (a) make clear the minimality of this difference and (b) devote substantial exposition to exploring the difference and why this is important (see my comments in the experimental section).
+
+Phrases like “the two-stage training successfully handles the existing issues of generative models” suggests that this method has solved all of the problems in generative modeling, which the authors have by no means demonstrated to be the case. 
+
+Calling the two stage training “Stochasticity free” is incorrect—if you’re training the model as a GAN, then (1) you’ll be sampling z’s in the first place so it already has a latent distribution defined and (2) the end result of training will be much more variable than, say, training with a likelihood metric. There is a *ton* of stochasticity in the first stage of training!
+
+The paper states several times that the “critical limitation” of adversarial models is their lack of an encoder. While implicit generative models do not generally require an encoder, there are plenty of methods (BiGAN by Donahue and ALI by DuMoulin, along with all the VAE-GAN hybrids) that jointly learn encoders, and much work on training an encoder post-hoc. These methods are acknowledged in the related work, but I think they should be taken into consideration when describing this “critical limitation.” While not having an encoder does indeed hinder or prevent the use of an implicit model for inference, I think stability, mode dropping, and mode collapse are more prominent issues with GANs. I think the authors might do better to say something to indicate that the challenge is to train a model which both has sharp, high-quality samples (as with GANs) which is still capable of inference or explicit likelihood evaluation (VAEs, etc). 
+
+In general, I found the description of the model itself to be confusing, and needed several thorough read-throughs just to understand what was going on: what was being frozen when, the fact that the model is just a GAN with a post-hoc trained encoder--I felt that there was a lot of obfuscatory language obscuring the actual simplicity of the method (which might arguably be its strength).
+
+While I would generally like the paper’s exposition to be improved, I understand that saying “write it better” is unhelpful so please see my individual notes at the end of this review for additional specific points. 
+
+Fourth, I found the empirical evaluations to be somewhat weak. To be clear, the results appear to be very good-the model retains the sample quality of  StyleGAN (at least as far as can be seen from the presented samples) while achieving noticeably higher-quality reconstructions on all the tested datasets. However, the metrics used for evaluation are limited—while at least MSE is presented, I would again stress that reconstruction is an odd metric to use when other factors like compression rates are not considered. While it is interesting to note that in this exact setup (mostly dim_z=512) LIA outperforms the baselines wrt the chosen metrics, a more thorough evaluation would, for instance, sweep the choice of dim_z, and ideally present NLL results (which I think are possible to evaluate given that LIA has a flow model even if it’s not trained using NLL, but I’m not 100% sure on this front and am open to counterarguments on this front).
+
+What’s more, the datasets chosen are all single-class datasets with a massive amount of data—as far as generative modeling is concerned, these are very datasets with a minimal amount of variation. This is critical because the LIA method relies on pre-training a GAN, meaning that it does nothing to deal with problems like mode dropping and mode collapse. While we may not see much mode dropping on these very easy datasets (where there are, essentially, very few modes), this is still a substantial problem in the general case, as can be seen by results on e.g. ImageNet. If your GAN collapses or drops modes then post-training the encoder is not likely to be able to recover them. This is also arguably a weakness of this paper relative to GLF which incorporates the encoder into the training loop of the decoder and is likely to be better at covering modes.
+
+Accordingly, I have substantial concerns that this method will not work well on datasets outside of these highly-constrained, nearly-unimodal, single-object, very-high-data datasets. While I would of course prefer to see results on something massively multimodal like ImageNet (training on a 100-class subset @ 64x64 resolution would be about 100,000 images and should be even less hardware intensive than the already performed experiments) I am aware of how cliché it is for reviewers to ask for imagenet results. Auxiliary experiments on CIFAR-100 or something with more than one class would go a long way towards allaying my concerns on this front.
+
+Next, no error bars are presented; this is simply inexcusable. Given that no hardware requirements are presented it is difficult to judge if expecting multiple runs is unreasonable but unless each run requires weeks of the authors’ full hardware capacity, there is no reason for the authors not to include error bars or expected variances on the numbers for as many of their experiments as possible.
+
+Further, I found the experiment in 5.3 to be confusing and the results irrelevant to the argument made by the authors. First of all, what does it mean that the “gradients” are plotted in the figures relating to this experiment? Are these gradient norms for a layer in the network, and if so, what type? Is the loss in Figure 5c the training loss or the test loss? I also disagree that the VAE “gradients” are “more unstable” than the LIA “gradients,” they are simply noisier. I do not see why the increased gradient noise relative to LIA is indicative of the superiority of the method, but is instead entirely expected given that noise is explicitly injected into a standard VAE—I would argue that the change in gradient noise is simply the result of removing the stochasticity, but it says nothing as to whether or not the LIA method is better than the VAE method. Again, I agree that using an invertible network in some capacity is preferable to using the reparameterization trick, but I found this specific experiment to be distracting.
+
+I think the paper would do better to explore the importance of the invertible network relative to the exact same procedure but with the invertible network replaced with an arbitrary MLP of similar capacity. This appears to be what the encoded styleGAN model is, but I think it would do more to elucidate the key insights of this paper if the analysis was to focus more on this front. Why is it helpful to have an invertible phi in place of the StyleGAN MLP? What happens as the capacity of this portion of the model is increased or decreased? What is the form of the distribution output by Phi (maybe just some histogram visualizations along different dims?), and how does it compare to that of the typical MLP? What is the form of the distribution output by the encoder, and how does it differ from (a) the analytical latent distribution in the case of encoded styleGAN and (b) the empirical latent distribution of LIA? There’s quite a bit to explore there but this paper doesn’t dig very deep on this topic.
+
+I recognize that the amount of suggestions and changes I have listed are exceptionally large (more than I’ve personally ever written before, for sure), and I want to make it clear that I don’t expect the authors to address them all in the limited timespan of the rebuttal period. While this unfortunately may mean that there is simply not enough time for my concerns to be addressed, if this is the case then I hope these suggestions prove useful for publication in the next conference cycle, where this paper could be very strong. As it is, given the extent of my concerns, this paper is currently sitting at about a 4/10 in my mind. 
+
+Minor notes:
+
+“In the parlance of probability,” page 2. I liked this alliteration a lot. This paragraph as a whole was quite clear and well written.
+
+“But it requires the dimension dx of the data space to be identical to the dimension dz of the latent space” Minor nitpick, but I would swap “dx of the data” with “dz of the latent space” in this sentence, to make it clear that the model’s latent dimensionality is constrained by the dimensionality of the data. As written it makes it sound like it’s the other way around.
+
+“The prior distribution can be exactly fitted from an unfolded feature space.” While flows have exact inference, saying that you can exactly fit the distribution of the encoder is arguably inaccurate unless you can show perfect generalization. Specifically, if you attain 0 training loss for the flow, do you also have 0 test loss (i.e. the NLL of the flow on  the encoder’s output for test samples is also minimized). 
+
+Furthermore, the phrasing “unfolded feature space” (used elsewhere in the paper) is confusing and not in common usage—does this mean the output of the encoder, or some sort of Taylor expansion? It’s not immediately clear, and I would recommend the authors find a different way to express what they mean.
+
+“Therefore the training is deterministic” Training is not deterministic if the first stage of training involves training a GAN. You are still sampling from a preselected prior in this stage.
+
+“As shown in Figure 1f, we symmetrically embed an invertible neural network in the latent space of VAE, following the diagram of mapping process as…” This sentence is confusing. The term “embed” has a specific meaning in the literature: you might use word embeddings, or embed a sample in a space, but to “embed a [model] in a latent space” doesn’t make sense to me. I think the authors would do well to use more standard terminology, and to reconsider their description of the model to be more concise and clear.
+
+“Our primal goal is to faithfully reconstruct real images from the latent code.” Primal should be primary. I would also like to see this motivated better—why do you care to exactly reconstruct real images? Is there a downstream task where this is relevant or an intrinsic reason why we should care about being able to attain exact reconstructions?
+
+“indispensable discriminator.” Indispensible means “something you can’t get rid of,” whereas it would appear the discriminator is not used after  training (and is frozen after the first stage)—do the authors perhaps mean “dispensable” or “disposable”?
+
+“The interesting phenomenon is that the StyleGAN with encoder only does not succeed in recovering the target faces using the same training strategy as LIA, even though it is capable of generating photo-realistic faces in high quality due to the StyleGAN generator” This sentence is confusingly written and poorly supported. While I do agree that the LIA reconstructions are superior to the encoded styleGAN reconstructions, exactly what do the authors mean that LIA “recovers” the target faces while StyleGAN does not? The LIA reconstructions are not identity preserving—while most of the semantic features are the same, and the model does do a good job of picking up on unusual details such as hats, the facial identities are definitely not preserved (i.e. for every face in row 1 and row 2, I would say that the two faces belong to different people with similar features, but they are still definitely different people) .
+
+“This indicates that the invertible network plays the crucial role to make the LIA work” This statement is unsupported. There are a number of differences in training setup, and the authors, in this reviewer’s opinion, have not presented evidence to indicate that the use of the flow model is specifically responsible for this. Specifically, what would happen if during the decoder training stage, the invertible network was not employed? While I do believe that the inclusion of the invertible network is important, the authors should go to greater lengths to elucidate exactly what it does (see my comments above in the experimental section re. the shape of the distribution and how the encoder ends up matched to the decoder depending on what the actual latent distribution is from the POV of the generator).
+
+“To further evaluate LIA on the data with large variations” The choice of three single-category, single-subject datasets for evaluation is strictly at odds with this statement. These are highly constrained, clean datasets with tremendous amounts of data per class, which are substantially less difficult to model than e.g. ImageNet 
+
+“They will be made them available for evaluation.” -> “These subsets will be made available for evaluation”
+
+“The experimental results on FFHQ and LSUN databases verify that the symmetric design of the invertible network and the two-stage training successfully handles the existing issues of generative models.” This statement is far too strong—saying a method “successfully handles the existing issues of generative models” suggests that this method is the end-all be-all and has solved the problem of generative modeling entirely. I would suggest the authors dial back the strength of this claim.
+
+“Table 2: FID accuracy of generative results.” What is FID accuracy? Do the authors just mean FID?
+
+Specify hardware used and training times, at least in the appendix.
+",3,,ICLR2020
+Ubkqy_k8oe,2,WweBNiwWkZh,WweBNiwWkZh,Volumetric parametrization improves over surface parametrization. Insufficient novelty and experiments.  ,"The authors derive a volumetric extension of the surface parameterization approach developed by Jin et.al. Towards this, they propose to use tetrahedral parameterization using well known techniques in computer graphics community. The kinematically deforming skinned mesh (KDSM) formulation for tetrahedral parameterization is borrowed from Lee at. al. 
+
+The combination of the above techniques coupled with some heuristics to increase robustness to inversion suggest improvement over Jin et. al. This is a very niche topic and I am not confident that the general audience stands to benefit from this specific formulation for clothes. The ICLR community would benefit by demonstrating the approach on other deformations of solids/liquids and validating the generality of the approach compared to other representations beyond virtual cloth. Only comparing to Jin et. al significantly limits the scope of the paper. 
+
+The computational complexity of the approach is completely ignored. As the gains over Jin et.al. seem to stem from extending the formulation to 3d domain, the compute should be compared. This becomes important for high dimensional solid/liquid simulation. ",4,3.0,ICLR2021
+us8ffD545_C,2,zDy_nQCXiIj,zDy_nQCXiIj,Paper review,"The authors propose two new techniques that extract interpretable directions from latent spaces of pretrained GAN generators. Both techniques are very efficient and are shown to work with the state-of-the-art BigGAN models. Furthermore, the authors describe additional details of the method, like determining the transformation end-points, which are important for usage in the practical visual editing.
+
+Strengths:
+
+1. The paper tackles the important problem, provides a thorough description of the field, demonstrates an in-depth understanding of the area. The important small details (determining the end-points, entangled transformations) are only slightly addressed in the existing literature, but are crucial for editing applications. This paper convincingly addresses this gap.
+
+2. The proposed techniques are both simple and efficient, much faster compared to existing methods.
+
+Weaknesses:
+
+1. What upset me most was a certain amount of overclaiming about the main contributions.
+
+""Second, it detects many more semantic directions than other methods."" - I cannot agree with this statement.
+
+a) For user-specified transformations, the method finds the same directions as  Jahanian et al. (2020).
+
+b) For unsupervisedly discovered transformations, I did not find in the text any examples of directions, that are not covered by the:
+
+Voynov & Babenko, 2020
+
+Ha ̈rko ̈nen et al. (2020), 2020,
+
+The Hessian Penalty: A Weak Prior for Unsupervised Disentanglement, ECCV 2020 (which is a missing related work, btw)
+
+In my opinion, the statement is misleading, which is not acceptable when listing contributions.
+
+2. I am not convinced by the significance of the search of user-specified geometric transformations, which take the largest part of the submission. Simple transformations, like zoom/shift/brightness, can be easily obtained automatically in editing applications, do we really need GANs for them?
+
+3. ""[Ha ̈rko ̈nen et al. (2020)] obtain a set of non-orthogonal latent-space directions that correspond to repeated effects. In contrast, our directions are orthogonal by construction, and therefore capture a super-set of the effects found by Ha ̈rko ̈nen et al. (2020)""
+
+This sound like a bold claim. First, directions from Ha ̈rko ̈nen et al. (2020), while not being orthogonal by construction, de-facto still can be close to orthogonal. Second, from both the main text and appendix, I did not understand, why the authors' method captures a super-set of directions, no quantitative comparison is provided.
+
+To sum up, my current evaluation is (5). The proposed method is obviously more efficient, but I do not consider this advantage as a very important one, since transformation search is performed only once. Given the weaknesses listed above, I cannot recommend acceptance.
+
+AFTER REBUTTAL:
+
+The authors have toned down some of their claims and I am increasing my score accordingly. ",6,5.0,ICLR2021
+HJegbBKCFr,2,HyxY6JHKwr,HyxY6JHKwr,Official Blind Review #4,"The problem tackled by the paper is related to the sensitivity of deep learning models to hyperparameters. While most of the hyperparameters correspond to the choice of architecture and optimization scheme, some influence the loss function. This paper assumes that the loss function consists of multiple weighted terms and proposes the method of finding the optimal neural network for each set of parameters by only training it once.
+
+The proposed method consists of two aspects: the conditioning of the neural network and the sampling of the loss functions' weights. Feature-wise Linear Modulation is used for conditioning and log-uniform distribution -- for sampling.
+
+My decision is a weak accept.
+
+It is not clear to me if the choice of performance metrics is correct. In many practical scenarios, we would prefer a single network that performs best under a quality metric of choice (for example, perceptual image quality) to an ensemble of networks that all are good at minimizing their respective loss functions. Therefore, the main performance metric should be the following: how much computation is required to achieve the desired performance with respect to a chosen test metric.
+
+Moreover, it might be obvious that the proposed method would be the best w.r.t. this metric, compared to other hyperparameters optimization methods, since it only requires a neural network to be trained once with little computational overhead on top. But then its performance falls short of the ""fixed weight"" scenario, where a neural network is trained on a fixed loss function and requires to raise the complexity of the network to achieve similar performance.
+
+Therefore, obtaining a neural network that would match the desired performance in the test time and would have a similar computational complexity requires more than ""only training once"", with more components, such as distillation, required to be built on top of the proposed method. The title of the paper is, therefore, slightly misleading, considering its contents.
+
+Also, it is slightly disappointing that the practical implementation of the method does not allow a more fine-grained sampling of weights, with uniform weights sampling shown to be degrading the performance. This implies that the method would have to be either applied multiple times, each time searching for a more fine-grained approximation for the best hyperparameters, or achieve a suboptimal solution.
+
+Below are other minor points to improve that did not affect the decision:
+-- no ImageNet experiments for VAE
+-- make plots more readable (maybe by using log-scale)
+-- some images are missing from fig. 7 comparison",6,,ICLR2020
+i91vC_wKTNa,2,o2ko2D_uvXJ,o2ko2D_uvXJ,learning group features ,"The authors proposed a neural network architecture, Group-Connected Multilayer Perceptron (GMLP), which automatically groups the input features and extracts high level representations according to the groups. This paper focuses on classification problems.
+
+The architecture can be decomposed to three stages. The first stage is to automatically group the input features by multiplying soft-max of a routing matrix. At the second stage, a locally fully connected layer with the corresponding activation functions is used for each group, and a pooling layer merges two groups to a new group. At the final stage, all the groups would be concatenated and input to a fully connect layer to get the final output. 
+
+The experiments showed that GMLP outperforms vanilla MLP, SNN, SET, FGR on seven real-world classification datasets in different domains. GMLP ensures higher accuracy with lower complexity compared to vanilla MLPs.
+
+The authors claimed that if we consider the groups as leaves, this method then becomes growing a binary tree from the leaf to the root. The extensive experiment results showed the effect of GMLP hyper-parameter choices, e.g. number of groups (number of nodes), width of each group (size of nodes) and type of pooling layer (way to built parent nodes). But in terms of a tree, it would be interesting to have some experiments to show the effect of the way combining feature groups. 
+
+The experiments on the simulated Bayesian network dataset supported the claim that this architecture can utilize the fact that some of the features are not related and do not need to interact with each other. However, the architecture the authors used corresponds to the model that generated data, which is almost impossible in many real life problems. It can be helpful to have some results on simulated data with mismatched architectures from the model to help better understand the performance. 
+
+One of the most important ideas in this paper is limiting the group-wise interactions. The size and number of groups the experiments chose would lead to many overlapped groups and many features chosen multiple times. It would be nice to have some analyses on the chosen groups and selected features, e.g. the existence of a set of features that always come into one group, and/or comparison of derived groups by GMLP and randomly chosen groups.
+
+A more detailed explanation of the dataset would be needed. For example, the authors used MIT-BIH dataset to compare the accuracy of GMLP and MLP with different sizes without introducing the dataset.
+
+If the complexity analysis of GMLP, equation (7), is only for inference, please also include the training complexity. Another concern is that the results suggest that the number of feature groups may need to be quite large (compared to the number of original features). In addition to figure 2, please provide the analysis of model size in addition to complexity analysis. 
+
+GMLP selects features by using soft-max of a $km \times d$ matrix. The authors may want to investigate reparametrization tricks to solve similar problems, including concrete relaxation in the following references: 
+
+C. J. Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. In ICLR, 2017.
+
+Muhammed Fatih Balin, Abubakar Abid, and James Y. Zou. Concrete autoencoders: Differentiable feature selection and reconstruction. In ICML, 2019.
+",5,5.0,ICLR2021
+Np0-NxD_I4d,1,TCAmP8zKZ6k,TCAmP8zKZ6k,"INS-DS shows improvement over existing DSMADs, but clarifications are needed regarding experimental settings","In this manuscript, authors proposed a novel dialogue system for medical automatic diagnosis (DSMAD) called INS-DS. There are three components in the general DSMADs, which are NLU, DM and NLG. In this paper, NLU and NLG components are adopted from Xu et al., 2019. Authors focused on designing decision-making parts in the dialogue management (DM) component. INS-DS includes two modules: an inquiry module and an introspective module which is inspired by the introspective process when humans make decisions. Authors also introduced two metrics evaluating the reliability and robustness of general DSMAD agents.
+
+Authors demonstrated INS-DS achieved state-of-the-art performance compared with other DSMAD agents and performed better with respect to robustness and reliability. My concerns are mainly related to the experimental settings and the detailed comments are listed as follows.
+
+#### Major Comments:
+1. (Table 4) Evaluating and improving the robustness and reliability of DSMAD agents is one of the major contributions in this manuscript. Authors have evaluated the robustness and reliability of DSMAD agents based on proposed metrics. One of the further experiments authors can conduct is to check the performance of DSMAD agents trained on MZ or DX and test on the other. This is a general and widely-accepted approach to demonstrate the agents' robustness and reliability.
+
+ Both DX (527 conversations) and MZ (710 conversations) are relatively small datasets and there are only two diseases (I.D. and U.R.I) that appeared in both of the datasets. The performances regarding I.D. are pretty consistent but this is not true for U.R.I. The symptoms related to U.R.I. may be different across these two sets but there should be a reasonable overlap. Could you train the model on DX and test it on MZ focusing on I.D. and U.R.I? This would be more convincing to demonstrate the agent's robustness and reliability and this helps understand the inconsistency of diagnosis accuracies regarding U.R.I.
+
+2. (Table 1) For Basic DQN and KR-DS, their performances in Ext. (External trust) across MZ and DX are dramatically different. Is there any reason that can explain this difference especially considering that DX dataset was proposed along with KR-DS?
+
+3. (Table 2) It's better to include the baseline performances (test dataset without noise) for each of the agents. Otherwise it is hard to see whether agents are robust to noise or not under different settings (NS.1, NS.2 and NS.3 ).
+4. (Table 3). If more related symptoms are appended, the tasks should be relatively easier for humans. But NS.3 is consistently worse than NS.2 across all agents. Could you elaborate on this observation?
+
+5. (Figure 3) How does the diagnosis validity correlate with the ground-truth diagnostics for the test set? The diagnosis validity score ranges from 2.8 to 3.2 which is far from the best score 5. Is this due to the inconsistency between the ground-truth diagnosis and the diagnosis made by students?
+
+6. (Page 7, robustness analysis) In the robustness analysis, authors made use of noise test sets to demonstrate the agents' robustness. For humans, the diagnosis should be invariant with respect to the orders of the explicit symptoms or implicit symptoms. Based on this assumption, it is natural to augment the train set by permuting/sampling symptoms and this should further improve the performances of all models. Have you applied this augmentation strategy in the training phase?
+
+
+#### Minor Comments:
+1. (Figure 3) It is helpful to include the error bars (or all three data points) to show the variance of scores assigned by students.
+
+2. (Table 2) Robusteness -> Robustness
+
+
+",6,4.0,ICLR2021
+9OTZqSDDqdH,1,4IwieFS44l,4IwieFS44l,Interesting paper with a somewhat flawed presentation,"The paper presents a method to create neural networks that, due to floating-point error, lead to wrong robustness certifications on most input images by a so-called ""complete verifier"" for neural network robustness. The authors show how to make their networks look a bit less suspicious and they discuss a way to detect neural networks that have been manipulated in the way they suggest.
+
+To me, it was obvious a priori that any ""complete verifier"" for neural network robustness that treats floating-point arithmetic as a perfect representation of real arithmetic is unsound.
+However, I think works like the current one are important to publish such as to practically demonstrate the limitations of the ""guarantees"" given by certain robustness certification systems and to motivate further research. Therefore, I expect the target audience of the paper to be informed outsiders who have not so far questioned the validity of robustness certification research that did not explicitly address floating-point semantics. In light of this, the paper has several weaknesses related to presentation:
+- Terminology is often used in a confusing way. For example, the approach that is practically demonstrated to be unsound is called a ""complete verifier"" with the ""strongest guarantees"", wrongly implying that all other verifiers must be at least as unsound.
+- The related work is incomplete. For example, unsoundness due to floating-point-error has been previously practically observed in Reluplex: https://arxiv.org/pdf/1804.10829.pdf (in this case, it produced wrong adversarial examples, without any special measures having been taken to fool the verifier).
+- The related work is not properly discussed in relation to floating-point semantics. Some of the cited works are sound with respect to round-off, others are not. I would expect this to be the central theme of the related work section such as to properly inform the reader if and why certain approaches should be expected to be unsound with respect to round-off. The current wording that ""all the verifiers that work with a model of the network are potentially vulnerable"" is not fair to all authors of such systems; some have taken great care to ensure they properly capture round-off semantics.
+- I did not find obfuscation and defense particularly well-motivated. What is the practical scenario in which they would become necessary?
+- The paper sends a somewhat strange message: it (exclusively) suggests to combat floating-point unsoundness by employing heuristics to make it harder to find actual counterexamples. What about just employing verifiers with honest error bounds that explicitly take into account floating-point semantics? It may not be possible in the near-term to actually make correct ""complete verifiers"", but at least authors of incomplete verifiers will not have to succumb to pressure to make an unsound ""complete"" version in order to match precision, performance and/or ""guarantees"" of their competitors.
+
+The technical sections are written well enough to be understandable, and the main technical contribution is a pattern of neurons we can insert into a neural network in order to make it behave in an arbitrary way that is invisible to the considered verifier. This is interesting and disproves any claim of ""completeness"", but scenarios where this would be a way to attack a system seem a bit contrived. Ideally, there would be an approach that can exploit round-off within a non-manipulated verified neural network to arbitrarily change the classification of a given input without changing the network. The paper might benefit from a discussion of this possibility and an explanation why it was not attempted.
+
+---
+The new section 2.4 is appreciated, though it seems the paper still does not say that incomplete methods can deal with round-off error by sound overapproximation.",6,5.0,ICLR2021
+YjZ2xnF9BbR,2,RrSuwzJfMQN,RrSuwzJfMQN,ODE integrator error bounds intrinsically confer adversarial robustness to Neural ODEs,"This paper uses a high-order ODE solver to take an $h=1$ step of a neural
+network layer.  The mechanics of training with an ODE solver that uses
+parameters to determine the dynamics of producing a layer output is previously
+known.  This paper notes that outputs produced using an ODE integrator, wrt
+adversarial inputs, have established error bounds. They demonstrate the
+additional adversarial robustness of operating in a regime better respecting
+such bounds.
+
+I feel the paper is well written and clear.  The central theoretical idea is not a huge leap, but is
+novel in the context of robust machine learning, imho. While presenting a simple, basic idea is
+always nice, the paper left me unclear about whether the demonstration was primarily intended
+to demonstrate a not terribly surprising theoretical prediction, or whether the technique would
+be useful in practice.
+
+They show many common networks with skip connections (resnet, etc.) correspond
+to a forward ODE integration scheme with large step size $h=1$, whereas the
+theoretical ODE adversarial bounds only hold as $h\rightarrow 0$.  They first show 
+that using fixed h corresponding to some fixed learning rate either does not
+satisfy $h\sim 0$ or has too slow convergence, with no gain in adversarial
+robustness.  So instead, they use a variable step size integrator instead, to
+project forward to a larger $h$.
+
+The idea and theory are simple and fairly well presented, and the demonstration
+on a simple dataset nicely shows the benefit of this approach compared to the
+original neural network.  However I felt a large number of significant things
+were left out regarding choice of integrator.  Most obviously, Table 1 lacks
+dopri parameters.  How does adversarial robustness depend on such parameter[s]?
+What is the effect on execution time?
+
+They only use one variable step size integrator.  They use a 4th and 5th order
+variable step size integrator, to demonstrate the predicted natural ODE-based
+robustness.  Several times I felt the presentation hinting that the low-order
+of integration schemes is to blame; however, for layers with discontinuities
+(relu), I naively expect lower order (and more evaluations?) might work out
+just as well.
+
+While the experiments constitute a simple demonstration, it still remained
+difficult to judge the practical importance without some data comparing
+execution time.  And once one begins to consider efficiency, a question
+of what styles of integrator work well in practice would be nice (ex. gear
+vs. Richardson vs. their one chosen integrator).
+
+Even if the ODE method is expensive, compute time can be compared with another
+ways to promote robustness, such as adversarial training.  A comment about
+feasibility of using ODE method in conjunction with other robustness methods
+could be made.
+
+---
+
+I have read the authors' comments.  The addition of the boundary attack experiment
+was an excellent step; however, it underscores a requirement for further analysis to
+understand *why*, apparently, in some cases the additional theoretical bound *fails* to
+confer significant robustness.  The suggested ""natural robustness"" is only sometimes
+present. Often clean accuracy is much reduced, so the method is not yet one I would consider
+useful yet.  For me, understanding when the method works well (or not so well) would
+bring this work out of the realm of interesting theoretical bounds into one of more
+general interest.
+",5,4.0,ICLR2021
+qOM3wROZy57,2,hpH98mK5Puk,hpH98mK5Puk,Weak accept,"This work (InfoBERT) proposes additional objectives for transformer finetuning to obtain models more robust to adversarial inputs. The authors first propose a mutual information based information bottleneck objective, next the authors propose an adversarial loss inspired method for identifying robust features and a subsequent objective to emphasize the mutual information between global representations and these robust features. The experiments demonstrate that InfoBERT consistently outperforms other adversarial training approaches on a variety of adversarial evaluations.
+
+I largely follow the logic behind the derivation, however I find some of the details unclear. I would like to see proofs for the theorems as well as an explanation of the assumptions under which the theorems hold. The experimental results are convincing, however there are no ablation studies to disentangle the performance contributions of the two proposed objectives. For the first point, the questions I have are as follows:
+
+for equations 1-3, I find the integral notation to be a bit odd - isn't it common practice to put the dydt at the very end of the integral? Also you should consider leaving out punctuations from equations
+for equation 5, why is there another 1/N inside the square brackets?
+why is equation 7 true in the general case? Suppose n=1, is this essentially saying that any sample from an empirical distribution would provide a lower bound for the true distribution?
+I'd like to see a proof for Theorems 3.1 and 3.2
+how do the authors define stability and robustness? The manuscript talk about them in vague terms and they do not seem to be precisely defined
+how does equation 9 follow from 6? Can you put in the intermediate steps? Also in this case what is N and what is M? And what happened to the multiplier n from equation 6?",6,3.0,ICLR2021
+SkD9M_NZf,3,SJvu-GW0b,SJvu-GW0b,Unclear motivation and experiments,"The paper proposes GRAPH2SEQ that represents graphs as infinite time-series of vectors, one for
+each vertex of the graph and in an invertible representation of a graph.  By not having the restriction of representation to a fixed dimension, the authors claims their proposed method is much more scalable. They also define a formal computational model, called LOCAL-Gather that includes GRAPH2SEQ and other classes of GCNN representations, and show that GRAPH2SEQ is capable of computing certain graph functions that fixed-depth GCNNs cannot. They experiment on graphs of size at most 800 nodes to discover minimum vertex cover and show that their method perform much better than GCNNs but is comparable with greedy heuristics for minimum vertex cover.
+
+I find the experiments to be hugely disappointing. Claiming that this particular representation helps in scalability and then doing experiment on graphs of extremely small size does not reflect well. It would have been much more desirable if the authors had conducted experiments on large graphs and compare the results with greedy heuristics. Also, the authors need to consider other functions, not only minimum vertex cover. In general, lack of substantial experiments makes it difficult to appreciate the novelty of the work. I am not at all sure, if this representation is indeed useful for graph optimization problems practically.
+
+
+
+
+",4,3.0,ICLR2018
+7IqYKzVLUBT,4,DILxQP08O3B,DILxQP08O3B,Novel approach for learning visual representations for navigation; but weak experiments and explanations.,"Summary 
+The paper proposes Visual Transformer Network which encodes the relationship between all detected object instances in a frame and uses it for navigation. The paper uses DETR for object detection and learn an association between local descriptors (from the object detector) with global descriptors (ResNet18) using the proposed VT model. They show that using VT improves performance on the object navigation task in AI2-THOR simulator compared to existing methods. 
+
+Strengths 
+- The paper proposed a novel transformer architecture that learns an association between local object descriptors with global image region features so that actions can be grounded to visual regions in the image. 
+
+- Different from prior work, the paper uses all the objects detected for a label instead of just the most confident detection. 
+
+
+Weaknesses 
+
+- The paper doesn't fully address why DETR performs better than FasterRCNN features. Appearance features from FasterRCNN have been widely used for several downstream tasks in Vision and Language Navigation[1], Vision and Language tasks[2]. From the experiments, it's not clear why DETR is doing better than Faster-RCNN especially when the detection accuracy of DETR is also better than Faster RCNN.
+
+- Additionally, I didn't fully follow how authors obtain the appearance features from Faster RCNN based method. The authors mention that object appearance features are extracted from different layers of a backbone network. How is it different from the approach taken by Bottom-Up, Top-Down[3] paper in which 2048-dim appearance features are extracted for each visual region?   
+
+- The experimental setup isn't fully reflective of the object goal navigation task. The experiments are conducted in AI2 thor scenes which only contain one room. It's not clear, how this method will perform when evaluated on significantly more complicated environments like Matterport / Gibson [4]. Specifically, I am interested in how will the proposed architecture perform when the goal object is not in the same room as the agent. 
+
+- The navigation task is also made simpler by discretizing into a grid. Single room environments and discrete grids simplify a lot of navigation-related challenges and the authors don't discuss how the proposed architecture will generalize to more complex object navigation tasks. 
+
+- The use of spatial embeddings as well as appearance embedding isn't all that surprising. Existing work including Du et al. uses bounding box coordinates to help learn spatial associations between objects. 
+
+Other questions: 
+- Instead of pre-training without employing the navigation policy, did the authors try using shortest-path based demonstrations to help learn the navigation policy as well? In the first stage, the navigation policy learns using imitation learning and then finetuned with A3C? 
+
+- What is the step size of the agent for the forward step? What are the turn angles for Turn-left, Turn-right actions? What are the tilt angles for look-up and look-down actions? 
+ 
+- What's the reason for improvement over ORG (in absence of TPN). Is it superior visual representations (Faster RCNN vs DETR) or the fact ORG only chooses objects with the highest confidence while VT uses all the detected objects? 
+
+- How does the agent learn long-term associations between objects across multiple frames. In my opinion, the proposed architecture puts all the burden of learning these long-term object relationships across multiple frames on the LSTM policy since the VT only learns association within a single frame. 
+
+[1] Improving Vision-and-Language Navigation with Image-Text Pairs from the Web; Majumdar et al. 
+
+[2] Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks; Li et al. 
+
+[3] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering; Anderson et al. 
+",6,4.0,ICLR2021
+tM4WYz1Sfov,1,MJAqnaC2vO1,MJAqnaC2vO1, ,"In this paper, the authors suggest using differentiable surrogate parameterized loss functions that more closely approximate some of the frequently used metrics for segmentation, including variations of accuracy, IoU, and F-score for the whole area and the boundaries, and use reinforcement learning to tune the parameters instantiating the surrogate loss functions.
+Moderate improvement is shown through experimental evaluations compared to cross-entropy and other used loss functions.
+
+I think, overall, this is a good paper, the following elaborates it:
+
+Strengths:
+* I like the general idea of bridging the gap between the often-non-differentiable evaluation metrics and the optimization loss functions so that the model is more closely optimizing for what the task mostly cares about.
+* Despite the existence of several previous works on the taken general direction, there are still seemingly novel methodological contributions including the loss parameterization and its tuning process. The design of most of the components made intuitive sense.
+* The experimental setup including comparisons to different losses, ablation, and transferability studies are thorough. A moderate but consistent improvement is observable when compared to the frequently used loss functions, over two commonly used semantic segmentation datasets, and across various network architectures. 
+* The paper is fairly well-written and easy to follow.
+
+Weak points:
+* Even though designing differentiable surrogate losses is conceptually simple and straight-forward, as shown in the experimental section, in practice it turns out that a relatively complicated set of tricks is needed to be leveraged before it shows improvements, including the parameterization and its two regularization schemes and the parameterization tuning using reinforcement learning. This in total adds significant complexity on top of a normal training scheme.
+* Some of the other distance-based metrics including the Hausdorff and Chamfer distance used to evaluate semantic segmentation have not been covered in this work.
+
+Minor comments:
+* In table 4, for R101-DeepLabv3+, R101-PSPNet and HRNetV2p-W48, I am missing comparisons to their ceiling, i.e. the values they could have obtained if the parameters were directly optimized for them.
+* Algorithm 1: \mu_0, for t=1... and (\mu_t, ...) seem not consistent. Please fix.
+* Equation (4) variable c is ambiguous in the left denominator as being the variable for both nested sigmas, please change the inner, e.g. to c'.
+* Equation (2) is not computationally stable as the denominator can turn to 0, in case a class c is not present in the dataset.
+* In Table 5, ""Fail"" under VOC mIoU is ambiguous; in case a very low performance is obtained, I think reporting the number would still help.
+* In the text of section 3, IoU-based metrics, please specify that Min-Pooling and Max-Pooling are with stride 1.
+* Typo: ""the our searched loss"" => ""that our searched loss""",7,4.0,ICLR2021
+i8LnvyHrJNz,3,oGq4d9TbyIA,oGq4d9TbyIA,"Good idea, but simple. Results comparison has flaws.","The authors propose neural channel expansion (NCE), a neural architecture search (NAS) and quantization method. Existing NAS+Q methods typically search for the architecture of the DNN along with the precision at each layer, maximizing accuracy while respecting some kind of hardware constraint. The result is a DNN with mixed-precision, which is challenging for most existing hardware (which only support one or a few precisions). NCE keeps precision the same in each layer, and instead uses the precision sensitivity signal in the NAS to adjust the width of the layer (expand or shrink). The result is uniform-precision, hardware-friendly DNN.
+
+NCE works by first training normally (with quantization) for 40 warmup epochs, followed by 110 epochs of search. At the end of each search epoch, NCE adjusts the channels search parameter in each layer. A few experiments show convincingly that a wider layer is indeed less sensitive to quantization. Thus using the sensitivity-to-quantization signal to adjust layer width is a good idea.
+
+Experiments on CIFAR-10 show NCE can boost quantized accuracy at 2w2a by up to 0.8%, and in the case of VGG16 trim unnecessary params. On ImageNet there is also some accuracy improvement, though only a little bit over LSQ. And for ResNet-50 NCE again can reduce param size.
+
+The paper is that the idea is fairly simple, and the results are not too impressive. LSQ seems to already do very well and it also uses uniform quantization. One major issue I have with the comparison in Tables 1 and 2 is that, on some of the smaller networks (ResNet-32 for CIFAR and ResNet-18 for ImageNet) the NCE result has more params than the other methods. This is potentially unfair as a larger network is almost always more accurate. I think you should uniformly increase channel widths in at least the ""w/o NCE"" baseline to see if the accuracy boost is from NCE learning the layer sensitivities or just from a bigger model.
+
+The results on larger networks (VGG and ResNet-50) is much more compelling, showing that NCE can trim unnecessary params while improving accuracy. More results like this would make the paper more convincing. I would also like to see exactly which layers were reduced in size on these networks.
+
+Another issue is that despite being a NAS work, there aren't NAS baselines in the comparison. I understand that NCE only requires one training run while the original NAS required many retrainings. But I believe HAQ (Wang et al 2019) and DARTS (Liu et al 2018) are both NAS techniques for mixed-precision quantization that require only one training run. The authors should include a comparison against such methods or discuss why it isn't needed.
+
+Minor issues:
+ - Section 4.3.2 Typo: ""results of the 2X case are inferior to the 2X case""
+ - Table 2, ResNet-18, you highlighted your own result but LSQ seems to be better in accuracy and param size?
+
+EDIT: Raised score from 4 to 6 after the authors clarified some points and added additional experiments.",6,4.0,ICLR2021
+pQwfLiBsf-U,2,#NAME?,#NAME?,"Interesting approach with extensive experiments, some gaps remain","# Summary
+This paper proposes uncertainty aware pseudo-labelling for semi-supervised learning, extending previously known methods by negative labels.
+
+# Score justification
+Well written paper and with extensive experiments with some points of improvement. The method should be positioned against others using confidence filtering and the role of calibration needs to be studied further.
+
+
+# Strong and weak points
+## Pros
+Extensive experiments on different datasets and domains.
+Broadly applicable method, that aims to improve conventional pseudo-labelling. Method is independent of uncertainty estimate and data augmentation.
+Combining uncertainty estimates and confidence estimates, as well as adding negative learning are interesting approaches.
+
+## Cons
+Confidence filtering for pseudo-labels has already been suggested, see e.g. suggestions given in detailed comments.
+
+I am unfortunately not convinced by the role of calibration, details given below. The authors could enhance the ablations studies with and without calibration and -importantly- adjusted thresholds, to clarify that point.
+
+
+
+# Questions to the authors
+The authors should consider e.g. https://arxiv.org/pdf/2002.02705 and https://www.microsoft.com/en-us/research/uploads/prod/2020/06/uncertainty_self_training_neurips_2020.pdf and position their work against those.
+
+""Learning with UPS"" have you treated pseudo-labels and original labels equally when training $f_{\theta,1}$, if so, how about fine-tuning on the original labels?
+
+Please elaborate on the effect of different thresholds and how they were chosen. Especially with regards to calibration. 
+
+# Detailed comments
+The authors should consider e.g. https://arxiv.org/pdf/2002.02705 and https://www.microsoft.com/en-us/research/uploads/prod/2020/06/uncertainty_self_training_neurips_2020.pdf
+
+Augmentations for other domains are only not effective, if they are not suitable for that domain. Augmentations should use domain specific invariances. From the paper it becomes clear, you did use data augmentation, not sure why you argue so strongly against it.
+
+I am not sure if calibration plays a role here. Calibration only moves the distribution in shape, by setting a confidence threshold $\tau$ the threshold would be changed, but a suitable threshold could be found before and after calibration.
+
+Section 3.1 small type: ""psuedo"" --> ""pseudo""",6,4.0,ICLR2021
+HyNnyzceG,3,ry8dvM-R-,ry8dvM-R-,Revised Review ,"The paper introduces a routing network for multi-task learning. The routing network consists of a router and a set of function blocks. Router makes a routing decision by either passing the input to a function block or back to the router. This network paradigm is tested on multi-task settings of MNIST, mini-imagenet and CIFAR-100 datasets.
+
+The paper is well-organized and the goal of the paper is valuable. However, I am not very clear about how this paper improves the previous work on multi-task learning by reading the Related Work and Results sections.
+
+The Related Work section includes many recent work, however, the comparison of this work and previous work is not clear. For example:
+""Routing Networks share a common goal with techniques for automated selective transfer learning
+using attention (Rajendran et al., 2017) and learning gating mechanisms between representations
+(Stollenga et al., 2014), (Misra et al., 2016), (Ruder et al., 2017).  However, these techniques have
+not been shown to scale to large numbers of routing decisions and task."" Why couldn't these techniques scale to large numbers of routing decisions and task? How could the proposed network in this paper scale?
+
+The result section also has no comparison with the previously published work. Is it possible to set similar experiments with the previously published material on this topic and compare the results?
+
+
+-- REVISED
+
+Thank you for adding the comparisons with other work and re-writing of the paper for clarity.
+I increase my rating to 7.
+
+",7,3.0,ICLR2018
+rkxXYzgb9r,2,rkg98yBFDr,rkg98yBFDr,Official Blind Review #2,"This paper studies classification problems via a reject option. A reject option could be useful in prediction problems to handle Out-of-distribution examples. The classification procedure studied in this paper builds on three components 1. An auto-encoder that obtains a latent low-dimensional representation of the data point 2.  A generative model that models the class-conditional probability model and 3.  a margin based loss function that learns a classifier that provides a large probability mass to the class-conditional distribution corresponding to the correct class.  The final decision procedure is to reject an input if the best class conditional probability is small and to use the class corresponding to the best class conditional probability otherwise. 
+
+On the whole I like the paper and think that the problem tackles an important problem. I have a few comments
+1. I would like to see what is the log-likelihood assigned by the proposed procedure on OOD samples and would like to see a comparison of the log-likelihood assigned by other procedures.",8,,ICLR2020
+HJeVITt6KS,3,HJgcw0Etwr,HJgcw0Etwr,Official Blind Review #3,"This paper studies the learning of over-parameterized neural networks in the student-teacher setting. More specifically, this paper assumes that there is a fixed teacher network providing the output for student network to learn, where the student network is typically over-parameterized (i.e., wider than teacher network). 
+
+This paper first investigates the properties of critical points of student networks in the ideal case, i.e., assuming we have infinite number of training examples. Then the results have been generalized to a practical case (the gradient is smaller than some small quantity). Moreover, this paper further studies the training dynamics via gradient flow, and proves some convergence results of GD.
+
+Overall, this paper is somewhat difficult to follow and understand. The notation system is kind of complicated and some assumptions seem to be unrealistic.  Detailed comments are as follows:
+
+It is a little bit difficult to get insightful understandings towards the critical points of deep neural networks from the theorems provided in this paper. I would like to see clearer properties of the critical points learned by student network rather than some intermediate results. 
+
+The title is not consistent with the content of the paper. From the title of this paper looks like a characterization on the student network trained by SGD. However, throughout the paper, the authors somehow investigate the critical points under a stronger condition, i.e., all stochastic gradient is zero, rather than the widely used one, the expectation of stochastic gradient is zero. I don’t think the critical points considered in this paper can be guaranteed to be found by SGD. Besides, when analyzing the training dynamics, as provided in Section 5, the authors resort to gradient descent, because in (5) the dynamics of $W_k$ rely on the expectation of stochastic gradients.
+
+Many statements should be elaborated in detail. For example, in the paragraph before Corollary 1, why $R_l$ is a convex polytope? In Theorem 2, what’s $\alpha_{kj}$? What’s the meaning of alignment? In the paragraph after Theorem 4, why Theorem 4 suggests a picture of bottom-up training? I believe the authors should provide a more detailed explanation.
+
+This paper studies the over-parameterized student network, is there any condition on its width?
+
+In Theorem 5, the assumption $\|g_1\|_\infty<\epsilon$ seems rather unrealistic, typically this bound can only hold in expectation or with high probability. Besides, why there is no condition on the sample size n in Theorem 5? It looks like Theorem 5 aims to tackle the case of finite number of training samples.
+
+-----------------------------------
+Thanks for your response and revision.  The current title is clearer and the definition of SGD critical points is more accurate. The observations regarding the alignment between teacher and student networks are indeed interesting. However, I still feel that this result is somehow difficult to parse, as I am not clear why this can be interpreted as the learning of the teacher network. Therefore I would like to keep my score.
+
+",3,,ICLR2020
+4BQoSUinQx_,2,BVSM0x3EDK6,BVSM0x3EDK6,Interesting use of Random convolutions for Data-Augmentation ,"This paper proposes a simple way to increase the robustness of the learned representations in a network perform a series of object recognition tasks by adding a random convolution layer as a pre-processing stage, thus “filtering the image” and preserving the global shape but altering the local `texture’ of the newly transformed image. Here, the hope is that  -- analogous to Geirhos et al. 2019 that induces a shape bias by transforming the image distribution into a new one with altered *global* textures that induce a shape bias and increases general robustness to o.o.d distortions --  the authors here go about doing something similar at the local level given the small size of the receptive field of the filter, thus preserving the shape and slightly altering “the texture”.
+
+Pros:
+* While the innovation is simple and efficient, this data-augmentation scheme works, and I can see how other future works may use this as well as a data-augmentation technique for object recognition. I am not sure however if no one else has explored the effects of random convolutions for robustness. It sounds too good to be true, but then again -- there is always beauty in simplicity and it is possible that the authors have hit the nail on the head on finding a somewhat ‘contrived’ filtering process as a bonus rather than a limitation. Simple, yet counter-intuitive findings like these are relevant for ICLR.
+* Authors provide lots of experiments that to some degree prove the success of their augmentation strategy (although see Cons).
+
+Cons:
+* Biological Inspiration: What is the biological mechanism linked to the success of using random convolutions. One could argue that this point is ‘irrelevant’ to the authors and the readers, but as there is a plethora of different data-augmentation techniques to choose from, why should computer vision and machine learning practitioners choose this one? (See Missing Reference for a suggestion)
+* Insufficient/Incomplete Baseline: The model is inspired loosely by Geirhos et al. 2019; but how does the model compete with Geirhos’ et al.’s Stylized ImageNet? I would have wanted to see a baseline between the authors proposed model and other texture-based augmentation strategies. This would elucidate the Global vs Local advantages of “texture”/style transfer on learned representations. I think this is where authors could capitalize more on.
+* The word `texture’ in the paper is a mis-nomer. Here what is really done is 1st order filtering via a convolution operation with a filter that does not happen to have a Gabor-like shape. “Texture” in other contexts going back to vision science and even computer vision and image processing (style transfer included), is usually computed by a set of cross-correlations between *outputs* of a filtered image (analogous to the Gramian Matrix of Gatys et al. 2015), or the principled Portilla-Simoncelli texture model from 1999.
+
+Missing references:
+* Excessive Invariance increases adversarial vulnerability by Jacobsen et al. ICLR 2019. The augmentation procedure proposed by the authors shows robustness to common distortions, but how about adversarial robustness? Is this relevant? Was this tried? I’d love to hear more about the authors thoughts on this to potentially raise my score.
+* Emergent Properties of Foveated Perceptual Systems (link: https://openreview.net/forum?id=2_Z6MECjPEa): An interesting concurrent submission to this year's ICLR has shown that the biological mechanism of visual crowding (that resembles texture computation for humans in the visual periphery) is linked to some of the operations introduced in the paper by the authors. It would be great if the authors potentially cite similar (and/or the before-mentioned) works to provide a link to a biological mechanism that may support why their data-augmentation procedure works and/or should be used; otherwise it seems contrived and could be seen as “yet another data-augmentation procedure that increases robustness but we don’t know why”.
+* Implementing a Primary Visual Cortex in the retina increases adversarial robustness by Dapello, Marques et al. 2020 (NeurIPS). This recently published paper in a way shows almost the opposite of what the authors are proposing here. Rather than using random convolutions, they actually mimic the gamut of spatial frequency tuning properties of Gabor filters in the first stages of convolution as done in human/monkey V1. The authors should discuss how their results fit with Dapello, Marques et al. 2020 and how they can reconcile their somewhat opposing views.
+
+Final Assessment:
+I am on the fence of having this paper accepted at ICLR given the limitations expressed above, but I do like it’s simplicity that should not take away it’s merit -- thus my slight lean towards acceptance. I am willing to raise my score however if authors address some of the cons/limitations, and am also curious to see the opinion from other reviewers, it is possible that I may have missed a key reference regarding data-augmentation that may weaken my assessment.
+",6,4.0,ICLR2021
+b13emmFhpZs,1,guEuB3FPcd,guEuB3FPcd,Interesting study of replacing the traditional real-valued algebra with other associative algebras,"The paper proposes an interesting kind of networks, AlgebraNets, which is a general paradigm of replacing the commonly used real-valued algebra with other associative algebras. This paper considers C, H, M2(R) (the set of 2 × 2 real-valued matrices), M2(C), M3(R), M4(R), dual numbers, and the R3 cross product, and investigates the sparsity within AlgebraNets. 
+
+The work in the paper is interesting and this paper is generally written well. However, there are a few issues/comments with the work:
+
+1.The citation of the references in the main body of this paper is not easy to read. It will be better to replace the format “author(s) (year)” with the format “(author(s), year)” ;
+
+2.Some figures and tables do not appear near the discussion, for example, Figure 1 is shown on Page but it is discussed until page 5, which makes it difficult to read;
+
+3.In Figure 1, the subfigure in the second row and first column, it seems that the performance of model with H and whitening the best stable performance.  The subfigure in the second row and second column, it can be seen that the model with H  is not better than the baseline model;
+
+4.There are many inconsistencies in the format of the reference, for example,
+
+1)In some places the author's name is abbreviated, while in others it is not. References “C. J. Gaudet and A. S. Maida. Deep quaternion networks. In 2018 International Joint Conference on Neural Networks (IJCNN), pages 1–8, 2018. ” and “Geoffrey E. Hinton, Sara Sabour, and Nicholas Frosst. Matrix capsules with em routing. In ICLR, 2018. ”;
+
+2)In some places the conference’s name is abbreviated with the link, while in others it is not. References “Siddhant M. Jayakumar, Wojciech M. Czarnecki, Jacob Menick, Jonathan Schwarz, Jack Rae, Simon Osindero, Yee Whye Teh, Tim Harley, and Razvan Pascanu. Multiplicative interactions and where to find them. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum? id=rylnK6VtDH.” and “Geoffrey E. Hinton, Sara Sabour, and Nicholas Frosst. Matrix capsules with em routing. In ICLR, 2018. ”.
+
+Please check carefully and correct the inconsistencies.
+
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+
+The paper replaces the traditional real-valued algebra with other associative algebras and shows its parameter and FLOP efficiency.  In the beginning, ""I think it is an interesting piece of work, and it may be helpful to develop the basic structural design of neural networks. "". However, after getting the response from the author(s), I more doubt the significance of the work in this paper: although many types of models have been proposed in this paper, the improvement over the baseline models is limited. I did not lower the grade on this paper since I thought it would be interesting and important (if effective) to extend the traditional real number field to more complex algebraic structures.
+
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++",6,2.0,ICLR2021
+G5_5a3tA3Ax,4,jz7tDvX6XYR,jz7tDvX6XYR,Official Blind Review #3,"This work introduces a method to accelerate the training speed of deep learning models. The key idea is to start training with shared weights and unroll the shared weights in the middle of the training. The authors report that this strategy accelerates the convergence speed. The paper introduces heuristics on when to stop weight sharing and how many layers to share weights. It further provides an analysis via the view of deep linear models on why weight sharing helps improve the convergence speed. In the evaluation, the paper evaluates their approach against the training of BERT, and shows that their method can obtain comparable and sometimes even better accuracy on downstream tasks while with 50% faster training speed.
+ 
+Strengths:
++ The paper aims to address an important problem in large model training: slow training speed.
++ The paper proposes an easy-to-implement approach to accelerate the convergence speed of BERT-like models. 
+ 
+Weakness:
+- The technical contribution seems rather incremental. The main difference between this work and the prior work [1], which also train Transformers via shared weights,  seems to be switching from sharing weights to unsharring weights in the middle of the training.
+- Important references are missing, making it not clear the advantage of this work as compared with existing approaches that accelerate the training of Transformer networks. 
+The comparison with existing work is inadequate. Important ablation studies are needed. 
+ 
+Comments:
+ 
+Prior work [1] uses weight sharing to train a smaller Transformer model to obtain similar accuracy.  However, weight sharing does not improve training speed per batch, because the training still needs to perform the same amount of forward and backward computation for each batch. Training may actually be slowed down, since the model may need to train with more batches to reach the same accuracy. This paper aims to speed up the training process by switching from shared to unshared weights in the middle of the training, and it observes faster convergence -- achieving similar downstream accuracy with less number of pretraining samples. This is an interesting empirical observation and can potentially become useful in practice. However, there are some major concerns about the paper.
+ 
+It is still unclear whether this faster convergence comes from switching from sharing to unsharring weights or is an effect of the model or hyperparameter changes. First, the stop condition (e.g., the switching threshold) cannot be known as a prior. Therefore, it is controlled with an additional hyperparameter. From the text, it is unclear how this hyperparameter has been chosen or will be chosen when training a new model. It would be better to test the sensitivity of the hyperparameter on another model such as GPT-2 to verify the effectiveness of the proposed method.
+ 
+Second, the paper adopts Pre-LN in its evaluation (as briefly mentioned in Section 5.1). However, from the text description, it seems it employs the original BERT as the baseline (""We first show experiment results English Wikipedia and BookCorpus for pretraining as in the original BERT paper""). As Pre-LN has been studied in several prior work [2,3] and has been demonstrated to also have the effect of accelerating the convergence speed, it is unclear whether the observed speedup in this paper is an effect of PreLN or weight sharing/unsharing. An ablation study with BERT-PreLN is needed if not already included.
+ 
+Third, the analysis on deep linear models appears to be over-simplified, where important characteristics of the DNNs such as non-linear activations and residual branches are not represented, making it difficult to connect it with the actual observations in practice.
+ 
+Finally, importance reference [4] is missing. [4] starts with a shallow BERT and progressively stacks Transformer blocks to accelerate training speed, which is in some sense similar to the proposed technique, which starts with shallow BERT with shared weights and switches to full-length BERT in the middle of the training. The paper might need to highlight the difference between this work and [4]. 
+ 
+Several places in the paper are vague or inconsistent: 
+
+1. The paper claims that it uses the same BERT implementation and training procedure as the one used by Devlin et. al.. However, the accuracy reported in Table 2 seems to be consistently lower than what was reported in the original BERT paper. For example, QQP in the original paper reaches 72.1, whereas this paper reports 71.4, QNLI was 92.7 in the BERT paper and 91.7 in this paper. If we use the original BERT reported results, the proposed technique seems to incur low accuracy on most tested datasets. Some clarification on the accuracy results is needed. 
+
+2. The paper claims that ""it sounds natural for us to expect that ALBERT's performance will be improved if we stop its weight sharing at some point of training. The optimal models are supposed to not be far from weight sharing."" However, this explanation actually creates some confusion. First, the paper does not provide an analysis of how the weight distribution between the weights trained with and without weight sharing, so it is unclear what ""not be far"" means. Second, what does it mean by ""optimal model""?  Does it refer to models trained not through weight sharing? If so, prior work [5] identified that model weights and gradients at different layers can exhibit very different characteristics, which seems to contradict the argument that ""The optimal models are supposed to not be far from weight sharing"". 
+
+3. I find it challenging to claim the proposed technique generic for models with repeatedly layers, whereas only BERT is evaluated in the experiments.
+
+4. The paper says ""This means that the weight sharing stage should not be too long"", but it is unclear how long is considered as not too long. 
+ 
+[1] Lan et. al. ""ALBERT: A Lite BERT for Self-supervised Learning of Language Representations"", https://arxiv.org/abs/1909.11942
+
+[2] Xiong et. al. ""On Layer Normalization in the Transformer Architecture"", https://arxiv.org/abs/2002.04745
+
+[3] Shoeybi et. al. ""Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism"", https://arxiv.org/abs/1909.08053
+
+[4] Gong et. al. ""Efficient Training of BERT by Progressively Stacking"", http://proceedings.mlr.press/v97/gong19a/gong19a.pdf
+
+[5] You et. al. ""Large Batch Optimization for Deep Learning: Training BERT in 76 minutes"", https://arxiv.org/abs/1904.00962",6,4.0,ICLR2021
+Nzg6I9vupip,3,FZ1oTwcXchK,FZ1oTwcXchK,Interesting (second review),"## Edit on second review
+
+I apologize again for the tone of my first review, I sincerely tried to understand the paper but I could not when I first read it. A re-read the paper and finally understood it during the review. I left a comment to the authors in the discussion below and they appropriately addressed my new recommendations. With the new equation (1) the paper is hopefully more understandable now. 
+
+I increase my grade from 3 to 5. The findings are quite interesting but I still believe that the paper is not well written: the equations are interesting but the explanations between the equations are often unclear. One has to understand each equation and be quite imaginative to finally identify the contributions of the paper (even for somebody only ""very slightly"" off from the research topic).
+
+## Summary
+
+The authors suggest a relationship between a leaky relu and a spiking integrate and firing neuron model.
+This relationship suggests a mapping between the two models which is imperfect, a loss seems to be derived to reduce this mismatch along the network training. The method is tested on CIFAR-10 and CIFAR-100, and compared with some other methods for converting ANNs to SNNs.
+
+## Critical review
+
+This topic is potentially important since spiking neural network are gaining popularity. But this paper is clearly badly written and it is extremely hard to understand, both in the math and in the text. I don't think it would help the progress of the field to publish the article in the current form.
+
+I tried to read that carefully and got lost after equation (4), the transition to equation (5) and (6) are not clear at all. I do not understand what is an approximation, what is a definition and what is a derivation.
+
+Also (5) seems wrong in itself, the authors are trying to approximate a rectified linear network but it suggest that the activity will be equivalent to a linear network (at least when v(T) is small) ? And magically this changes in (6), and a clip non-linearity is introduced ?
+
+The Figure 1 seems very encouraging at first, because it suggests that there is a clear and easy mapping between accumulating the spikes and computing a relu. I did not understand where this is appearing in the math and I cannot check whether the intuition conveyed by the figure is correct or not.
+
+I was therefore hoping to see an empirical study of the difference between the SNN and the ANN: do the activity of the spiking neural network match the activity of artificial network? This is not shown.  I do not even understand if it is necessary to re-train the network to go back from the SNN to ANN or vice versa.
+
+Since I had not understood the basics of the paper, it was impossible for me to understand the later section about the conversion error. My only take is that it seems wrong at first sight: how minimizing the error in the loss would minimize the mismatch between the network activity?
+",5,3.0,ICLR2021
+dYdf4kqyFfQ,2,4T489T4yav,4T489T4yav,Official Blind Review #2,"[summary]
+The paper proposes a relaxed way to solve the segmentation of sequence that can directly leverage the deep learning architectures. The relaxed model allows each segmentation parameter to be a linear interpolations between two consecutive parameters depends on a continuous warping function. The paper then proposes to use mixture of TSP distribution to simulate a step-like warping function and perform a thorough empirical comparison results with different methods and different warping functions. It turns out that TSP-based methods consistently achieve good performance.
+
+[novelty]
+To be honest I am not familiar with this area. It seems that this is a very novel method and may have impact across the area.
+
+[significance]
+The algorithm is very easy to understand and simple to implement, which may be very easy to reproduce by the community. The segmentation problem itself, especially the COVID19 case, show the importance of this area and may attract further attention from even outside the community.
+
+[clarity]
+I enjoy reading the paper and find it very easy to follow. The experimental results are clear and detailed.
+
+[some further questions]
+I'm curious for the relaxed models in equation (4). I didn't see why (4) should be very general form of relaxation. Why is it important to only interpolate consecutive two parameters? Is is possible to rewrite (4) into a weighted average $\sum_k w_{k,t} \theta_k$ and we hope so that $w_{k,t}$ depend on a continuous function, similar to $\zeta_t$, and index $k$?",7,3.0,ICLR2021
+ZLwRbypCUzT,2,trYkgJMOXhy,trYkgJMOXhy,"Interesting direction, but more needs to  be done","The paper proposes a teacher-student framework to ensure fairness by letting the teacher choose examples for the student from either the training data or from a counterfactual distribution. The main contributions are a counterfactual generative model and an algorithm for learning the teacher policy.
+
+Strengths:
+========
+1. The idea seems interesting and the proposed teacher-student framework is novel in the area of fair learning.
+
+2. The authors have done a good job of modeling various aspects of their complex approach using neural networks.
+
+3. The fact that authors were able to make the complex optimization work is itself a good thing since the objective has a lot of moving parts.
+
+4. The presented evaluations also show some promise.
+
+Things to consider improving:
+=========================
+1. My basic question is regarding the motivation for such a framework. Why is this approach important? Is there any fundamentally new insight that can be obtained with a GFT framework that cannot be obtained using existing fair learning approaches.
+
+2. The argument that causal methods cannot benefit commonly used fairness metrics like demographic parity (DP) or equalized odds (EO) should be elaborated. Isn't it possible to create Structural causal models that subsume conditional independences like DP and EO?
+
+3. There are a number of issues with references:
+- There are a number of additional pre-, in-, and post-processing methods that can be cited. Looking into some recent fairness papers may give the authors an idea of the works.
+- Agarwal et al. 2018 is considered an in-processing approach and not post-processing or pre-processing since it imposes fairness criteria while training a fair model.
+- Demographic parity is a really old concept and has been discussed in many papers before Madras et al. 2018.
+- The authors should also consider citing and perhaps comparing with https://homes.cs.washington.edu/~suciu/sigmod-2019-fairness.pdf if it makes sense, since it also does ""database repair"" or pre-processing.
+- Barring some exceptions, many early works in fairness are not cited. For example https://ieeexplore.ieee.org/document/6175897 is a classic work that discusses a lot of concepts in fairness that we use today (they call it discrimination). I recommend the authors to do a thorough literature survey and include at least the important works.
+
+5. Many crucial algorithmic details are missing. While  the complex optimization is important to cover all the aspects of the framework for data generation, it should also be motivated and explained better. 
+- In Sec. 3.2.1, how valid is it to assume U to be independent of A? Technically A is a part of the data that is generated so it may be reasonable to surmise that it should depend on the latent. In general the authors should provide a DAG which encodes their assumptions and justify them.
+- How would this architecture be modified for DP since the authors discuss both DP and EO in the beginning?
+- What is L_att trying to optimize?
+- What does it mean to have attribute labels as auxiliary tasks for D (in L_cls)? What are attribute labels?
+- It may make sense to summarize in a few sentences what each term in the objective does. If space  is less, the authors can move some contents to the appendix.
+- In Sec. 3.3, could the authors provide some insights on why such a training works?
+- Please also discuss what REINCFORCE is, and provide reference/more details.
+
+6. Generally in experiments, cross-validated results are needed. It is also crucial to provide sufficient pre-processing and hyper-parameter details to help reproduction of results.
+
+7. There are also a number of recent post-processing methods that can be compared with besides Hardt et al, 2016 (see http://auai.org/uai2019/proceedings/papers/315.pdf, http://proceedings.mlr.press/v108/wei20a.html)
+
+8. More details needed on how Agarwal et al., 2018 was run for example. In https://arxiv.org/pdf/1906.00066.pdf, pg. 30  Agarwal seems to be more competitive than what is shown here. Look at adult-gender-LR-EO and  adult-gender-GBM-EO, Agarwal et al. 2018 (named ""red"""")  at EO=0.04 has higher  accuracies than shown here. Similar comments apply to COMPAS.
+
+9.There are  also a few other pre-processing approaches that use GANs. 
+https://arxiv.org/pdf/1805.11202.pdf
+https://arxiv.org/pdf/1805.09910.pdf
+
+10. The authors must identify the caveats  of training a model on CelebA which has a ""western"" and ""celebrity"" bias. Models trained there may not transfer to other general settings.
+
+11. In page 8, the analysis of teacher model needs more details. How can we say original images dominate if only 7% image is chosen at max at any iteration? Why is the teacher behavior of choosing real samples in the beginning and synthesized samples later justified?
+
+In summary, the paper has identified an interesting direction but this needs to be taken forward a bit more.
+
+Post-rebuttal
+===========
+Thanks to the  authors for  their detailed response to  my  questions. Some of the answers are indeed satisfactory, but some questions remain - such as extensive comparisons to other methods (probably using more datasets), how  the method would  behave (practically)  with  a different fairness measure like DP, and more  carefully situating the method in  the  fairness literature. I encourage the authors  to keep pursuing this interesting  direction.",5,4.0,ICLR2021
+OAsKl71G4Vk,2,B5VvQrI49Pa,B5VvQrI49Pa,"Solid work on modeling nonseparable Hamiltonian dynamics, but needs more clarification ","The paper extends the symplectic family of network architectures towards modeling nonseparable Hamiltonian dynamic systems. More specifically, the paper implements a  symplectic integration schema (from Tao (2016)) for solving arbitrary nonseparable (and separable) Hamiltonian systems within a symplectic neural network architecture. The results from several modeling tasks show that the proposed Nonseparable Symplectic NNs (NSSNNs) are more robust and accurate than vanilla HNNs and NeuralODEs when applied to nonseparable Hamiltonian systems. 
+
+Although the idea of modeling nonseparable Hamiltonian systems with Symplectic NNs was already briefly outlined in the SRNN paper (Chen et al 2020), this paper implements it and further analyses various properties of this approach. Overall, the paper is well structured and well written, however there are still some inconsistencies that need to be addressed and clarified. 
+
+Namely, the related work discussion is somewhat handled poorly: For instance, the authors state in only one sentence that NSSNNs are closely related to SympNets (Jin et al 2020), without discussing any further details on how are they related and, more importantly, how they differ. Moreover, from that point on, SympNets are never considered (in the experiments) nor mentioned, even though SympNets are indeed able to model nonseparable Hamiltonian systems. In Table 1, that compares the properties of NSSNNs w.r.t some benchmarks, the authors discus ""TaylorNet"" and ""SSINN"" - these two are never introduced before. I assume the former refers to Thong et al. 2020, but I have no idea about the latter.  
+
+Regarding the choice of \omega, the authors provide some evidence that the choice of \omega plays a role as a regularization, where larger values tend to restrain the system. The analyses given in Appendix B show that with \omega 10 the system already is stable (which also supports the experiments presented in Tao 2016). But then the \omega is set to 2000 in the experiments, which is orders of magnitudes larger than the analyses. How and why was this value chosen? 
+
+Lines 206-207 state that from the results in Fig4, it is ""clear that"" NSSNNs can perform long-term predictions but HNNs and NeuralODEs (in the legend they are listed as ODE-nets, are these the same method?)  fail. It is not clear how was this determined, since the results show that NSSNNs are more robust to noise than the other two, NeuralODEs are still able to perform long-term predictions (in a noiseless setting), and HNNs in a both scenarios w/o noise and w/ moderate amount of noise.
+
+Some typos and minor comments:
+	L1: Hamiltonian systems are not a ""special"" category of physical systems, but is a formalism for modeling certain physical systems (eg. a pendulum, besides within Hamiltonian mechanics, can be modeled within classical (Newtonian) mechanics and Lagrangian mechanics).
+	L42: ""e.g. see Tong et al. 2020"" -> ""Tong et al. 2020""
+	L56: ""degree of Freedoms"" -> ""degrees of freedom""
+	L206: ""figure 4"" -> ""Figure 4""
+
+#Update 
+
+I thank the authors for addressing my questions and revising the manuscript, which clarified many of my concerns regarding this work. ",6,3.0,ICLR2021
+T5Fv2YXrq0P,3,PvVbsAmxdlZ,PvVbsAmxdlZ,Interesting framework with promising experimental results,"##########################################################################
+
+Summary:
+
+The paper presents a framework for deep reinforcement learning that is motivated by causal inference and with the central objective of being resilient to observational interferences. The key idea is to use interference labels in the training phase to learn a causal model including a hidden confounding state, and then use this model in the testing to make safer decisions and improve resilience. The authors also propose a new robustness measure, CLEVER-Q, which estimates a noise bound of an RL model below which the model's greedy decision would not be altered. The framework is tested extensively over multiple applications and under different types of observational interferences. The results show a clear advantage of the proposed framework over baseline RL methods in terms of resilience to interference.
+
+##########################################################################
+
+Reasons for Score:
+
+The paper addresses an important problem in AI relating to the robustness of the algorithms and paradigms to noisy interference. The proposed framework appears to be sound and the experimental results show superior performance in comparison to other baseline RL methods. That being said, I am not very familiar with the literature on RL and Deep learning, so my decision is more of an educated guess. However, I do have a concern about the causal inference component of the paper (explained below) and this is reflected by the score.
+
+##########################################################################
+
+Cons:
+
+The causal component of the proposed framework is not well-explained. More specifically, the causal graphical model in Figure 2 is introduced at the beginning of Subsection 3.3 very briefly, and the authors don't explain the intuition behind constructing this graph.
+The authors say, ""We use z_t to denote the latent state which can be viewed as a confounder in causal inference"", but there is no explanation for why this makes sense. Is it just an assumption that happens to work?
+
+Moreover, the following phrase appears to be inaccurate ""knowing the interference labels it or not corresponds to different levels in Pearl’s causal hierarchy...: the intervention level with the interference labels and the association level without the information"". Knowing vs not knowing the interference labels does not correspond to interventional vs associational levels in the causal hierarchy, but simply switching a variable/node in the CGM between observed and latent/unobserved. Such knowledge of a variable in the CGM does not account for an intervention.
+
+##########################################################################
+
+Questions during the rebuttal period:
+
+It would help if the authors can clarify the issue raised in the ""Cons"" above regarding the clarity of the causal component and its central role, as claimed, in the proposed framework.
+
+##########################################################################
+
+Typos:
+
+- p.1, ""the RL agent is asked to learn a binary causation label and *embedded* a latent state into its model"": embedded -> embed.
+- p.3, ""design an end-to-end structure ... and *evaluated* by treatment effects on rewards"": The statement does not parse.
+-p.3, ""where M is a *fix* number for the history"": fix -> fixed.
+-p.3, ""We assume that interference labels i_t *follows* an i.i.d. Bernoulli process"": follows -> follow.
+
+##########################################################################
+
+Comments after Discussion:
+
+I appreciate the effort made by the authors to elaborate on the causal formulation behind CIQ. However, the additional discussions in the paper are still confusing and raise soundness concerns. Some of the issues are discussed below.
+
+1- Rubin's Causal Model: The authors reference RCM in Subsection 3.1 for the causal formulation yet the rest of the work does not seem to use the potential outcome notation. Instead, Subsection 3.3 uses graphical models and the do-operator which follows the causal framework by Pearl. Then, in Subsection 3.4, the authors go back to reference RCM. It is not clear why this alternation between the two approaches is employed.
+
+2- If $z_t$ is defined as a function of $x_t$ and $i_t$, shouldn't the CGM reflect that with an arrow from $i_t$ to $z_t$ instead of it being the other way around in Fig.2(a)? Despite the attempt by the authors to elaborate on the causal formulation, I'm unable to map the structural equations such as Eq. (1) and the function of $z_t$ to the given CGM in Fig.2(a).
+
+3- The discussion in Subsection 3.3 leading to Eq. (3) sounds flawed to me. Quoting the authors, ""the interference model of Eq. (1) can be viewed as the intervention logic with the interference label it being the treatment information"". This statement is elaborating on the formulation of $x_t'$ where $x_t$ is intervened on and replaced by an interfered state when $i_t=1$. Alternatively, $x_t$ is kept intact when $i_t=0$. This intervention on the mechanism of $x_t$ happens whether we obtain $i_t$ and train the DQN with it or not. In this sense, the intervention is not happening under the CIQ framework only, but also when we simply train based on $x_t'$. Accordingly, it is not clear to me how ""the learning problem is elevated to Level II of the causal hierarchy"" due to the presence of the interference labels. To be clear, I'm not questioning the significance of using the interference labels in the training, but rather the causal story and formulation behind CIQ.",4,3.0,ICLR2021
+_ckRsDkn29J,2,71zCSP_HuBN,71zCSP_HuBN,Individually Fair Rankings.  Interesting paper that extends the individual fairness approach to learning to rank domain.,"The authors extended individual fairness approach to the domain of learning to rank domain. This paper proposes a method for training individually fair learning to rank models by making use of optimal transport-based regularizer.
+
+This work appears to be an extension to Yurochkin et. al. 2020 ICLR SenSR paper in which fair metric was learned from data. While that papers focused on training individually fair ML models by casting the problem as training supervised ML models that are robust to sensitive perturbations, this paper extended the idea to individual fair ranking that are invariant to sensitive perturbations of features of ranked items. 
+
+###############
+This paper is well written. The code and the datasets used for the experimentation have been provided. 
+I vote to accept this paper. I like the approach presented in this paper for solving this very important issue. 
+
+###############
+The importance of the problem of dealing with bias in search results cannot be understated. Example of the issue of fairness in search ranking: how to ensure that minority candidates (or candidates of opposite gender) would be fairly ranked together with other job candidates when having similar qualifications. This paper attempts to find a solution to this and similar type-problems.
+
+
+###############
+In this paper both theoretical and empirical results were presented. The authors showed that using their optimal transport-based regularizer leads to certified individual fair LTR models.
+The results were demonstrated on several datasets: synthetic and two publicly available datasets. Results showed that authors’ method exhibited ranking stability with respect to sensitive perturbations when compared with a state-of-the-art fair LTR model.
+Importantly, the authors showed empirically that individual fairness implied group fairness by not vice versa. 
+
+###############
+Recommendations:
+On page 4 section 3 authors reference Theorem 2.3 but in the citation there is no Theorem 2.3.  In the citation there is a relevant equation 2.3 and a note that states this equation comes from another paper. So I believe that this should probably needs to be cited differently (by citing Blanchet & Murthy original paper - assuming they were the first to prove this theorem) 
+Additionally, authors keep citing Yurochkin & Sun ArXiv pre-paper even through it has already appeared at ICLR 2020. 
+
+Question:
+My only question for the authors is how easy would it be to learn fair metric in a typical application for LTR models? ",7,3.0,ICLR2021
+laQgAlNGwuE,4,w5uur-ZwCXn,w5uur-ZwCXn,Good experiment results yet limited novelty,"This paper proposed a new data augmentation framework for low-resourse (and zero resource) cross-lingual task adaptation by combining several methods (entropy regluarized training, self-training). The authors conducted extensive experiments on three cross-lingual tasks, demonstrating the effectiveness of XLA. In addition, the authors compared different choices in the XLA distilation stage and claimed that the gain from XLA is beyond the model ensemble effect.
+
+According to the experiment results, the XLA framework has a remarkable gain over previous methods. However, it's not clear which component actually contributed to this gain. An thorough ablation study on different model component will help clarify this concern.
+
+The writing style is a bit redundant and confusing. For instance, the author use ""model distillation"" to define the label selection procedure, yet, this term is another technique in machine learning literature.
+
+Lastly from a methodology perspective, I think the algorithm seems like a combination of existing methods and its novelty is incremental to me. I hope the authors can edit the paper more carefully with a clarification on their novelty.",5,3.0,ICLR2021
+T42TXqleET,2,D9I3drBz4UC,D9I3drBz4UC,"In this paper, authors propose a stage-wise multi-expert system for long-tailed recognition problem.","The majority of feature extraction backbone is shared among different agents and the classifiers of experts are trained with both classification loss and proposed distribution-aware diversity loss. For the second stage, an expert assignment module is trained to re-weight the expert decisions. The whole paper is generally well-organized. However, there are some technical issues authors should further address:
+1) In the introduction, the authors measure the mean accuracy, model bias and model variance by randomly sampling the Cifar100 a few times, and reported different methods in Fig1. The details of how such bias and variance is computed are not given. For example, it is stated the bias also measures the accuracy.
+2) For the motivation of distribution-aware diversity loss, if the same input batch is observed by all the experts, why is it 'distribution aware'? Also, why requires the diversity among experts if setting the lower temperature to tail classes is to encourage the divergent prediction, how both diversity and divergent exists simultaneously?
+3) Is the input batch randomly sampled from the entire training set?
+4) For experiments on ImageNet-LT and iNaturalist, is the RIDE combined with the regular CE loss or other methods?
+5) Can authors demonstrate some details regarding the performance of expert assignment module during the test, as this is the core module in the proposed framework? For example, for each split (many, low) how the different expert behaves?
+
+[Post Rebuttal Comments] Authors have done a good job for addressing my concerns, especially the additional ablation studies regarding the performance of the expert assignment module. I'm updating my score accordingly for recommending acceptance.
+",7,4.0,ICLR2021
+9IejdRQSqEs,3,1IBgFQbj7y,1IBgFQbj7y,The current write up is unclear and needs improvement,"Summary: the paper proposes a new loss function, called MCCE to reduce the effect of overfitting to noisy examples. This involves calculating the Maximum Entropy (ME) of the input images as well as the filters (?). Experiments are conducted on standard datasets to validate the claims. 
+
+Review: I find the current state of the paper very confusing and unclear. Specifically, it is unclear what the method is trying to optimize (other than adding some form of regularization term based on entropy). The only technical development of the algorithm is given in Alg. 1 and no justification is provided for the design choices (such as: how is mu = ME(w) used in the algorithm? What does convolutional reconstruction loss amount to? What is the purpose of the interpolation? ...). The general discussion up to Section 2 can be shortened significantly and devoted to the development of the method. Overall, the paper is poorly written on the technical side. 
+
+Additionally, it is not clear why the authors attribute the bias in the predictions to noisy examples. For instance, a poorly trained model or a model which overfits to certain examples can produce biased predictions. A number of recent work also aim to reduce the effect of overfitting to noisy examples. For instance, (Amid et al. 2019a) generalizes the GCE loss (Zhang and Sabuncu 2018) by introducing two temperatures t1 and t2 which recovers GCE when t1 = q and t2 = 1. A more recent work, called the bi-tempered loss (Amid et al. 2019b) extends these methods by introducing a proper (unbiased) generalization of the CE loss and is shown to be extremely effective in reducing the effect of noisy examples. Also, (Yang and Guo 2020) proposes peer-loss (which can be combined with CE, bi-tempered, etc. loss) for handling noise. Please consider referencing/comparing to these SOTA methods.
+
+Additional references:
+
+(Amid et al. 2019a) Amid et al. ""Two-temperature logistic regression based on the Tsallis divergence."" In The 22nd International Conference on Artificial Intelligence and Statistics, 2019.
+
+(Amid et al. 2019b) Amid et al. ""Robust bi-tempered logistic loss based on Bregman divergences."" In Advances in Neural Information Processing Systems, 2019.
+
+(Yang and Guo 2020) Yang and Guo. ""Peer Loss Functions: Learning from Noisy Labels without Knowing Noise Rates."" In International Conference on Machine Learning, 2020.",4,5.0,ICLR2021
+rklKqbep3X,3,SJe9rh0cFX,SJe9rh0cFX,Review for On the Universal Approximability and Complexity Bounds of Quantized ReLU Neural Networks ,"This paper studies the expressive power of quantized ReLU networks from a theoretical point of view. This is well-motivated by the recent success of using quantized neural networks as a compression technique. This paper considers both linear quantization and non-linear quantization, both function independent network structures and function dependent network structures. The obtained results show that the number of weights need by a quantized network is no more than polylog factors times that of a unquantized network. This justifies the use of quantized neural networks as a compression technique. 
+
+Overall, this paper is well-written and sheds light on a well-motivated problem, makes important progress in understanding the full power of quantized neural networks as a compression technique. I didn’t check all details of the proof, but the structure of the proof and several key constructions seem correct to me. I would recommend acceptance. 
+
+The presentation can be improved by having a formal definition of linear quantized networks and non-linear quantized networks, function-independent structure and function-dependent structure in Section 3 to make the discussion mathematically rigorous. Also, some of the ideas/constructions seem to follow (Yarotsky, 2017). It seems to be a good idea to have a paragraph in the introduction to have a more detailed comparison with (Yarotsky, 2017), highlighting the difference of the constructions, the difficulties that the authors overcame when deriving the bounds, etc. 
+
+Minor Comment: First paragraph of page 2: extra space after ``to prove the universal approximability’’.
+",7,3.0,ICLR2019
+4NfnkJ8xMh5,3,tGZu6DlbreV,tGZu6DlbreV,Review,"In this work, the authors illustrate an approach for learning logical rules starting from knowledge graphs. Learning logic rules is more interesting than simply performing link prediction because rules are human-readable and hence provide explainability.
+The approach seems interesting and the tackled problem could interest a wide audience. It does not seem extremely novel, but it seems valid to me.
+The paper is well-written and self-contained. Moreover, the experimental results show that the proposed approach has competitive performance comparing to other systems (even compared with systems that do not learn rules but perform only link prediction).
+
+For all these reasons I think the paper should be accepted for publication.",8,1.0,ICLR2021
+r1KVjuSlf,1,H1VGkIxRZ,H1VGkIxRZ,Review,"
+-----UPDATE------
+
+The authors addressed my concerns satisfactorily. Given this and the other reviews I have bumped up my score from a 5 to a 6.
+
+----------------------
+
+
+This paper introduces two modifications that allow neural networks to be better at distinguishing between in- and out- of distribution examples: (i) adding a high temperature to the softmax, and (ii) adding adversarial perturbations to the inputs. This is a novel use of existing methods.
+
+Some roughly chronological comments follow:
+
+In the abstract you don't mention that the result given is when CIFAR-10 is mixed with TinyImageNet.
+
+The paper is quite well written aside from some grammatical issues. In particular, articles are frequently missing from nouns. Some sentences need rewriting (e.g. in 4.1 ""which is as well used by Hendrycks..."", in 5.2 ""performance becomes unchanged"").
+
+ It is perhaps slightly unnecessary to give a name to your approach (ODIN) but in a world where there are hundreds of different kinds of GANs you could be forgiven.
+
+I'm not convinced that the performance of the network for in-distribution images is unchanged, as this would require you to be able to isolate 100% of the in-distribution images. I'm curious as to what would happen to the overall accuracy if you ignored the results for in-distribution images that appear to be out-of-distribution (e.g. by simply counting them as incorrect classifications). Would there be a correlation between difficult-to-classify images, and those that don't appear to be in distribution?
+
+When you describe the method it relies on a threshold delta which does not appear to be explicitly mentioned again.
+
+In terms of experimentation it would be interesting to see the reciprocal of the results between two datasets. For instance, how would a network trained on TinyImageNet cope with out-of-distribution images from CIFAR 10?
+
+Section 4.5 felt out of place, as to me, the discussion section flowed more naturally from the experimental results. This may just be a matter of taste.
+
+I did like the observations in 5.1 about class deviation, although then, what would happen if the out-of-distribution dataset had a similar class distribution to the in-distribution one? (This is in part, addressed in the CIFAR80 20 experiments in the appendices).
+
+This appears to be a borderline paper, as I am concerned that the method isn't sufficiently novel (although it is a novel use of existing methods).
+
+Pros:
+- Baseline performance is exceeded by a large margin
+- Novel use of adversarial perturbation and temperature
+- Interesting analysis
+
+Cons:
+- Doesn't introduce and novel methods of its own
+- Could do with additional experiments (as mentioned above)
+- Minor grammatical errors
+",6,4.0,ICLR2018
+MBsIQ8ykjII,3,m08OHhXxl-5,m08OHhXxl-5,The empirical results seem complete. I have several concerns about the technical part.,"This paper studies the problem of privacy-preserving calibration under the domain shift. The authors propose ''accuracy temperature scaling'' with privacy guarantees.  
+ 
+The empirical results seem complete. I still have several concerns about the technical part.
+
+ 
+1. The proposed algorithm is not described very clearly in section 3 and section 4. After spending considerable time reading sections 3 and 4, I still feel hard to follow how they address the domain-shift issues.
+ 
+2. The privacy part seems like a plug-and-play of the Laplace mechanism. Hence, the technical novelty might be limited. Note that the privacy computation in section 3 based on a naive composition --- ` each $M_i$ satisfies $\epsilon/k$, the total privacy cost follows $\epsilon$. I would suggest the authors use recently advanced composition [1,2] for better privacy and utility tradeoffs. Moreover, the calculation of sensitivity seems to be wrong. As the authors claim in Section C.2.1 in the appendix, the sensitivity is technically infinite. They set $\triangle_f=10$ based on empirical observation, which violates the privacy definition.
+
+[1] The Composition Theorem for Differential Privacy.
+[2] Renyi Differential Privacy.",5,4.0,ICLR2021
+ryla3ZvP3X,2,B1MbDj0ctQ,B1MbDj0ctQ,The proposed approach is not clearly presented.,"This paper proposes a new model for switching linear dynamical systems. The standard model and the proposed model are presented. Together with the inference procedure associated to the new model. This inference procedure is based on variational auto-encoders, which model the transition and measurement posterior distributions, which is exactly the methodological contribution of the manuscript. Experiments on three different tasks are reported, and qualitative and quantitative results (comparing with different state-of-the-art methods) are reported.
+
+The standard model is very well described, formally and graphically, except for the dynamic model of the switching variable, and its dependence on z_t-1. The proposed model has a clear graphical representation, but its formal counterpart is a bit  more difficult to grasp, we need to reach 4.2 (after the inference procedure is discussed) to understand the main difference (the switching variable does not influence the observation model). Still, the dependency of the dynamics of s_t on z_t is not discussed.
+
+In my opinion, another issue is the discussion of the variational inference procedure, mainly because it is unclear what additional assumptions are made. This is because the procedure does not seem to derive from the a posteriori distribution (at least it is not presented like this). Sometimes we do not know if the authors are assuming further hypothesis or if there are typos in the equations. 
+
+For instance (7) is quite problematic. Indeed, the starting point of (7) is the approximation of the a posteriori distribution q_phi(z_t|z_t-1,x_1:T,u_1:T), that is split into two parts, a transition model and an inverse measurement model. First, this split is neither well motivated nor justified: does it come from smartly using the Bayes and other probability rules? In particular, I do not understand how come, given that q_phi is not conditioned on s_t, the past measurements and control inputs can be discarded. Second, do the authors impose that this a posteriori probability is a Gaussian? Third, the variable s_t seems to be in and out at the authors discretion, which is not correct from a mathematical point of view, and critical since the interesting part of the model is exactly the existence of a switching variable and its relationship with the other latent/observed variables. Finally, if the posterior q_phi is conditioned to s_t (and I am sure it must), then the measurement model also has to be conditioned on s_t, which poses perhaps another inference problem.
+
+Equation (10) has the same problem, in the sense that we do not understand where does it derive from, why is the chosen split justified and why the convex sum of the two distributions is the appropriate way to merge the information of the inverse measurements and the transition model.
+
+Another difficulty is found in the generative model, when it is announced that the model uses M base matrices (but there are S possibilities for the switching variable). s_t(i) is not defined and the transition model for the switching variable is not defined. This part is difficult to understand and confusing. At the end, since we do not understand the basic assumptions of the model, it is very hard to grasp the contribution of the paper. In addition, the interpretation of the results is much harder, since we are missing an overall understanding of the proposed approach.
+
+The numerical and quantitative results demonstrate the ability of the approach to outperform the state-of-the-art (at least for the normal distribution and on the first two tasks).
+
+Due to the lack of discussion, motivation, justification and details of the proposed approach, I recommend this paper to be rejected and resubmitted when all these concerns will be addressed.",4,4.0,ICLR2019
+rk0LDknlz,3,SJDJNzWAZ,SJDJNzWAZ,Approach is good,"The paper proposes a set of methods for using temporal information in event sequence prediction. Two methods for time-dependent event representation are proposed. Also two methods for using next event duration are introduced.
+
+The motivation of the paper is interesting and I like the approach. The proposed methods seem valid. Only concern is that the proposed methods do not outperform others much with some level of significance. More advance models may be needed.
+
+",5,3.0,ICLR2018
+S1lt0AiWnQ,1,SJLhxnRqFQ,SJLhxnRqFQ,A GAN variant for joint discrete-continous latent model,The paper presents a generative model that can be used for unsupervised and semi-supervised data clustering. unlike most of previous method  the latent  variable is composed of both continuous and discrete variables. Unlike previous methods like ALI the conditional probability  p(y|x) of the labels given the object is represented by a neural network and not simply drown from the data.  The  authors show a  clustering error rate on the MNIST data that is better than previously proposed methods. ,6,2.0,ICLR2019
+SkeifEYIhX,1,Syx0Mh05YQ,Syx0Mh05YQ,The motivation for this work needs to be clarified,"
+
+=Major Comments=
+The prior work on grid cells and deep learning makes it clear that the goal of the work is to demonstrate that a simple learning system equipped with representation learning will produce spatial representations that are grid-like. Finding grid-like representations is important because these representations occur in the mammalian brain. 
+
+Your paper would be improved by making a similar argument, where you would need to draw much more explicitly on the neuroscience literature. Namely, the validation of your proposed representations for position and velocity are mostly validated by the fact that they yield grid-like representations, not that they are useful for downstream tasks.
+
+Furthermore, you should better justify why your simple model is better than prior work? What does the simplicity provide? Interpretability? Ease if optimization? Sample complexity for training?
+
+This is important because otherwise it is unclear why you need to perform representation learning. The tasks you present (path integral and planning) could be easily performed in basic x-y coordinates. You wouldn’t need to introduce a latent v. Furthermore, this would mprove your argument for the importance of the block-diagonal M, since it would be more clear why interpretability matters.
+
+
+Finally, you definitely need to discuss the literature on randomized approximations to RBF kernels (random Fourier features). Given the way you pose the representation learning objective, I expect that these would be optimal. With this, it is clear why grid-like patterns would emerge.
+
+=Additional Comments=
+What can you say about the quality of the path returned by (10)? Is it guaranteed to converge to a path that ends at y? Is it the globally optimal path? 
+
+I don’t agree with your statement that your approach enables simple planning by steepest descent. First of all, are the plans that your method outputs high-quality? Second, if you had solved (10) directly in x-y coordinates, you could have done this easily since it is an optimization problem in just 2 variables. That could be approximately solved by grid search.
+
+I would remove section 5.4. The latent vector v is a high-dimensional encoding of low-dimensional data, so of-course it is robust to corruptions. The corruptions you consider don’t come from a meaningful noise process, however? I can imagine, for example, that the agent observes corrupted versions of (x,y), but why would v get corrupted?
+
+",7,4.0,ICLR2019
+Bylnso_aKH,1,Byg5flHFDr,Byg5flHFDr,Official Blind Review #1,"This paper presents a system for predicting evolution of graphs. It makes use of three different known components - (a) Graph Neural Networks (GNN); (b) Recurrent Neural Networks (RNN); (c) Graph Generator. A significant portion of the paper is spent in explaining these known concepts. The contribution of the paper seems to be a system of combining these to achieve graph evolution prediction. As stated, this system is effectively a recurrent auto-encoder of sorts. 
+
+The main objection I have in this paper is that they have only used two real datasets (both of which are from the same domain). There are several only available datasets that have temporally annotated graph evolution. It is not possible to conclude the empirical superiority of a system based on such little evidence. ",3,,ICLR2020
+H1eabJol9B,3,S1g2skStPB,S1g2skStPB,Official Blind Review #2,"In this paper, the authors propose an RL-based structure searching method for causal discovery. The authors reformulate the score-based causal discovery problem into an RL-format, which includes the reward function re-design, hyper-parameter choose, and graph generation. To my knowledge, it’s the first time that the RL algorithm is applied to causal discovery area for structure searching.
+ 
+The authors’ contributions are:
+(1) re-design the reword function which concludes the traditional score function and the acyclic constraint
+
+(2) Theoretically prove that the maximizing the reward function is equivalent to maximizing the original score function under some choices of the hyper-parameters.
+
+(3) Apply the reinforce gradient estimator to search the parameters related to adjacency matrix generation.  
+
+(4) In the experiment, the authors conduct experiment on datasets which includes both linear/non-linear model with Gaussian/Non-gaussian noise.
+
+(5) The authors public their code for reproducibility.
+ 
+Overall, the idea of this paper is novel, and the experiment is comprehensive. I have the following concerns.
+ 
+(1) In page 4 Encoder paragraph, the authors mention that the self-attention scheme is capable of finding the causal relationships. Why? In my opinion, the attention scheme only reflects the correlation relationship. The authors should give more clarifications to convince me about their beliefs.
+ 
+(2) The authors first introduce the h(A) constraint in eqn. (4), and mentioned that only have that constraint would result in a large penalty weight. To solve this, the authors introduce the indicator function constraint. What if we only use the indicator function constraint? In this case, the equivalence is still satisfied, so I am confused about the motivation of imposing the h(A) constraint.
+ 
+(3) In the last paragraph of page 5, why the authors adjust the predefined scores to a certain range?
+ 
+(4) Whether the acyclic can be guaranteed after minimizing the negative reward function (the eqn.(6))? I.e., After the training process, whether the graph with the best reward can be theoretically guaranteed to be acyclic?
+ 
+(5) In section 5.3, the authors mention that the generated graph may contain spurious edges? Whether the edges that in the cyclic are spurious? Whether the last pruning step contains pruning the cyclic path?
+ 
+ 
+(6) In the experiment, the authors adopt three metrics. For better comparison, the author should clarify that: the smaller the FDR/SHD is, the better the performance, and the larger the TPR is, the better the performance.
+
+(7) From the experimental results, the proposed method seems more superiors under the non-linear model case. Why? Could the authors give a few sentences about the guidance of the model selection in the real-world? i.e., when to select the proposed RL-based method? And under which case to choose RL-BIC, and which case to selection RL-BIC2?
+ 
+(8) What’s training time, and how many samples are needed in the training process?
+ 
+ 
+Minor:
+1. In the page 4 decoder section, the notation of enc_i and enc_j is not clarified.
+
+2. On page 5, the \Delta_1 and \Delta_2 are not explained.
+
+3. For better reading experience, in table 1,2,3,4, the authors should bold value that has the best performance.
+
+",8,,ICLR2020
+rylo1DbthX,1,HygjqjR9Km,HygjqjR9Km,Interesting idea but more evidence to show the significance of the work would be appreciated.,"The paper proposes a new discriminator loss for MMDGAN which encourages repulsion between points from the target distribution. The discriminator can then learn finer details of the target distribution unlike previous versions of MMDGAN. The paper also proposes an alternative to the RBF kernel to stabilize training and use spectral normalization to regularize the discriminator. The paper is clear and well written overall and the experiments show that the proposed method leads to improvements. The proposed idea is promising and a better theoretical understanding would make this work more significant. Indeed, it seems that MMD-rep can lead to instabilities during training while this is not the case for MMD-rep as shown in Appendix A. It would be good to better understand under which conditions MMD-rep leads to stable training. Figure 3 suggests that lambda should not be too big, but more theoretical evidence would be appreciated.
+Regarding the experiments: 
+- The proposed repulsive loss seems to improve over the classical attractive loss according to table 1, however, some ablation studies might be needed: how much improvement is attributed to the use of SN alone? The Hinge loss uses 1 output dimension for the critic and still leads to good results, while MMD variants use 16 output dimensions. Have you tried to compare the methods using the same dimension?
+-The generalized spectral normalization proposed in this work seems to depend on the dimensionality of the input which can be problematic for high dimensional inputs. On the other hand, Myato’s algorithm only depends on the dimensions of the filter. Moreover, I would expect the two spectral norms to be mathematically related [1]. It is unclear what advantages the proposed algorithm for computing SN has.
+- Regarding the choice of the kernel, it doesn’t seem that the choice defined in eq 6 and 7 defines a positive semi-definite kernel because of the truncation and the fact that it depends on whether the input comes from the true or the fake distribution. In that case, the mmd loss loses all its interpretation as a distance. Besides, the issue of saturation of the Gaussian kernel was already addressed in a more general case in [2]. Is there any reason to think the proposed kernel has any particular advantage?
+
+Revision:
+
+After reading the author's response, I think most of the points were well addressed and that the repulsive loss has interesting properties that should be further investigated. Also, the authors show experimentally the benefit of using PICO ver PIM which is also an interesting finding.
+I'm less convinced by the bounded RBF kernel, which seems a little hacky although it works well in practice. I think the saturation issues with RBF kernel is mainly due to discontinuity under the weak topology of the optimized MMD [2] and can be fixed by controlling the Lipschitz constant of the critic.
+Overall I feel that this paper has two interesting contributions (Repulsive loss + highlighting the difference between PICO and PIM) and I would recommend acceptance.
+
+
+
+
+
+
+[1]: Sedghi, Hanie, Vineet Gupta, and Philip M. Long. “The Singular Values of Convolutional Layers.” CoRR 
+[2]: M. Arbel, D. J. Sutherland, M. Binkowski, and A. Gretton. On gradient regularizers for MMD GANs.
+
+
+
+",7,5.0,ICLR2019
+SkxapRNs3X,1,rkxkHnA5tX,rkxkHnA5tX,"many things unclear, experiments not convincing enough, writing needs improvement.","The paper makes its intent plainly clear, it wants to remove the assumption that demonstrations are optimal.  Thus it should show that in a case that some demonstrations are bad, it outperforms other methods which assume they are all good. The method proposed, while interesting, well-conceived and potentially novel, is not convincingly tested to this end. 
+
+The paper should also show that the method can detect the bad demonstrations, and select the good demonstrations. 
+
+The experiments are on toy tasks and not existing tasks in the literature. Why not use an existing dataset/domain and simply noise up the demonstrations?
+
+Furthermore, many crucial details are omitted, such as the nature of the heuristic function K, and how precisely the weighting $c_i$ is adapted (section 4.4). Is it done by gradient descent? We would have to know what K is, and if it is differentiable to know this.
+
+Also the writing itself needs a thorough revision.
+
+I think there may well be promise in the method, but it does not appear ready for publication.
+",4,3.0,ICLR2019
+rJejGJP_qH,3,rylXBkrYDS,rylXBkrYDS,Official Blind Review #1,"This paper introduces a transductive learning baseline for few-shot image classification. The proposed approach includes a standard cross-entropy loss on the labeled support samples and a Shannon entropy loss on the unlabeled query samples. Despite its simplicity, the experimental results show that it can consistently outperform the state-of-the-art on four public few-shot datasets. In addition, they introduce a large-scale few-shot benchmark with 21K classes of ImageNet21K. Finally, they point out that accuracies from different episodes have high variance and develop another few-shot performance metric based on the hardness of each episode.
+
+Positive comments:
+1. The proposed transductive loss that minimizes entropy of query samples is novel in few-shot learning. Given limited labeled samples, finetuning with unlabeled query samples via proper loss is a good idea to tackle few-shot learning. 
+2. The evaluation is thorough. A significant number of few-shot methods are compared on 4 exisiting few-shot benchmarks. An additional large-scale benchmark is also introduced to facilitate  the few-shot learning research. 
+3. A novel evaluation metric is proposed to evaluate few-shot learning methods under different difficulties level. Although I am convinced by the importance of such metric, it is interesting to supplement the averaged accuracy because it tells how the methods work under easy and difficult classes. 
+
+Negative comments:
+1. The folloing important reference of the Shannon entropy on unlabeled data is missing. In fact, I suggest the authors to extend Section 3.2 a bit more because this is the main technic contribution. 
+Semi-supervised Learningby Entropy Minimization.  Grandvalet et al. NIPS 2015
+2. I am not convinced by the necessity of the proposed hardness metric. The main argument of this paper is that accuracies over episodes have high variance. But isn't it expected that different  episode can include samples with different difficulties, leading to high variance of accuracies? I do not think it is realistic to have one algorithm that achieves similar accuracies on both easy and difficult tasks. The authors also fail to evaluate different methods with the proposed metric and show if this metric makes the ranking of algorithms different. Moreover, I find Figure 3 hard to interpret because there are too much information in it, including different colors, a lot of markers and lines. Why is the range of hardness 1-3 for some datasets and 1-5 for other datasets? I believe writting of Section 4.4 could be further improved.
+
+Overall, I think this paper has significant contributions of proposing a novel few-shot baseline that establishes a new state-of-the-art and would recommend weak accept.
+
+",6,,ICLR2020
+HJ9LXfvlz,1,rJ33wwxRb,rJ33wwxRb,Interesting result for generalisation guarantees of overparametrised 1-hidden layer network with fixed output layer.,"Paper studies an interesting phenomenon of overparameterised models being able to learn well-generalising solutions. It focuses on a setting with three crucial simplifications:
+- data is linearly separable
+- model is 1-hidden layer feed forward network with homogenous activations
+- **only input-hidden layer weights** are trained, while the hidden-output layer's weights are fixed to be (v, v, v, ..., v, -v, -v, -v, ..., -v) (in particular -- (1,1,...,1,-1,-1,...,-1))
+While the last assumption does not limit the expressiveness of the model in any way, as homogenous activations have the property of f(ax)=af(x) (for positive a) and so for any unconstrained model in the second layer, we can ""propagate"" its weights back into first layer and obtain functionally equivalent network. However, learning dynamics of a model of form 
+ z(x) = SUM( g(Wx+b) ) - SUM( g(Vx+c) ) + d
+and ""standard"" neural model
+ z(x) = Vg(Wx+b)+c
+can be completely different.
+Consequently, while the results are very interesting, claiming their applicability to the deep models is (at this point) far fetched. In particular, abstract suggests no simplifications are being made, which does not correspond to actual result in the paper. The results themselves are interesting, but due to the above restriction it is not clear whether it sheds any light on neural nets, or simply described a behaviour of very specific, non-standard shallow model.
+
+I am happy to revisit my current rating given authors rephrase the paper so that the simplifications being made are clear both in abstract and in the text, and that (at least empirically) it does not affect learning in practice. In other words - all the experiments in the paper follow the assumption made, if authors claim is that the restriction introduced does not matter, but make proofs too technical - at least experimental section should show this. If the claims do not hold empirically without the assumptions made, then the assumptions are not realistic and cannot be used for explaining the behaviour of models we are interested in.
+
+Pros:
+- tackling a hard problem of overparametrised models, without introducing common unrealistic assumptions of activations independence
+- very nice result of ""phase change"" dependend on the size of hidden layer in section 7
+
+Cons:
+- simplification with non-trainable second layer is currently not well studied in the paper; and while not affecting expressive power - it is something that can change learning dynamics completely
+
+# After the update
+
+Authors addressed my concerns by:
+- making simplification assumption clearer in the text
+- adding empirical evaluation without the assumption
+- weakening the assumptions
+
+I find these modifications satisfactory and rating has been updated accordingly. 
+",7,3.0,ICLR2018
+#NAME?,1,rsogjAnYs4z,rsogjAnYs4z,Interesting theoretical and empirical study of the interplay of batch size on the required number of optimisation steps at different pruning levels,"1) Summary
+The manuscript studies the effect of batch size at different sparsity levels (achieved by applying connection sensitivity pruning) on the required number of optimisation steps to reach a certain accuracy. The goal is to understand the interplay between those fundamental parameters. The empirical evaluation is performed for different triples of dataset, network architecture and optimisation scheme. The theoretical analysis is based on established bounds on the expected gradient norm.
+
+2) Strengths
++ The paper is well written, the figures and the structure of the text is clear and the experimental setup is concise.
++ The empirical evaluation is very exhaustive and conclusive and allows to draw precise conclusions.
++ Code will be released to reproduce the results. Data sets are public domain.
++ The theoretical analysis is enlightning.
+
+3) Concerns
+- While the results are very interesting, the authors could have been more explicit on how the results of this work could help the ordinary neural network user setting parameters in practice by suggesting e.g. a rule of thumb.
+
+4) Remarks/Questions
+  a) I found the last sentence of the Introduction a little bold.
+  b) ""Data parallelism"" is a synonyme for ""batch size"" and ""sparsity"" is equivalent to ""pruning"". This could be made more clear in the abstract already.",7,3.0,ICLR2021
+YX4OyPWmnO,1,JHx9ZDCQEA,JHx9ZDCQEA,Review,"Summary
+- This paper aims at solving an important yet challenging task, the polymer retrosynthesis problem.
+- This paper formulates the condensation polymerization as a constrained optimization with prior structural knowledge.
+- This paper proposes a framework to solve the problem and demonstrates its effectiveness in a few-shot benchmark.
+
+---
+
+Strengths
+- Although I'm not an expert in chemistry, I already know that retrosynthesis is a very important problem in academia and industry. Thus, polymer retrosynthesis also seems to be practically important.
+- This paper well explains how to apply a single-step retrosynthesis model (trained in a reaction benchmark) into the polymer retrosynthesis problem. All sub-steps (except domain adaptation) are reasonable and easy to understand.
+
+---
+
+Concern (or Weakness) \#1: Domain adaptation
+- The domain adaptation is emphasized throughout this paper. However, I think this adaptation is hard to provide a meaningful gain in general, (a) fundamentally in the few-shot setting and (b) practically as shown in Figure 4 (right). Details of (a) and (b) are described below.
+  - (a) In the few-shot setting, the prior empirical distribution $\hat{p}(T)$ is often meaningless. This is because the reaction templates are very diverse, e.g., 40k reactions in USPTO-50k has 10k different templates, so the distribution is often very sparse, especially in the few-shot setting. In this case, optimizing Equation (6) cannot improve $p(u|r)$. In my opinion, the authors' choice, using atom- and bond-counting features, might alleviate this issue, however, I'm not sure what is the (chemical) meaning of this simplified distribution, and it can generalize to other polymer benchmarks.
+  - (b) As illustrated in Figure 4 (right), the adaptation does not provide any gain in the monomer prediction even though it depends on $p(u|r)$ as described in the caption of Figure 4. It means that the stability constraint will filter out some wrong predictions in $p(u|r)$ from PolyRetro-USPTO. It is then hard to say that the domain adaptation is helpful for polymer retrosynthetic optimization (Definition 4).
+- Since tempalte-based models are used and $u$ is determined when $T$ is given, the PolyRetro model with domain adaptation can be written as $p_\text{PolyRetro}(T|r)=\lambda p_\text{USPTO}(T|b-A-A-c) + (1-\lambda)\hat{p}(T)$ where $p_\text{USPTO}(T|b-A-A-c)$ is the pre-trained single-step model. Am I correct? The notations are somewhat confused because all density functions are denoted by $p$.
+- Hyperparameter settings:
+  - How to decide $\lambda$ without the validation set? I think one fold should be used for validation.
+  - What $h$ is used for experiments? Is $h$ in the $\hat{p}(T)$ the same as used in Definition 4?
+
+Concern (or Weakness) \#2: Methodology Novelty
+- All sub-steps (except domain adaptation) are just filtering processes using a pre-trained single-step model. While the authors well-formulate the polymer retrosynthesis problem and well-explain the sub-steps, they seem somewhat straightforward. I am willing to give credit for this work because this is the first one (to my knowledge), but I am concerned about novelty in terms of methodology.
+
+Concern (or Weakness) \#3: No qualitative study
+- In retrosynthesis, top-k accuracy cannot reflect practical scenarios since multiple solutions can exist. Thus many papers often provide their success/failure cases and allow readers to evaluate them. In my opinion, such studies are critical in this field because it can provide additional insight into a chemist even though a prediction is wrong.
+
+Concern (or Weakness) \#4: Baselines
+- What is the support of $p(u|r)$? I'm curious about how RandomProposal works.
+- I cannot be sure that RandomProposal and Transformer are good baselines. As I mentioned in the second concern, designing Poly Retro-USPTO is somewhat straightforward given the constraints of the proposed problem. Thus I think PolyRetro-USPTO is a more proper and stronger baseline.
+- In the Appendix, there are stronger baselines than RandomProposal and Transformer. Why are they in the Appendix? I think they should be presented in the main paper.
+- I want to know I correctly understand how mlp-retro and seq2seq-retro work. As far as I understand, they follow three sub-steps: (a) Given $r$, apply a pre-trained single-step retrosynthesis model (template-based mlp or template-free seq2seq) into $H-r-H$; (b) If the model obtains $H-B-b+c-C-H\rightarrow H-r-H$; (c) Then, the final monomers are $b-B-b$ and $c-C-c$ and the unit polymer is $b-B-C-c$. If I'm right, I wonder why mlp-retro is failed (because it is similar to PolyRetro-USPTO), and seq2seq-retro is not working for additional polymers as described in Appendix (because it just uses a pre-trained model). If I'm wrong, that can be a stronger baseline than RandomProposal/Transformer.
+- Is Transformer trained using only one training fold from scratch?
+
+---
+
+Questions
+- What is $\mathcal{R}_{m1,m2}$? I think its definition should be provided.
+- Are the end-groups ($b$, $c$ in papers' definitions) of polymers often unknown? Or, are they not important? I wonder why only the repeat unit $r$ is given in the polymer retrosynthesis problem.
+- Why the monomers should be symmetric? I think $b-B-b' + c'-C-c \rightarrow b-A-c + b'+c'$ can be another available candidate under the stability constraint (Definition 2).
+
+---
+
+Conclusion: While this paper focuses on an important problem, I have several concerns mentioned above, so  I think this paper is on the borderline.
+",5,3.0,ICLR2021
+S1bAZxcxf,2,H1srNebAZ,H1srNebAZ,Experimental study on how units of CNNs behave as binary classifiers,"This paper presents an experimental study on the behavior of the units of neural networks. In particular, authors aim to show that units behave as binary classifiers during training and testing. 
+
+I found the paper unnecessarily longer than the suggested 8 pages. The focus of the paper is confusing: while the introduction discusses about works on CNN model interpretability, the rest of the paper is focused on showing that each unit behaves consistently as a binary classifier, without analyzing anything in relation to interpretability.  I think some formal formulation and specific examples on the relevance of the partial derivative of the loss with respect to the activation of a unit will help to understand better the main idea of the paper. Also, quantitative figures would be useful to get the big picture. For example in Figures 1 and 2 the authors show the behavior of some specific units as examples, but it would be nice to see a graph showing quantitatively the behavior of all the units at each layer. It would be also useful to see a comparison of different CNNs and see how the observation holds more or less depending on the performance of the network.
+",5,4.0,ICLR2018
+r1lY9byg9H,2,B1eQcCEtDB,B1eQcCEtDB,Official Blind Review #2,"This paper presents a calibration-based approach to measure long range discrepancies between a model distribution and the true distribution in terms of the difference between entropy rate and cross entropy, which is exactly the forward KL divergence. It propose a simple one parameter estimation to improve the model and provides experiments to show the effectiveness.
+
+This paper in fact provides theoretical justification to the so-called temperature sweep method that is hotly debated in the area of text generation. Several issued should be clearly addressed before the acceptance for publication.
+
+1. The authors should read the Language GANs falling short paper, and conduct the experiments in the same way in that paper and compare their approach with temperature sweep method. The temperature sweep method is only used at inference stage. The proposed approach, Algorithm 2, is used at training stage.
+
+2. The paper provides theoretical results in terms of population distribution, the true but known distribution, however in practice, empirical distribution is used instead, for example the cross entropy in section 3 used in training, are these theoretical results still valid for empirical distribution? If yes, please state in the paper, if not please state why? 
+
+3. On page 6, first line, it is stated ""Since this holds for any bounded function, we can obtain the error amplification of the entropy rateof \hat{Pr} simply by choosing f = − log \hat{Pr}."" The log function is unbounded, so please be careful. Fortunately \hat{Pr}^{\epsilon} is bounded, so Corollary 4.2 is correct. For the proof of Corollary 4.2, I don't know how to get the inequality in (2) and the first claim, so please provides more steps or explanations.
+
+4. How the entropy rate of each language model in Table 1 is obtained? 
+
+5. In section 3, capital letter is used for random variable, but to define H, CR, EntRate, KL, small letter is used, which is not consistent. Also some is used as subscript, some is under E
+
+6. Many unconditional language model papers are not cited, for example, ELMO, BERT, XLNet, Albert et la. and many language GANs paper. On the other hand, many papers for conditional language models are cited,  these papers are not appropriate to cite since the paper targets on unconditional language model.
+
+7. In the first paragraph of page 1, there is a statement of ""Capturing long-termdependencies has especially been a major focus, with approaches ranging from explicit memorybased neural networks (Grave et al., 2016; Ke et al., 2018) to optimization improvements to stabilizelearning (Le et al., 2015; Trinh et al., 2018).""
+
+In the second paragraph of page 1, there is a statement of ""Capturing long-term dependencies has especially been a major focus, with approaches ranging from explicit memory-based neural networks (Grave et al., 2016; Ke et al., 2018) to optimizationimprovements aimed at stabilizing training (Le et al., 2015; Trinh et al., 2018)."" 
+
+This is redundant.
+
+8. This paper focuses  on the forward KL divergence, which is related to the quality of language model, but doesn't anything about diversity of language model, which is related to the reverse KL divergence? Can it be extended to the reverse KL divergence?
+
+Missing references:
+
+M. Caccia, L. Caccia, W. Fedus, H. Larochelle, J. Pineau, and L. Charlin. Language GANs falling short. In Neural Information Processing Systems Workshop on Critiquing and Correcting Trends in Machine Learning, 2018. 
+
+William Fedus, Ian J. Goodfellow, and Andrew M. Dai. MaskGAN: Better text generation via filling in the . ICLR, 2018.
+
+Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, and Jun Wang. Long text generation via adversarial training with leaked information. AAAI, 2018.
+
+Ferenc Huszar. How (not) to train your generative model: Scheduled sampling, likelihood, adversary? arXiv preprint arXiv:1511.05101, 2015.
+
+Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. SeqGAN: Sequence generative adversarial nets with policy gradient. AAAI, 2017.
+
+Zhongliang Li, Tian Xia, Xingyu Lou, Kaihe Xu, Shaojun Wang, and Jing Xiao. Adversarial discrete sequence generation without explicit neural networks as discriminators. AISTATS, 2019.
+
+Kevin Lin, Dianqi Li, Xiaodong He, Zhengyou Zhang, and Ming-Ting Sun. Adversarial ranking for language generation. NIPS, 2017.
+
+Ehsan Montahaei, Danial Alihosseini, and Mahdieh Soleymani Baghshah. Jointly measuring diversity and quality in text generation models. NAACL-HLT, 2019.
+
+A Quality-Diversity Controllable GAN for Text Generation
+",3,,ICLR2020
+rJg7BzyAYS,1,rkecJ6VFvr,rkecJ6VFvr,Official Blind Review #1,"This paper is an extension of the Transformer Algorithm used to solve sequential problems such as in NLP and games (such as the BoxWorld environment from Zambaldi et al.), stating that the Transformer Algorithm is an inductive bias for learning structural representations.
+The authors argue that, the use of simplicial inductive biases over the normal relational inductive bias (provided by the Transformer algorithm) may be a better way to achieve abstract learning; in fact they argue that the “relational inductive bias” used in the normal Transformer is just a 1-simplicial Transformer, thus the more complex topological space provided by the 2-simplicial Transformer should generate better results than the 1-simplicial Transformer.
+While technically sound  it’s not obvious whether  this implementation is indeed better than the normal.
+
+First of all the significance of the results are heavily determined by the fact that the authors just displayed the results of the proposed algorithm against the “normal Transformer”, using a modified version of the BoxWorld. Moreover  in the paper it is stated that, they tested the new algorithm with the original BoxWorld environment, but they omitted it as the proposed algorithm excelled over the old one in such environments. Also the relational induction algorithm used in this paper does not seem to be the same as the one used by Zambaldi et al.
+
+Second, Im concerned about the explanation made where they try to justify that the use of the bridge-BoxWorld which is argued to provide a more complex logical structure because of the use of linear logic-formulas using more connectives. In other words, the question arises as to whether the authors thought that a more complex logical structure would i mply a more complex environment, thus justifying the use of the bridge-BoxWorld. 
+
+Finally, it is worth reiterating that, although the use of simplices and simplicial complexes is a very interesting idea, especially to provide a mathematical explanation of how attention works in the Transformer, it’s not entirely obvious  whether the implementation of the 2-simplicity Transformer really presents an improvement over the original Transformer, in the learning of abstract representation. ",3,,ICLR2020
+ZUHv4pFDotG,3,TSRTzJnuEBS,TSRTzJnuEBS,"Straightforward paper, good enough but not super exciting or surprising","#### Summary
+
+For me the paper doesn't have any really major negatives. It is straightforward and makes sense, but it also didn't have anything that jumped out to me as super exciting, surprising or interesting, hence the marginal accept.
+
+#### Review
+The paper studies the task of generative modelling with what is referred to as 'anytime sampling', that is sampling with graceful early stopping, where computation time can be traded off with sample quality. The approach taken is to adapt the VQ-VAE, imposing an ordering on latent dimensions, using a kind of dropout which effectively weights the latent dimensions differently in the objective function. The latent dimensions have a fixed ordering, and those which are early in the sequence are forced to have more explanatory power than later dimensions. Experiments are presented which demonstrate that the method basically works for images and audio, in that sample quality degrades gracefully with computational budget.
+
+The most serious criticism I have is that I felt that little attention was given to motivating the problem. The paper seemed to assume a priori that anytime sampling is interesting and/or relevant to real-world applications. If that is so then I'm surprised there hasn't been more work addressing it. Could the authors add some more detail on their motivations for working on this, perhaps including specific examples of use-cases? Compression was mentioned very briefly near the end of the paper when discussing related work. This seems like a natural application to me, and indeed what's referred to as 'progressive decoding' in the compression community is a common feature of modern codecs. The authors don't seem to be interested in the compression use case, so why are they working on this, and why should I care?
+
+I'm by no means an expert on the niche topic of anytime sampling, but there are a few papers which I think are sufficiently closely related that they should be mentioned. Firstly https://arxiv.org/abs/2007.06731 would fit neatly in the last paragraph of Section 6. Denoising diffusion probabilistic models (https://arxiv.org/abs/2006.11239 and https://openreview.net/forum?id=-NEXDKk8gZ) also allow trading off computational budget with sample quality, although obviously the approach is very different and this task/trade-off is not the focus of those papers.
+
+#### Typos and other nits
+ - Second paragraph of page 2, autoregressive is mis-spelled in the first line.
+ - Latent vectors are transposed, why?
+ - First paragraph of Section 4, last sentence says '...with small change of the original VQ-VAE...', that should either be 'with a small change to the original VQ-VAE' or 'with small changes to the original VQ-VAE'.
+ - Under 'Remark' in Section 5.2.1, reference is made to 'the PCA distribution'. I don't think this is a well defined concept - PCA is not a probabilistic method and does not correspond to a probability distribution (see e.g. https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/1467-9868.00196 for more detail).",6,4.0,ICLR2021
+DkPgsit_NKM,2,X6YPReSv5CX,X6YPReSv5CX,"The paper introduces different step return targets to bring more heterogeneity to the bootstrapped DQN, but the improvement is not significant enough.","This paper utilizes different step return targets for different heads in bootstrapped DQN, such that a more heterogeneous posterior estimation of policy value may be obtained. However, my main concern is the novelty. The step size can be viewed as a tunable hyperparameter, and the posterior computation is mainly credited to the usage of different randomized DQNs. Similarly, other tunable hyperparameters may also be diversified to introduce more heterogeneity such as learning rates, etc. 
+
+Re- simulations, I appreciate the authors' efforts in conducting different experiments to understand their method, but I do have a few comments/thoughts as below:
+- what is the performance of all-2-step bootstrapped DQN? It seems more natural to me to include the comparison with this baseline, since the performance of mixed-1-3 may be more close to all-2.
+- I'm not sure if Figure 4 does do you a favor, since again we can always tune the hyperparameter of step size, and it does not need to be the same for different environments. This paper's method seems to not peak either of them.
+- the reason behind the good performance of MB-DQN explained in section 4.2 is not obvious to me. I'm not sure about the point of comparing an agent trained with its own data versus one trained with other data, since we are just learning the policy value but not the policy itself, and there is no issue like off-policy correction, etc. Actually such techniques of data-sharing have been used a lot if we have multiple agents to collect data.
+- in section 4.4, when comparing different setups of MB-DQN, it would be fairer to include more baselines used in part of the MB-DQN such as all-2, all-3.
+ ",4,4.0,ICLR2021
+BJgg7RijtS,1,BygWRaVYwH,BygWRaVYwH,Official Blind Review #3,"The authors propose the general formulation of recent meta-learning methods and propose a good library to use.
+
+Pros:
+1. The general formulation of recent meta-learning methods is reasonable.
+2. The proposed library is easy to use.
+
+Cons:
+
+The paper lacks technical novelty. I understand the goal of this paper is to build a library. However, the paper only describes a general formulation for recent meta-learning methods (e.g., MAML) and implement the formulation. It is better to clarify and some key engineering challenges and do the corresponding experiments.
+
+In addition, in the experiment parts, the authors only compare the results with MAML++. It will be more convincing if the authors can analyze other popular meta-learning methods (e.g.. Prototypical network [1], meta-LSTM [2]). 
+
+Another suggestion is that the authors can give some examples to connect current meta-learning models with the proposed general formulation. For example, the meaning of \phi_i^opt, \phi_i^loss in MAML, Prototype, Reptile, etc.
+
+It is better to explain the meaning of different colors in Figure 3.
+
+[1] Snell, Jake, Kevin Swersky, and Richard Zemel. ""Prototypical networks for few-shot learning."" Advances in Neural Information Processing Systems. 2017.
+[2] Ravi, Sachin, and Hugo Larochelle. ""Optimization as a model for few-shot learning."" ICLR (2016).
+
+
+
+Decision after rebuttal: I have read the authors' responses. Like review 2, I also think the ""generalization"" is overclaimed, it only provides a general formulation. Thus, I finally decide to keep my score.",3,,ICLR2020
+SyeHB4cKnm,3,rkxfjjA5Km,rkxfjjA5Km,The presentation of this paper can be improved. The notation is not very clear. ,"This paper proposes a new approach to construct model-X knockoffs based on VAE, which can be used for controlling the false discovery rate. Both numerical simulations and real-data experiments are provided to corroborate the proposed method.  
+
+Although the problem of generating knockoffs based on VAE is novel, the paper presentation is not easy to follow and the notation seems confusing. Moreover, the main idea of this paper seems not entirely novel. The proposed method is based on combining the analysis in ''Robust inference with knockoffs'' by Barber et. al. and the VAE.  
+
+Detailed comments:
+
+1. The presentation of the main results is a bit short. Section 2, the proposed method, only takes 2 pages. It would be better to present the main results with more details. 
+
+2. The method works under the assumption that there exists a random variable $Z$ such that $X_j$'s are mutually independent conditioning on $Z$. Is this a strong assumption? It seems better to illustrate when this assumption holds and fails.
+
+3. The notation of this paper seems confusing. For example, the authors did not introduce what $(X_j, X_{-j}, \tilde X_j, \tilde X_{-j} )$ means. Moreover, in Algorithm 1, what is $\hat \theta$ and $\hat f$. 
+
+4. I think there might be a typo in the proof of Theorem 2.1. In the main equation, why $\tilde Z$ and $\tilde X$ did not appear? They should show up somewhere in the probabilities.
+
+5. In Theorem 2.2, how strong is the assumption that $\sup_{z,x} | log (density ratio)| $ is smaller than $\alpha_n$? Usually, we might only achieve nonparametric rate for estimating the likelihood ratios. But here you also take a supremum, which might sacrifice the rate. The paper suggested that $\alpha_n$ can be o( (n \log p)^{-1/2}). When can we achieve such a rate?
+
+6. Novelty. Theorem 2.2 seems to be an application of the result in Barber et. al. Compared with that work, this paper seems to use VAE to construct the distribution $ P_{\tilde X| X}$ and its analysis seems hinges on the assumptions in Theorem 2.2 that might be stringent.
+
+7. In Figure 1 and 2, what is the $x$-axis?
+
+8. A typo: Page 2, last paragraph. ""In this paper, we relaxes the ...""",3,4.0,ICLR2019
+rymXy2ggz,1,HkwrqtlR-,HkwrqtlR-,"Some interesting things, but not enough","The main take-away messages of this paper seem to be:
+
+1. GANs don't really match the target distribution. Some previous theory supports this, and some experiments are provided here demonstrating that the failure seems to be largely in under-sampling the tails, and sometimes perhaps in introducing spurious modes.
+
+2. Even if GANs don't exactly match the target distribution, their outputs might still be useful for some tasks.
+
+(I wouldn't be surprised if you disagree with what the main takeaways are; I found the flow of the paper somewhat disjointed, and had something of a hard time identifying what the ""point"" was.)
+
+Mode-dropping being a primary failure mode of GANs is already a fairly accepted hypothesis in the community (see, e.g. Mode Regularized GANs, Che et al ICLR 2017, among others), though some extra empirical evidence is provided here.
+
+The second point is, in my opinion, simultaneously (i) an important point that more GAN research should take to heart, (ii) relatively obvious, and (iii) barely explored in this paper. The only example in the paper of using a GAN for something other than directly matching the target distribution is PassGAN, and even that is barely explored beyond saying that some of the spurious modes seem like reasonable-ish passwords.
+
+Thus though this paper has some interesting aspects to it, I do not think its contributions rise to the level required for an ICLR paper.
+
+Some more specifics:
+
+Section 2.1 discusses four previous theoretical results about the convergence of GANs to the true density. This overview is mostly reasonable, and the discussion of Arora et al. (2017) and Liu et al. (2017) do at least vaguely support the conclusion in the last section of this paragraph. But this section is glaringly missing an important paper in this area: Arjovsky and Bottou (2017), cited here only in passing in the introduction, who proved that typical GAN architectures *cannot* exactly match the data distribution. Thus the question of metrics for convergence is of central importance, which it seems should be important to the topic of the present paper. (Figure 3 of Danihelka et al. https://arxiv.org/abs/1705.05263 gives a particularly vivid example of how optimizing different metrics can lead to very different results.) Presumably different metrics lead to models that are useful for different final tasks.
+
+Also, although they do not quite fit into the framing of this section, Nowozin et al.'s local convergence proof and especially the convergence to a Nash equilibrium argument of Heusel et al. (NIPS 2017, https://arxiv.org/abs/1706.08500) should probably be mentioned here.
+
+The two sample testing section of this paper, discussed in Section 2.2 and then implemented in Section 3.1.1, seems to be essentially a special case of what was previously done by Sutherland et al. (2017), except that it was run on CIFAR-10 as well. However, the bottom half of Table 1 demonstrates that something is seriously wrong with the implementation of your tests: using 1000 bootstrap samples, you should reject H_0 at approximately the nominal rate of 5%, not about 50%! To double-check, I ran a median-heuristic RBF kernel MMD myself on the MNIST test set with N_test = 100, repeating 1000 times, and rejected the null 4.8% of the time. My code is available at https://gist.github.com/anonymous/2993a16fbc28a424a0e79b1c8ff31d24 if you want to use it to help find the difference from what you did. Although Table 1 does indicate that the GAN distribution is more different from the test set than the test set is from itself, the apparent serious flaw in your procedure makes those results questionable. (Also, it seems that your entry labeled ""MMD"" in the table is probably n * MMD_b^2, which is what is computed by the code linked to in footnote 2.)
+
+The appendix gives a further study of what went wrong with the MNIST GAN model, arguing based on nearest-neighbors that the GAN model is over-representing modes and under-representing the tails. This is fairly interesting; certainly more interesting than the rehash of running MMD tests on GAN outputs, in my opinion.
+
+Minor:
+
+In 3.1.1, you say ""ideally the null hypothesis H0 should never be rejected"" – it should be rejected at most an alpha portion of the time.
+
+In the description of section 3.2, you should clarify whether the train-test split was done such that unique passwords were assigned to a single fold or not: did 123456 appear in both folds? (It is not entirely clear whether it should or not; both schemes have possible advantages for evaluation.)",3,5.0,ICLR2018
+WMJ_hYScKhM,2,2NU7a9AHo-6,2NU7a9AHo-6,new metric for model evaluation but lacking on methodology and potential impact,"
+The paper argues that AUL is a better metric than AUC under the PU (positive and unlabeled data) learning setup in the sense that it leads to an unbiased estimator in this setting, which is not the case for the commonly used and known metric - AUC. It is also argued that it leads to better performance than those methods which directly optimize an AUC based metric and computationally efficient to evaluate than methods which attempt to estimate the unknown parameters (\alpha, \beta in the paper). The appropriateness of this setting in the PU learning setting is demonstrated on the UCI datasets.
+
+The main contribution of the paper is to bring in the AUL metric in the context of PU learning, which seems rather new from the ML perspective. However, there are following concerns regarding the methodology and experimentation :
+
+- The paper repeatedly makes the claim that existing work, including recent ones, which work on estimation of \frac{1-\alpha}{1-\alpha\beta} do not work well on their setup, without discussing why that happens. It is not discussed if there is something wrong in these papers. In my opinion, one needs to give reasonable arguments why these methods do not work, instead of simply saying these did not work.
+
+- The proof of Theorem 1 is not quite clear, and in particular, how equality defining the t_{x_i} holds. Is it related to the SCAR assumption made in the paper. The proof needs to be more details, and it should be clarified where the SCAR assumption is used.
+
+- The SCAR assumption is vaguely defined in words. In the context of the paper, it should be formally defined in terms of quantities already metioned such as \alpha and \beta. Also, is the assumption not too strong, and do other papers make this assumption. How to verify that this assumption holds in practice. 
+
+- The paper argues about the computational advantage of their method compared to other methods. However, it is evaluated on small scale UCI datasets,
+
+- There seem to be incorrect usage of PU/PN learning in various places : (i) When defining \alpha in section 2, the sentence says ""In PU learning, we ..."", should it not be ""In PN learning, we ..."", (ii) In the statement of Theorem 1, ""a PN dataset \mathbb{D} with the proportion of labeled samples in positive samples \beta= ..."" - does it make sense to talk about labeled and unlabeled data when talking about PN dataset.
+
+- Comparison with a recent related work is missing - Class Prior Estimation with Biased Positives and Unlabeled Examples - AAAI 2020",5,3.0,ICLR2021
+KckNCqCCjBJ,2,SeFiP8YAJy,SeFiP8YAJy,Review for Better Together: Resnet-50 accuracy with  13x fewer parameters and at  3x speed,"This paper introduces a training policy for jointly learning parameters of both supernets (or big teacher models) and subnets (or small student models). The student models have the similar architectures with teacher models, but have different numbers of convolution filters. For jointly training teacher models and student models, an adjoint loss consisted of cross-entropy loss and KL-div is defined. The experiments are conducted on several image benchmarks using ResNet-18 and ResNet-50.
+
+Strengths:
++: The experimental results show the proposed method can significantly reduce model complexity with slight performance loss. 
++: The proposed method seems simple and easy to implement.  
+
+
+Weaknesses:
+-: Lacking more clarification on significances of this work.
+(1) No state-of-the-art model compression method is compared for verifying the effectiveness of the proposed method. For filter pruning, many recently proposed methods [r1, r2, r3, r4] are presented to compress different sizes of CNN models (e.g., VGG, ResNet18, ResNet34, ResNet50, ResNet101, ResNet152) on various image benchmarks (e.g., CIFAR-10, CIFAR-100 and ImageNet). The corresponding discussion and comparison are missing in this work.
+[r1] Discrimination-aware Channel Pruning for Deep Neural Networks. NIPS, 2018.
+[r2] Gate Decorator: Global Filter Pruning Method for Accelerating Deep Convolutional Neural Networks. NIPS, 2019.
+[r3] Channel Pruning via Automatic Structure Search. IJCAI, 2020.
+[r4] HRank: Filter Pruning using High-Rank Feature Map. CVPR, 2020.
+
+(2) The idea of training subnets from supernets shares similar philosophy with one-shot NAS. Particularly, one-shot NAS based on knowledge distillation has been studied [r5], which supernets and subnets are jointly trained. Meanwhile, [r5] also shows the searched small student models can achieve comparable or better performance than big teacher models. Therefore, what are advantages of the proposed method over [r5]?
+[r5] Neural Architecture Search by Block-wisely Distilling Architecture Knowledge. CVPR, 2020.
+
+(3) I wonder that why the results (73.41%) of ResNet-50 on ImageNet in this paper is inferior to those of other works (~75.1%). Besides, why the authors did not report the results of ResNet-18 on ImageNet?
+
+(4) For fully verifying the effectiveness of the proposed method, the authors would better report the results using more CNN models, rather than those on various benchmarks. In particular, Imagewoof shares similar philosophy with ImageNet.
+
+-: The writing is too colloquial, and there are too many typos. 
+(1) $13.7x$ -> $13.7 \times$
+(2) kl -> KL
+(3) the the smaller -> the smaller
+(4) dropouts -> Dropout
+
+-: I wonder that the meaning of Table 3. For the common settings, ResNet does not adopt Dropout. Therefore, is it fair or necessary for comparing ResNet with Dropout?
+
+-: How about the transfer (generalization) ability of the proposed adjoined networks?  
+",4,4.0,ICLR2021
+SkxwFxnu2Q,1,S1eEdj0cK7,S1eEdj0cK7,Interesting interpretability work cast as bilingual word alignment.,"This paper sets out to build good bilingual word alignments from the information in an NMT system (both Transformer and RNN), where the goal is to match human-generated word-alignments as measured by AER. At least that’s how it starts. They contribute two aligners: one supervised aligner that uses NMT source and target representations as features and is trained on silver data generated by FastAlign, and one interpretability-based aligner that scores the affinity of a source-target word-pair by deleting the source word (replacing its embedding with a 0-vector) and measuring the impact on the probability of the target word. These are both shown to outperform directly extracting alignments from attention matrices by large margins. Despite the supervised aligner getting better AER, the authors proceed to quickly discard it as they dive deep on the interpretability approach, applying it also to target-target word pairs, and drawing somewhat interesting conclusions about two classes of target words: those that depend most of source context and those that depend most on target context.
+
+Ultimately, this paper’s main contribution is its subtraction-based method for doing model interpretation. Its secondary contributions are the idea of evaluating this interpretation method empirically using human-aligned sentence pairs, and the idea of using the subtraction method on target-target pairs. The conclusion does a good job of emphasizing these contributions, but the abstract and front-matter do not. Much of the rest of the paper feels like a distraction. Overall, I believe the contributions listed above are valuable, novel and worth publishing. I can imagine using this paper’s techniques and ideas in my own research.
+
+Specific concerns:
+
+The front-matter mentions ‘multiple attention layers’. It would probably be a good idea to define this term carefully, as there are lots of things that could fit: multiple decoder layers with distinct attentions, multi-headed attention, etc.
+
+In contrast to what is said in the introduction, GNMT as described in the Wu et al. 2016 paper only calculates attention once, based on the top encoder layer and the bottom decoder layer, so it doesn’t fit any definition of multiple attention layers.
+
+Equation (1) and the following text use the variable L without defining it.
+
+‘dominative’ -> ‘dominant’
+
+Is there any way to generate a null alignment with Equation 3? That is, a target word that has no aligned source words? If not, that is a major advantage for FastAlign.
+
+Similarly, what exactly are you evaluating when you evaluate FastAlign? Are you doing the standard tricks from the phrase-based days, and generating source->target and target->source models, and combining their alignments with grow-diag-final? If so, you could apply the same tricks to the NMT system to help even the playing field. Maybe this isn’t that important since the paper didn’t win up being about how to build the best possible word aligner from NMT (which I think is for the best).
+
+I found Equations (7) and (8) to be confusing and distracting. I understand that you were inspired by Zintgraf’s method, but the subtraction-based method you landed on doesn’t seem to have much to do with the original Zintgraf et al. approach (and your method is much easier to the understand in the context of NMT than theirs). Likewise, I do not understand why you state, “we take the uniform distribution as P(x) regarding equation 8 for simplicity” - equation 9 completely redefines the LHS of equation 8, with no sum over x and no uniform distribution in sight.
+
+The Data section of 4.1 never describes the NIST 2005 hand-aligned dataset.
+
+The conclusions drawn at the end of 4.4 based on ‘translation recall’ are too strong. What we see is that the Transformer outperforms Moses by 2.8 onCFS, and by 3.7 on CFT. This hardly seems to support a claim that CFT words are the reason why Transformer yields better translation.
+
+4.5 paragraph 1: there is no way to sample 12000 datasets without replacement from NIST 2005 and have the samples be the same size as NIST 2005. You must mean “with replacement”?",6,4.0,ICLR2019
+QzNeo0wQJ6y,3,yOkSW62hqq2,yOkSW62hqq2,"Interesting and novel work knowledge distillation, with some unanswered concerns","Summary:
+
+The paper proposes new KD framework, i.e., Explicit Connection Distillation (ECD), which unlike existing methods, designs teacher network that is well aligned with the student architecture and trains both the networks simultaneously using explicit dense feature connections. The proposed method is evaluated on CIFAR-100 and ImageNet datasets.
+
+Strengths:
+
+- The proposed method neither requires any explicit pre-trained teacher network, nor any distillation loss. So, the method overcomes the problem of selecting teacher network or alternatives of distillation losses for the task at hand.
+- By design, the generated teacher network has features aligned with the student network at every layer.
+
+Concerns:
+
+- Though existing works involves complex optimization in terms of losses but the hyperparameters involved in distillation like the weight on distillation loss or the temperature value is not so sensitive like learning rate. Even without careful tuning, decent distillation performance can be achieved with moderate temperature, high weight on distillation loss and low weight on cross entropy loss. So, this is not a major limitation in existing methods.
+- In the proposed ECD framework, both the teacher and student networks are trained simultaneously, so number of trainable parameters (teacher parameters + student parameters) would be large. So, the method may not work well in case of limited amount of training samples.
+- Selecting an optimal value of kernel number ‘n’ is a concern.
+- The gain in performance in Table 1 for WRN-40-2 is marginal. So, it seems the proposed method may not be effective on some architectures like wide ResNets where the network channels are widened.
+- In Table 5, marginal improvement using ECD over stage wise feature supervision.
+- Table 4 shows shallow layers migrate more information than higher layers and dense connections are preferred on shallow layers only to get optimal performance. But identifying the layer from which high level semantics would be captured is non-trivial.
+
+Queries for authors:
+
+- Any restriction or range of values that alpha can take?
+- All the experiments are done with n=16, how the performance changes by varying ‘n’?
+- Is the performance of the teacher reported in Table 1, obtained through auxiliary teacher involving feature connections with the student network?
+- While training using the proposed ECD, how to decide number of epochs for training (based on either teacher or student performance on validation data)?
+- Details about ECD* and how learnable ensemble is applied is not mentioned in detail even in Appendix.
+
+General Remarks:
+
+While the creation of auxiliary teacher directly from the student network removes its dependencies from pre-trained teacher but dependency on several design choices like the number of dynamic additive convolutions for the first module and appropriate places for adding connection paths in the second module for explicit flow of layer to layer gradients remain.",7,4.0,ICLR2021
+S1gXpfnhFB,1,HJlvCR4KDS,HJlvCR4KDS,Official Blind Review #2,"
+The paper proposes a novel method for explaining VQA systems. Different from most previous work, the proposed approach generates textual explanations based on three steps. First, it extracts keywords from the question. Then an explanation sentence is decoded (based on an RNN image captioner) through the proposed Variable-Constrained Beam Search (VCBS) algorithm to satisfy the keyword constraints. Finally, 3) checking through linguistic inference whether the explanation sentence can be used as a premise to infer the question and answer.
+
+I would recommend for acceptance. The paper proposes an alternative approach to VQA explanations, together with a few supporting algorithms such as VCBS. It is potentially helpful to future work on textual explanations and explainable AI in general.
+
+At a high level, it is ambiguous to decide what is a reasonable explanation for many “no” answers. For example, one usually cannot provide stronger justification than “there is indeed no one” or “I don’t see anyone” to the question “Is there anyone in the room” with an answer “no.” The paper frames this explanation generation task as a linguistic inference task and checks entailment between the explanation and the question-answer pair. While it is debatable whether this is optimal, the proposed approach provides valuable insights on what constitutes a good explanation.
+
+However, the proposed approach also has noticeable weaknesses. 
+
+It relies on external models or tools for natural language inference, and such inference does not take into account the visual context of the image. Also, the explanations generated from the proposed model only justify the answer but are not introspective, and they do not reflect the decision process of the target VQA model.",6,,ICLR2020
+Sr9oDkkTJP8,4,wQRlSUZ5V7B,wQRlSUZ5V7B,The relationships to some existing methods are not sufficiently discussed.,"This paper focuses on latent representations learning when some labels are provided. The authors propose a method called the characteristic capturing VAE (CCVAE). This method learns real-valued auxiliary variables that capture the label information. The proposed method is tested on a medical image and a face image dataset.
+
+One of the major motivations of the proposed method is that it can be used to conditionally generate images based on desired characteristics. However, this paper proposes a VAE model, which usually generates lower-quality images compared to Generative Adversarial Networks (GAN). It is not clear to me why this paper focuses on a VAE model rather than a GAN model. There have been GAN-based methods [1, 2] also based on an auto-encoding framework. Note that if we remove the adversarial loss from the objective functions of these methods, and add the KL divergence penalty, these methods become VAE. Since these methods also introduce real-value variables, I believe the authors should compare to the VAE-version of these methods. 
+
+In addition, in the VAE literature, [3] proposes a semi-supervised method.  Since [3] also involves a real-value latent variable, it looks like the difference between the proposed method and [3] is that the author extends [3] to multiple labels. Is this true? I suggest the authors better clarify the novelty of the proposed method compared to [3]. 
+
+I do not suggest accepting this paper because the relationships to some existing methods are not sufficiently discussed.
+
+Minor:
+In the experiments, the paper reports the quantitative measures for classification and disentanglement. However, no quantitative measures for image quality are reported. I suggest the authors report some measures such as reconstruction error and FID score.
+
+References
+[1] Xiao, Taihong, Jiapeng Hong, and Jinwen Ma. ""DNA-GAN: Learning disentangled representations from multi-attribute images."" International Conference on Learning Representation Workshop. 2018.
+
+[2]Xiao, Taihong, Jiapeng Hong, and Jinwen Ma. ""Elegant: Exchanging latent encodings with gan for transferring multiple face attributes."" Proceedings of the European conference on computer vision (ECCV). 2018.
+
+[3] Li, Yang, et al. ""Disentangled variational auto-encoder for semi-supervised learning."" Information Sciences 482 (2019): 73-85.",6,4.0,ICLR2021
+bUN-Pzesp6,1,GXJPLbB5P-y,GXJPLbB5P-y,new interesting framework to leverage unlabelled output data,"The paper introduces a “predict-and-denoise” model for structured prediction, specifically for tasks where the output has to adhere to some constraints e.g. natural language, code etc. This framework allows leveraging of unlabelled output data to train the denoiser, which consequently allows the base predictor to be of low complexity that can potentially generalize with relatively fewer labelled data. The authors theoretically back their arguments basing their theory on a 2 layer ReLU model. The paper demonstrates the performance of this model on two tasks - font image generation, and pseudocode-to-code translation and shows improvement in performance over previous works. 
+
++ves :
+
+- The paper is very well written and easy to follow. The motivation and the contributions are very clear, and the experimental section is also well detailed and organized. 
+- To the best of my knowledge the framework of predict-and-denoise learned in a composed manner and using this framework to leverage unlabelled output data is a novel contribution of the paper. 
+- The authors argue that this framework allows reduced complexity of the base predictor, backed theoretically for a 2 layer ReLU network. The authors have provided a detailed proof of their argument in the supplementary material, although I have not completely verified its correctness. 
+
+Concerns :
+
+- I believe that the experimental section currently lacks fair comparisons, especially in the task of pseudocode-to-code translation. The authors compare their method with other methods for leveraging unlabelled data such as pre-training and back-translation. The authors show that predict-and-denoise framework can be applied on top of these existing approaches, and yields consistent improvement.
+
+  However, when comparing such combinations such as “pre-training/back-translation + composed” against pre-training/back-translation, the resulting performance is not  compared with an accordingly scaled base pre-training/back-translation model to have comparable number of parameters. With a very different number of parameters in the models being compared, it's hard to say where the performance benefit is coming from. 
+
+- While the proposed method is complementary to approaches like pre-training and back translation, it will be helpful to also include comparisons such as “composed vs pre-training”, or “composed vs back-translation”. This will give an interesting comparison among these different ways of leveraging unlabelled output data. Again proper care needs to be taken about a comparable number of parameters. 
+
+While I find the framework ""predict-and-denoise"" very interesting, I am not entirely convinced with its empirical performance reported in the current form and I have given my score accordingly. If the authors agree with my concerns, and can try to incorporate these changes during rebuttal, I will consider updating my score.  
+",6,3.0,ICLR2021
+BJDe0lzNl,3,HyFkG45gl,HyFkG45gl,Rich data generation procedure but system specific and not well motivated,"The authors describe a system for solving physics word problems. The system consists of two neural networks: a labeler and a classifier, followed by a numerical integrator. On the dataset that the authors synthesize, the full system attains near full performance. Outside of the pipeline, the authors also provide some network activation visualizations.
+
+The paper is clear, and the data generation procedure/grammar is rich and interesting. However, overall the system is not well motivated. Why did they consider this particular problem domain, and what challenges did they specifically hope to address? Is it the ability to label sequences using LSTM networks, or the ability to classify what is being asked for in the question? This has already been illustrated, for example, by work on POS tagging and by memory networks for the babi tasks. A couple of standard architectural modifications, i.e. bi-directionality and a content-based attention mechanism, were also not considered.",4,4.0,ICLR2017
+pn_JmBAuKG1,2,0MjC3uMthAb,0MjC3uMthAb,A simple method for well-motivated problem,"Summary: 
+This paper proposes to use a shot-conditioned model that specializes in pre-trained few-shot learning model to a wide spectrum of shots. The proposed approach is simple but effective, it trains neural networks that conditioned on a support set of a different number of shots K during the episodic fine-tuning stage, with FiLM as the conditioning mechanism.
+
+
+Pros:
+1. This paper is clearly written, well-motivated, and thoroughly evaluated. It is an enjoyable reading experience .
+2. The proposed model and training algorithm is simple but effective
+3. Experiments are well designed, which first verifies that shot-conditioned few-shot learners can achieve relatively good performances on different Ks and then perform the large-scale evaluation.
+
+Cons:
+1. Smoothing the shot distribution section is overly simplified. It is very hard to understand algorithm 1 given the limited details presented in the method section. I would suggest explaining in detail the smooth-shot procedure in the paper. 
+2. Lack of ablation studies. There are two key designs of the few-shot learner presented in this paper, that does not have detailed ablation study results. The first is whether using the convex combination of FiLM parameters to obtain shot distribution s. The second is whether to use smoothing in shot distribution. How much are these two designs contribute to the model's final performances?
+3. In figure 2, it seems that the method trained with 5-shot has very good performances in all three scenarios presented? Would it be possible that such a good K-shot can be always found and therefore we do not necessarily need this K-conditioned few-shot learner? For instance, just always train with 5-shot and evaluate different shots.
+4. Instead of looking at the results of some instances of different K, it would be nice to have a comprehensive evaluation over different Ks. For instance, evaluating over all Ks from 1 to 40 and compute the average accuracy of each different model. To reduce the computation overhead, we can also try to evaluate Ks in {1, 10, 20, 30, 40}.
+5. Comparison of state-of-the-art methods for meta-dataset experiments.
+6. What is UMAP projection? The content of the experiments is not self-contained. How shall we read Figure 3?
+7. Figure 4 is also not straightforward to understand. What is the value on the y-axis? What does it mean? 
+ 
+
+Minor:
+1. Inconsistent formats for average performances in Table 1. It seems that the results of the first two columns is badly formatted?",6,3.0,ICLR2021
+BkX4GPilz,3,ByeqORgAW,ByeqORgAW,A good paper but which should compare with BFGS techniques,"Summary:
+
+Using a penalty formulation of backpropagation introduced in a paper of Carreira-Perpinan and Wang (2014), the current submission proposes to minimize this formulation using explicit step for the update of the variables corresponding to the backward pass, but implicit steps for the update of the parameters of the network. The implicit steps have the advantage that the choice of step-size is replaced by a choice of a proximity coefficient, which the advantage that while too large step-size can increase the objective, any value of the proximity coefficient yields a proximal mapping guaranteed to decrease the objective.
+The implicit are potentially one order of magnitude more costly than an explicit step since they require
+to solve a linear system, but can be solved (exactly or partially) using conjugate gradient steps. The experiments demonstrate that the proposed algorithm are competitive with standard backpropagation and potentially faster if code is optimized further. The experiments show also that in on of the considered case the generalization accuracy is better for the proposed method.
+
+Summary of the review: 
+
+The paper is well written, clear, tackles an interesting problem. 
+But, given that the method is solving a formulation that leverages second order information, it would seem reasonable to compare with existing techniques that leverage second order information to learn neural networks, namely BFGS, which has been studied for deep learning (see the references to Li and Fukushima (2001) and Ngiam et al (2011) below).
+
+Review:
+
+Using an implicit step leads to a descent step in a direction which is different than the gradient step.
+Based on the experiment, the step in the implicit direction seems to decrease faster the objective, but the paper does not make an attempt to explain why. The authors must nonetheless have some intuition about this. Is it because the method can be understood as some form of block-coordinate Newton with momentum? It would be nice to have an even informal explanation.
+
+Since a sequence of similar linear systems have to be solved could a preconditioner be gradually be solved and updated from previous iterations, using for example a BFGS approximation of the Hessian or other similar technique. This could be a way to decrease the number of CG iterations that must done at each step. Or can this replaced by a single BFGS style step?
+
+The proposed scheme is applicable to the batch setting when most deep network are learned using stochastic gradient type methods. What is the relevance/applicability of the method given this context?
+ 
+In fact given that the proposed scheme applies in the batch case, it seems that other contenders that are very natural are applicable, including BFGS variants for the non-convex case (
+
+see e.g. Li, D. H., & Fukushima, M. (2001). On the global convergence of the BFGS method for nonconvex unconstrained optimization problems. SIAM Journal on Optimization, 11(4), 1054-1064.
+
+and
+
+J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, Q. V. Le, and A. Y. Ng,
+“On optimization methods for deep learning,” in Proceedings of the 28th
+International Conference on Machine Learning, 2011, pp. 265–272.
+
+) 
+
+or even a variant of BFGS which makes a block-diagonal approximation to the Hessian with one block per layer. To apply BFGS, one might have to replace the RELU function by a smooth counterpart..
+ 
+How should one choose tau_theta?
+
+In the experiments the authors compare with classical backpropagation, but they do not compare with 
+the explicit step of Carreira-Perpinan and Wang? This might be a relevant comparison to add to establish more clearly that it is the implicit step that yields the improvement.
+
+
+
+
+
+Typos or question related to notations, details etc:
+
+In the description of algorithm 2: the pseudo-code does not specify that the implicit step is done with regularization coefficient tau_theta
+
+In equation (10) is z_l=z_l^k or z_l^(k+1/2) (I assume the former).
+
+6th line of 5.1 theta_l is initialised uniformly in an interval -> could you explain why and/or provide a reference motivating this ?
+
+8th line of 5.1 you mention Nesterov momentum method -> a precise reference and precise equation to lift ambiguities might be helpful.
+
+In section 5.2 the reference to Table 5.2 should be Table 1.
+",6,4.0,ICLR2018
+Bkg0Rq9Vam,3,rklwwo05Ym,rklwwo05Ym,"A special interpretation of Dropout. Unfortunately, not convincing.","Different from an existing variational dropout method which used variational inference to explain Dropout, this paper proposes to interpret Dropout from the MAP perspective. More specifically, the authors utilize the Jensen inequality to develop a lower bound for log-posterior, which is used as training objective for dropout. They then exploit the power mean to develop the conditional power mean model family, which provide additional flexibility for evaluation during validation.
+Even though the way how the proposed method is analyzed/generalized is interesting, the proposed method is not convincing, but I am not absolutely sure. Besides the paper is hard to follow, some other concerns are listed below.
+(1) “…the original/usual dropout objective” and “the dropout rate” are not defined in the paper, even though they appear many times in the paper.
+(2) In the last paragraph of Sec. 2, the authors argue that utilizing their MAP objective “sidestep any questions about whether variational inference makes sense.” However, the presented MAP lower bound has its own problem, since it is derived using the Jensen inequality.  For example, as shown in Appendix C, the equality becomes true only when p(w|\Theta) is a delta function.
+(3) How to tune the hyperparameters (alpha, lambda) of the extended dropout family in practice?
+(4) The current experiments might be weak. Additional experiments on popular image datasets are recommended.
+
+Minors:
+(1) In Eq. (3), is p(w_r|\Theta) of the second formula identical to p(w|\Theta) of the third formula?
+(2) In the second row below Eq. (6), E_w p(w|\Theta) p(y|x,w) is a typo.
+",5,3.0,ICLR2019
+cYTAWjKYttx,1,oFp8Mx_V5FL,oFp8Mx_V5FL,Misleading results [updated],"Summary
+-------------
+This paper extends neural compression approaches by fine-tuning the decoder on individual instances and including (an update to) the decoder in the bit-stream for each image/video. The proposed approach is evaluated on the UVG dataset and the authors find a 1db improvement (PSNR) relative to their own baseline.
+
+
+Quality (5/10)
+----------
+The proposed approach is sound and it would have been interesting to see the gains which can be achieved by fine-tuning the decoder of common neural compression approaches. Unfortunately, the few results provided in the paper are not just failing to answer this question but are misleading. By choosing a weak baseline, the reader is led to believe that fine-tuning a decoder will lead to large gains when realistic models are likely to benefit significantly less.
+
+The authors motivate their simple baseline by noting that their approach is ""model-agnostic"". However, while the approach is model-agnostic, the results and conclusions are not. And it is mostly the empirical results which will be of interest to the reader. (A reader familiar with compression will be very well aware that a neural decoder _could_ be included in the bit-stream, making the conceptual contributions less interesting.)
+
+The evaluated model encodes each frame of a 600 frame video sequence _independently_. A more realistic decoder would be conditioned on information in previously encoded frames, changing its behavior. It is reasonable to expect that similar change in behavior is encoded in the model updates. That is, the proposed approach is likely less effective in a more realistic setting.
+
+If model complexity was a concern, the authors could have evaluated their approach on images instead of videos. The results would have looked less impressive but would have been more useful. Alternatively, they could have chosen a different video compression architecture of low complexity but one which is still practically relevant. E.g., one motivated by computational constraints.
+
+
+Significance (4/10)
+----------------
+Neural compression is of interest to many people in the the ICLR community and exploring the fine-tuning of decoders would be a useful contribution to this field. The significance of this contribution is only limited by the lack of a meaningful results.
+
+
+Originality (4/10)
+--------------
+Including model information in the bit-stream is an old idea in compression and not limited to neural compression. For example, Netflix is optimizing their classical video codecs at a ""shot"" level. Even JPEG (1992) allows us to fine-tune the Huffman table for an individual image (""optimized JPEG"").
+
+It is also common for compression challenges to require the model to be included in the bit-stream (e.g., the Hutter prize or the P-frame challenge of CLIC 2020).
+
+Many papers have been written on the related topic of _model compression_ (e.g., Han et al., 2016), which should at least be acknowledged. Compressed model updates are also used in parallelized implementations of SGD (e.g., Alistarh et al., 2017).
+
+
+Clarity (8/10)
+---------
+The paper is well written and clear.",6,4.0,ICLR2021
+eSqDDe41bJx,1,qHXkE-8c1sQ,qHXkE-8c1sQ,Defining information distance by relaxing Kolmogorov complexity ,"Summary: 
+
+The authors provide a practical distance measure among different neural networks. They extend the classical information distance by replacing the uncomputable Kolmogorov complexity in terms of code length of prequential coding. Empirically, they show several practical advantages of the proposed distances.  
+
+Pros: The proposed distance may be easy to estimate. 
+
+Cons:
+
+1. For the proposed information distance definition and invariance property, are there any concrete examples or analytical formulas to demonstrate its effectiveness? E.g., if a simple linear function gives the neural network, what is the concrete value of this distance. 
+
+2. The distance is data-dependent. Suppose the data is given by a particularly known distribution, even Gaussian, 
+is there a mathematical way to demonstrate the defined quantities? 
+
+Some sentences need to be revised:
+
+`""` Information distance dp is based on information distance defined with Komolgorov complexity K.""
+
+Some questions:
+
+1. In literature, there are already many works in studying the distance and geometry associated with the neural networks, such as Fisher information geometry (S. Amari) and Wasserstein information geometry (W. Li). What is the relation between this new distance based on prequential coding's codelength and two distances mentioned above? Especially the role that KL-divergence plays in this framework?
+
+Li, Zhao, Wasserstein information matrix. 
+
+Amari, Matsuda, Wasserstein statistics in one-dimensional location-scale model. 
+
+
+2. Fisher information metric is known to be invariant under parameterization and is even characterized as the unique metric on probability space to have this property. In your paper, you mention that this new practical distance is also invariant under parameterization. Does this have something to do with the classical Fisher metric?
+
+3. To obtain the data-dependency definition of information distance, do you have to calculate an empirical version of it? If this is the case, what is the convergence properties, such as the convergence rate of this empirical information distance?",5,4.0,ICLR2021
+SkgOb3BRKH,2,rJlqoTEtDB,rJlqoTEtDB,Official Blind Review #2,"This paper proposes PowerSGD for improving SGD to train deep neural networks. The main idea is to raise the stochastic gradient to a certain power. Convergence analysis and experimental results on CIFAR-10/CIFAR-100/Imagenet and classical CNN architectures are given. 
+
+Overall, this is a clearly-written paper with comprehensive experiments. My major concern is whether the results are significant enough to deserve acceptance. The proposed method PowerSGD is an extension of the method in Yuan et al. (extended to handle stochastic gradient and momentum). I am not sure how novel the convergence analysis for PowerSGD is, and it would be nice if the authors could discuss technical challenges they overcome in the introduction. ",3,,ICLR2020
+BJEN13JNR8K,3,6puUoArESGp,6puUoArESGp,Concerns about the model,"The focus of the work is on model interpretability using concept-based explanation. The authors consider the issue of concepts being correlated with confounding information in the features. They propose a causal graph for representing the system and use instrumental variable methods to remove the impact of unobserved confounders. The proposed method is evaluated on synthetic and real data.
+
+I have the following comments:
+
+(1) The authors provide a graphical modelings for the setup of the problem. They assume that: 1. There is an ""unconfounded concept"" generated only from the label, 2. There is no confounding effect on the label.
+
+Unfortunately, no arguments is provided for justifying the conditional independence assumptions in the model. Any edge that is removed from a graphical model implies conditional independency assumptions and they should be carefully justified, specially since the topic of the work is interpretability. This includes the conditional independencies above as well as the way it is assumed that variable c is generated.
+
+(2) The model in this work is different from an IV model: One of the main requirements of the IV framework is exclusion restriction which requires that the effect of the instrument variable on the outcome should be only through the treatment variable. In the proposed model, variable y is also directly connected to variable x. Also, d is not confounded or observed, which again make the model different from IV model. Therefore, the model in this work does not represent the IV model, and y is not a valid IV, although it seems that that was actually not used in the approach. Only an independence assumption is actually used in the approach, which as mentioned above, is not justified.
+
+(3) It is not quite clear what exactly the variable d represents compared to variable c, and what is its interpretation.
+
+(4) In the synthetic simulations, the value of the variance chosen, specially for noises is very small. It is important to see the performance for larger values for the variance of the noises.",5,4.0,ICLR2021
+GYliPA4drwL,4,ipUPfYxWZvM,ipUPfYxWZvM,Does this increase the training flops by a factor of n_reorder?,"I think the general idea behind this paper is very exciting: architectures mutating in an instance-based way.
+
+I was hoping to see that the authors had achieved this whilst adding only a small overhead of parameters and training FLOPs such that we could be sure any gains aren't due to augmenting these two quantities (i.e. instance-based re-ordering is the key ingredient) and to be sure this could be applied in the large-model setting and theoretically improve production translation systems.
+
+However the proposed architecture is essentially training an ensemble of three, or n_reorder, weight-shared models and then at inference time, using a hard-threshold. The paper really tries to brush over this fact, stating there is negligible additional cost in the abstract and in several other parts of the text. There is only this one line to acknowledge the fact that we are training 3x (or even 4x) models:  ""One may concern the training cost is increased through our training. As we present in Appendix B.1, the cost is actually acceptable with a fast convergence."" and then this section in the appendix continues to mostly discuss inference cost, but finally admits the n_reorder flops increase stating that it is acceptable because the model converges faster. It doesn't seem likely the model will converge 3x faster in general, especially if it learns to mostly select only one re-ordering after some training.
+
+My general sense is the proposed architecture sits in a very tenuous space. If the researcher has 3-4x space and flops to train a model 3-4x larger, then they should clearly do so. This will get much better performance. If the researcher has 3-4x space to train a larger model but wants faster inference, perhaps this approach could be useful but I would suspect training large and then pruning or distilling to be much more effective. If the architecture trained with hard re-ordering decisions and thus used the same number of flops, but still outperformed the baseline it would be a clear win.
+
+Suggestions:
+* Compare training flops and eval performance to the approach of training a 3x larger model that is then pruned or distilled.
+* Be up-front about the training wall-clock time in the main text, instead of stating everything has negligible cost and only reporting inference speed.
+* Consider using RL or another hard-decision approach.
+* Consider comparing to a model that has the same number of parameters but 3x the compute (e.g. a transformer with 3x depth but shared weights for each of the three layers).",5,4.0,ICLR2021
+rkHQ7cZNl,2,HyQJ-mclg,HyQJ-mclg,Reasonable idea,"The idea of this paper is reasonable - gradually go from original weights to compressed weights by compressing a part of them and fine-tuning the rest. Everything seems fine, results look good, and my questions have been addressed.
+
+To improve the paper:
+
+1) It would be good to incorporate some of the answers into the paper, mainly the results with pruning + this method as that can be compared fairly to Han et al. and outperforms it.
+
+2) It would be good to better explain the encoding method (my question 4) as it is not that clear from the paper (e.g. made me make a mistake in question 5 for the computation of n2). The ""5 bits"" is misleading as in fact what is used is variable length encoding (which is on average close to 5 bits) where:
+- 0 is represented with 1 bit, e.g. 0
+- other values are represented with 5 bits, where the first bit is needed to distinguish from 0, and the remaining 4 bits represent the 16 different values for the powers of 2.
+",7,4.0,ICLR2017
+S4dxT47QqT,4,FUtMxDTJ_h,FUtMxDTJ_h,an application of NN to learning symmetries in physics,"This paper presents the results of a NN trained to learn symmetries in physics, specifically, to learn and preserve quantities that are preserved (e.g., energy, angular momentum). The input is a sequence generated from a Hamiltonian dynamics. Results of experiments on 2 and 3 body problems and a harmonic oscillator are presented. The training networks are small, shallow feedforward networks. There is some customization of the training networks to incorporate ""cyclic"" coordinates. Results indicated empirical conservation up to small error of physically conserved quantities. The paper is fairly easy to read, with much relevant background provided.
+
+In the early days of NNs, this might have been a very interesting paper. With today's advances, and NN finding success in almost every area with data, it is not clear what the contribution of this paper is. Perhaps the main innovation is the design of the networks. Unfortunately, there is little explanation provided of the experimental results.
+
+-- Why is this result interesting? Given that the output is a simple function of the input, why is this result surprising in the least?
+
+-- Given that the model for data is explicit and the training model is simple, can you say anthing rigorous to explain the results obtained empirically?
+
+-- the outputs are close but do not perfectly periodic. What parameter/model changes could possibly explain this? Would a different representation do better?
+",4,4.0,ICLR2021
+S1guRoZwhQ,1,HkgTkhRcKQ,HkgTkhRcKQ,Analyses and fixes one problem of ADAM that could be specific or general,"This manuscript contributes a new online gradient descent algorithm with adaptation to local curvature, in the style of the Adam optimizer, ie with a diagonal reweighting of the gradient that serves as an adaptive step size. First the authors identify a limitation of Adam: the adaptive step size decreases with the gradient magnitude. The paper is well written.
+
+The strengths of the paper are a interesting theoretical analysis of convergence difficulties in ADAM, a proposal for an improvement, and nice empirical results that shows good benefits. In my eyes, the limitations of the paper are that the example studied is a bit contrived and as a results, I am not sure how general the improvements.
+
+# Specific comments and suggestions
+
+Under the ambitious term ""theorem"", the results of theorem 2 and 3 limited to the example of failure given in eq 6. I would have been more humble, and called such analyses ""lemma"". Similarly, theorem 4 is an extension of this example to stochastic online settings. More generally, I am worried that the theoretical results and the intuitions backing the improvements are built only on one pathological example. Are there arguments to claim that this example is a prototype for a more general behavior?
+
+
+Ali Rahimi presented a very simple example of poor perform of the Adam optimizer in his test-of-time award speech at NIPS this year (https://www.youtube.com/watch?v=Qi1Yry33TQE): a very ill-conditioned factorized linear model (product of two matrices that correspond to two different layers) with a square loss. It seems like an excellent test for any optimizer that tries to be robust to ill-conditioning (as with Adam), though I suspect that the problem solved here is a different one than the problem raised by Rahimi's example.
+
+
+With regards to the solution proposed, temporal decorrelation, I wonder how it interacts with mini-batch side. With only a light understanding of the problem, it seems to me that large mini-batches will decrease the variance of the gradient estimates and hence increase the correlation of successive samples, breaking the assumptions of the method.
+
+
+Using a shared scalar across the multiple dimensions implies that the direction of the step is now the same as that of the gradient. This is a strong departure compared to ADAM. It would be interesting to illustrate the two behaviors to optimize an ill-conditioned quadratic function, for which the gradient direction is not a very good choice.
+
+
+The performance gain compared to ADAM seems consistent. It would have been interesting to see Nadam in the comparisons.
+
+
+
+I would like to congratulate the authors for sharing code.
+
+There is a typo on the y label of figure 4 right.
+",9,4.0,ICLR2019
+BJgKdk3vh7,1,By41BjA9YQ,By41BjA9YQ,Some concerns on experiments and written style,"The paper proposes to use a simple tri-diagonal matrix to reduce the variance of stochastic gradient and provide a better generalization property. Such a variant is shown to be equivalent to applying GD on smoothed objective function. Theoretical results show a convergence rate and variance reduction. Various experiments are done in different settings.  I have following comments:
+
+1) In section 2, it is stated that ""This viscosity solution u(w, t) makes f(w) more convex by bringing down the local maxima while retaining the wide minima."" Besides illustrating such a point on some nicely constructed function f, is there any theory or analysis supporting this statement? Or is there any intuition behind it? In the abstract and Section 1, how to define a function is ""more convex""? This is one of the fountains of the paper, it worths to spend one or two paragraphs to explain it, or at least introduce some references here. The current statement is not formed in a rigorous way.
+
+2) The main advantages of proposed method that the paper claims are, reduce the variance and improve the generalization accuracy. However, there are few comparisons with other existed methods, besides numerical section. Such comparisons or analysis could help readers understand the difference and novelty.
+
+3) The proof seems fine. Propositions 1-4 try to analyze the convergence rate, which are common techniques in other variance reduction papers on SGD. Propositions 5-9 rely on some nice properties of matrix A_\sigma and show it can help to reduce the variance. Typos:
+Page 11, ""Proof of Proposition 1"", there is a missing ""-"" in \nabla_w u(w, t), also in the next equation.
+Page 13, ""Proof of Proposition 6, d = A_\sigma g"". 
+
+4) The proposed method strongly relies on the choice of \sigma, but discussion on how to choose the value for \sigma is rare. From Proposition 8, the upper bound on reduced variance is a quadratic function on \sigma, so it is better to discuss more on it or have some experiments on sensitivity analysis. In Section 4, \sigma varies (1.0, 3.0, etc) in different experiments, but again there are no explanations.
+
+5) Numerical results in Section 4.3 is not strong enough to support the advantage of the proposed method. It is hard to observe ""visibly less noisy"" in both Figure 8 and 9. Better ways of illustration might be considered.
+
+6) The paper is not nicely written thus cannot be easily read. It seems to be cut and pasted from another version in a short time. Some titles of subsections is missing. The font size is not fixed in the whole paper.
+
+The above concerns prevents me to give a higher rating at this time.
+
+Summary
+quality: ok
+clarity: good
+originality: nice
+significance: good",5,4.0,ICLR2019
+S1xeMeCk5B,3,HJlU-AVtvS,HJlU-AVtvS,Official Blind Review #1,"This paper examined the spectrum of NNGP and NTK kernels and answer several questions about deep networks using both analytical results and experimental evidence:
+* Are randomly initialized and trained deep networks biased to simple functions?
+* How does this change with depth, activation function, and initialization?
+
+All studies are conducted on a space of inputs that is a boolean cube. The input distribution is assumed to be uniform. Though it is argued in Section 3 that the results also generalize to uniform distributions on spheres and isotropic Gaussian distributions. Although this boolean cube setting is followed from previous works on the same topic, it does limit the scope of the paper. Discussions on how this assumption relates to practical problems are missing from the paper.
+
+Putting aside the limitations of restricting the input distributions on boolean cubes (and other similar choices), I really like the paper, which demonstrates the powerfulness of spectral analysis. I also found that many analytical results (e.g., computing eigenvalues of a kernel operator with respect to uniform distributions on a boolean cube) in the paper are highly nontrivial to derive, which adds to the value of the paper. These results might seem restricted in terms of deep network theory because of the assumptions on input distributions, but I do believe the methods used can be of interest to a wider audience.
+
+Some questions:
+* In Figure 1, the 10^4 boolean function samples are sorted according to frequency (rank). What precisely is the frequency (rank) here? It shouldn't be the frequency that corresponds to the eigendecomposition because each function sample could always have multiple components with different frequencies.
+* In Figure 1b, the y-axis is described as normalized eigenvalues, which seems different from degree k fractional variance defined in the next section. The degree k fractional variance is the sum of all normalized eigenvalues for degree k eigenfunctions. Is this difference intended or it is a mistake?
+* Is the ground truth degree k polynomial used in experiments defined somewhere in the paper?
+
+On writing and clarity. Overall I find this paper well-written and a pleasure to read. Some minor issues are
+* The definition of ""neural kernels"" seems unnecessary and a bit sudden. It would be helpful to include the definition of Phi just after Eq. (2) for CK and NTK.
+* For introducing boolean analysis and Fourier series, it might be better to include the formula that explicit shows the expansion f(x) = \sum_{S} f^p(S) X_S(x) before introducing Theorem 3.1.
+",6,,ICLR2020
+SJxNpokU27,2,rkgd0iA9FQ,rkgd0iA9FQ,There may exist an error on the proof of Stochastic RMSProp.,"There may exist an error on the proof of Theorem 3.1 in appendix. For the first equation in page 13, the authors want to estimate lower-bound of the term $E<\nabla{f}(xt),V_t^{-0/5}*gt>$. The second inequality $>$ may be wrong. Please check it carefully.  (Hints: both the index sets { i | \nabla{f}(xt))_{i}*gt_{i} <0 } and { i | \nabla{f}(xt))_{i}*gt_{i} >0 } depend on the random variable $gt$. Hence, the expectation and summation cannot be exchanged in the second inequality.)",4,5.0,ICLR2019
+rJeuRmuJ9H,2,Hyl9ahVFwH,Hyl9ahVFwH,Official Blind Review #3,"This is a well-written paper which looks into options for learning similarity metrics on data derived from PDE models common in the sciences. In comparison to other metric learning settings, here a type of ground-truth distance information is available (rather than, say, triplets), and it is possible to attempt to directly target an objective function which aims to match the learned distance to the ground-truth distance. The model architecture follows a fairly standard siamese-network setup.
+
+Quite a bit of space is devoted to ensuring that the learned metric actually satisfies pseudo-metric axioms. This is all very clearly presented, with justifications for different modeling choices and how they preserve the axioms; my only criticism here is that many aspects of this are fairly obvious (i.e. an architecture which shares weights in computing the embeddings of both data points, followed by computing a squared L2 distance, will quite clearly get us in the ballpark of a pseudometric), but in my opinion ""excess"" clarity is much better than the opposite.
+
+I think the more important contribution of the paper is in sections 4 and 5, which outlines a specific data generation process, including means of injecting noise, and compares options for loss functions. I would have expected pearson correlation to work quite well in this context, and it is interesting to note that performance notably improved by also adding an MSE term. I am curious about the ""distance"" prediction, as described just before equation 5, where it is stated that d \in R^n — is this really R^n? The target distance c is in [0,1]^n, and it seems like a simple modification to the distance prediction network would be capable of ensuring that the predicted values d also fall in this range. Such normalization could reduce the need for the MSE loss term, which presumably helps keep the overall relative scales of the two distances in check.
+
+The empirical testing is also thorough, and I particularly appreciate the use of the random-weight networks as a baseline — I think it is good to note that these are actually fairly competitive on many of the test data sets (in fact, I believe it should be in bold for ""TID"" in table 1). 
+
+I think the main weakness of this paper is that it falls slightly short of actually presenting the real use cases and needs for a similarity metric on PDE outputs — in my opinion, this comes to play when matching the output of a PDE-based model with real data. It would be nice to see a discussion of how this could be useful for parameter inference in PDE models. If there are other important applications of a distance learned in this way, I think the paper could benefit *greatly* by pointing them out. Otherwise, this risks being perceived as adding little value, since for individual PDE runs with known parameters, there is a ground-truth distance available — in which case, why bother using deep learning to estimate the distance, if the parameters are known? I think relevance to applications should to be clearly addressed.
+
+The supplemental material is long, but complete and clearly presented.",8,,ICLR2020
+ryg3s8BzTX,3,r1NJqsRctX,r1NJqsRctX,A very interesting idea for combining MCMC and VI.,"This paper proposes a clever and sensible approach to using the structure learned by the auxiliary variational method to accelerate random-walk MCMC. The idea is to learn a low-dimensional latent space that explains much of the variation in the original parameter space, then do random-walk sampling in that space (while also updating a state variable in the original state, which is necessary to ensure correctness).
+
+I like this idea and think the paper merits acceptance, although there are some important unanswered questions. For example:
+- How does the method work on higher-dimensional target distributions? I would think it would be hard for a low-dimensional auxiliary space to have high mutual information with a much higher-dimensional space. In principle neural networks can do all sorts of crazy things, but phenomena like VAEs with low-dimensional latent spaces generating blurry samples make me suspect that auxiliary dimension should be important.
+- How does the method work with hierarchical models, heavy-tailed models, etc.? Rings, MoGs, and flat logistic regressions are already pretty easy targets.
+- Is it really so valuable to not need gradients? High-quality automatic differentiation systems are widely available, and variational inference on discrete parameters with neural nets remains a pretty hard problem in general.
+
+Some other comments:
+
+* It’s probably worth citing Ranganath et al. (2015; “Hierarchical Variational Models”), who combine the auxiliary variational method with modern stochastic VI. Also, I wonder if there are connections to approximate Bayesian computation (ABC).
+
+* I think you could prove the validity of the procedure in section 2.1 more succinctly by interpreting it as alternating a Gibbs sampling update for “a” with a Metropolis-Hastings update for “x”. If we treat “a” as an auxiliary variable such that
+p(a | x) = \tilde q(a | x)
+p(x | a) \propto p(x) \tilde q(a | x)
+then the equation (2) is the correct M-H acceptance probability for the proposal
+\tilde q(a’, x’) = δ(a’-a) \tilde q(x’ | a).
+Alternating between this proposal and a Gibbs update for “a” yields the mixture proposal in section 2.1.
+
+* It’s also possibly worth noting that this procedure will have a strictly lower acceptance rate than the ideal procedure of using the marginal
+\tilde q(x’|x)
+as a M-H proposal directly. Unfortunately that marginal density usually can’t be computed, which makes this ideal procedure impractical. It might be interesting to try to say something about how large this gap is for the proposed method.
+
+* ""We choose not to investigate burn-in since AVS is initialized by the variational distribution and therefore has negligible if any burn-in time.” This claim seems unjustified to me. It’s only true insofar as the variational distribution is an excellent approximation to the posterior (in which case why use MCMC at all?). It’s easy to find examples where an MCMC chain initialized with a sample from a variational distribution takes quite a while to burn in.",7,5.0,ICLR2019
+SkJBNo-Ve,1,Bk8N0RLxx,Bk8N0RLxx,Useful tricks for faster decoding and training of NMT,"This paper evaluates several strategies to reduce output vocabulary size in order to speed up NMT decoding and training. It could be quite useful to practitioners, although the main contributions of the paper seem somewhat orthogonal to representation learning and neural networks, and I am not sure ICLR is the ideal venue for this work.
+
+- Do the reported decoding times take into account the vocabulary reduction step?
+- Aside from machine translation, might there be applications to other settings such as language modeling, where large vocabulary is also a scalability challenge?
+- The proposed methods are helpful because of the difficulties induced by using a word-level model. But (at least in my opinion) starting from a character or even lower-level abstraction seems to be the obvious solution to the huge vocabulary problem.
+",5,3.0,ICLR2017
+zbOhuMKpDye,4,kW_zpEmMLdP,kW_zpEmMLdP,Official review,"This paper presents Neural Event ODEs, a method to extend Neural ODEs for modeling discontinuous dynamics in a continuous-time system.  Neural Event ODEs allow to learn termination criteria dependent on the system's state while being fully differentiable. Experiments on time series and temporal point processes validate the benefits of Neural Event ODEs on discountinuous dynamics. 
+
+The paper is well written and relatively easy to follow. 
+The benefits that Neural Event ODEs provide for modeling discontinuous dynamics already becomes apparent from the formulation of its ODE solver. The simplicity of the approach is another advantage, and I can see many possible applications/use cases that can benefit from such an ODE solver.
+
+Nevertheless, there were a few questions and remarks I had when reading the paper:
+
+- In experiment 1, the setup of the Neural ODE is not clear. In particular: 
+ - How many switch variables have you defined? In case of 3, have you tested what happens if you specify more (e.g. 4-6) to see how sensitive the model is to this parameter?
+ - In the result paragraph, a classifier is mentioned that probably should refer to a classifier over $w$. Is this classifier trained by having a weighted average in $f$, i.e. $\frac{d z(t)}{dt}=\sum_{w=1}^{M} p(w)\left[A^{(w)}z+b^{(w)}\right]$, or how is the exact setup?
+
+- Table 2 evaluates the bouncing ball collision experiment on the mean squared error of the predicted trajectories. Based on those scores, the Neural Event ODE performs slightly worse than Neural ODE which is not clearly discussed in the section. Is the conclusion of Neural Event ODE being able to generalize better in this experiment based on qualitative results? If so, wouldn't have an adversarial metric reflected this result better? Besides the region close to the ground, it is hard to tell which model \textit{generalizes} better.
+
+- In section 6, the issue of minibatching is mentioned. How does this effect the results of Neural Event ODEs shown in section 4 and 5.1? Have those models been trained with batch size 1 in contrast to the other methods, or using methods like gradient accumulation of multiple single-batches (hence only time of training increased)?
+
+- Most experimental setups in the paper clearly require the modeling of discountinuous dynamics. Have you tested Neural Event ODEs on tasks where the system to model does not have obvious discontinuities, for instance flow-based generative modeling? Would you see any potential advantages of your method there?
+
+- Experimental details such as the concrete parameterization of $f$ and hyperparameter values is missing for reproducibility of the results. No supplementary materials have been submitted in which this information could have been outlined.
+
+Overall, I think that the ideas presented in the paper are a valuable contribution to the research community and therefore, I would recommend the paper for acceptance, especially if the points mentioned above can be clarified.
+
+Additional comments: Page 4, third line from the bottom has a double punctuation.",7,4.0,ICLR2021
+jLH1lhK0VWz,4,3-a23gHXQmr,3-a23gHXQmr,AnonReviewer1,"* **Summary**: 
+
+This paper propose to improve parametric density estimation for unseen environment using uncertainty-aware neural networks, and proposed a practical, two-step algorithm based on DeepEnsemble. Given training data $(y, x)$ and a known density family $p(y|\theta)$, the goal is to learn $\theta$ in an unseen environment $\{x', y'\}$. 
+
+The basic method proceeds as follows: (step 1) at training time $(y, x)$, train a deep ensemble model $\{ f_1, .., f_M \}$ to learn a mapping $f: x \rightarrow y$, (step 2) at testing time, estimate parameters of interest $\theta$ by MLE w.r.t. $p(y'|\theta)$, where $y'$ comes from ensemble predictions $f_1(x')), .., f_M(x'))$.  Based on this basic recipe, authors proposed two augmentations: (a) select only the top performing ensemble members at test time. (b) re-weight the training objective using a function of the predictive variance $w_{nm} \propto \sigma_{nm}^{-\lambda}$, where $\lambda$ is an application specific parameter.
+
+* **Strength and Weakness**
+   * (Strength) A novel method for density parameter estimation in physics problems that account for uncertainty. 
+   * (Strength) An interesting application to X-ray polarization, showing clear advantage over existing approaches. 
+   * (Weakness) There exists some theoretical concerns on the soundness of the approach, which should be addressed by adding additional discussion, please see Major Comments. 
+   * (Weakness) Insufficient ablation study for the proposed modifications (ensemble member selection and re-weighting), as a result it is unclear the relative contribution of each components. 
+
+* **Recommendation**: I recommend reject the manuscript in its current form. While acknowledging the novelty and significant of the application, I find the method a relatively straightforward combination of existing techniques (deep uncertainty and sample re-weighting), without sufficient in-depth analysis (either theoretical or empirical) on the merit the combination for the intended application. There are also some potential theoretical concerns that needs to be addressed. I'm open to adjust my recommendation, assuming these concerns are sufficiently discussed in the paper and additional ablation is conducted.
+
+* **Major Comments**:
+  * Why two-stage approach / use Gaussian likelihood: If I understand correctly, at test time, authors trained a deep model to generate uncertainty-aware predictions $y_{test}$ using Gaussian likelihood, and then conduct parameter estimation by performing MLE over a weighted likelihood constructed using the deep ensemble prediction. I have two concerns over this approach: (1) In the case that distribution of y is not Gaussian (e.g., Equation 4), how to justify learning y using a Gaussian likelihood? Would that lead to issues in uncertainty quantification, since model likelihood is mis-specified? (2) Even if it is admissible to learn y using Gaussian likelihood, are we risking under-estimating uncertainty by using MLE to estimate parameters in the second stage? In comparison, why can't we estimate the model parameters $\theta=(\phi, \Pi)$ jointly with $y$ (e.g., jointly learn $y=f_y(x)$ and $\theta=f_\theta(x)$ using deep ensemble by minimizing the correct likelihood $p(y|\theta)$)? Because by doing so you are learning with respect to the correct likelihood, and the uncertainties can be quantified end-to-end via deep ensemble. It would be good if author can provide discussion clarifying (1), and discuss / compare the method outlined in (2) as an intuitive baseline.
+
+  * Uncertainty under model selection: At test time, author used only the best-performing ensemble members to construct the model likelihood. There might be an concern regarding uncertainty quantification under model selection: since the model selection is not uniform, the predictive uncertainty from the select model is no longer a representative sample of the original deep ensemble. Would this cause issue in terms of uncertainty quantification? It might be good for author to justify this at least empirically by comparing against a baseline with no model selection.
+ 
+
+* **Minor Comments**:
+  * Notation: This is very minor: author used $k$ for total number of parameters, and $K$ for testing data points. This can be a bit confusing. It might be good to use consistently use lower case for index, and upper case for total number of parameter / samples. So it might be good to use $N_{train}$, $N_{test}$ to indicate sample size, and $K$ for the total number of parameters to estimate.
+",5,3.0,ICLR2021
+H1eLF2upFH,2,B1eBoJStwr,B1eBoJStwr,Official Blind Review #2,"This paper provided first provided analysis for the problem of semantic segmentation. Through a few simple example, the authors suggested that the cluster assumption doesn’t hold for semantic segmentation. The paper also illustrated how to perturb the training examples so that consistency regularization still works for semantic segmentation. 
+The paper also introduce a perturbation method that can achieve high dimensional perturbation, which achieve solid experimental results.
+
+The analysis part seems interesting and innovative to me. But it is very qualitative and I'm not fully convinced that the analysis on 2d example can actually carry over to high dimensional spaces for images. I also don't quite see the connection between the toy example and the proposed perturbation method. For example, why the proposed perturbation method has the property of ""the probability of a perturbation crossing the true class boundary must be very small compared to the amount of exploration in other dimensions""?
+
+The proposed algorithm is an extension of the existing cutout and cut mix. The way to generate new mask is a very smart design to me. This should be the most important contribution of the paper.
+
+The writing of the paper is very clear and easy to follow. The experimental results look very convincing overall and proposed algorithm does show very promising results. 
+
+To sum up, the paper is an ok paper from the practical perspective, but the analysis in the paper wasn't strong enough to me.",3,,ICLR2020
+S1gzKHFunm,1,Hk4fpoA5Km,Hk4fpoA5Km,Interesting paper on the challenges of GAIL,"This paper investigates two issues regarding Adversarial Imitation Learning. They identify a bias in commonly used reward functions and provide a solution to this. Furthermore they suggest to improve sample efficiency by introducing a off-policy algorithm dubbed ""Discriminator-Actor-Critic"". They key point here being that they propose a replay buffer to sample transitions from. 
+
+It is well written and easy to follow. The authors are able to position their work well into the existing literature and pointing the differences out. 
+
+Pros:
+	* Well written
+	* Motivation is clear
+	* Example on biased reward functions 
+	* Experiments are carefully designed and thorough
+Cons:
+	* The analysis of the results in section 5.1 is a bit short
+
+Questions:
+	* You provide a pseudo code of you method in the appendix where you give the loss function. I assume this corresponds to Eq. 2. Did you omit the entropy penalty or did you not use that termin during learning?
+
+	* What's the point of plotting the reward of a random policy? It seems your using it as a lower bound making it zero. I think it would benefit the plots if you just mention it instead of plotting the line and having an extra legend
+
+	* In Fig. 4 you show results for DAC, TRPO, and PPO for the HalfCheetah environment in 25M steps. Could you also provide this for the remaining environments?
+
+	* Is it possible to show results of the effect of absorbing states on the Mujoco environments?
+
+Minor suggestions:
+In Eq. (1) it is not clear what is meant by pi_E. From context we can assume that E stands for expert policy. Maybe add that. Figures 1 and 2 are not referenced in the text and their respective caption is very short. Please reference them accordingly and maybe add a bit of information. In section 4.1.1 you reference figure 4.1 but i think your talking about figure 3.",7,3.0,ICLR2019
+HJgpVslatS,2,HklmoRVYvr,HklmoRVYvr,Official Blind Review #3,"The paper proposes a type of recurrent neural network module called Long History Short-Term Memory (LH-STM) for longer-term video generation. This module can be used to replace ConvLSTMs in previously published video prediction models. It expands ConvLSTMs by adding a ""previous history"" term to the ConvLSTM equations that compute the IFO gates and the candidate new state. This history term corresponds to a linear combination of previous hidden states selected through a soft-attention mechanism. As such, it is not clear if there are significant differences between LH-STMs and previously proposed LSTMs with attention on previous hidden states. The authors propose recurrent units that include one or two History Selection (soft-attention) steps, called single LH-STM and double LH-STM respectively. The exact formulation of the double LH-STM is not clear from the paper.  The authors then propose to use models with LH-STM units for longer term video generation. They claim that LH-STM can better reduce error propagation and better model the complex dynamics of videos. To support the claims, they conduct empirical experiments where they show that the proposed model outperforms previous video prediction models on KTH (up to 80 frames) and the BAIR Push dataset (up to 25 frames).
+
+Overall I believe there are serious flaws with the paper that prevent acceptance in its current form.
+
+First, I believe the paper starts from the wrong assumption, namely that current video prediction models are limited by their capacity to limit the propagation of errors and to capture complex dynamics. Instead, it is well known that the main difficulty for longer term video prediction is to manage the increasing uncertainty in future outcomes. Stochastic models such as SVG-LP or SAVP are currently the state-of-the-art in video generation, with deterministic models not being able to generate more than a few non-blurry frames of video. While the authors mention that they do not focus on future uncertainty here, it is not clear how the proposed model helps to generate better longer-term videos when it does not deal with what actually makes long-term video generation difficult. In addition, it's misleading to claim that current models produce high quality generations for ""only one or less than ten frames"", especially without defining high quality. Models such as SVG [1] or SAVP[2] can produce non-blurry videos for 30-100 frames for the BAIR dataset, for example. 
+
+The experiments are missing 1) SVG as a baseline, 2) metrics that correlate with human perception such as LPIPS or FVD [3] and 3) qualitative samples that compare to stochastic models. Deterministic models can achieve very high PSNR/MSE/SSIM scores but produce very bad samples, as these scores are maximized by blurry predictions that conflate all possible future outcomes. This is highly apparent when looking at samples, and metrics that correlate better with human perception are usually better to compare video prediction methods. Comparisons to SAVP are found in Table 1 and 3 but there are no figures comparing samples from this model to the proposed model. The samples from the proposed model on the BAIR Push dataset for example (found in the appendix) are of significant lower quality than those reported from SAVP or SVG, and at the same time they are not longer-term than the predictions from these models. Consequently, the experimental section does not correctly assess how this model can generate better longer-term prediction than current models and it also does not give an accurate assessment of the model with respect to the current state-of-the-art.
+
+To sum up, the paper does not adequately address how the proposed model allows for longer-term video generation. It is missing critical qualitative comparisons to state-of-the-art models such as SVG and it is unclear how the proposed model is different from a ConvLSTM with attention on previous hidden states.
+
+[1] Stochastic Video Generation with a Learned Prior. E.Denton and R. Fergus. ICML 2018
+[2] Stochastic Adversarial Video Prediction. Lee et al. Arxiv 2018
+[3] Towards Accurate Generative Models of Video: A New Metric & Challenges. Unterthiner et al. Arxiv 2018
+
+
+--- Post-discussion update ---
+The authors have addressed a number of points raised by the reviewers and I'm raising my score to a weak reject from a reject. There are important remaining issues with the experimental section and the conclusions reached from their results, and therefore I still think the paper is below the acceptance bar.",3,,ICLR2020
+BylIovgxcr,1,Syxp-1HtvB,Syxp-1HtvB,Official Blind Review #1,"Updates after author response:
+I'd like to thank the authors for their detailed responses. Some of my primary concerns were regarding the presentation, and I feel they have been mostly addressed with the changes to the introduction and abstract (I'd still recommend using 'layerwise latent code' instead of 'layerwise representation' everywhere in the text). The additional qualitative results showing the benefits of manipulating 'z' vs y_l were also helpful. Finally, I agree that given the popularity of StyleGAN like models, the investigation methodology proposed, and the insights presented might be useful to a broad audience. Overall, I am inclined to update my rating to lean towards acceptance.
+
+---------------------------
+This paper investigates the aspects encoded by the latent variables input to different layers in StyleGAN (Karras et. al.), and demonstrates that these correspond to encoding different aspects of the scene across layers e.g. initial ones correspond to layout, final ones to lighting.
+
+The ’StyleGAN’ work first-generates a per-layer latent code y_l (from a global latent variable w), and uses these in a generative model. This paper investigates which layer’s latent codes best explain certain variations in scenes. To formalize the notion of how a latent vector is causally related to a scene property, the approach here is to use an off-the-shelf classifier for the property, and a) find a linear decision boundary in the latent space, and b) quantifying whether changing the latent code indeed affects the predicted score.
+
+Positives:
+1. The analysis presented in the work is thorough and results interesting. The paper analyzes the relation of various scene properties w.r.t the latent variables across layers, and does convincingly show that aspects like layout, category, attribute etc, are related to different layers.
+
+2. The visual results depicting manipulation of specific properties of scenes by changing specific variables in the latent space, and the ones in Sec 3.2 studying transitions across scene types, are also impressive and interesting.
+
+3. The proposed way of measuring the ‘manipulability’ of an aspect of a scene w.r.t a latent variable is simple and elegant, thought I have some concerns regarding its general applicability (see below).
+
+Despite these positives, I am not sure about accepting the paper because I feel the investigation methods and the results are both very specific to a particular sort of GAN, and the writing (introduction, abstract, related work etc.) pitch the paper as being more general than it is, and claim the insights to be more applicable. More specifically:
+
+1) The text claims the approach ‘probes the layer-wise representations’. However, what is actually investigated is the layer-wise latent code (NOT ‘representation’ which is typically defined to mean the responses of filters/outputs of each layer). In fact, I do not think this work is directly applicable to probing ‘representations’ as the term is normally used because it may be too high-dimensional to infer meaningful linear decision boundaries, or directly manipulate it.
+ 
+2) All the initial text in the paper’s abstract, introduction etc. leads the reader to believe that the findings here are generally applicable e.g. the sentence “the generative representations learned by GAN are specialized to synthesize different hierarchical semantics” should actually be something like “the per-layer latent variables for StyleGAN affect different levels of scene semantics“. Independent of any other concerns, I would be hesitant to accept the paper with the current writing given the very general nature of assertions made despite experiments in far more specific settings.
+
+3) In Sec 4, this paper only shows some sample results other models e.g. BIGGAN, but no ’semantic hierarchy in deep generative representation’ is shown (not surprising given only a global latent code).  As the discussion also alludes to, I do not think this approach would yield any insights if a GAN does not have a multi-layered latent code.
+
+4) Finally, while the results obtained for StyleGAN do convincingly show the causal relations claimed, these results are essentially backing up the insights that led to the design of StyleGAN i.e. having a single-level latent variable capture all source of variation is sub-optimal.
+
+5) This is not a really weakness, but perhaps an ablation that may help. The results showing scene property manipulation e.g. in Fig 4 are obtained by varying a certain y_l, and it’d help to also show the results if the initial latent code w was modified directly (therefore affecting all layers!). It would be interesting to know if this adversely affects constancy of some aspects e.g. maybe objects also change in addition to layout.
+
+Overall, while the results are interesting, they are only in context of a specific GAN, and using an approach that is applicable to generative models having a multi-layer code. I feel the paper should also be written better to be more precise regarding the claims. While the rating here only allows me to give a ‘3’ as a weak reject, I am perhaps a bit more towards borderline (though leaning towards reject) than that indicates.",6,,ICLR2020
+rkxeKKs_h7,1,ryx3_iAcY7,ryx3_iAcY7,The improvement seems not enough,"The paper proposes contextual role representation which is an interesting point. 
+The writing is clear and the idea is original.
+Even with the interesting point, however, the performance improvement seems not enough compared to the baseline. The baseline might be carefully tuned as the authors said, but the proposed representation is supposed to improve the performance on top of the baseline.
+The interpretation of the role representation is pros of the proposed model. However, it is somehow arguable, since it is subjective. 
+
+- minor issues: 
+There are typos in the notations right before Eq. (8). 
+",4,4.0,ICLR2019
+CPR3Qwe21Xq,4,qYda4oLEc1,qYda4oLEc1,Embedding for multi-task learning problems with disjoint inputs,"**Summary.** Authors present a methodology for performing multi-task learning from data with disjoint and heterogeneous input domains. Particularly, they introduce an embedding of the inputs, in order to project each pair of input-output observations in a common continuous manifold where the exploration is significantly easier. Results show that the approach is valid with both synthetic and real-world data and they also demonstrate that the model is flexible when increasing/decreasing the dimensionality of the latent manifold.
+
+**Strengths.** The explanation of the multi-task learning scenario with disjoint input domains is particularly well-written. This description makes easier to understand the reasons behind the introduction of the embedding between every single input and latent vectors z. Additionally, authors did an effort for explaining point-by-point the structure of the deep NN transformation behind the embedding. This is valuable. I appreciated the design of experiments and (author-blind) video on youtube was impressive.
+
+**Weaknesses, Questions & Recommendations.** 
+The main weaknesses (to me) in the paper are: 
+[W1]. There is likely a lack of references and analysis about similar works on multi-task learning with the particular problem of disjoint inputs. This makes the reader doubt about the potential novelty of the model, in particular about the embedding.
+[W2]. The notation based on subsets V_t is a bit confusing, (I think that keeping the (x,y,z) notation all along the paper would be better). Particularly in the pp.3, this notation is a difficult to follow before the introduction of the TOM embedding.
+[W3]. The TOM implementation may be better placed before the experiments, being a bit better connected with the main section of the manuscript, but this is just an opinion.
+[W4]. More analysis on the dimensionality D of the manifold could be of interest for the reader. In the last experiment, this dimensionality is pretty high. [Q] Why is this? What is the principal consequence?
+[W5]. Error metrics in the experiments do not include confidence intervals or variance values from several runs.
+[W6]. Typically, one chooses Discussion or Conclusion. The content of the Conclusion is similar to the thing said in the previous section.
+
+Recommendations: 
+[Rec1]. Motivating even better the disjoint input problem from the very beginning would make the paper stronger.
+[Rec2]. An input-output notation all along the paper and some diagram explaining the projection into a continuous manifold would help as well.
+[Rec3]. Details about the implementation could be better placed in the appendix, or at least integrated with the model and the flow of explanations.
+[Rec4]. Confidence intervals in the tables of error metrics as well as a bit more of motivation for the circle experiment would improve the presentation of experiments.
+
+**Reasons for score.** I understood the idea that authors presented and the problem of disjoint input domains. However, I feel that the presentation of the model is a bit weak as well as the experiments could be improved with a few details. The last pp. of the manuscript with the duplicity Discussion+Conclusion is also a bit odd. For this reason, I cannot recommend an acceptance score for this venue.
+
+**Post-rebuttal comments.** Thanks to the authors for their response. The updated version of the manuscript addressed my main concerns and recommendations. Now, it is clearly improved, figures and metrics updated and the proposed methodology is better presented. Authors even did major changes on the structure of the paper, what I recognize as an important revision. Having said this, I raised my score.",6,3.0,ICLR2021
+NOLh7XxvuHA,1,0F_OC_oROWb,0F_OC_oROWb,"I enjoyed this paper, and believe it is the seed of something very important. The main result of a viable alternative to backprop may very well be one of the things that pushes AI into the next stage.  My detailed comments are meant to strengthen the paper.  However, the authors answered nearly all of my questions I had while reading.","This paper discusses a possible method for training a deep neural network without using backpropagation.  Backpropagation has been very successful in minimizing output errors for both supervised and unsupervised methods and has stood up to challenges from other methods.  The motivation for finding suitable replacements or approximations is to reduce the computational complexity of training neural networks.  This can take the form of reducing the size of the NN required to accomplish a task or simply train an NN with a smaller number of operations. I believe this is a very important new topic to find viable alternatives to backprop.  These kinds of methods have advantages on better-utilizing memory bandwidth, making cheaper hardware more relevant to the training side of NNs.
+
+The authors do a good job of giving background by citing node perturbation methods, lottery ticket hypothesis, and genetic methods.  They all appear to be pointing to an underlying concept that random initializations in overparameterized networks already have embedded, sparse representations.
+
+The main result of the paper is that a small number of sequential weight updates using the authors' proposed algorithm rivals the performance of an established method like backpropagation.  The proposed algorithm is simply to perturb weights from a randomly initialized neural network and keep the perturbation if it reduces the loss on a minibatch. This relies on an assumption that a randomly initialized network is close to a final solution. 
+
+I really enjoyed this paper. Nearly every question I asked myself while reading it was answered in a subsequent section.  As pointed out, this is the first step at a new concept.  As with any good paper, this paper begets a lot more questions than it completely answers.  
+
+Suggestions:
+Section 3: what is the motivation for using a Gaussian distribution to initialize weights?  Not that I see anything wrong with that, but is there some reason this might be better or worse than other initializations?
+Section 3: “We first update the weights of the layer closest…”.  This could be an area of additional research as to where to update first.  If we look at neuroscience, we see that layers closer to inputs seems to learn first, so might be good to include some discussion on that here.
+Section 4: These are good networks to start with, but I would like to see larger networks that are more relevant to problems today….transformers being trained to useful levels using this method could be a huge and important step. 
+
+Section 4.1: It could strengthen the paper to include some analysis on the number of MAC operations required and the number of reads/writes to memory for SGD vs RSO.  This could be useful in this paper, or a subsequent one.
+
+Section 4.2: Some theory likely needs to be developed here.  It would good to add some discussion about the tradeoffs between these options. I believe this is more for future work.
+
+Section 4.5: If the RSO algorithm is more amenable to parallelism, that could be an important advantage.  Some discussion of that vs SGD could also build a stronger case.ZS
+
+",8,5.0,ICLR2021
+8zqE28JCDo,4,XOjv2HxIF6i,XOjv2HxIF6i,Interesting work for unsupervised meta-learning,"This paper considers the problem of unsupervised meta-learning, where the goal is to generate tasks for meta-training without supervision. Whereas previous work generated training and test sets from the unlabeled set for meta-training via augmentations (UMTRA) or unsupervised clustering of embeddings (CACTUs), this paper considers doing this using interpolation of the latent space representations produced by generative models. Specifically, the idea is to first train a generative model on the unlabeled set and then produce training and test sets for meta-training by decoding the interpolation of latent space representations of multiple examples from the original unlabeled set. The authors discuss 3 specific ways to produce examples for the train and test sets for meta-training in this way. They first select an anchor example from the unlabeled set that will be representative of one class in the dataset. Then, the 3 methods involve:
+1. Adding noise: adding noise to the latent space representation of the anchor example to produce examples that make up the training and test set for a single class.
+2. Random out-of-class sample: selecting another example from the unlabeled set and finding new examples for the class by interpolating between the anchor's and this example's representations. 
+3. With Other Classes' samples: same as (2) but instead of picking another random example, the example considered is another anchor example that was used to represent a different class.
+
+The authors evaluate their method by considering 3 few-shot learning benchmarks: (1) Omniglot; (2) CelebA few-shot identity recognition; and (3) CelebA attribute prediction. On these 3 benchmarks, they show that their method performs favorably compared to UMTRA and CACTUs.
+
+Pros
+* This paper proposes a simple yet very interesting idea for performing unsupervised meta-learning that is unique compared to previous work.
+* The big benefit of this method compared to previous work is that it seems to require less tweaking per dataset. Whereas previous methods required tuning per dataset (for example, in UMTRA, selecting which augmentations to use for a specific dataset), this method requires training a generative model on the unlabeled set and using the learned latent space interpolation (where the generative model can directly learn properties of the specific dataset that can be used during interpolation). However, there are still some choices to make in terms of hyperparameters for latent space interpolation).
+
+Cons
+* I have minor concerns about the Mini-ImageNet experiments. Firstly, why are Mini-ImageNet experiments not discussed in the main paper but in the supplementary material? The difficulties of using Mini-ImageNet are mentioned, in that it is difficult to train a generative model on this more complex dataset using the limited examples in Mini-ImageNet. Thus, I believe it is a good idea to use whole ImageNet dataset as the unlabeled set, as the authors did, and the results for the method seem favorable compared to previous work. So, I think it's useful to include these results to show how this method extends to more complex images? I think a note just needs to be added that the unlabeled set this method uses is much larger than the ones used in previous work for the Mini-ImageNet comparison but I don't view this as a big negative because the data required for training is still unlabeled.
+* Details of the CelebA few-shot identity recognition benchmark seem to be lacking? I don't see this benchmark mentioned in previous work so I was curious how the metrics for other methods (such as CACTUs) were generated given that this benchmark wasn't discussed in those papers? I think more details about this benchmark would be useful in general. Additionally, some citations to previous results on this benchmark would also be helpful.",7,4.0,ICLR2021
+B1gpWORE2m,1,SJgEl3A5tm,SJgEl3A5tm,Adversarial attacks for vehicles in simulators,"Adversarial attacks and defences are of growing popularity now a days. As AI starts to be present everywhere, more and more people can start to try to attack those systems. Critical systems such as security systems are the ones that can suffer more from those attacks. In this paper the case of vehicles that attack an object detection system by trying to not be detected are tackled.
+
+The proposed system is trained and evaluated in a simulation environment. A set of possible camouflage patterns are proposed and the system learns how to setup those in the cars to reduce the performance of the detection system. Two methods are proposed. Those methods are based on Expectation over transformation method. This method requires the simulator to be differentiable which is not the case with Unity/Unreal environments. The methods proposed skip the need of the simulator to be differentiable by approximating it with a neural network.
+
+The obtained results reduce the effectivity of the detection system. The methods are compared with two trivial baselines. Isn't there any other state of the art methods to compare with?
+
+The paper is well written, the results are ok, the related work is comprehensive and the formulation is correct. The method is simply but effective. Some minor comments:
+ - Is the simulator used CARLA? Or is a new one? Where are the 3D assets extracted from?
+ - Two methods are proposed but I only find results for one",7,3.0,ICLR2019
+H1xf8Ek2FB,1,B1l6nnEtwr,B1l6nnEtwr,Official Blind Review #1,"In this paper, the authors propose the Homotopy Training Algorithm (HTA) for neural network optimization problems. They claim that HTA starts with several simplified problems and tracks the solution to the original problem via a continuous homotopy path. They give the theoretical analysis and conduct experiments on the synthetic data and the CIFAR-10 dataset. 
+My major concerns are as follows.
+1. The authors may want to give more detailed explanations of HTA. For example, they may want to give the pseudocode for HTA and explain its advantages compared to other optimization methods.
+2. The theoretical analysis is trivial. The proof of Theorem 3.1 is to verify Assumptions 4.1 and 4.3 in [1]. Moreover, the proof of Theorem 3.2 is similar to the analysis for the convergence of SGD for convex problems in [2].
+3. The experiments do not show the efficiency of HTA, as the original quasi-newton method is faster than the quasi-newton method with the homotopy setup.
+4. The authors make a mistake in the proof of Theorem 3.1. The claim that “{\theta_k} is contained in an open set which is bounded. Since that g is continuous, g is bounded.” is incorrect. We can find a counterexample g(x) = \frac{1}{x}, x\in (0,1).
+
+[1] L. Bottou, F. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311, 2018.
+[2] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.",3,,ICLR2020
+Hk-DIMdez,1,SkERSm-0-,SkERSm-0-,"An important problem has been tackled, but not in a satisfactory direction. ","This paper studies the importance of the noise modelling in Gaussian VAE. The original Gaussian VAE proposes to use the inference network for the noise that takes latent variables as inputs and outputs the variances, but most of the existing works on Gaussian VAE just use fixed noise probably because the inference network is hard to train. In this paper, instead of using the fixed noise or inference network for the noise, the authors proposed to train the noise using Empirical-Bayes like fashion. The algorithm to train noise level for the single Gaussian decoder and mixture of Gaussian decoder is presented, and the experiments show that fitting the noise actually improves the ELBO and enhances the ability to disentangle latent factors.
+ 
+I appreciate the importance of noise modeling, but not sure if the presented algorithm is a right way to do it. The proposed algorithm assumes the Gaussian likelihood with homoscedastic noise, but this is not the case for many real-world data (MNIST and Color images are usually modelled with Bernoulli likelihood). The update equations for noises rely on the simple model structure, and this may not hold for the arbitrary complex likelihood (or implicit likelihood case). In my personal opinion, making the inference network for the noise to be trainable would be more principled way of solving the problem.
+ 
+The paper is too long (30 pages) and dense, so it is very hard to read and understand the whole stuff. Remember that the ‘recommended’ page limit is 8 pages. The proposed algorithm was not compared to the generative models other than the basic VAE or beta-VAE.",5,4.0,ICLR2018
+sPEOWJ789pe,1,Hw2Za4N5hy0,Hw2Za4N5hy0,Interesting ideas but many issues remain,"In this work, the authors propose new ways of averaging updates received at the server from a subset of clients in a federated scenario. Specifically the authors aim to address issues arising from the non-iid nature of the data that arises in FL and propose to treat BatchNorm parameters differently from other NN parameters. 
+
+Introduction
+Reading this paper, I am confused about terminology used by the authors. Specifically, the authors discuss 'data bias' and 'parameter bias'. 
+In in the introduction, the authors claim that 'Conventional approaches average gradients uniformly from the clients, which could cause great bias to the real data distribution'. Assuming the authors understand FedAvg to be a conventional approach, the averaging with $|x_k|/|x|$ (Eq. 1) is not uniform but weighted by the local dataset size. Further, the meaning of 'causing bias to the real data distribution' is not clear to me. Unfortunately, the further explanation and Figure 1 don't help me in understanding what is meant. The 'GroundTruth' distribution of labels for cifar10 is approximately uniform. From the context, I understand that the authors try to describe some consequence of a non-i.i.d sampled distribution of data according to labels, but I cannot understand the point they try to make. 
+Next, the authors discuss 'parameter bias'. The authors distinguish between BN parameters and other NN parameters. They term the BN parameters 'statistical parameters such as mean and variance'. Generally speaking, BN contains the 'scale and shift' parameters $\gamma$ and $\beta$ (https://arxiv.org/pdf/1502.03167.pdf), which I am assuming the authors reference to. Again, the authors make use of the term 'bias' to say: '[...] bias on the BN layer parameters'. I am not familiar with the notion of 'bias on a parameter' and would like the authors to clarify. Based on Figure 2 I assume they aim to convey that BN parameters in a FL setting converge to different values compared to a centrally trained model. 
+In Section 3.1 the authors further make the distinction explicitly between 'gradient parameters', by which they mean weights and biases as opposed to 'statistical parameters' of BN. Since the scale-and-shift parameters of BN are also updated by gradient descent, I am wondering if the authors mean the mean and variance estimate across data-points for feature-maps, which also plays a central part in (federated) BatchNormalization. The authors in their experimental section make no mention of how they form these global estimates for mean and variance in BN, so they omit that crucial detail there.
+
+The authors specifically focus on label-skew as source of non-i.i.d-ness in this work, but never make this limitation explicit. Since the non-i.i.d challenges in FL are not limited to label skew, I believe the authors should make this explicit. 
+
+Related work:
+The issues with BN in the federated setting has been described for example in Section 5 of https://arxiv.org/pdf/1910.00189.pdf. There, the authors propose to replace BN with GroupNorm, an approach that has been adopted in several follow-up and recent works in FL with models that originally contain BN. I would encourage the authors to compare their work against this approach, both in the RW section and also in the experimental section.
+
+Method Section.
+Notation-wise, I encourage the authors to not use $x$ or $x_k$ to denote a (labeled) dataset, since $x$ is usually reserved for a single data-point with associated label $y$. In Eq. (1) the loss formulation of FL is a bit sloppy since the parameters to optimise for, $W$ do not appear in the RHS of the equation. In FedAvg, we explicitly optimise $min \sum_k |x_k|/|x| L_k(W,x_k)$, where the local parameter estimates $W_k$ appear as intermediate parameters as a consequence of multiple local optimisation steps. 
+The authors claim that 'data points available locally could be biased from the overall distribution'. Again, I believe to understand the intended meaning to be the non-iid issue, but I encourage the authors to make their understanding of 'bias' more concrete. 
+
+In FedAvg, the individual clients do not transmit gradients $\nabla L_k ()$ to the server (Section 3.2.1). This is the approach in conventional distributed SGD as employed in a high-speed-connected data-centre for speeding up centralised training. In this centralised setting, the non-iid problem does not exist. Instead, in FL, clients transmit parameters that have been updated through a series of gradient-descent steps. More recent work (https://arxiv.org/abs/2003.00295) makes the role that these transmitted parameters have in an interpretation as a gradient. This distinction is important.
+
+The derivation in Appendix A.2 for show-casing the unbiased-ness of the variance parameter averaging seems wrong. Going from the third to fourth equation makes a mistake and also if the original expectation was equal to the sum of weighted expectations, then simply averaging would actually be the unbiased estimator. The authors here are falsifying their own argument through a derivation mistake. The last equation should only pull the expectation into the sum in the right-most term and you are done.
+
+Section 3.2.3
+I like the notion of modelling the datasets as GMM and to infer responsibilities at averaging time. The explanation of the approach is confusing to me, however. 
+If I understand it right, then EQ 9 describes a VAE setup per client k. There is no sharing of parameters or latent space between clients. $s_k = [\mu_k,\sigma_k]$ are the mean and standard-deviation across the whole local dataset at a client $k$. The authors propose to encode this single vector $s_k$ into a latent space z_k of dimension $C$. The authors do not explicitly specify the prior p(z_k), but given the constraints on $z_k$, I assume it is meant to be a Dirichlet distribution. From context I could imagine that it has something to do with the per-client label distribution. 
+Since each ELBO is per-client and each client has just one data-point $s_k$, I do not understand the need for auto-encoding. Simply infer z given a decoder-model for the single data-point.  The authors mention the use of neural networks, but they do not detail their architecture choices anywhere. It is also unclear to me how $\pi_k$ falls out in equation 10. I imagine it I corresponds to some sort of posterior across all local models' encoding z. Since the latent-spaces across clients k do not share any meaning, I don't see how that can be sensible.  All together, this section is not readable to me. 
+I can see the appeal of reweighing updates through a specific formulation pi_k, but since the updates are label-independent (eq. 3,4,5), how does that play into this.
+
+Section 4
+My understanding of proofs of this form is somewhat limited. From what I can gather, this proof shows that FedAvg converges if the per-client loss-function is pre-multiplied by a constant factor $\pi_k$. As such, the convergence proof should be analogous to what is presented in e.g. https://arxiv.org/pdf/1907.02189.pdf with the exception of the update in Eq 5. Maybe the other reviewers can comment further on this. 
+
+Section 5.
+In the Federated Setting, 20 clients should be considered not enough for experimental validation generally speaking. I appreciate the breath of experiments in terms of models, algorithms and related algorithms. 
+Unfortunately, the authors chose to define their own non-iid-split of the datasets. In general, I would appreciate to see comparisons with existing data-set splits in related work on non-iid data, such as for example https://arxiv.org/abs/2003.00295. This would help avoid the bifurcation of the literature. 
+The reference to q-FedSGD seems to be wrong.
+
+After reading the experiment section, some serious questions arise:
+What is the fraction of selected clients per communication round? What is the number of local epochs per client? Some of the related works seem to suggest performing FedSGD (also the first paragraph in Section 3.2.1 suggests this). If you have single-gradients per device and equal-size data-sets per client, then I don't see how non-iid data-distribution across clients are an issue as this approaches a global mini batch step, where each mini-batch consists of smaller mini-batches, one from each client. The non-i.i.d issue in FL stems from the fact that each client optimises on its own for a sufficiently long time that the resulting progress is destroyed by averaging in parameter space. The authors need to specify their setup here. Concretely, I would want to at least see Cifar10 split into 100 clients, 10 of which are selected at every round. Each client needs to optimise locally for a full epoch on its own dataset of 45000/100 = 450 data-points. 
+
+The authors propose two things: A new averaging approach by computation of $\pi_k$, as well as a new approach to estimating the gradient for the scale-parameter of BN using pooled averaging. These two things need to be studied separately by setting $\pi_k = |x_k|/|x|$ in one scenario. At the moment, my trust into the computation of $\pi_k$ is very low, since the corresponding section is not understandable to me. The pooled averaging approach seems sensible and I am curious to see if it solves the issue of BN in FL. Additionally, the authors need to clarify how the BN-statistics are computed at test time. I would also like to see a comparison of the proposed models with BN and with the state-of-the-art method which is replacing them with GroupNormalization. 
+
+Finally, I would like to thank the authors for the interesting approach to training BN-equiped Neural Networks in a federated setting. This idea seems promising to me. The approach for computing pi_k is not clear to me and I would like the authors to revise it. Please specify the role that knowledge of the label-distribution plays and if the method is applicable when non-iid-ness stems from other sources than label skew. (I propose looking at the FEMNIST dataset for example). 
+Many issues with this paper remain and I encourage the authors to overhaul their work. 
+
+I see no issues with the Code of Ethics",3,3.0,ICLR2021
+r40eqCHw2lJ,2,#NAME?,#NAME?,Hierarchical disentangled VAE for adversarial robustness. ,"The paper considers the regularization of latent space toward achieving adversarial robustness against latent space attack. The paper demonstrates the applicability of disentanglement promoting VAEs for achieving adversarial robustness and further enhancing such VAEs by considering their hierarchical counterparts. The paper demonstrates their results in the benchmark datasets considered in the disentanglement and computer vision literature. The overall research direction pursued by this paper is exciting. However, I have some concerns, which include:
+
+1. The paper attempts to establish the connection between disentanglement and robustness. The linkage, however, is not clear. In section 3, the paper argues for the smoothness of the encoder mapping and the decoder mapping. Toward this, the paper postulates additional regularization to enforce ""simplicity"" or ""noiseness"". First of all, it is unclear how disentangled latent space helps achieve ""simplicity"" in the ""encode-decode process"". Secondly, regarding ""noiseness"", it is not explained what extra would disentangled version of VAEs (e.g., TCVAE) provide compared to the standard setup. 
+
+2. In section 3.2, the paper empirically demonstrates the connection between disentanglement and adversarial robustness. However, the evaluation carried out are not explicit. Firstly, to demonstrate the connection, the paper uses the attacker's achieved loss \delta (from Eqn 1) as the metric. Although the \delta is shown across different \beta values, it is still unclear if disentanglement is directly related to robustness. Can the authors point out some disentanglement metrics (e.g., MIG) for each beta and compare MIG vs. \delta? Also, the curves are combined for all the d_z. What is the motivation behind doing that? Because it has been known that disentanglement behavior is related to the dimension of latent space. Also, authors could consider decomposing the first term of \delta for all the latent space dimensions and analyze if the disentangled dimensions are robust compared to the entangled dimension. This could be more helpful to establish the linkage. Secondly, authors have picked TCVAE considering ""reconstruction quality"" compared to \beta-VAE, but in Fig 2 (right), ELBO is compared. Can the authors compare the reconstruction error? Also, for fig 3, I think it is natural to see the comparison with \beta-VAE. Why is such a comparison not included? 
+
+3. In section 4, for the motivation for the hierarchical TC-penalised VAEs, the paper states that ""TC-penalisation in single layer VAEs comes at the expense of model reconstruction quality"". However, this directly contradicts the use of TC-VAE in the previous section. Although the results presented afterward support the authors' statement, the motivation must be clear and well written. The same comments for section 3.2 apply here too. 
+
+4. The experimental results demonstrating protection against downstream tasks is performed using a simple 2-lear MLPs. This is different from the regular CNN network commonly considered for these datasets. Although this was meant to demonstrate the proposed model's efficacy, it would be more clear if the experimental setup is consistent with the current literature setting. Also, can the authors point out the initial results for the models before the attack? 
+
+
+Minor comments:
+
+- There are a lot of grammatical errors and hard-to-follow sentences. Some examples:
+    - "".. are not only even more ..""
+    - "".. attack the models using methods outlined .."". But Eq (1) refers to only one method, right?
+    - ""… then \delta too is small .."" 
+
+(Update): The score has been updated after a rebuttal from the authors. ",6,5.0,ICLR2021
+BJJjahzEe,3,HJStZKqel,HJStZKqel,Review and review update,"
+I think the paper is a bit more solid now and I still stand by my positive review. I do however agree with other reviewers that the tasks are very simple. While NPI is trained with stronger supervision, it is able to learn quicksort perfectly as shown by Dawn Song and colleagues in this conference. Reed et al had already demonstrated it for bubblesort. If the programs are much shorter, it becomes easy to marginalise over latent variables (pointers) and solve the task end to end. The failure to attack much longer combinatorial problems is my main complaint about this paper, because it makes one feel that it is over-claiming.
+
+In relation to the comments concerning NPI,  Reed et al freeze the weights of the core LSTM to then show that an LSTM with fixed weights can continue learning new programs that re-use the existing programs (ie the trained model can create new programs). 
+
+However, despite this criticism, I still think this is an excellent paper, illustrating the power of combining traditional programming with neural networks. It is very promising and I would love to see it appear at ICLR.
+
+===========
+This paper makes a valuable contribution to the emerging research area of learning programs from data.
+
+The authors mix their TerpreT framework, which enables them to compile programs with finite integer variables to a (differentiable) TensorFlow graph,  and neural networks for perceiving simple images. This is made possible through the use of simple tapes and arithmetic tasks.  In these arithmetic tasks, two networks are re-used, one for digits and one for arithmetic operations.  This clean setup enables the authors to demonstrate not only the avoidance of catastrophic interference, but in fact some reverse transfer. 
+
+Overall, this is a very elegant and potentially very useful way to combine symbolic programming with neural networks. As a full-fledged tool, it could become very useful. Thus far it has only been demonstrated on very simple examples. It would be nice for instance to see it demonstrated in all the tasks introduced in other approaches to neural programming and induction: sorting, image manipulation, semantic parsing, question answering. Hopefully, the authors will release neural TerpreT to further advance research in this domain.
+
+ ",8,4.0,ICLR2017
+mpxa544gMyn,3,8_7yhptEWD,8_7yhptEWD,"Recommendation to reject on ""On the Neural Tangent Kernel of Equilibrium Models""","The paper shows the deep equilibrium model has non-degenerate neural tangent kernel in the infinite depth setting. The neural tangent kernel can be computed by a similar root-finding problem as that in the deep equilibrium problem itself. Some experiments have been performed to compare the performance of deep equilibrium neural tangent kernel with that of finite depth neural tangent kernel.
+
+Overall I vote for rejecting. My concerns are as follows:
+
+The paper lacks related literature. First, the motivation of considering deep equilibrium models is unclear to me. The authors should provide some further literature review. The advantage of using such a model in practice should be explained. Second, related proof techniques in the existing literature needs to be discussed. 
+
+The result is expectable and the proof techniques are not novel. The main theorem (Theorem 1) is the simple extension of the existing results on neural tangent kernel. The following theorem (Theorem 2) is the consequence of the main theorem under some specified initialization.
+
+The theorems in the paper are lack of explanation. More discussion is needed to explain and extend the results in the paper.
+
+The experiment part is not well-organized. More description is needed to improve the results.
+
+The paper has some grammar mistakes and misuse of words. The paper needs to be revised carefully. To name a few:
+Abstract: DEQ model....DEQ models have...
+Section 3, 1st paragraph: we simplify fully-connected DEQs as DEQs.
+Section 3, 1st paragraph: In section 3.1, we show the NTK of the approximated DEQ using finite depth iteration...
+
+",3,4.0,ICLR2021
+rkg7S_D93X,2,SyxYEoA5FX,SyxYEoA5FX,not entirely novel with few concerns but includes results leading to interesting insights,"The paper has two distinct parts. In the first part (section 2) it studies the volume of preimage of a ReLU network’s activation at a certain layer as being singular, finite, or infinite. This part is an extension of the work in the study of (Carlsson et al. 2017). The second part (section 3) builds on the piecewise linearity of a ReLU network’s forward function. As a result, each point in the input space is in a polytope where the model acts linearly. In that respect, it studies the stability of the linearized model at a point in the input space. The study involves looking at the singular values of the linear mapping. 
+
+The findings of the paper are non-trivial and the implications potentially interesting. However, I have some concerns about the study.
+
+There is a key concern about the feasibility of the numerical analysis for the first part. That is, a layer-by-layer study can have a computational problem where the preimage is finite at each layer but can become infinite by the mapping of the preceding layers. In that regard, I would like the authors to comment on the worst-case computational complexity of the numerical analysis for determining the volume of a preimage through multiple layers.
+
+As for the second part, the authors mention the increase in the dimensionality of the latent space in the current deep networks. However, this observation views convolutional networks as MLPs. However, there is more structure in a convolutional layer’s mapping function. The structure is obtained by the shared and sparse rows of matrix A. I would like the authors to comment on how the studies will be affected by this property of the common networks.
+
+All in all, while there are some concerns and the contributions are not entirely novel, the reviewer believes the findings of the paper is generally non-trivial and shed more light on the inner workings of the ReLU networks and is thus a valuable contribution to the field.",6,4.0,ICLR2019
+2MXhCez_-Sb,3,gDHCPUvKRP,gDHCPUvKRP,Interesting paper,"
+The paper provides an interesting and novel use of butterfly factoziations in encoder-decoder networks. Specifically, the paper proposes replacing the 
+encoder with a truncated butterfly network followed by a dense linear layer. The parameters are chosen so as to keep the number of weights in the (replaced) encoder near linear in the input dimension. The authors provide a theoretical result related to auto-encoder optimization.
+
+######
+
+I vote for accepting the paper. The main reason is for my vote is that the main idea introduced is novel and the algorithmic contribution is substantial. 
+
+######
+
+pros 
++ The proposed truncated butterfly network is novel. Aside from the algorithmic contribution, the theorem in the paper raise important questions about
+the optimization landscape of butterfly networks. 
++ The paper is clearly written and well justified 
++ Exhaustive literature survey and background on relevant work in both the matrix factorization and neural networks front
+
+######
+
+cons 
+- The beginning of section 7 needs to be expanded with more details on the original Indyk et al 2019 paper -- the authors mention that they use the same setting as 
+the 2019 paper but the details on how Indyk et al train their network are lacking without looking up the original paper. 
+
+- I think that low matrix approximation experimentation (Sec 7) can be more thorough. Why are only three datasets (in Table 3) used? Additionally and more importantly, Table 4 shows the approximation results for low values of k (max of 30) -- what happens when k=(min(n,d))? Also, the error is measured with respect to the best rank k approximation of X (the eq. following eq. 20) While this way of measuring the error is often used in the low rank matrix approximation literature it is not very informative of the actual approximation quality of \tilde X. It could well be the case that \tilde X = X_k but the error ||X-\tilde X||_F/||X||_F might  be poor hence this way of measuring the approximation performance is a better indicator of how well X is actually approximated in practice.   
+
+######
+
+questions: see the above section ",7,5.0,ICLR2021
+BmQha9LMPcS,2,Oecm1tBcguW,Oecm1tBcguW,Review of the PACOH-NN algorithm for learning BNN priors ,"### Summary:
+
+One of the main issues that BNNs have to face is the choice of good informative priors in order to provide precise information about the uncertainty of predictions. The present work connects BNNs and PAC theory to formulate a new system to obtain a general-purpose approach for obtaining significative priors for Bayesian NNs. This is done by employing the closed-form solution for the PAC-Bayesian meta-learning problem. The meta learner here learns a hyper-posterior over the priors of the parameters of the BNN by using the closed-form expression PACOH  (PAC Optimal Hyper-Posterior). This is applied in the context of NNs, where the priors are to be defined over the NN parameters. Extensive experiments are carried out to show the performance both in regression and classification datasets, as well as its scalability. In all of these regards, the system seems to be competitive, improving on the results of previous state-of-the-art methods and producing promising results in real-world problems. 
+
+##### Pros:
+
+* The method does not employ nested optimization problems, therefore avoiding all the issues related to these approaches altogether.
+* The usage of Bayesian seems to point in the right direction since the Bayesian framework allows for an easy formulation of the different levels needed here to formulate the system. 
+* The construction of the PACOH closed-form expression seems innovative and relevant 
+* The final system improves on the previous state-of-the-art in most of the cases here shown, and in some experiments, the improvement is very clear. Thanks to the fact that it is agnostic to the inference method in use, it presents itself as a very general-purposed approach, able to improve the predictive qualities of previous methods. 
+* The article is very complete and detailed, both in the main body and in the appendix. The appendix is particularly extensive, providing insight on many of the main points of the main text. 
+* The experiments conducted are exhaustive and complete, providing a very wide scope of the capabilities of the method. Detailed results can be found for every experiment and task. 
+* The text is well written and comprehensible
+
+##### Cons:
+
+* The choice in section 5 of using Gaussian priors with diagonal covariance matrices is motivated by the convenience in the computations. Moreover, the hyper-prior is also modeled by a zero-centered spherical Gaussian. How does choosing these distributions affect the final results? Are there been any experiments on which the parametric distributions chosen here are different from the ones presented? Please, describe how do you think these choices may affect the results and bias the final distributions obtained. 
+* The calibration error is all the information provided to quantify the quality of the predictive intervals for regression. It would be helpful to include some other quantification of this, such as the usage of CRPS or other strictly proper scoring rules. In the same line, using metrics such as the Brier score for classification tasks would help getting a more complete picture. 
+* How does the complexity of the BNNs employed affect the final results? Does using more layers and/or more units improve the results? There have been recent works (e.g. Functional Variational BNNs) regarding artifacts that arise when using large BNNs which make the system not able to properly learn the data. Could these problems be solved as well with this approach since the final prior is constructed using the data? 
+
+###### Other comments: 
+
+* As an optional suggestion, the article is well written but can be difficult to comprehend at some points. To that end, I would suggest trying to provide qualitative descriptions of some of the quantities in the expressions that later prove to be of importance. As an example, providing some intuition on $\psi(\sqrt{m})$ would help to understand the expression (1). In general, I think the article would benefit from extending a bit the explanations of some parts, especially section 4. 
+* Typo: section 5, paragraph 2 - ""categorigal"" 
+* To have a clearer explanation of figure 1 I would try to include the description of the sinusoidal functions that is already present in the appendix section C.1, since it seems relevant to the text in section 6.1 as well.",7,2.0,ICLR2021
+MpCSLT84Yk,4,cbdp6RLk2r7,cbdp6RLk2r7,"Review of ""Addressing the Topological Defects of Disentanglement""","Summary: The authors proposed a new way to disentangle affine transformations without topological defects. This paper made several theoretical contributions including a new definition of disentanglement and demonstration of the topological defects in existing disentanglement methods. Experimentally, this paper showed how their proposed shift operator model is powerful when dealing with topological defects.
+
+Disentanglement is a relatively challenging task due to the lack of clear definition and the lack of a robust evaluation method. The authors did a good job providing new theoretical definitions and providing empirical and qualitative results to support their claims. The main weakness of the paper is the lack of quantitative metrics to evaluate their approach and compare with others. In addition, the model doesn’t appear to be very flexible as it requires that the transformation is known in advance.
+
+Strengths:
++ Overall, the paper is well written and contains a good review of advances in the theory of disentanglement.
++ The idea of addressing topological defects for disentanglement appears novel. 
++ Using operators on the entire latent space is a new direction for the study of disentanglement. The authors’ viewpoint that “isolating factors of variation” is different from “mapping these factors into distinct subspaces”, and how they propose a new definition based on this viewpoint is interesting.
+
+
+Weaknesses:
+- Lack of quantitative evaluation metrics. The MSE in the appendix is not enough for quantifying disentanglement. 
+- Since this paper focuses on disentanglement, at least Factor-VAE, one of the other representative disentanglement VAE models should be considered when doing the model evaluation.
+-  Baseline models should be optimized in a more comprehensive manner (e.g., currently the selection of beta is {4, 10, 100, 1000} and latent dimension is {10, 30}). It’s unclear whether these models have been well optimized, or what measures are used to optimize the models for this task. 
+- Because the method requires that the transformation is known in advance, this limits the flexibility of the approach. 
+- How different transformations impact each other is not shown experimentally - there is only an example on Fig 3E showing some visual results, but this should be elaborated on further given the goal of the paper.
+
+Minor points:
+- The complex version of the shift operator is used. It would be interesting to show another version and their differences. 
+- Latent traversals results appear to be rather sparse. It would be interesting to show how the variation exists inside the model via dense traversals and the computing of generated images variation with different latent traversals.
+- Rotations may be more challenging to learn. 2000 examples may be insufficient for the model to learn this transformation correctly.
+",6,3.0,ICLR2021
+S1xR2dmTFr,2,B1gn-pEKwH,B1gn-pEKwH,Official Blind Review #3,"This paper proposed a model for continuous-time, discrete events prediction and entropy rate estimation by combining unifilar hidden semi-Markov model and neural networks where the dwell time distribution is represented by a shallow neural network. 
+
+Comments:
+The literature review on previous work for continuous-time, discrete events prediction is not thorough enough. For this problem, there are continuous-time Markov networks [El-Hay et al. 2006], continuous-time Bayesian networks [Nodelman et al. 2002] and its counterpart in relational learning domain, i.e. relational continuous-time Bayesian networks [Yang et al. 2016]. The authors should have learned their work and addressed the difference between the proposed model and these work in the related work section.
+
+Tal El-Hay, Nir Friedman, Daphne Koller, and Raz Kupferman. Continuous Time Markov Networks. In UAI, 2006.
+Nodelman, U.; Shelton, C.; and Koller, D. Continuous Time Bayesian Networks. In Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence (UAI), pages 378–387, 2002.
+Shuo Yang, Tushar Khot, Kristian Kersting, and Sriraam Natarajan. Learning continuous-time Bayesian networks in relational domains: A non-parametric approach. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pages 2265–2271, 2016. 
+
+The approximations made to get equation (3) and equation (4) lacks theoretical proof for the up-bound of its influence on the model performance. If these assumptions have been made before, please cite the references; if not, please illustrate why it makes sense and how the approximation will influence the value of the objective function.
+
+Please explicitly state the meaning of all the symbols used in the paper even it can be inferred by the readers. E.g. ‘n’, which first appears in Equation (5) and is never being explained.
+
+The experiments are rather simple both in terms of the model used to generate the data and the number of different data sets being used. Hence, the experimental results are not strong enough to support the claims made by this paper. 
+Specifically, it claims “With very little data, a two-state model shown in Fig. 3 is deemed most likely; but as the amount of data increases, the correct four-state model eventually takes precedence”, but in Figure 3, the plot with the highest BIC score when the training sample is less is the green curve which stands for the six-state model according to the legend. 
+“The corresponding mean-squared errors for the three methods are shown in Fig. 3(bottom) for two different dataset sizes.” I could not find it in Figure 3.
+",3,,ICLR2020
+rkZd9y9xz,2,rkEfPeZRb,rkEfPeZRb,Ok but not good enough,"The paper proposes a novel way of compressing gradient updates for distributed SGD, in order to speed up overall execution. While the technique is novel as far as I know (eq. (1) in particular), many details in the paper are poorly explained (I am unable to understand) and experimental results do not demonstrate that the problem targeted is actually alleviated.
+
+More detailed remarks:
+1: Motivating with ImageNet taking over a week to train seems misplaced when we have papers claiming to train ImageNet in 1 hour, 24 mins, 15 mins...
+4.1: Lemma 4.1 seems like you want B > 1, or clarify definition of V_B.
+4.2: This section is not fully comprehensible to me.
+- It seems you are confusingly overloading the term gradient and words derived (also in other parts or the paper). What is ""maximum value of gradients in a matrix""? Make sure to use something else, when talking about individual elements of a vector (which is constructed as an average of gradients), etc.
+- Rounding: do you use deterministic or random rounding? Do you then again store the inaccuracy?
+- I don't understand definition of d. It seems you subtract logarithm of a gradient from a scalar.
+- In total, I really don't know what is the object that actually gets communicated, and consequently when you remark that this can be combined with QSGD and the more below it, I don't understand it. This section has to be thoroughly explained, perhaps with some illustrative examples.
+4.3: allgatherv remark: does that mean that this approach would not scale well to higher number of workers?
+4.4: Remarks about quantization and mantissa manipulation are not clear to me again, or what is the point in doing so. Possible because the problems above.
+5: I think this section is not too useful unless you can accompany it with actual efficient implementation and contrast the practical performance. 
+6: Given that I don't understand how you compress the information being communicated, it is hard to believe the utility of the method. The objective was to speed up training time because communication is bottleneck. If you provide 12,000x compression, is it any more practically useful than providing 120x compression? What would be the difference in runtime? Such questions are never discussed. Further, if in the implementation you discuss masking mantissa, I have serious concern about whether the compression protocol is feasible to implement efficiently, without writing some extremely low-level code. I think the soundness of work addressing this particular problem is damaged if not implemented properly (compared to other kinds of works in current ML related research). Therefore I highly recommend including proper time comparison with a baseline in the future.
+Further, I don't understand 2 things about the Tables. a) how do you combine the proposed method with Momentum in SGD? This is not discussed as far as I can see. b) What is ""QSGD, 2bit"" If I remember QSGD protocol correctly, there's no natural mapping of 2bit to its parameters.",4,4.0,ICLR2018
+rkx9Nl1zcH,2,BJgza6VtPB,BJgza6VtPB,Official Blind Review #2,"This paper concerns the limitation of the quality-only evaluation metric for text generation models. Instead, a desirable evaluation metric should not only measure the sample quality, but also the sample diversity, to prevent the mode collapse problem in gan-based models generation. The author presents an interesting, but not too surprising finding that, tuning the temperature beam search sampling consistently outperform all other GAN/RL-based training method for text generation models. The idea of sweeping temperature during beam search decoding is not new in the NLP community, which limits the novelty of this paper. What’s more, some parts of the experiment results is also somehow not new, in the sense that the SBLEU vs Negative BLEU tradeoff curve is also shown in [1,2,3,4].
+
+[1] Jointly measuring diversity and quality in text generation models, 2019
+[2] Training language gans from scratch, 2019
+[3] On accurate evaluation of gans for language generation, 2018
+[4] Towards Text Generation with Adversarially Learned Neural Outlines, 2018
+
+I would love to increase my score if the author could address the following comments:
+(1) Are the comparing methods, say MLE models and other GAN-based models, have the similar number of model parameters? It is not clear from the paper. Otherwise, one can use a 12/24 layer Transformer-XL to have dominative performance?
+(2) Since this is an empirical study paper. It would be great if this paper can also present more SoTA models trained by MLE such as Transformer-XL on more challenging datasets, such as Wikitext-2 or Wikitext-103. In this kind of large vocabulary datasets, I think the RL/GAN-based training methods would easily breakdown, and far worse than MLE-based training.
+(3) To make the empirical study more comprehensive, the author could perhaps evaluate with the n-gram and FED metric.
+",6,,ICLR2020
+Syerno7kqH,3,SJeq9JBFvH,SJeq9JBFvH,Official Blind Review #1,"This paper introduces  a novel DPS(Deep Probabilistic Subsampling) framework for the task-adaptive  subsampling case, which attempts to resolve the issue of end-to-end optimization of an optimal subset of signal with jointly learning a sub-Nyquist sampling scheme and a predictive model for downstream tasks. The parameterization is used to simplify the subsampling distribution and ensure an expressive yet tractable distribution.  The new approach contribution is applied to  both reconstruction and classification tasks and demonstrated with a suite of experiments in a toy dataset, MINIST, and COFAR10.
+
+
+Overall, the paper requires significant improvement. 
+
+1. The approach is not well justified either by theory or practice. There is no experiment clearly shows convincing evidence of the correctness of the proposed approach or its utility compared to existing approaches (Xie & Ermon (2019); Kool et al. (2019); PlÂšotz & Roth (2018) ).
+
+2. The paper never clearly demonstrates the problem they are trying to solve (nor well differentiates it from the compressed sensing problem  or sample selection problem)
+
+   The method is difficult to understand, missing many details and essential explanation, and generally does not support a significant contribution. 
+
+3. The paper is not nicely written or rather easy to follow. The model is not well motivated and the optimization algorithm is also not well described.
+
+4. A theoretical analysis of the convergence of the optimization algorithm could be needed.
+
+5. The paper is imprecise and unpolished and the presentation needs improvement.
+
+**There are so many missing details or questions to answer**
+
+1. What is the Gumbel-max trick? 
+2. How to tune the parameters discussed in training details in the experiments?
+3. Why to use experience replay for the linear experiments?
+4. Are there evaluations on the utility of proposed compared to existing approaches?
+5. Does the proposed approach work in real-world problems?
+6. Was there any concrete theoretical guarantee to ensure the convergence of the algorithm.
+
+[Post Review after discussion]: The uploaded version has significantly improved over the first submission. It is now acceptable. 
+",6,,ICLR2020
+GA26ylAC_rc,1,#NAME?,#NAME?,Official review #2,"Summary: This work builds on the vulnerability of VAEs to adversarial attacks to propose investigate how training with alternative losses may alleviate this problem, with a specific focus on disentanglement. In particular it is found that disentanglement constraints may improve the robustness to adversarial attacks, to the detriment of the performance. In order to get the best of both, the author(s) propose a more flexible (hierarchical) model, trained with the beta-TC penalization on the ELBO. The algorithm, named Seatbelt-VAE, shows improvement over the beta-TC VAE in terms of reconstruction, as well as in term of adversarial robustness for several datasets (Chairs, 3D Faces, dSprites). 
+
+Comments: 
+1. The paper is well-written and the underlying reasoning is easy to follow.
+2. The experiments are sound and well documented (results are reported across latent space dimensions, and adversarial attack parameters)
+
+Questions:
+1. I am wondering how the bias of estimating the TC term on Z^L in Eq (8) scales with L and the minibatch size, compared to the more simple TC estimator from Chen et al. (2018) and if the author(s) had any evidence from the experiments that it might be problematic? Does the algorithm require even larger batch sizes?
+2. Should this approach be compared as well to weight decay or other simple regularization on the weights?
+
+Minor questions:
+3. I wish the paper would make a stronger connection between disentanglement and robustness. The beta-TC VAE is only one choice among other possible to constrain the variational network. Did the authors ever try anything else? 
+4. Is it possible that the TC-VAE is effective at providing a defense against adversarial attacks in this manuscript because of the nature of the attack used in this manuscript (Eq. (1))? If the attack was not based on the Kullback-Leibler divergence, but based on another geometry, maybe another disentanglement constraints would be more performant?",7,2.0,ICLR2021
+ACNtmeSDiea,2,OodqmQT3fir,OodqmQT3fir,"Well written paper, approach proposed is simple and easy to understand","This paper proposed a novel policy prediction model that combines self-supervised contrastive learning, graph representation learning and neural algorithm execution to generalize the Value Iteration Networks to MDPs. The method described in the paper is a combination of existing works in the literature but seems to work well in practice. The experiments evaluate multiple aspects of the proposed model (E.g. number of executor layers, etc.) and show significant performance improvement over the existing approaches.
+
+The latent representation used for policy prediction implicitly incorporates two-step aggregation 1) from initial representation extraction through encoding network and 2) from message passing, which seems helpful for the generalization.
+
+As I am not actively tracking RL literature, I am not sure if there is a similar approach has already been proposed or not. My comments are based on the assumption that no existing work using GNN to do further representation learning on top of the individual encoded information. It would be better if the authors clearly state this novelty at the end of related works.
+
+Pros
+===
+1. The paper is very well written and provided sufficient background knowledge to let the reader follow the description.
+2. While it appears to be a simple combination of existing techniques, the proposed model shows the benefit of obtaining better latent state representations for the policy prediction task. 
+
+Cons
+===
+1. The novelty of the proposed model is a bit weak in terms of a lack of specialization for this particular task. 
+2. Figure 1 in the paper is not quite meaningful. A better demonstrative figure would help improve this paper.",6,2.0,ICLR2021
+S1lF9obI9r,2,SJloA0EYDr,SJloA0EYDr,Official Blind Review #1,"This paper proposes A*MCTS, which combines A* and MCTS with policy and value networks to prioritize the next state to be explored. It further establishes the sample complexity to determine optimal actions. Experimental results validate the theoretical analysis and demonstrate the effectiveness of A*MCTS over benchmark MCTS algorithms with value and policy networks.
+
+Pros:
+This paper presents the first study of tree search for optimal actions in the presence of pretrained value and policy networks. And it combines A* search with MCTS to improve the performance over the traditional MCTS approaches based on UCT or PUCT tree policies. Experimental results show that the proposed algorithm outperform the MCTS algorithms.
+
+Cons:
+However, there are several issues that should be addressed including the presentation of the paper:
+•	The algorithm seeks to combine A* search with MCTS (combined with policy and value networks), and is shown to outperform the baseline MCTS method. However, it does not clearly explain the key insights of why it could perform better. For example, what kind of additional benefit will it bring when integrating the priority queue into the MCTS algorithms? How could it improve over the traditional tree policy (e.g., UCT) for the selection step in MCTS? These discussions are critical to understand the merit of the proposed algorithms. In addition, more experimental analysis should also be presented to support why such a combination is the key contribution to the performance gain.
+•	Many design choices for the algorithms are not clearly explained. For example, in line 8 of Algorithm 2, why only the top 3 child nodes are added to the queue?
+•	The complexity bound in Theorem 1 is hard to understand. It does not give the explicit relations of the sample complexity with respect to different quantities in the algorithms. In particular, the probability in the second term of Theorem 1 is hard to parse. The authors need to give more discussion and explanation about it. This is also the case for Theorems 2-4. The authors give some concrete examples in Section 6.2 for these bounds. However, it would be better to have some discussion earlier right after these theorems are presented.
+•	The experimental results are carried out under the very simplified settings for both the proposed algorithm and the baseline MCTS. In fact, it is performed under the exact assumption where the theoretical analysis is done for the A*MCTS. This may bring some advantage for the proposed algorithm. It is not clear whether such assumptions hold for practical problems. More convincing experimental comparison should be done under real environment such as Atari games (by using the simulator as the environment model as shown in [Guo et al 2014] “Deep learning for real-time atari game play using offline monte-carlo tree search planning”).
+ 
+Other comments:
+•	It is assumed that the noise of value and policy network is zero at the leaf node. In practice, this is not true because even at the leaf node the value could still be estimated by an inaccurate value network (e.g., AlphaGo or AlphaZero). How would this affect the results?
+•	In fact, the proof of the theorems could be moved to appendices.
+•	In the first paragraph of Section 6.2, there is a typo: V*=V_{l*}=\eta should be V*-V_{l*}=\eta ?",3,,ICLR2020
+H1QljSQxz,1,HyfHgI6aW,HyfHgI6aW,What about strong motion-planning baselines?,"Summary:
+
+A method is proposed for robot navigation in partially observable scenarios. E.g. 2D navigation in a grid world from start to goal but the robot can only sense obstacles in a certain radius around it. A learning-based method is proposed here which takes the currently discovered partial map as input to convolutional layers and then passes through K-iterations of a VIN module to a final controller. The controller takes as input both the convolutional features, the VIN module and has access to a differential memory module. A linear layer takes inputs from both the controller and memory and predicts the next step of the robot. This architecture is termed as MACN.
+
+In experiments on 2D randomly generated grid worlds, general graphs and a simulated ground robot with a lidar, it is shown that memory is important for navigating partially observable environments and that the VIN module is important to the architecture since a CNN replacement doesn't perform as well. Also larger start-goal distances can be better handled by increasing the memory available.
+
+Comments:
+
+- My main concern is that there are no non-learning based obvious baselines like A*, D*, D*-Lite and related motion planners which have been used for this exact task very successfully and run on real-world robots like the Mars rover. In comparison to the size of problems that can be handled by such planners the experiments shown here are much smaller and crucially the network can output actions which collide with obstacles while the search-based planners by definition will always produce feasible paths and require no training data. I would like to see in the experimental tables, comparison to path lengths produced by MACN vs. those produced by D*-Lite or Multi-Heuristic A*. While it is true that motion-planning will keep the entire discovered map in memory for the problem sizes shown here (2D maps: 16x16, 32x32, 64x64 bitmaps, general graphs: 9, 16, 25, 36 nodes) that is on the order of a few kB memory only. For the 3D simulated robot which is actually still treated as a 2D task due to the line lidar scanner MxN bitmap is not specified but even a few Mb is easily handled by modern day embedded systems. I can see that perhaps when map sizes exceed say tens of Gbs then perhaps MACN's memory will be smaller to obtain similar performance since it may learn better map compression to better utilize the smaller budget available to it. But experiments at that scale have not been shown currently.
+
+- Figure 1: There is no sensor (lidar or camera or kinect or radar) which can produce the kind of sensor observations shown in 1(b) since they can't look beyond occlusions. So such observations are pretty unrealistic.
+
+- ""The parts of the map that lie within the range of the laser scanner are converted to obstacle-free ..."": How are occluded regions marked?",4,5.0,ICLR2018
+HJgqjEXpFH,1,ryghPCVYvH,ryghPCVYvH,Official Blind Review #1,"This paper presents a model and training framework for generating samples based on restricted kernel machines. It is extended to multi-view generation and uncorrelated feature representation learning. 
+
+- The paper is well-written and well-organized. Notations and claims are clear.
+
+- The idea of a multi-view generation model based on restricted kernel machines is interesting. However, the paper seems to be limited to model definition and algorithm overview without a performance evaluating analysis. 
+
+
+- The experimental evaluations are not satisfactory. Although it is claimed in the paper that the model is able to generate high quality images, it is very hard to be confirmed with these experiments. There is no concrete attempt at comparing the performance of the model to the other used methodologies. Generating high quality images with multiple views is an interesting problem, and there are good works in the field addressing the issues. To name a few:
+Zhu, Z., Luo, P., Wang, X., Tang, X.: Multi-view perceptron: a deep model for learning face identity and view representations. In: Advances in Neural Information Processing Systems (NIPS). pp. 217–225, (2014)
+Kan, M., Shan, S., Chen, X.: Multi-view deep network for cross-view classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4847–4855, (2016) 
+Yin, X., Liu, X.: Multi-task convolutional neural network for pose-invariant face recognition. IEEE Transactions on Image Processing (2017) 
+Yim, J., Jung, H., Yoo, B., Choi, C., Park, D., Kim, J.: Rotating your face using multitask deep neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 676–684, (2015)
+Yu Tian, Xi Peng , Long Zhao, Shaoting Zhang , ,Dimitris N. Metaxas , CR-GAN: Learning Complete Representations for Multi-view Generation, arXiv: 1806.11191, 2018.
+
+There might be differences between these works and the paper, it is common to evaluate the quality of the generation to other models in terms of accuracy or in the classification tasks. Unfortunately, there is no such quantitative analysis in the paper. So the advantages of the proposed model is not very clear since there is not enough quantitative performance analysis. It would be interesting to see complexity analysis to evaluate the computational costs.
+
+Overall, I do not recommend this paper for publication. The experimental results are not satisfactory, and the paper needs improvements in that regard.
+
+** update:
+I would like to thank the authors for their comments. However, I still see major issues in the paper unresolved and my review remains the same. ",3,,ICLR2020
+SygUVxWc3X,2,ryG8UsR5t7,ryG8UsR5t7,"Interesting paper for the deep learning community, but the experimental section is not convincing enough","This works presents an overview of different techniques to obtain uncertainty estimates for regression algorithms, as well as metrics to assess the quality of these uncertainty estimates.
+It then introduces MeRCI, a novel metric that is more suitable for deep learning applications.
+
+Being able to build algorithms that are good not only at making predictions, but also at reliably assessing the confidence of these predictions is fundamental in any application. While this is often a focus in many communities, in the deep learning community however this is not the case, so I really like that the authors of this paper want to raise awareness on these techniques. The paper is well written and I enjoyed reading it. 
+I feel that to be more readable for a broader audience it would be relevant to introduce more in depth key concepts such as sharpness and calibration, an not just in a few lines as done in the end of page 2. 
+
+While I found the theoretical explanation interesting, I feel that the experimental part does not support strongly enough the claims made in the paper. First of all, for this type of paper I would have expected more real-life experiments, and not just the monocular depth estimation one. This is in fact the only way to assess if the findings of the paper generalize.
+Then, I am not convinced that keeping the networks predictions fixed in all experiments is correct. The different predictive uncertainty methods return both a mean and a variance of the prediction, but it seems that you disregard the information on the mean in you tests. If I understood correctly, I would expect the absolute errors to change for each of the methods, so the comparisons in Figure 4 can be very misleading. 
+With which method did you obtain the predictions in Figure 4.c? 
+
+Typos:
+- ""implies"" -> ""imply"" in first line of page 3
+- ""0. 2"" -> ""0.2"" in pag 6, also you should clarify if 0.2 refers to the fraction of units that are dropped or that are kept
+
+",5,3.0,ICLR2019
+BJgm18Gaqr,2,SJg4Y3VFPS,SJg4Y3VFPS,Official Blind Review #3,"This paper addresses the problem of learning expressive feature combinations in order to improve learning for domains where there is no known structure between features. These settings would normally lead to the use of fully-connected MLP networks, which unfortunately have problems with efficient training and generalization after a few layers of depth. The main idea is to use grouping at first, in combination with smaller fully-connected layers for each group, as well as pooling pairs of groups together as the layers go on. Results are shown as comparisons on 5 real-world datasets, and intuitive visualizations on two other datasets. Related work covered MLPs, regularization techniques, sparse networks, random forest models, and other feature grouping. The paper is well written and easy to read. This work did a good job with giving implementation details as well as performing hyperparameter searches and giving the baselines a good effort. 
+
+My current decision is a weak reject, for a well-written paper, but some concerns as follows:
+-The results do not show much improvement (i.e., < 0.3% improvement for 3 of the datasets, and < 1% for another one), aside from CIFAR-10. Considering that the premise of the paper is that MLP’s are not good enough when dealing with data in which the relationships between features are unknown, it seems like these are definitely not good datasets on which to demonstrate this notion of “there has been little progress in deep reinforcement learning for domains without a known structure between features.”
+-The MNIST visualization of group-select felt informative, but the XOR example for grouping visualizations seemed too easy. It would’ve been good to see visualizations or intuitions regarding grouping for harder datasets, in order to be convinced of the need for more expressive feature representations than standard MLP’s.
+-I’m not an expert on causality, but it seems like citations from that area are required for this problem statement of dealing with features where the connections between them are unknown but potentially very important. 
+
+Less major:
+-It would have been nice to include related work on other ways to encourage inter-feature interactions, such as perhaps taking the outer product of the input with itself. 
+-It seems like different sizes per group would be a more realistic expectation, and that perhaps this should be worked into the algorithm. Similarly, pooling only 2 groups together (from pre-specified positions) seems like it would be limiting as well. It also seems like the algorithm should account for being able to use a high-level feature from one layer as part of multiple groups in the future (i.e. reuse). Even if any of these options don’t make a difference, it would be good to check/evaluate.
+
+Minor:
+-Equation 8 did not fully make sense to me.
+-Why were “random horizontal flips” used as preprocessing for the permutation-invariant CIFAR-10 dataset? This shouldn’t make a difference at all if the pixels become randomly shuffled anyway.",3,,ICLR2020
+R625utgPjP6,2,GXJPLbB5P-y,GXJPLbB5P-y,Interesting idea,"The authors propose a more data-efficient way to train generative models with constraints on the output; specifically they evaluate on image generation and pseudocode-to-code (SPoC) tasks. They train two separate models, a “predictor” and a “denoiser”, which they then compose: the output from the “predictor” is further processed by the “denoiser”. For the SPoC task they show an improvement of 3-5% over a simple transformer baseline.
+ 
+The authors suggest a simple idea to make use of unlabelled data, should it be available. They use it to perturbate the unlabelled data and use the (perturbed example, example) pairs to train a denoising model. They argue that this should theoretically simplify the task of the predictor, and show improvements on several tasks. I believe that this is an interesting idea, and practically useful in the cases where data is sparse.
+ 
+However, the results that they demonstrate do not seem very strong, and I would have liked to see this technique demonstrated on more competitive tasks to better gauge how well it works. The improvement of 3-5% they state seems like a low gain over a simple baseline, that may also be achievable with other techniques.
+ 
+Clearly state your recommendation (accept or reject) with one or two key reasons for this choice.
+ 
+I do recommend this paper to be accepted, because it clearly presents an interesting idea.
+ 
+The recommendation is a “weak accept” though, because the experimental evidence for the technique is not convincing enough to me. I would have expected significant gains on a well understood task, clearly attributable to the technique.
+",6,4.0,ICLR2021
+Bl4chsMDNl,1,4dXmpCDGNp7,4dXmpCDGNp7,An effective explanation method based on robustness,"The paper proposes an effective explanation method based on a notion of robustness defined by the authors.
+The paper is well presented and easy to follow. It compares against state-of-the-art methods and provides valid and statistically significant experimentation. The explanations returned are different from the competitors and useful.
+
+However, some points that should be addressed before publication.
+First, I suggest to add the evaluation measures for robustness presented in 
+Hara, S., Ikeno, K., Soma, T., & Maehara, T. (2018). Maximally invariant data perturbation as explanation. arXiv preprint arXiv:1806.07004. 
+The evaluation of Hara is similar to the one adopted w.r.t Insertion/Deletion.
+In addition, an evaluation measure for robustness different from everything presented in the paper but quite important for having a different validation is the one proposed in 
+Melis, D. A., & Jaakkola, T. (2018). Towards robust interpretability with self-explaining neural networks. In Advances in Neural Information Processing Systems (pp. 7775-7784).
+based on the local Lipschitz estimation.
+
+Second, an explanation similar to the one reported in the paper is returned by the method presented in 
+Guidotti, R., Monreale, A., Matwin, S., & Pedreschi, D. (2019, September). Black Box Explanation by Learning Image Exemplars in the Latent Feature Space. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (pp. 189-205). Springer, Cham.
+Another method that should be considered for comparison as it is based on robust anchors is the one presented on:
+Ribeiro, M. T., Singh, S., & Guestrin, C. (2018, February). Anchors: High-Precision Model-Agnostic Explanations. In AAAI (Vol. 18, pp. 1527-1535).
+It would be interesting to compare the Greedy proposals with these two methods both quantitatively and qualitatively.
+
+Third, an important aspect that is missing for the model proposed is either an analysis of the complexity or even better, the running time for returning explanations. In this way, the user can understand which is the best compromise between robustness and speed.
+
+Minor issues:
+- The results on textual data are not as convincing as those on images. The authors should treat this part of the presentation carefully and with more attention.
+- The authors do not specify in the main paper (or it is not easy to find) which are the classification models explained.
+
+",7,4.0,ICLR2021
+UKEnrXcu-U,2,p3_z68kKrus,p3_z68kKrus,"Paper makes a worthwhile contribution and is to the point. The authors might, however, have overlooked something crucial which leads to the paper having more complex results than necessary.","UPDATE:
+As the authors were already aware of the zero loss case and analyzed this previously, I am confident that the authors can address this to the point in an updated version. With this I think this is a good paper that should be accepted.
+
+##########################
+
+Summary:
+The papers addresses the setting of overparametrized models that interpolate the training data, and the related double descent observation in a kernel setting. The overparametrized case of interpolating models is not yet that well understood, but of importance as the success of neural networks is closely related to that setting. This paper shows that the minimum norm interpolating solution is optimal (among all interpolating solutions) with respect to a derived bound on the expected leave-one-out stability, and thus also optimal in the same sense with respect to the excess risk. 
+
+##########################
+
+##########################
+
+Pros:
+The paper is well written, to the point, and technically (mostly! see cons) sound. 
+To the best of my knowledge this particular stability analysis is novel, and thus warrants a publication, in particular as the overparametrized case is not that well understood as of yet.
+
+##########################
+
+##########################
+
+Cons:
+I actually don't have many cons, I enjoyed the paper. There is however one thing that the authors seemed to have missed:
+The term V(hat(f_S),z_i) is zero, as hat(f_S) interpolates the training data and z_i is part of it. That doesn't mean that any of the theory is wrong, but that creates two problems in my opinion:
+1. The story about leave-one-out stability does not make sense anymore. In fact the expected leave-one-out stability is just the expected risk for interpolating solutions.
+2. I would imagine that most of the results can be simplified because of that. I could imagine that all the results hold with essentially all terms regarding hat(f_s) being removed. I think the qualitative conclusions would remain the same though.
+
+My suggestion would be to leave the paper as is (the results as far as I see also hold for not interpolating solutions), and then discuss the interpolating solutions as an extra case.
+
+##########################
+
+##########################
+
+Scoring: 
+For now I will have to vote for a rejection, as I am not sure if the problem that I mention can be addressed in one rebuttal phase. But I am happily convinced otherwise, or convinced that I am wrong in any other way.
+
+##########################
+
+##########################
+
+Additional feedback:
+- When I first read the title I thought that you wanted to show minimal norm solutions are NOT stable, as it has 'minimal' stability. I understand now that minimal refers to the numerical value of the stability definition, but still the wording was somewhat confusing. (Just to consider, no need to change for me if you think it is correct like that)
+
+- Equation (2) and also bit later. You use a comma to separate an index ""S,i"", fairly unusual I would say.
+
+- Remark 4, rework first sentence.
+
+- Equation (7), in the very basic property of RKHS the kappa would depend on x, for you that seems to follow with one of your assumptions + cauchy-schwarz. I would not consider that a basic property.
+
+##########################",8,4.0,ICLR2021
+HyywCCnef,3,By3v9k-RZ,By3v9k-RZ,"An interesting, but weird framework for bAbI QA","The paper presents an interesting framework for bAbI QA.  Essentially, the argument is that when given a very long paragraph, the existing approaches for end-to-end learning becomes very inefficient (linear to the number of the sentences).  The proposed alternative is to encode the knowledge of each sentence symbolically as n-grams, which is thus easy to index.  While the argument makes sense, it is not clear to me why one cannot simply index the original text. The additional encode/decode mechanism seems to introduce unnecessary noise.  The framework does include several components and techniques from latest recent work, which look pretty sophisticated. However, as the dataset is generated by simulation, with a very small set of vocabulary, the value of the proposed framework in practice remains largely unproven.
+
+Pros:
+  1. An interesting framework for bAbI QA by encoding sentence to n-grams
+
+Cons:
+  1. The overall justification is somewhat unclear
+  2. The approach could be over-engineered for a special, lengthy version of bAbI and it lacks evaluation using real-world data
+",4,3.0,ICLR2018
+wgwuqqkJ_nZ,1,Oc-Aedbjq0,Oc-Aedbjq0,review,"## Summary
+This paper proposes hyper-structure network for model compression (channel pruning). The idea is to have a hyper-network generate the *architecture* of the network to be pruned. To do so, the proposed approach use Gumbel softmax together with STE to get around the non-differentiability issue of such design. Additionally, the proposed approach dynamically adjust the gradient for each layer so that earlier doesn't get over-regularized due to the FLOPs regularizer. Empirical results are presented showing better performance compared to alternatives with ablation study on the proposed components (hyper-network and layer-wise scaling).
+
+## Reasons for score
+I'm leaning toward a rejection. I like the idea of both layer-wise scaling and using GRU to model inter-layer dependencies. However, I find them under-studied as the novel components of the paper. While this paper provides seemingly good performance compared to prior methods, they are not really apple-to-apple comparisons, which makes them relative weak signals. Moreover, some of the recent papers that are closely related to this submission are missing. I'm willing to raise my score if the weaknesses I listed below are properly addressed during the rebuttal period.
+
+## Strengths
+
+- The paper proposed a novel formulation towards the channel pruning problem. Specifically, the novelty lies in using GRU with the proposed layer-wise scaling.
+- The results seem good compared to prior literature (with a caveat of having longer training iterations)
+- Ablation of the proposed method ($\lambda$ and LWS)
+
+## Weaknesses
+
+- The novel aspect of the work, namely the layer-wise scaling, can be discussed in more detail. For example, why do we expect modifying the scaling of layer-wise gradient to make a better gradient than analytical gradient? In my understanding, analytical gradient gives you the steepest descent direction. In this case, why do we expect we can do better by tuning $\alpha_i$? It seems this is suggesting that we need layer-wise learning rate and such learning rates can be optimized via gradient descent by meta-gradient. It is not clear to me why such a formulation is specific to pruning. Can it benefit training vanilla network by setting $\lambda=0$? Overall, it is a bit mysterious to me why such a formulation *accelerates* training for equation (4), which is empirically shown in Figure 3 (c)(d). A more in-depth theoretical analysis would be very helpful. Without theoretical analysis, I'd be curious to see more settings empirically. If we sweep multiple learning rates, can LWS stop being better? If we use a different optimizer (say SGD), is LWS still better? Most importantly, if we use LARS optimizer [1], is LWS still better?
+
+- Missing AutoML-based related methods that have strong performances [2-5]. Discussion with these prior methods are needed to better position the proposed method. Comparing with these methods, the proposed method is only comparable. Specifically, DMCP [4] has 47% reduction with top-1 of 76.2 for ResNet-50 and 30% reduction with top-1 of 72.2 for MobileNetV2. Putting these methods into the table and discussion is necessary.
+
+- Comparison with AMC is limited to numbers from the previous paper. From the formulation of this paper, AutoML-based methods are highly relevant. It would be better if AMC is compared with this paper by using the same empirical setup. This is in fact doable as AMC has open source code. Without such a fair comparison, it is hard to understand what are the benefits of the proposed approach over AMC that solves the exact same problem. The paper has argued in the related work that policy gradient is noisy without really providing the quantitative evidences. The numbers from the paper is a really weak signal as AMC only fine-tunes for 30 epochs while this paper fine-tunes for 100 epochs for ImageNet.
+
+- The other novel aspect of the paper is using GRU for designing the network. However, it is not clear if GRU is necessary and why it is a good design choice. The argument for GRU is cross-layer dependences. I'm wondering what the results would look like if we simply use FC for each layer independently. This can better motivate the so-called cross layer dependencies and better motivate the adoption of GRU.
+
+
+[1] You, Yang, Igor Gitman, and Boris Ginsburg. ""Large batch training of convolutional networks."" arXiv preprint arXiv:1708.03888 (2017).
+
+[2] Yu, Jiahui, and Thomas Huang. ""AutoSlim: Towards One-Shot Architecture Search for Channel Numbers."" arXiv preprint arXiv:1903.11728 (2019).
+
+[3] Berman, Maxim, et al. ""AOWS: Adaptive and optimal network width search with latency constraints."" Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.
+
+[4] Guo, Shaopeng, et al. ""DMCP: Differentiable Markov Channel Pruning for Neural Networks."" Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.
+
+[5] Chin, Ting-Wu, et al. ""Towards Efficient Model Compression via Learned Global Ranking."" Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.
+
+
+--------- Post rebuttal ---------------
+
+I've read the rebuttal and I appreciate the additional efforts by the authors.
+
+Specifically, the authors have addressed my concerns comparing LWS (the proposed method) and LARS. Additionally, the authors have addressed my concerns regarding LWS by conducting more experiments. With another detailed read, I figured LWS updates $\alpha$ in a lookahead fashion. Specifically, $\frac{\partial \mathcal{J}}{\partial \mu}$ in Eq. 6 essentially requires one to compute the loss after the gradient update is being made, which gives it the potential to outperform the analytical gradient.
+
+Moreover, the authors have run additional experiments to demonstrate the usefulness of GRU, which makes the proposed method more convincing.
+
+While the authors argued that it is not fair to compare to AutoSlim, AOWS, and DMCP, I disagree. They are all relevant and strong channel pruning methods and the authors should have cited them and discuss the main differences (can be used to prune a pre-trained model or not) in the related work as opposed to omit them entirely.
+
+Overall, I find the paper interesting and it provides several novel aspects: GRU and LWS. Both are empirically verified to be useful in the channel pruning setting. However, the related work section can be further improved. As a result, I raised my score to weak accept.
+",6,5.0,ICLR2021
+SJgZe8sAYH,1,HJeRveHKDH,HJeRveHKDH,Official Blind Review #3,"This paper proposes a method for generating hard puzzles with a trainable puzzle solver. This is an interesting and important problem which sits at the intersection of symbolic and deep learning based AI. The approach is largely GAN inspired, where the neural solver takes the role of a discriminator, and the generator is trained with REINFORCE instead of plain gradient descend.
+
+Although I'm not an expert in this area, I have found this paper well written and easy to follow. The problem is well motivated, and the approach is sensible. As this is a novel problem, the paper also defines their own metric, namely the average time taken to solve the puzzle by given solvers, and the diversity of generated puzzles. It is nice to see that the generator indeed learns to generate puzzles that are significantly harder than random counterparts, while maintaining reasonable diversity. Although I think these are convincing results, my question to the authors is: have you tried or considered other ways of evaluating the generated puzzles? E.g., if you train the guided search solver on the generated puzzles and evaluate it on a random set of puzzles, would you see an improvement? I think this would be interesting to see, which can serve as an alternative evaluation metric.
+
+My other comments are regarding the experiment section:
+1. It would be useful to provide references to the solvers used, both in the adversarial training phase and the evaluation phase, if there is any.
+2. More details of the training process would also be valuable. E.g., the training time and stability, common failure modes if any.
+
+Minors:
+1. Figure f3 should be s.count(""A"")==1000 and s.count(""AA"")==0 
+2. First sentence under Fig 1, one is give -> one is given
+3. Figure 5, f2: 2**(x**2)) == 16 -> 2**(x**2) == 16",8,,ICLR2020
+Hkgdo5Da2m,2,S1lDV3RcKm,S1lDV3RcKm,"Resolving a major challenge in AmbientGAN, by focusing on a very specific application. ","Building upon the success of AmbientGAN by Bora, Price, and Dimakis, this paper studies one of the major issues that is not resolved in AmbientGAN: the distribution of the data corruption is typically unknown. In general this is an ill-defined problem to solve, as the data corruption distribution is not identifiable from the corrupted data. The major insight of this paper is to identify a plausible setting where such identifiability issues are not present. Namely, the corruption itself is identifiable from the corrupted data. The brilliance of this paper is in identifying this niche application of data imputation/missing data/incomplete data. 
+
+Once the goal is set to train a GAN on incomplete data, the solution somewhat follows in a straightforward manner from AmbientGAN. Pass the generated output through a masking operator, which is also trained. Train the masking operator on the masking pattern of the real (corrupted) data. Imputation generator and discriminator also follows in a straightforward manner. 
+
+A major shortcoming of this paper is that the performance of the proposed approach is not fully supported by extensive experiments. For example, a major application of such imputation solution will be predicting missing data in real world applications, such as recommendation systems, or biological experimental data. A experimental setting in ""GAIN: Missing Data Imputation using Generative Adversarial Nets"" provides an excellent benchmark dataset, and imputation approaches should be compared against GAIN in those scenarios. 
+
+
+",6,5.0,ICLR2019
+SyxB2-Bd3m,1,S1MQ6jCcK7,S1MQ6jCcK7,"ok papers but lacking of related works, important baselines and well-motivated storyline.","This paper formulates a new deep learning method called ChoiceNet for noisy data. Their main idea is to estimate the densities of data distributions using a set of correlated mean functions. They argue that ChoiceNet can robustly infer the target distribution on corrupted data.
+
+Pros:
+
+1. The authors find a new angle for learning with noisy labels. For example, the keypoint of ChoiceNet is to design the mixture of correlated density network block. 
+
+2. The authors perform numerical experiments to demonstrate the effectiveness of their framework in both regression tasks and classification tasks. And their experimental result support their previous claims.
+
+Cons:
+
+We have three questions in the following.
+
+1. Related works: In deep learning with noisy labels, there are three main directions, including small-loss trick [1-3], estimating noise transition matrix [4-6], and explicit and implicit regularization [7-9]. I would appreciate if the authors can survey and compare more baselines in their paper instead of listing some basic ones.
+
+2. Experiment: 
+2.1 Baselines: For noisy labels, the authors should add MentorNet [1] as a baseline https://github.com/google/mentornet From my own experience, this baseline is very strong. At the same time, they should compare with VAT [7]. 
+
+2.2 Datasets: For datasets, I think the author should first compare their methods on symmetric and aysmmetric noisy data [4]. Besides, the current paper only verifies on vision datasets. The authors are encouraged to conduct 1 NLP dataset.
+
+3. Motivation: The authors are encouraged to re-write their paper with more motivated storyline. The current version is okay but not very exciting for idea selling.
+
+References:
+
+[1] L. Jiang, Z. Zhou, T. Leung, L. Li, and L. Fei-Fei. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In ICML, 2018.
+
+[2] M. Ren, W. Zeng, B. Yang, and R. Urtasun. Learning to reweight examples for robust deep learning. In ICML, 2018.
+
+[3] B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, M. Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In NIPS, 2018.
+
+[4] G. Patrini, A. Rozza, A. Menon, R. Nock, and L. Qu. Making deep neural networks robust to label noise: A loss correction approach. In CVPR, 2017.
+
+[5] J. Goldberger and E. Ben-Reuven. Training deep neural-networks using a noise adaptation layer. In ICLR, 2017.
+
+[6] S. Sukhbaatar, J. Bruna, M. Paluri, L. Bourdev, and R. Fergus. Training convolutional networks with noisy labels. In ICLR workshop, 2015.
+
+[7] T. Miyato, S. Maeda, M. Koyama, and S. Ishii. Virtual adversarial training: A regularization method for supervised and semi-supervised learning. ICLR, 2016.
+
+[8] A. Tarvainen and H. Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In NIPS, 2017.
+
+[9] S. Laine and T. Aila. Temporal ensembling for semi-supervised learning. In ICLR, 2017.",5,5.0,ICLR2019
+lfZzHF5AlB,1,G70Z8ds32C9,G70Z8ds32C9,"MCR^2 principle makes sense, details of ReduNet are confusing","The paper formulates an iterative process of deriving encoding of data into feature space as a deep model, called ReduNet, where each layer corresponds to iteration of the optimisation process the feature space according to the MCR^2 principle.  The MCR^2 optimisation maps points of different classes into separate subspaces, with volume of each subspace being minimized while the volume of the entire space is maximized.   It is analogous to pushing like things together and unlike things apart.  The novelty of the paper is in that formulation of the feature optimisation is baked-in into a deep architecture. 
+
+MCR^2 principle seems like a sensible approach to learning, especially given that embedding algorithms (such as face encoding) use it already.  It’s nice to see some rigorous mathematical treatment on this.  However, I get confused pretty early on by the notations.  If f(x,\theta)\in \mathcal{R}^k and z=f(x)\in \mathcal{R}^n….then since y=h(z), then f(x,\theta)=h(f(x))…and so f(x,\delta) and f(x) are two different functions.  Yet later in the text z=f(x,\theta).  And from then on, including equation 11, f(x,\delta)=\psi^L(z_L+\etag(z_{L-1},\theta_{L-1})…which would make it seem f(x,\theta)\in \mathca{R}^n.  And what is g(z_l,\theta_1)?  Equation 8 tells us what g(z_l,\theta_1) must approximate…but what is it exactly?   Is that a neural network, or some model, with parameters \theta_l?  Or is Equation 8 a definition of g(z_l,\theta_l)…in which case what is \theta_l?  I don’t think the math is necessarily wrong…just notation is confusing and definitions changing/not consistent.
+
+I have also questions about equation 11, where number of layers is equivalent to iterations while maximizing MCR^2 and the width of each layer corresponds to m, the training points in the dataset.  So, in order to do a mapping of an input x, we need to perform L iterative steps using the entire m points every time?  Isn’t that equivalent to doing a massive learning process, using the entire dataset, for each mapping?  How computationally costly is that?  I also don’t quite understand how \psi^l(z_1,\theata_1) works - how does \theta_l change over iterations?  Experimental section is not helping me with this, since it’s stated that E, C^j are computed for each layer…but there are no details on what \theta_l is and how g(z_1,\theta_1) is evaluated.  And if f(x,\theta)=z^L…then how do we get classification from that?  Is it just based on definition of \hat{\pi}^j(z_l) from page 4?
+
+Finally, I am not sure if the result of obtaining a convnet architecture in ReduNet when translation invariance constraint is added the embedding is all that surprising.  Isn't it somewhat obvious that if each layer of ReduNet is invariant in some way, then the entire network is invariant?  It feels like that what we are learning here is not that in order to have translation-invariant mapping we must have a convent...but rather that we can obtain a translation invariant deep architecture with translation invariant layers. ",6,3.0,ICLR2021
+BklY-9xwnX,1,ryGgSsAcFQ,ryGgSsAcFQ,An interesting proof on approximation capabilities of deep skinny neural networks,"This paper shows that deep ""narrow"" neural networks (i.e. all hidden layers have maximum width at most the input dimension) with a variety of activation functions, including ReLU and sigmoid, can only learn functions with unbounded level set components, and thus cannot be a universal approximator. This complements previous work, such as Nguyen et. al 2018 which study connectivity of decision regions and Lu et. al 2017 on ReLU networks in different ways.
+
+Overall the paper is clearly written and technically sound. The result itself may not be super novel as noted in the related work but it's still a strict improvement over previous results which is often constrained to ReLU activation function. Moreover, the proofs of this paper are really nice and elegant. Compared to other work on approximation capability of neural networks, it can tell us in a more intuitive way and explicitly which class of functions/problems cannot be learned by neural networks if none of their layers have more neurons than the input dimension, which might be helpful in practice. Given the fact that there are not many previous work that take a similar approach in this direction, I'm happy to vote for accepting this paper.  
+
+Minor comments:
+The proof of Lemma 3 should be given for completeness. I guess this can be done more easily by setting delta=epsilon, A_0=A and A_{i+1}=epsilon-neighborhood of f_i(A_i)?
+page7: the square brackets in ""...g(x'')=[y-epsilon,y+epsilon]..."" should be open brackets.
+page7:""By Lemma 4, every function in N_n has bounded level components..."" -> ""..unbounded...""",7,4.0,ICLR2019
+#NAME?,1,M71R_ivbTQP,M71R_ivbTQP,"Interesting but a rather ad-hoc method, experiments not sufficiently convincing","The paper proposes a method called NeuroChains for extracting a sub-network from a deep neural network (DNN) that can accurately match the outputs of the full network for inputs in a small region of the input space. The goal is to be able to explain the important steps that a DNN takes to get from inputs in a small region of the input space, to its predictions for those inputs. NeuroChains initialises the sub-network as the original DNN except each weight/filter is multiplied with a real-valued score. The L1 norm of these scores is minimised while penalising any deviation of the output of the sub-network from the output of the original network. After the minimisation, weights/filters with a score below some threshold value are removed, leaving a sub-network. In the paper there are experiments that aim to verify three claims: (1) NeuroChains can find sub-networks containing less than 5% of the filters in a deep convolutional neural network while preserving its outputs in some small region of the input space; (2) every filter selected by NeuroChains is important for preserving the outputs and removing one of them leads to considerable drop in performance; (3) the sub-networks extracted for small regions of the input space can be generalized to unseen samples in nearby regions.
+
+Strengths:
+1. The experimental results suggest that NeuroChains can, as claimed, extract sparse sub-networks that match the outputs of the original network for some small region of the input space. These sub-networks are easier to analyse than the full network, which should help with interpretability.
+
+2. The descriptions of the method and the experiments were very clear. As a result, the experimental results could be reproduced based only on the information in the paper.
+
+3. The distinction between NeuroChains and related methods was made clear. In particular, comparisons were made to pruning methods and work on interpretable machine learning. 
+
+Suggestions and questions:
+1. In the paper it says that scores are not limited to [0,1] due to possible redundancy among filters in the original DNN. If there are redundant filters, their scores could be set to 0 so that these redundant filters are ignored. Why does the possibility of redundant filters motivate unconstrained scores?
+
+2. The KL divergence between the full network output distribution and the sub-network output distribution was used to penalise the scores. Why was KL divergence used over cross-entropy? The KL divergence is equal to the cross-entropy of the sub-network output distribution relative to the full network output distribution minus the entropy of the full network output distribution. The entropy of the full network output distribution is independent of the scores. Therefore, cross-entropy and KL divergence have the same minima and gradient with respect to the scores. Since the cross-entropy is a little cheaper to compute than the KL divergence, wouldn’t it be better to minimise the cross-entropy instead of the KL divergence?
+
+3. In section 4, paragraph 1, the paper states ""(2) every filter selected by NeuroChains is important to preserving the outputs since removing one will leads to considerable drop in performance"". However, in Figure 3 (right) there are many filters with scores greater than the threshold tau = 0.1 that appear to cause almost 0 change in the predicted probabilities when removed. Therefore, the claim that all filters found by NeuroChains are important is not well supported by the experimental results.
+
+4. In Figure 3 (right), the magenta line appears to be a line of best fit, which does not by itself imply correlation. However, in the paper it says ""magenta line implies strong correlation between the two”. If the goal is to demonstrate strong correlation, wouldn’t it be better to report the Pearson correlation coefficient or Spearman correlation coefficient instead of the line of best fit?
+
+5. On page 8, paragraph 1, the paper references Figure 4 and states that sub-networks extracted for local regions of the input space can generalise to nearby regions because their test fidelity (accuracy with which the output of the full network can be reproduced by the sub-network) remains high when the number of nearest images in the validation set is below 100. However, in my opinion, Figure 4 seems to suggest that as the number of nearest images increases, the test fidelity starts to decrease immediately and at a roughly constant rate until 180 nearest images is reached. Therefore, I don't think that Figure 4 provides strong evidence that the sub-networks extracted for local regions of the input space generalise well to nearby regions.
+
+Overall, the NeuroChains method appears to do a good job at extracting sub-networks from large DNNs and these sub-networks make predictions of the original DNN significantly easier to interpret. However, I don’t fully understand some of the decisions made in the design of the algorithm (questions 1. and 2.) and I am not totally convinced by some of the conclusions drawn from the experimental results (questions 3., 4. and 5.).",5,2.0,ICLR2021
+n2PiTrWYmb9,4,083vV3utxpC,083vV3utxpC,"Easy to follow, and want to see more analysis and details","In this paper, the authors have proposed a new approach to determine the optimized subset of weights instead of simply conduct full weights updating. In order to better update the weights, they measure each weight's contribution to the analytical upper bound on the loss reduction from two sides (global and locally). After evaluation, a weight will be updated only if it has a large contribution to the loss reduction given the newly collected data samples. The experimental results show that their method can achieve a high inference accuracy while updating a rather small number of weights. 
+
+Strength:
+The idea is easy to follow and seems applicable to be adopted.
+Paper is well structured and written in general.
+
+Weakness:
+1. Lack of explanations:
+	(1) from reward measurement side (motivation side):
+In the introduction, the authors did not explain why they pick the loss as the weight measurement criteria instead of others (e.g., accuracy). While they report the accuracy in the evaluation part as one evaluation results.
+	(2) from the update algorithm side:
+The paper did mention their weights updating method is determined via both global and local contributions, and they talked in 3.1 'It turns out experimentally, that a simple sum of both contributions leads to sufficiently good and robust final results'. however, it is not convincing that those two facts can have the equal impacts on the final prediction.
+	(3)  from the updating setting side:
+It seems that the defined updating ratio is one important factor as discussed in section2, not  enough contents are provided in the paper to describe how to calculate this ratio.
+	(4) re-initialize mechanism:
+Re-initialize is also another important factor in the weight updating as discussed in section 3.2 'trained from the last round for a long sequence of rounds. Thus, we propose to re-initialize the weights after a certain number of rounds', however, the computation of how many rounds the network needs to be re-initialized seems not plausible.
+2. Evaluation:
+	(1) lack of comparison: It would be good if authors can apply their method on some recent works (or models), which can also show others how flexible their method can be adopted or applied
+	(2) there is no contents in the paper showing how authors decide their experiment settings, for example, why authors always select k (weight changing ratio) as very small 0.01, 0.05, 0.1, 0.2 instead of 0.5
+	(3) in Fig2, it is curious why authors apply different settings on different datasets when comparing their methods	
+	(4) for section 4.2, it would be good if the authors can also try other initialization ways, for example using the average weights in each round window instead of directly using the latest round weights 
+	(5) in Table 1, it seems full updating still can beat the combined method, however, in Fig2, authors did not explain why DPU has better performance than other settings even compare with the full update
+	(6) in Fig3, while DPU with re-init can achieve best performance than others, there is no explain about why it did not perform well in the first few rounds
+	(7) the authors did not mentioned how many runs which they have conduct their experiments to provide the results 
+3. Some parts need to be further improved for example
+	(1) Fig3, it would be good if authors can add some texts information for {1000, 5000}; 
+	(2) Section3 is a little bit hard to follow need to be reorganized
+	(3) Related work can be further improved to better cover most recent works",6,3.0,ICLR2021
+SJGBLcYxG,1,Hk6WhagRW,Hk6WhagRW,Review 'Emergent Communication through Negotiation',"The authors describe a variant of the negotiation game in which agents of different type, selfish or prosocial, and with different preferences. The central feature is the consideration of a secondary communication (linguistic) channel for the purpose of cheap talk, i.e. talk whose semantics are not laid out a priori. 
+
+The essential findings include that prosociality is a prerequisite for effective communication (i.e. formation of meaningful communication on the linguistic channel), and furthermore, that the secondary channel helps improve the negotiation outcomes.
+
+The paper is well-structured and incrementally introduces the added features and includes staged evaluations for the individual additions, starting with the differentiation of agent characteristics, explored with combination of linguistic and proposal channel. Finally, agent societies are represented by injecting individuals' ID into the input representation.
+
+The positive:
+- The authors attack the challenging task of given agents a means to develop communication patterns without apriori knowledge.
+- The paper presents the problem in a well-structured manner and sufficient clarity to retrace the essential contribution (minor points for improvement).
+- The quality of the text is very high and error-free.
+- The background and results are well-contextualised with relevant related work. 
+
+The problematic:
+- By the very nature of the employed learning mechanisms, the provided solution provides little insight into what the emerging communication is really about. In my view, the lack of interpretable semantics hardly warrants a reference to 'cheap talk'. As such the expectations set by the well-developed introduction and background sections are moderated over the course of the paper.
+- The goal of providing agents with richer communicative ability without providing prior grounding is challenging, since agents need to learn about communication partners at runtime. But it appears as of the main contribution of the paper can be reduced to the decomposition of the learnable feature space into two communication channels. The implicit relationship of linguistic channel on proposal channel input based on the time information (Page 4, top) provides agents with extended inputs, thus enabling a more nuanced learning based on the relationship of proposal and linguistic channel. As such the well-defined semantics of the proposal channel effectively act as the grounding for the linguistic channel. This, then, could have been equally achieved by providing agents with a richer input structure mediated by a single channel. From this perspective, the solution offers limited surprises. The improvement of accuracy in the context of agent societies based on provided ID follows the same pattern of extending the input features.
+- One of the motivating factors of using cheap talk is the exploitation of lying on the part of the agents. However, apart from this initial statement, this feature is not explicitly picked up. In combination with the previous point, the necessity/value of the additional communication channel is unclear.
+
+Concrete suggestions for improvement:
+
+- Providing exemplified communication traces would help the reader appreciate the complexity of the problem addressed by the paper.
+- Figure 3 is really hard to read/interpret. The same applies to Figure 4 (although less critical in this case).
+- Input parameters could have been made explicit in order to facilitate a more comprehensive understanding of technicalities (e.g. in appendix).
+- Emergent communication is effectively unidirectional, with one agent as listener. Have you observed other outcomes in your evaluation?
+
+In summary, the paper presents an interesting approach to combine unsupervised learning with multiple communication channels to improve learning of preferences in a well-established negotiation game. The problem is addressed systematically and well-presented, but can leave the reader with the impression that the secondary channel, apart from decomposing the model, does not provide conceptual benefit over introducing a richer feature space that can be exploited by the learning mechanisms. Combined with the lack of specific cheap talk features, the use of actual cheap talk is rather abstract. Those aspects warrant justification.",6,3.0,ICLR2018
+EIqg7wKvdeU,3,ioXEbG_Sf-a,ioXEbG_Sf-a,Learnable re-weighting of samples for actor-critic algorithms,"The paper proposes a generally applicable modification to experience sampling in the context of actor-critic algorithms using a Q function as a critic. The modification is called ""Likelihood-free Importance Weights"" (LFIW). The authors describe the approach in Appendix A in the form of pseudocode.  Comparing to a generic actor-critic algorithm, the changes include the keeping of two replay buffers (""fast"" and ""slow"") and inclusion of an additional re-weighting function w which in turn is used in the update of the Q function. The paper includes a thorough performance comparison on MuJoCo and DM Control Suite. 
+
+The results are good, but the authors seem to use a weak implementation of SAC. For comparison, I am referring to SAC as implemented in https://github.com/tensorflow/agents/blob/v0.6.0/tf_agents/agents/sac/sac_agent.py#L62-L634  Using this implementation I am getting, e.g. for the Humanoid-v2 environment,  at 500K steps results above 4000, compared to 3189 (SAC+LFIW) and 2033 (SAC) reported in the paper. Hence it is hard for me to assess whether LFIW offers a real improvement of the SOTA or perhaps fixes some problems of the underlying implementation of SAC.
+
+The mathematical analysis contained in Theorem 1 is interesting, but in my opinion, it is written confusingly. It would be better to decompose it into two separate statements:
+- the first statement stating the inequality with a simple proof based on the convexity of the square function,
+- the second statement proposing a counter-example in the form of Q+epsilon for appropriately small epsilon. 
+
+Also, the statement of the theorem is slightly weaker than I would like: can we just prove, that the counterexample exists regardless of gamma? The current statement says that the mapping is not a gamma-contraction, but one can imagine, that relaxing gamma would still lead to a contractive mapping.
+
+The example presented in Figure 1 is interesting, though in my opinion, somewhat detached from the main focus of the paper which in my opinion is the analysis of Algorithm 1. Also, the three-state MDP may seem too simple to conclude about the performance of the sampling method.",5,4.0,ICLR2021
+Bkxhog2O2Q,3,S1x2Fj0qKQ,S1x2Fj0qKQ,"Interesting idea, Convince Results","This paper tends to address the instability problem in GAN training by replacing batch normalization(BN) with whitening and coloring transform(WC) to provide a full-feature decorrelation. This paper consider both uncondition and condition cases.
+In general, the idea of replacing BN with WC is interesting and well motivated. 
+
+The proposed method looks novel to me. Compared with ZCA whitening in Huang et al. 2018, the Cholesky decomposition is much faster and performs better. The experiments show the promising results and demonstrate the proposed method is easily to integrate with other advanced technic. The experimental results also illustrate the role of each components and well supports the motivation of proposed method.
+
+My only concern is that the proposed WC algorithm seems to have capability of applying to many tasks including discriminative scenario. This paper seems to have potential to be a more general paper about the WC method. Why just consider GAN? What is the performance of WC compared with BN/ZCA whiten in other tasks. It would be better if the authors can elaborate the motivation of choosing GAN as the application. ",7,4.0,ICLR2019
+SyxDSP-CtH,3,S1g490VKvB,S1g490VKvB,Official Blind Review #3,"The authors propose a mean-field analysis of recurrent networks in this paper. I have a few concerns about this paper:
+
+(1) The most serious concern about their analysis comes from their assumption. They assume the weight W is independent on the state s_t (Page 4, Lines 5-6). The recursive structure is the most complicated part of the recurrent networks, and also its major difference from feedforward networks. In current networks, the hidden states become (or even heavily) dependent on the weight due to recursion.
+
+When making such an assumption, the recurrent networks just become similar to feedforward networks. The authors' claim that ""the untied weights assumption actually has long history of yielding correct prediction"" is not ungrounded and questionable.
+
+(2) The paper is not well written. Some assumptions are not explicitly stated. They are placed in the text without any highlight. Some theoretical statements are claimed without any rigorous proof. A few approximations are applied without clearly explaining about the resulting approximation errors. This is not acceptable, especially when the authors claim they are developing a ""THEORY"".
+
+(3) The experiments only consider the MNIST and CIFAR10 datasets. These datasets are mainly used for evaluating feedforward-type convolutional neural networks. Even though the authors might like their experiments, for the sake of the main stream users of recurrent networks. They should at least include experiments in conventional sequential prediction problems, e.g., speech, time series, machine translations.
+
+(4) Compared with other state of the art methods, their experimental results on CIFAR10 is too weak. I cannot believe such weak results can be used to make meaningful justifications.
+
+
+
+
+
+
+
+
+",1,,ICLR2020
+7RHeTJnpnB3,4,H38f_9b90BO,H38f_9b90BO,The paper is clear and in good quality,"This paper presents the one technique using label propagation with meta learning. The label smoothness is used to pseudo label the nodes in the graph. The experimental results look promising in two conditions of label noises. The paper is presented clearly and easy to read. Overall, the quality is good.
+
+The paper present the idea and experiments clearly.
+
+I checked a few literature and believe this work is original.
+
+Pros:
+1. clear presentation
+2. method is simple but seem very effective
+3. the experimental results outperformed the state-of-the-arts in both synthetic label noise and real noise scenarios.
+
+Cons:
+1. the contributions can be challenged. If check the GNN with label noise, I do can find some literature published on CVPR and ECCV. And the third one is evaluation result, cannot be categorized as a contribution.
+",7,4.0,ICLR2021
+BJe77yJc27,3,HJMsiiRctX,HJMsiiRctX,Unconvincing Results,"The authors present an algorithm that incorporates deep learning and physics simulation, and apply this algorithm to the game Flappy Bird.  The algorithm uses a convolutional network trained on agent play to predict the agent’s own actions given a sequence of frames.  Using this action estimator output as a prior over an action distribution (parameterized by a Dirichlet process), the algorithm iteratively updates the action by rolling out a ground-truth physics simulator of the environment, observing whether this ground-truth simulation yields negative reward, and updating the action accordingly.
+
+While I find the authors' introductory philosophy largely compelling (it draws inspiration from developmental psychology, learning to model the physical world, and the synthesis of model-based and model-free learning), I have concerns with most other aspects of the paper.  Specifically, here are a few points:
+
+1)  The authors only apply their algorithm to a single game (Flappy Bird), a game that has no previously established benchmarks.  In fact, while there is no prior work in the literature on this game (perhaps because it is considered very easy), some unofficial results suggest that it is solvable by a straightforward application of existing methods (see this report:  http://cs229.stanford.edu/proj2015/362_report.pdf).  The authors do apply one baseline (out-of-the-box DQN) to this game, but the reported scores are suspiciously low, particularly in light of the report linked above.  No training curves or additional baselines are shown, and no prior work on this game in the literature exists to compare against.
+
+2)  The authors’ algorithm uses privileged information which eliminates the possibility for a fair comparison to baselines.  Specifically, their algorithm uses ground-truth state (not just image input), and a ground-truth physics simulator (which should be an enormous advantage).  Their one baseline (DQN) does not have either of these sources of privileged information, hence cannot be a fair comparison.
+
+3)  The authors’ algorithm is not general-purpose.  Because the algorithm itself uses a ground-truth environment-specific state, a ground-truth environment-specific simulator, and relies on a “crash boolean” (whether the bird hit a tree) specific to this game, it cannot be applied out-of-the-box on a different environment.
+
+4)  The authors make some claims that are too strong in light of the reported results.  For example, they claim that “the performance of the model outperforms all model-free and model-based approaches” (section 1), while they do not even compare against any model-based baselines (and only a single model-free baseline, DQN, which is not state-of-the-art anymore).
+
+Overall, I would recommend the authors choose a game or set of games that has/have established baselines in the literature, come up with a general-purpose algorithm which doesn’t rely on a ground-truth physics simulator, and more rigorously compare to existing methods.
+",3,4.0,ICLR2019
+o8FAdXc9w27,4,MbM_gvIB3Y4,MbM_gvIB3Y4,Conditions in which mutual information objectives are sufficient for reinforcement learning,"This paper studies which commonly-used mutual information objectives for learning state representations are sufficient for reinforcement learning. In particular, they provide counterexamples to show that state-only and inverse MI objectives are not Q*-sufficient, while proving that forward MI is Q*-sufficient. They validate their findings empirically with experiments in a simple RL domain.
+
+There has been a lot of work recently in reinforcement learning that uses mutual information objectives resulting in performance gains, so it’s very fascinating to see a finding that these objectives may be theoretically insufficient, despite their empirical success. The counterexamples shown are simple and the authors do a good job of explaining the intuition. I think this paper will be of great interest to the ICLR community. 
+
+One question: while J_state and J_inv are not sufficient, are there conditions in which they can be? If these conditions are limited to just the reward, could this somehow give insight on how to design better reward functions?
+
+Minor: there’s a reference in the last paragraph on page 6 to Figure 5 which I think should be to Figure 4.
+",7,3.0,ICLR2021
+BypzQJLNg,3,Hy8X3aKee,Hy8X3aKee,Review,"The paper compare three representation learning algorithms over symbolized sequences. Experiments are executed on several prediction tasks. The approach is potentially very important but the proposed algorithm is rather trivial. Besides detailed analysis on hyper parameters are not described. 
+",4,3.0,ICLR2017
+_rN8KFzJMmb,3,DC1Im3MkGG,DC1Im3MkGG,Interesting connection but could be supported with more theoretical guarantees,"The main contribution of the paper is to highlight the similarity between two active areas in ML namely ""domain generalization"" and ""fairness"". Further, the paper proposes an approach inspired by recent developments in the fairness literature for domain generalization. The high-level idea is that similarly to the way that fair algorithm are able to improve the worst-case accuracy of predictors across different groups without knowing the sensitive attributes, perhaps we can use these ideas to domain generalization when environment partitions are not known to the algorithm. In some sense, in both of these research areas the goal is to design robust algorithms. Similarly, the paper uses the idea from domain generalization to design fair algorithms w.r.t. a notion called ""group sufficiency"". The idea is to somehow infer the ""worst-case"" subgroup (i.e., the one that our algorithm has the worst accuracy on it) and then using a round of auditing improve the performance of the algorithm across all subgroups.
+
+The authors have supported their approach with empirical evaluations. In particular, I find the result on CMNIST quite interesting where the new algorithm as opposed to the standard approach like ERM will not be fooled by the spurious feature and can infer the useful environment.     
+
+While the paper has introduced (to best of my knowledge) a new concept, it seems that are many interesting questions that could show the applicability of the connection better are not yet answered (e.g., bi-level optimization EIIL). This could also help the paper to be supported with more provable guarantees. In general the paper is exploring a new connection between two areas and has shown its efficacy in practice and I believe it can lead to further works on this topic.   
+
+
+Minor comments:
+- define the notion of ""group sufficiency"" explicitly in the paper. I could not find the definition of the notion in words till in the caption of Figure 2 on page 8 and is formally defined on page 12!
+-page 5: poorly. . Consider -> poorly. Consider
+-page 6: generalizattion -> generalization
+-page 7: graysacle ->grayscale
+-page 14: exagerated -> exaggerated
+-page 14: orginal -> original
+-page 17: implicily -> implicitly",6,2.0,ICLR2021
+r1lg5FuhoQ,1,r1GbfhRqF7,r1GbfhRqF7,Not convinced that improvements are from better power,"The manuscript entitled ""Kernel Change-Point Detection with Auxiliary Deep Generative Models"" describes a novel approach to optimising the choice of kernel towards increased testing power in this challenging machine learning problem.  The proposed method is shown to offer improvements over alternatives on a set of real data problems and the minimax objective identified is well motivated, however, I am not entirely convinced that (a) the performance improvements arise for the hypothesised reasons, and (b) that the test setting is of wide applicability.
+
+A fundamental distinction between parametric and non-parametric tests for CPD in timeseries data is that the adoption of parametric assumptions allows for an easier introduction of strict but meaningful relationships in the temporal structure---e.g. a first order autoregressive model introduces a simple Markov structure---whereas non-parametric kernel tests typically imagine samples to be iid (before and after the change-point).  For this reason, the non-parametric tests may lack robustness to certain realistic types of temporal distributional changes: e.g. in the parameter of an autoregressive timeseries.  On the other hand, it may be prohibitively difficult to design parametric models to well characterise high dimensional data, whereas non-parametric models can typically do well in high dimension when the available data volumes are large.  In the present application it seems that the setting imagined is for low dimensional data of limited size in which there is likely to be non-iid temporal structure (i.e., outside the easy relative advantage of non-parametric methods).  For this reason it seems to me the key advantage offered by the proposed approach with its use of a distributional autoregressive process for the surrogate model may well be to introduce robustness against Type 1 errors due to otherwise unrepresented temporal structure in the base distribution (P).  In summarising the performance results by AUC it is unclear whether it is indeed the desired improvement in test power that offers the advantages or whether it is in fact a decrease in Type 1 errors.
+
+Another side of my concern here is that I disagree with the statement: ""As no prior knowledge of Q ... intuitiviely, we have to make G as close to P as possible"" interpretted as a way to maximise test power; as a way to minimise Type 1 errors, yes.
+
+Across change-point detection methods it is also important to distinguish key aspects of the problem formulation.  One particular specification here is that we have already some labelled instances of data known to come from the P distribution, and perhaps also some fewer instances of data labelled from Q.  This is distinct from fully automated change point detection methods for time series such as automatic scene selection in video data.  Another dissimilarity to that archetypal scenario is that here we suppose the P and Q distributions may have subtle differences that we're interested in; and it would also seem that we assume there is only one change-point to detect.  Or at least the algorithm does not seem to be designed to be applied in a recursive sense as it would be for scene selection.
+
+Finally there is no discussion here of computational complexity and cost?",7,4.0,ICLR2019
+WtsEdbF_YS8,4,uUAuBTcIIwq,uUAuBTcIIwq,Experimental section needs more work,"Response to rebuttal: the authors have drastically improved the quality of the submission with the new experiments and clarifications, I have therefore increased the score to a weak accept.
+
+-------------------------------------------
+
+This paper introduces a non-iid VAE architecture that uses a mixture of gaussian latent space and a global latent variables shared among all the elements of a mini batch to capture global information in correlated datapoints in an unsupervised way.
+
+Overall the paper in well written, and I believe in focuses on two important research directions, namely unsupervised learning of disentangled representations and domain alignment. The model itself is novel and well explained, but I feel the technical explanation is missing intuition on how the model can learn disentanglement in beta from purely random batches, which is not obvious to me. 
+
+My biggest concern is in the experimental section, that I did not find convincing enough for a number of reasons:
+1. I find it hard to understand if the improvements come from the introduction of the d or the beta latent variables, or a combination of both. How does the model perform in ablation studies in which you remove just one of this components while leaving the others unchanged? 
+2. In the single-datasets experiment in section 4.1 how do you define what constitutes a local vs global factors?  Currently some of the chosen factors in Figure 3 seem quite arbitrary. Why is light a local factor but contrast a global one? Why is hair local but beard global? 
+3. The quality of the images is not great to be honest (beta-VAE paper has more convincing ones, just to name a single work), and it is not easy to understand whether the low quality results are due to the fact that as you say you have not validated in depth the networks used or because of flaws in the methodology
+4. How would a beta-VAE perform with the same setup of the experiment in 4.1? I would not be surprised if it could capture the same features as your model. It is true as you claim that your method does not require the tuning of the beta hyperparameter in the ELBO, but the UG-VAE needs tuning of the dimensionality of d and beta, and is a more complex architecture than a beta-VAE so it is harder to implement and will take longer to train.
+5. It is not clear to me in Figure 4.1. why you are traversing z space in this way, but perhaps I misunderstood what you are doing. How are you guaranteed that you will follow the data manifold? The ML-VAE results might be off just because of this. 
+6. I believe the more exciting application of this model would be for domain alignment. Why haven't you focused on more multi-datasets experiments?
+7. How would a gm-vae baseline with 2 clusters perform with the same setup of the experiment in section 4.2? 
+
+In its current state I believe this paper is not ready for acceptance, but I hope the authors will be able to clarify some of my concerns in which case I will increase the score.
+
+Minor comment:
+* The second paragraph of the introduction is giving a lot of details on related work. I would recommend to move this discussion to the related work section, and leave the introduction for higher level discussions that only aim at giving intuition to the reader.
+",6,4.0,ICLR2021
+SJVkJ0Axf,3,r1Zi2Mb0-,r1Zi2Mb0-,Computational power,"This paper experiments the application of NAS to some natural language processing tasks : machine translation and question answering.  
+
+My main concern about this paper is its contribution. The difference with the paper of Zoph 2017 is really slight in terms of methodology. Moving from a language modeling task to machine translation is not very impressive neither really discussed. It could be interesting to change the NAS approach by taking into account this application shift.  
+
+On the experimental part, the paper is not really convincing. The results on WMT are not state of the art. The best system of this year was a standard phrase based and has achieved 29.3 BLEU score (for BLEU cased, otherwise it's one point more). Therefore the results on mt tasks are difficult to interpret. 
+
+At the end , the reader can be sure these experiments required a significant computational power. Beyond that it is difficult to really draw meaningful conclusions. ",3,4.0,ICLR2018
+DjT0Q-VIoet,2,MmcywoW7PbJ,MmcywoW7PbJ,Review,"Summary
+--
+This paper proposes an unsupervised learning objective for learning perceptual goal-conditioned policies. The goal is to enable unsupervised discovery of high-level behaviors in tandem with a perceptual-goal conditioned policy that can achieve these behaviors. The learning proceeds by training one policy to exhibit diverse behaviors; the states induced by these behaviors are then rendered and used as target goal states for a separate goal-conditioned policy.
+
+The learning objective is to maximize a lower-bound on sum of two terms -- (1) the mutual information between the behavior variable and states with a behavior-conditioned policy; states from this behavior-conditioned policy are transformed into perceptual goals that serve as input to a separate policy that forms (2) the mutual information between perceptual goals and the states it induces.
+
+The learning algorithm operates via alternating optimization, in which the skill-conditioned exploration policy is learned jointly with a skill 'discriminator' (inference network), and then the perceptual-goal conditioned policy is learned using the discriminator as a reward signal, which essentially estimates the extent to which a robot is achieving a skill given a perceptual goal along the skill-policy trajectory.
+
+A glut of experiments are used to investigate whether the method learns a meaningful skill reward function, whether it can achieve goals in various environments, how the method compares to related methods, and whether the specific 'disentanglement' inductive bias for constructing the policy is useful. The experiments demonstrate favorable performance over existing methods for these tasks.
+
+Quality
+--
+The goal, method, and experiments are likely high quality, given my understanding. However, there are significant gaps in clarity that reduce my certainty in this assessment, and relatedly, there is some important missing discussion on the specific differences between the proposed method and prior work.
+
+Clarity
+--
+- The learning algorithm is ambiguous: which goal(s), specifically, are used in the second learning stage? The current algorithm reused the 't' time index, which makes this unclear. I suspect it's just the goal corresponding to the last timestep from the first stage of the algorithm, but I'm not sure.
+- The state representation for the archery task is unclear
+- The 'fast imitation' procedure is unclear. The paper says the goals are the rendered states induced by the abstract policy, so how do the expert demonstrations get factored in? Is the learning algorithm different (is the abstract policy trained with the expert demonstrations somehow)? This is a significant ambiguity that makes it difficult to interpret the results of the imitation experiments. It's hard to guess at, because the conceptual difference between the standard imitation learning problem and a prototypical unsupervised ""RL"" algorithm is quite large.
+- The presence of two environments in Figure 1 is quite confusing. It's hard to tell from the context of the figure alone whether the policies are operating simultaneously in the same environment, simultaneously in separate equivalent environments, or at separate times in equivalent environments.
+- The \tilde notation and relationship between the \tilde an not-\tilde variables needs to be discussed in 3.1
+
+Originality
+--
+The learning objective seems to be quite similar to the DIAYN and DISCERN objectives, and the task of learning to condition on general perceptual goal in unsupervised setting is shared by RIG and DISCERN. Discussion of specific algorithmic and assumption differences between the proposed method and these approaches is quite necessary, but unfortunately missing. 
+
+Significance
+--
+In absence of knowing more about the specific differences between the proposed work and related works and the imitation learning experiments, the only thing that is clear is that the method seems relatively performant on an existing well-motivated task across a large range of settings, and compares favorably to existing methods on this task.
+
+Other points
+--
+Derivation is needed (e.g. in appendix) to show that (3) is a lower bound of the last line of (2). I don't think this aspect of the derivation follows from Jensen's inequality, as stated in the paper.
+- In the intro (para 3) ""an unsupervised method in RL"" is unclear to an RL reader unfamiliar with ""unsupervised RL"". I think this paper should explicitly define ""unsupervised RL"" to mean ""RL with an intrinsic reward function"" or something equivalent, and ""intrinsic reward function"" to mean a learning objective that can be applied in place of an alternative reward function across different MDPs.
+- Intro para 3 ""maximize an information theoretic objective"" is vague, because all learning objectives for probabilistic models are technically ""information theoretic"" -- all probabilistic models have a relationship to information.",7,4.0,ICLR2021
+xiPEuhEThyq,4,vcopnwZ7bC,vcopnwZ7bC,"Clear, easy to follow, missing comparisons. ","Summary: 
+Paper introduces Ordered Memory Policy Network (OMPN) with an objective to learn sub-task decomposition and hierarchy from demonstrations in the context of Imitation Learning under unsupervised and weakly supervised settings. Authors approach the problem of uncovering the task substructure from the architecture design lens where the goal is to design architectures that have the right inductive biases leading to the discovery of task sub-structures. The proposed solution views sub-tasks as finite state machines represented as memory banks that are updated via top-down and bottom-up recurrences. 
+
+Strengths:
+Paper is well polished and easy to follow. OMPN is shown to be effective in two domains. OMPN advantage in partially observable environments is also effectively demonstrated. 
+
+Discussion:
+My reservations are with the evaluations. 
+- The choice of baselines also seems a bit narrow. Comparisons to related methods dealing with sub-task decomposition and organization are missing. For example, a comparison to the Relay Policy Learning[1] (which the paper claims to be the most related work) is missing. Another closely related method Learning Latent Plans from Play[2] (https://learning-from-play.github.io/) is also missing.
+- Selected tasks have a very shallow task hierarchy and sparse task structure. The ceiling of the approach and limitations aren't clearly outlined and are less evident. This is also evident from the tasks considered in [1] and [2]. These methods have shown to be effective with play demonstrations in rich scenes such as kitchen and study-table scenes where are underlying task-structure is much more convoluted to uncover. Without similar comparisons, it's hard to evaluate the strength of OMPN. I'd strongly advise on including experiments under similar settings which will make the submission really strong.",6,4.0,ICLR2021
+BygNsRkAKH,3,S1elRa4twS,S1elRa4twS,Official Blind Review #1,"The paper focuses on a very interesting problem, that of pre-training deep RL solutions from observational data.   The specific angle selected for tackling this problem is through meta-learning, where a set of Q-functions / policies are learned during pre-training, and during testing the network identifies the training set MDP matching the data to extract a transferable solution.
+
+The main strength of the paper is to draw attention to the issue of pre-training in RL, which is much less studied than in supervised learning, where it has been shown to have tremendous impact.   The paper also provides reasonable coverage of a large amount of related work.
+
+Unfortunately I really struggled (despite careful reading) to understand several aspects of the proposed methods.   The training of function f() is not clearly explained; is this done as per Sec.2.4?  What is the loss function for this?  Is it done end-to-end as per the pipeline in Fig.1 (right), so using a gradient propagated back from Q?   What is the purpose of Proposition 1?  The more interesting point seems to be that solutions can “converge to a degenerate solution”, but this is not formally defined (i.e. how do you assess degeneracy, and how is that information used?)  Furthermore Proposition 1 seems limited to discrete state/action spaces. Is this the case for TIME in general?  The results on Mujoco suggest not.  Regarding the second phase of the pipeline, it is briefly mentioned that “P has low capacity” (bottom of p.4), but this is not explained further.  Is this due to a generalization issue, or a computational issue?  Why would P be low capacity but not E?  How does this actually impact the implementation?
+
+As a higher-level comment: is it really necessary (preferable) to infer the identity of a specific train MDP (using the function f)?  This is used as a premise in this work, but I am not convinced this is desirable (for good generalization) or scalable (in the case of several observed meta-train MDPs).   What is the advantage of proceeding in this way?  Much of the work on pre-training in supervised learning just exposes the learner to large amounts of observational data to pre-condition the solution.
+
+Finally, I have some concerns with the results as presented in the paper.  There are some details lacking, for example how specifically are the meta-test MDPs chosen for the Mujoco experiments?  How similar/different from the meta-train MDPs?  This is an issue because in Sec.5.1 the meta-test MDPs are chosen to coincide with meta-train MDPs.  So I am definitely interested in seeing how well the method actually generalizes to unseen MDPs, so need more detail on this part of the experiment.   I would also like to see a few additional naïve baselines.  First, what is the result if you do the pre-training as specified, and then at test time you randomly sample one of the pre-trained MDPs (rather than use the identification function).  Second, what is the result if you put all the meta-train data into a single batch, train a solution with SAC, then use this as a pre-trained solution (rather than the current “SAC trained from scratch”), allowing more training at test time.   Both these are useful sanity checks to verify the effectiveness of the proposed approach.
+
+============
+Post-rebuttal comments:
+
+1.  My question, as stated in the review is: "" how specifically are the meta-test MDPs chosen for the Mujoco experiments"".  I read that they are tested on unseen MDPs.  I want to know how those unseen MDPs are selected / specified, and again as per my review: ""How similar/different from the meta-train MDPs?""
+
+2.  You should entertain the possibility that perhaps Sec.3 is not as clear as you think it is.  For example the MDP (S,A,\hat{T}, \hat{R}, \gama) is not defined as ""the wrong MDP"" - which I assume is what you mean by degenerate solution?  
+
+3.  I would like to know what are the results from the 2 naive baselines I described, as good sanity check.
+
+In general, the rebuttal is intended to be a conversation to clarify understanding of the work. It is insulting to the reviewer to say that they did not read the paper carefully when they indicate they did.  It is much more productive to assume that many of your other (future) readers might have the same need for clarification, and you should be thankful for help provided by the reviewers to achieve this.",1,,ICLR2020
+URIYF9QPf7c,2,Py4VjN6V2JX,Py4VjN6V2JX,fair idea with strong experiments,"Summary: This paper proposes an audio-visual self-supervised learning approach based on two cross-modal contrastive loss that learns audio-visual representations that can generalize to both the tasks which require global semantic information and localized spatio-temporal information. Extensive experiments on 4 task demonstrate the usefulness of the learned representation.
+
+Strengths:
++ The paper is nicely written and well motivated. Existing works tend to learn either global representations or local representations, while this work aims at learning versatile representations that generalize well to both scenarios.
++ Good observation and analysis to construct triplets from different spatial and temporal span to capture either global information or local information. A tailored global-local contrastive learning network is desgined to realize the proposed idea.
++ Extensive experiments are performed on 4 task across a number of datasets to demonstrate the generality of the learned representations.
+
+Weakness:
+- Among the tasks evaluated on, only Table 3 compares to prior self-supervised learning approaches. This demontrates the proposed method learns better global representations, but no results are shown to prove that the prior ssl methods fail to learn local representations. In Table 1 and Table 2, the method only compares to prior state-of-the-art methods of the corresponding task. It would be more convincing to compare to the representations of some of the methods in Table 3.
+- For sound source localization, only some qualitative results are shown. Why not showing the quantitative results (IOU)? Only a figure of selected qualitative results is not convincing to this reviewer.
+- For Table 1 and Table 2,  the prior methods all use different network settings from the proposed network. How to tell whether the  gain is from a better network architecture or the proposed global-local representation learning method? Without comparing in an apples-to-apples manner with other self-supervised contrastive learning methods (e.g., GDT, AVID, etc.),  it would be unconvincing that the proposed global-local audio-visual contrastive learning method indeed learns better representations that have the suggested properties.
+- The related work is somewhat short and unorganized. It would be much better to re-organize the related work section (contrastive self-supervised representation learning, audio-visual learing,  etc. ) and highlight the differences to each group.
+
+Justification of rating: 
+The paper proposes a decent idea to learin global-local reprentations and evaluate on various tasks/datasets. However, some necessary comparisons are missing in order to demonstrate the effectivenss of the proposed method. This reviewer is happy to raise the score if the rebuttal can clarify the concerns.
+
+Post-rebuttal:
+Thanks for the clarifications in the rebuttal. It addresses some of my concerns. However, I am still concerned about the unfair comparisons of baselines using different settings, and the newly added ssl baselines all outform the proposed method on LRS by a large margin. Therefore, I would like to keep my original rating.",5,4.0,ICLR2021
+8Dx8xypI6k,1,#NAME?,#NAME?,"The results are promising, but many parts of the work remained to be explored/reported.","** Summary **
+
+(1) The authors proposed a translation system with an external memory. Given a sentence $x$ to be translated, they first retrieve a $(tx, ty)$ sentence pair from the training set through ``SEGMENT-BASED TM RETRIEVAL’’ defined in Section 4.1, where $tx$ and $ty$ are from source and target languages respectively. Then the $(x,tx,ty)$ are fused together to get the eventual translation, where authors design several ways to achieve that. 
+
+(2) Specifically, in the encoder side, the M-BERT model is leveraged to jointly encode the $x,tx,ty$.
+
+(3) The improvement in Table 1 is significant. 
+
+
+** Clarity **
+1.	Section 4.1 is unclear to me. In the 2nd paragraph of Section 4.1, ``For each s_k in s, we try to find a matched example (tx, ty) from D what tx contains the s_k’’: (1) There should be many sentence pairs (tx,ty) that tx contains s_k. Which ones should be kept? (2) What’is more, if tx contains s_k, can we say that the selected $tx$ is similar to $x$? (3) in experiments, the $n$ in  n-gram is set as? And what if we choose different $n$?
+2.	Which script did you choose to evaluate BLEU score?
+
+** Significance **
+1.	The idea itself is not novel. Compared to the related work, the novel part of this work is: (i) a new retrieval way, which is not quite clear and convincing to me. (ii) a new way to aggregate multiple inputs (using M-BERT) and several different decoding methods. In experiments, there is no comparison with previous retrieval based methods. Similar idea also exists in [R3], which is missing from this paper. The differences with [R3] should be discussed. 
+2.	The authors did not provide what the retrieved sentences are like. Given a validation corpus (X,Y), and the corresponding retrieved (tX, tY), the authors should at least show the similarity between (X,tX), (Y,tY), which measures the retrieval quality. 
+3.	Please report the total training and total inference time, and make a comparison with standard Transformer model. Specifically, for Table 1, the inference time of each algorithm should be reported (retrieval time included).
+4.	Why do you choose case-insensitive BLEU score for En->Fr, which is not commonly used in previous baselines.
+5.	Considering the BERT is leveraged, you should discuss the relation with BERT + NMT [R1,R2]. 
+6.     The authors should conduct experiments beyond English-to-French. More languages pairs should be verified.
+
+Typos:
+1.	compared with Enhanced baseline… -> Comparison with enhanced baseline
+
+
+** Refereces **
+
+R1: Zhu, Jinhua, Yingce Xia, Lijun Wu, Di He, Tao Qin, Wengang Zhou, Houqiang Li, and Tie-Yan Liu. ""Incorporating bert into neural machine translation."" ICLR’20, https://arxiv.org/pdf/2002.06823.pdf
+
+R2: Yang, Jiacheng, Mingxuan Wang, Hao Zhou, Chengqi Zhao, Yong Yu, Weinan Zhang, and Lei Li. ""Towards making the most of bert in neural machine translation."" AAAI’20, https://arxiv.org/pdf/1908.05672.pdf
+
+R3. Eriguchi, A., Rarrick, S. and Matsushita, H., 2019, November. Combining Translation Memory with Neural Machine Translation. In Proceedings of the 6th Workshop on Asian Translation (pp. 123-130).
+
+
+",4,5.0,ICLR2021
+HyW8p6vxM,2,r1CE9GWR-,r1CE9GWR-,"This work does not explain GANs, it merely revisits minimum-distance density estimation","First of all, let me state this upfront: despite the sexy acronym ""GAN"" in the title, this paper does not provide any genuine understanding of GANs. Conceptually, GANs are an algorithmic instantiation of a classic idea in statistics, mamely minimum-distance estimation, originally introduced by Jacob Wolfowitz in 1957 (*). This provides the 'min' part. The 'max' part comes from considering distances that can be expressed as a supremum over a class of test functions. Again, this is not new -- for instance, empirical risk minimization, in both supervised and unsupervised learning, can be phrased as precisely such a minimax problem by casting the convergence analysis in terms of suprema of suitable empirical processes (see, e.g., ""Empirical Processes in M-Estimation"" by Sara Van De Geer). Moreover, even the minimax (and, more broadly, game-theoretic) criteria go back all the way to the foundational papers of Abraham Wald.
+
+Now, the conceptual innovation of GANs is that this minimax formulation can be turned into a zero-sum game played by two algorithmic architectures, the generator and the discriminator. The generator proposes a model (which is assumed to be easy to sample from) and generates a sample starting from a fixed instrumental distribution; the discriminator evaluates the current proposal against a class of test functions, which, again, are assumed to be easily computable, e.g., by a neural net. One can also argue that the essence of GANs is precisely the architectural constraints on both the generator and the discriminator that make their respective problems amenable to 'differentiable' approaches, e.g., gradient descent/ascent with backpropagation. Without such a constraint, the saddle point is either trivial or reduces to finding a worst-case Bayes estimate, as classical statistical theory would predict.
+
+This paper essentially strips away the essence of GANs and considers a stylized minimum-distance estimation problem, where both the target and the instrumental distributions are Gaussian, and the 'distance' between statistical models is the quadratic Wasserstein distance induced by the Euclidean norm. This, essentially, stacks the deck in favor of linear strategies, and it is not surprising at all that PCA emerges as the solution. It is very hard to see how any of this helps our understanding of either strengths or shortcomings of GANs (such as mode collapse or stability issues). Moreover, the discussion of supervised and unsupervised paradigms is utterly unconvincing, especially in light of the above comment on minimum-distance estimation underlying both of these paradigms. In either setting, a learning algorithm is obtained from the population version of the problem by substituting the empirical distribution of the observed data for the unknown population law.
+
+Additional minor comments on proper attribution and novelty of results:
+
+1) Lemma 3 (structural result for optimal transport with L_2 Wasserstein cost) is not due to Chernozhukov et al., it is a classic result in the theory of optimal transportation, in various forms due to Brenier, McCann, and others -- cf., e.g., Chapters 2 and  3 of C. Villani, ""Topics in Optimal Transportation.""
+
+2) The rate-distortion formulation with fixed input and output marginal in Appendix A, while interesting, is also not new. Precise characterizations in terms of optimal transport are available, see, e.g., N. Saldi, T. Linder, and S. Yuksel, ""Randomized Quantization and Source Coding With Constrained Output Distribution,"" IEEE Transactions on Information Theory, vol. 61, no. 1., pp. 91-106, January 2015.
+
+(*) The method of Wolfowitz is not restricted to distance functions in the mathematical sense; it can work equally well with monotone functions of metrics -- e.g., the square of a metric.",4,5.0,ICLR2018
+S1wxhnsef,3,SyMvJrdaW,SyMvJrdaW,"Review of ""Decoupling the Layers in Residual Networks""","The main contribution of this paper is a particular Taylor expansion of the outputs of a ResNet which is shown to be exact at almost all points in the input space.  This expression is used to develop a new layer called a “warp layer” which essentially tries to compute several layers of the residual network using the Taylor expansion expression — however in this expression, things can be done in parallel, and interestingly, the authors show that the gradients also decouple when the (ResNet) model is close to a local minimum in a certain sense, which may motivate the decoupling of layers to begin with.  Finally the authors stack these warp layers to create a “warped resnet” which they show does about as well as an ordinary ResNet but has better parallelization properties.
+
+To me the analytical parts of the paper are the most interesting, particularly in showing how the gradients approximately decouple.  However there are several weaknesses to the paper (or maybe just things I didn’t understand).  First,  a major part of the paper tries to make the case that there is a symmetry breaking property of the proposed model, which I am afraid I simply was not able to follow.  Some of the notation is confusing here — for example, presumably the rotations refer to image level rotations rather than literally multiplying the inputs by an orthogonal matrix, which the notation suggests to be the case.  It is also never precisely spelled out what the final theoretical guarantee is (preferably the authors would do this in the form of a proposition or theorem).
+
+Throughout, the authors write out equations as if the weights in all layers are equal, but this is confusing even if the authors say that this is what they are doing, since their explanation is not very clear.  The confusion is particularly acute in places where derivatives are taken, because the derivatives continue to be taken as if the weights were untied, but then written as if they happened to be the same.
+
+Finally the experimental results are okay but perhaps a bit preliminary.  I have a few recommendations here:
+* It would be stronger to evaluate results on a larger dataset like ILSVRC.  
+* The relative speed-up of WarpNet compared to ResNet needs to be better explained — the authors break the computation of the WarpNet onto two GPUs, but it’s not clear if they do this for the (vanilla) ResNet as well.  In batch mode, the easiest way to parallelize is to have each GPU evaluate half the batch.  Even in a streaming mode where images need to be evaluated one by one, there are ways to pipeline execution of the residual blocks, and I do not see any discussion of these alternatives in the paper.
+* In the experimental results, K is set to be 2, and the authors only mention in passing that they have tried larger K in the conclusion.  It would be good to have a more thorough experimental evaluation of the trade-offs of setting K to be higher values.
+
+A few remaining questions for the authors:
+* There is a parallel submission (presumably by different authors called “Residual Connections Encourage Iterative Inference”) which contains some related insights.  I wonder what are the differences between the two Taylor expansions, and whether the insights of this paper could be used to help the other paper and vice versa?
+* On implementation - the authors mention using Tensorflow’s auto-differentiation.  My question here is — are gradients being re-used intelligently as suggested in Section 3.1?  
+* I notice that the analysis about the vanishing Hessian could be applied to most of the popular neural network architectures available now.  How much of the ideas offered in this paper would then generalize to non-resnet settings?
+
+",6,3.0,ICLR2018
+ryIhKEFxM,1,HyMTkQZAb,HyMTkQZAb,Assumptions used in approximations are not well justified,"This paper extends the Kronecker-factor Approximate Curvature (K-FAC) optimization method to the setting of recurrent neural networks. The K-FAC method is an approximate 2nd-order optimization method that builds a block diagonal approximation of the Fisher information matrix, where the block diagonal elements are Kronecker products of smaller matrices. 
+
+In order to approximate the Fisher information matrix for RNNs, the authors assume that the derivative of the loss function with respect to each weight matrix at each time step is independent of the length of the sequence, that these derivatives are temporally homogeneous, that the input and derivatives of the output are independent across every point in time, and that either the one-step cross-covariance of these derivatives is symmetric or that the training sequences are effectively infinite in length. Based on these assumptions, the authors show that the Fisher information can be reduced into a form in which the derivatives of the weight matrices can be approximated by a linear Gaussian graphical model and in which the approximate 2nd order method can be efficiently carried out. The authors compare their method to SGD on two language modeling tasks and against Adam for learning differentiable neural computers.
+
+The paper is relatively clear, and the authors do a reasonable job of introducing related work of the original K-FAC algorithm as well as its extension to CNNs before systematically deriving their method for RNNs. The problem of extending the K-FAC algorithm is natural, and the steps taken in this paper seem natural yet also original and non-trivial.  
+
+The main issue that I have with this paper is the lack of theoretical justification or even intuition for the many approximations carried out in the course of approximating the Fisher information matrix. In many instances, it seemed like these approximations were made purely for convenience and tractability without much regard for (even approximate) correctness. This quality of this paper would be greatly  strengthened if it had some bounds on approximation error or even empirical results testing the validity of the assumptions in the paper. Moreover, the experiments do not demonstrate levels of statistical significance in the results, so it is difficult to assert the practical significance of this work.  
+
+Specific comments and questions
+Page 2, ""r is is"". Typo.
+Page 2, ""DV"". I found the introduction of V without any explanation to be confusing.
+Page 2, ""P_{y|x}(\theta)"". The relation between P_{y|x}(\theta) and f(x,\theta) is never explained.
+Page 3, ""common practice of computing the natural gradient as (F + \lambda I) \nabla h instead of F^{-1} \nabla h"". I don't see how the former can serve as a replacement for the latter.
+Page 3, ""approximate g and a as statistically independent"". Even though K-FAC already exists, it would be good to explain why this assumption is reasonable, since similar assumptions are made for the work presented in this paper.
+Page 4, ""This new approximation, called ""KFC"", is derived by assuming...."". Same as previous comment. It would be good to briefly discuss why these assumptions are reasonable.
+Page 5, Independence of T and w_t's, temporal homogeneity of w_t's,, and independence between a_t's and g_t's. I can see why these are convenient assumptions, but why are they reasonable? Moreover, why is it further natural to assume that A and G are temporally homogeneous as well?
+Page 7, ""But insofar as the w_t's ... encode the relevant information contained in these external variables, they should be approximately Markovian"". I am not sure what this means.
+Page 7, ""The linear-Gaussian assumption meanwhile is a more severe one to make, but it seems necessary for there to be any hope that the required expectations remain tractable"". I am not sure that this is a good enough justification for such an idea, unless there are compelling approximation error bounds. 
+Page 8, Option 1. In what situations is it reasonable to assume that V_1 is symmetric? 
+Pages 8-9, Option 2. What is a good finite sample size in which the assumption that the training sequences are infinitely long is reasonable in practice? Can the error |\kappa(x) - \zeta_T(x)| be translated into a statement on the approximation error?
+Page 9, ""V_1 = V_{1,0} = ..."". Typos (that appear to have been caught by the authors already).
+Page 9, ""The 2nd-order statistics ... are accumulated through an exponential moving average during training"". How sensitive is the performance of this method to the decay rate of the exponential moving average? 
+Page 10, ""The additional computations required to get the approximate Fisher inverse from these statistics ... are performed asynchronously on the CPU's"". I find it a bit unfair to compare SGD to K-FAC in terms of wall clock time without also using the extra CPU's for SGD as well (e.g. via Hogwild or synchronous parallel SGD).
+Page 10, ""The hyperparameters of our approach..."". What is the sensitivity of the experimental results to these hyperparameters? Moreover, how sensitive are the results to initialization?
+Page 10, ""we found that each parameter update of our method required about 80% more wall-clock time than an SGD update"". How much of this is attributed to the fact that the statistics are computed asynchronously?
+Pages 10-12, Experiments. There are no error bars in any of the plots, so it is impossible to ascertain the statistical significance of any of these results. 
+Page 11: Figure 2. Where is the Adam batchsize 50 line in the left plot? Why did the Adam batchsize 200 line disappear halfway through the right plot?
+  
+
+
+  ",5,4.0,ICLR2018
+_BrD2XsKL1h,4,aYbCpFNnHdh,aYbCpFNnHdh,Official Review,"
+### Overall
+
+Authors extend CLEVR dataset so as to consider multiple viewpoints, and evaluate current neural network models in that setting. They also update a standard approach to introduce camera viewpoint information in the network so it can better answer visual question from the canonical scene frame even from other perspectives.
+
+### Positive aspects
+
+* Authors provide a study on a important topic of Computer Vision: understanding multiple views of a same scene. They do such study on a hard task, which is VQA. Actually, authors provide a more complex version of a simple VQA dataset (*simple* because it is synthetic and has very well established domain limits).
+* Authors evaluate different training frameworks (supervised and unsupervised from scratch).
+* Authors provided accuracy values for pretraining with NCE, which can be helpful.
+* Results seem to be promising.
+* In general, text is well written and easy to read.
+* It is interesting that even a frozen pretrained network provides good results in such visually different dataset. Although, it was nice that authors trained an encoder from scratch.
+* Code already available!
+
+### Weak aspects and suggestions
+
+* The problem is interesting, though my main concern is regarding the novelty and contribution of the paper. It seems to be an adaptation of CLEVR dataset, and an adaptation of the FILM model. In addition, authors use camera viewpoint information to ease the identification of the scenes. I have mixed feelings in using such specific kind of information in the model, because in a real world scenario we don't have access to them. I might be wrong, but maybe it is possible to insert a module in their approach to estimate the camera parameters, so as the network itself could learn to predict how viewpoints work and how scenes change with that. I think this could be done by adding such parameters as target information some of the models. For instance, the unsupervised architecture could be trained to predict whether the scenes are the same, but also the camera parameters. Apologies if I miss something here.
+
+* The proposed architecture seems to be basically an adaptation of the FILM model considering camera viewpoint information.
+
+* FILM (2018) is the best performing approach in CLEVER to date? There are more recent approaches that could be used in the results section as baselines.
+
+* It is unclear what happened to the spatial-related questions. They were removed of the dataset?
+
+* Results are promising, although why do they have such high variance? (7-8% of variance is not negligible by any means); considering that for some experiments it is likely that 2D FILM provides similar performance than 3D one. A statistical test might help to verify whether such results are statistically significant or not.
+
+* Font size for all images should be quite larger. It is hard to read in the current size.
+
+* Figure of the post processor does not help much. Authors could detail a little bit more what is inside that $postproc_w$ box.
+
+* *""Since the post-processor is a learnable module through which the FILM part of the pipeline is able to backpropagate through, it can be seen as learning an appropriate set of transforms that construct 3D feature volumes h0.""* I suggest rewriting this sentence, it is very confusing.
+
+* *""While we obtained great results, it may not leave a lot of room to improve on top of our methods,""* This sentence is odd. The sentence ""we obtained great results"" can be written in a more objective and scientific way (avoid the usage of adjectives). Another important aspect is: often it is easy to provide first large steps in a task (ImageNet for instance), although it gets much harder to improve on that when results are good (AlexNet vs ResNet, see the performance difference). Another aspect: maybe authors made the task too easy and should have explored more challenging scenarios.
+
+* *""and we identified some ways in which the dataset could be made more difficult""* Those ideas to make the task more challenging are indeed important. Why authors did not perform experiments in such scenarios? It does not seem very hard to generate such datasets.
+
+* Is it possible to visualize and understand what the postproc module does? It would be nice to visually explain the $h'$ (64, 16, 14, 14) tensor represents.
+
+* There could be some qualitative analysis.
+
+* The dataset extension seems to be a large portion of the work. I think it could have a separate section with more details.
+
+### Additional questions
+
+* What happens if other conditioning camera information strategy is used? For instance, simply concatenating or using other simpler fusion techniques. FILM would perform much better than other simpler approaches?
+
+* *""ResNet outputs... feature maps h of dimensions (1024,14, 14)""* Is this correct? I believe Resnet101 outputs (2048, 14, 14) feature maps.
+
+* *""in practice, we found $\tau = 0.1$ to produce the lowest softmax loss.""* Which ones you have tested? Why $\tau$ is 1.0 in Table 2?
+
+* *""Another idea is to allow the viewpoint camera’s elevation to change. ""* That is true. Or even the distance from the camera. Why did authors decide not to include such examples in this work?
+
+* *""This is to be expected, considering that any camera information that is forward-propagated will contribute gradients back to the postprocessing parameters in the backward propagation, effectively giving the postprocessor supervision in the form of camera extrinsics.""*. Can authors support/prove this claim?
+",4,3.0,ICLR2021
+SyKcRn9xf,3,Skp1ESxRZ,Skp1ESxRZ,"Interesting work with strong results, but lacks empirical analysis","This paper proposes a method for learning parsers for context-free languages. They demonstrate that this achieves perfect accuracy on training and held-out examples of input/output pairs for two synthetic grammars. In comparison, existing approaches appear to achieve little to no generalization, especially when tested on longer examples than seen during training.
+
+The approach is presented very thoroughly. Details about the grammars, the architecture, the learning algorithm, and the hyperparameters are clearly discussed, which is much appreciated. Despite the thoroughness of the task and model descriptions, the proposed method is not well motivated. The description of the relatively complex two-phase reinforcement learning algorithm is largely procedural, and it is not obvious how necessary the individual pieces of the algorithm are. This is particularly problematic because the only empirical result reported is that it achieves 100% accuracy. Quite a few natural questions left unanswered, limiting what readers can learn from this paper, e.g.
+- How quickly does the model learn? Is there a smooth progression that leads to perfect generalization?
+- Presumably the policy learned in Phase 1 is a decent model by itself, since it can reliably find candidate traces. How accurate is it? What are the drawbacks of using that instead of the model from the second phase? Are there systematic problems, such as overfitting, that necessitate a second phase?
+- How robust is the method to hyperparameters and multiple initializations? Why choose F = 10 and K = 3? Presumably, there exists some hyperparameters where the model does not achieve 100% test accuracy, in which case, what are the failure modes?
+
+Other misc. points:
+- The paper mentions that ""the training curriculum is very important to regularize the reinforcement learning process."" Unless I am misunderstanding the experimental setup, this is not supported by the result, correct? The proposed method achieves perfect accuracy in every condition.
+- The reimplementations of the methods from Grefenstette et al. 2015 have surprisingly low training accuracy (in some cases 0% for Stack LSTM and 2.23% for DeQueue LSTM). Have you evaluated these reimplementations on their reported tasks to tease apart differences due to varying tasks and differences due to varying implementations?",5,2.0,ICLR2018
+r1eVOIOt2m,1,BkesJ3R9YX,BkesJ3R9YX,"A spatial-temporal attention model, missing some baselines. ","The paper propose an end-to-end technique that applies both spatial and temporal attention. The spatial attention is done by training a mask-filter, while the temporal-attention use a soft-attention mechanism.  In addition the authors propose several regularization terms  to directly improve attention. The evaluated datasets are action recognition datasets, such as HMDB51, UCF10, Moments in Time, THUMOS’14. The paper reports SOTA on all three datasets. 
+
+
+
+Strengths:
+
+The paper is well written: easy to follow, and describe the importance of spatial-temporal attention. 
+
+The model is simple, and propose novel attention regularization terms. 
+
+The authors evaluates on several tasks, and shows good qualitative behavior. 
+
+
+Weaknesses:
+
+The reported number on UCF101 and HMDB51 are confusing/misleading.  Even with only RGB, the evaluation miss numbers of models like ActionVLAD with 50% on HMDB51 or Res3D with 88% on UCF101. I’ll also add that there are available models nowadays that achieve over 94% accuracy on UCF101, and over 72% on  HMDB51. The paper should at least have better discussion on those years of progress. The mis-information also continues in THUMOS14, for instance R-C3D beats the proposed model. 
+
+In my opinion the paper should include a flow variant. It is a common setup in action recognition, and a good model should take advantage of these features. Especially for spatial-temporal attention, e.g., VideoLSTM paper by Li. 
+
+In general spatial attention over each frame is extremely demanding. The original image features are now multiplied by 49 factor, this is more demanding in terms of memory consumption than the flow features they chose to ignore.  The authors reports on 15-frames datasets for those short videos. But it will be interesting to see if the model is still useable on longer videos, for instance on Charades dataset. 
+
+Can you please explain why you chose a regularized making instead of Soft-attention for spatial attention? 
+
+To conclude: 
+The goal of spatial-temporal attention is important, and the proposed approach behaves well. Yet the model is an extension of known techniques for image attention, which are not trivial to apply on long-videos with many frames. Evaluating only on rgb features is not enough for an action recognition model. Importantly, even when considering only rgb models, the paper still missed many popular stronger baselines. 
+
+",6,4.0,ICLR2019
+WyIK2qTc1H,4,0oabwyZbOu,0oabwyZbOu,An application of world models to Atari with an unclear motivation,"The authors build on the Dreamer architecture, that is able to learn models of an environment, to build DreamerV2, which learns a model of an environment in latent space. The authors then train their agent in this latent space. DreamerV2 was evaluated on the Atari learning environment and results showed that it was comparable to Rainbow and better, under certain metrics.
+
+I am unclear on the motivation of this paper. As with previous papers on model-based learning for Atari (i.e. Kaiser et. al (2019)), the goal of learning a model has been to reduce the number of environment steps. However, the authors use the same number of environment steps with the only difference being the model is trained in latent space. Training the model in latent space can speed up learning. Is this the main contribution of the paper?
+
+There is no analysis as to why using a world model for training might lead to better results than training in the real-world if the same number of environment steps are used. What is the authors' perspective on this? Did DreamerV2 use more steps in the world-model environment than in the real-world environment?
+
+Given that the latent space is trained based on some reconstruction error (instead of only being useful to a value function as with value prediction networks) it is not immediately obvious that the latent space will be a better place to learn a policy. Have the authors tried training their method on the reconstructions? Perhaps this will be slower, but I think it is still relevant since the authors claim that it may be easier to learn a model in latent space: ""Predicting compact representations instead of images can reduce the effect of accumulating errors""
+
+In the introduction, the authors say: Several attempts at learning accurate world models of Atari games have been made, without achieving competitive performance (Oh et al., 2015; Chiappa et al., 2017; Kaiser et al., 2019)."" I do not think this is a fair statement because papers such as Kaiser et al., 2019 intentionally use fewer environment steps.
+
+---After rebuttal---
+
+The authors have partially addressed my concerns, however, I am still not quite sure why their method would be better than SimPLE. The authors' ablation study of removing image gradients does not address my main question about where the performance benefit is coming from. I am assuming that the architecture and hyperparameters that the authors use are different than SimPLE. I think one must instead replace predicting the latent state with the image as SimPLE did to see if that makes a difference in performance. Therefore, I will be keeping my review the same.",4,4.0,ICLR2021
+kcXfeQrgCFb,1,0owsv3F-fM,0owsv3F-fM,"Good paper, but further details required","## Summary
+The paper proposes a new approach for performing cross-modal domain adaptation, i.e. adapting a policy trained with inputs from modality A (eg low-dimensional environment state) to work with inputs from domain B (eg images). The main use case demonstrated in the paper is the adaptation of policies trained on states in a simulator to work on image inputs, which can be useful for eg real world deployment where states might not be available. While it is a very classic approach to separately train a perception module images --> state (eg in robotics), the main novelty of the presented method is, that this mapping can be learned without the need for paired [image, state] data.
+
+## Strengths
+- the discussed problem of cross-modal domain adaptation is very relevant, since it can allow for the transfer of policies from training in simulation to deployment in real environments
+- being able to learn the image --> state mapping without paired data can be impactful since it is often tedious / impossible to get exact state annotations for real world data
+- the proposed technical approach is novel to the best of my knowledge
+- the method is evaluated on multiple environments and some of the design decisions are ablated + (preliminary) experiments on the robustness of the method are conducted
+
+## Weaknesses
+- **baselines are not described in detail**: the experimental section only mentions that ""We modify state-of-the-art methods in same-modal domain adaptation for comparison"" -- it is important to describe which method for same-modal adaptation was used and how it was modified for the cross-modal case to properly judge the results. 
+- **no investigation into why baseline does not work**: figure 1 provides one possible explanation (because of not taking dynamics into account), but later the paper mentions it could be because the necessary biases present in image-to-image translation are not present in image-to-state translation, finally it could be because the method was not tuned sufficiently. It would be good to show a more detailed investigation of this, particularly since the paper claims to be the first trying to apply domain adaptation techniques to the cross-modal case
+- **only tested on visually clean environments**: the environments used for testing the approach are visually clean -- in particular: the state information is sufficient to fully render / reconstruct the scene. This is likely not true for more realistic scenarios (eg think about autonomous driving where the commonly used state representations are certainly not sufficient to render all details in the environment -- the whole point is to reduce the amount of information in the policy's input). I wonder whether the proposed method would struggle with such environments since it constrains the ""latent"" variable of the prediction model to be equal to the pre-defined state while training it to reconstruct the full scene. If there is lots of detail in the scene that is not covered by the information in the state the model might struggle to reconstruct the scene properly.
+- **requires action trajectories in the image data domain**: at least the version of the model that was experimentally validated requires access to action trajectories in the image domain (for the action reconstruction loss). Such action annotations might be hard to obtain in the real world -- baselines like CycleGAN do not require these since they learn the mapping purely from state and image data.
+
+## Questions
+- does the proposed approach need access to a differentiable behavior policy? I would think that just action samples would be enough and it would only need a differentiable policy on states (which is anyways available) --> however, the formulation in the paragraph before eq6 talks about a ""differentiable behavior policy"" so it would be good if the authors could clarify whether it is indeed needed
+- how stable is the training of the approach? (Cycle)GAN approaches can be notoriously hard to train (see Appendix Sec D5) -- the proposed method also uses a GAN to minimize divergence from the prior. Some discussion on stability of this training or even a quantitative analysis of performance over a range of hyperparameters could help to show that this method might be easier to train than the GAN-based alternatives?
+- when training the dynamics model online the simulator needs to be reset to the predicted state of the model --> how can value ranges be handled? ie what to do if the model predicts an invalid output state?
+- what are the ""numerical instabilities in Mujoco"" mentioned in Section 4? maybe add a little more explanation?
+
+## Suggestions to improve the paper
+Addressing the weaknesses I listed above can help to improve the paper. In particular I would suggest to:
+- clearly describe the baselines used 
+- show *why* the baseline is failing to make clear that this is not an issue of tuning
+- include a GAN baseline that operates on short trajectory snippets instead of single states to see whether it is truly an issue of cross-modality or whether merely including dynamics information can help the simpler GAN baselines
+- test on environments with (non-static) visual components that are not captured in the state information (eg add moving visual distractors to the Mujoco scenes which are not part of the state)
+- ablate the action reconstruction component of the mapping loss to show that the method can work without the need for action annotations in the image domain
+- tone down the contribution claims that talk about ""transfer to real-world images"" since the tested scenarios are far from real-world images
+
+Some further, optional improvements:
+- add a baseline that shows RL trained on images from scratch --> this can show how much performance we gain / loose by doing the domain adaptation vs training from scratch (even if we loose some performance it is okay since training from scratch might not be feasible in the real world)
+- it seems that some of the differences to prior work require a better understanding of the technical details of the proposed method (particularly paragraph 4), it might be worth considering to move the related work section after approach before experiments
+- as mentioned in my summary, it is a quite classic approach eg in robotics to separately train a perception module that maps images to states and then use it to work in the real world with policies that operate on state input (eg motion planners etc), ie perform cross-modal domain adaptation. However, they assume access to a paired dataset of [image, state] tuples, which is potentially hard to obtain. I think the related work section could benefit from adding a discussion about this.
+- this sentence is not clear: ""We first formulate the generation process of the real world and simulation"" -- does this talk about how observations are actually generated in the real world or about the generation process of the model used in this paper? maybe reformulate for better clarity?
+- Section 3.3 is a bit confusing, it only later became clear to me that training the separate dynamics model is optional, this could be emphasized a bit more.
+- the dynamics mismatch experiment should be trained to full convergence (which seems to be ~10k steps for hopper), now it is only trained for 5k steps which equals only half-converged performance so it is unclear whether the mismatched runs will actually reach full performance
+
+## Overall Recommendation
+The proposed method is interesting and the paper certainly has merit. My main concern is that it is currently hard to judge the thoroughness of the experimental evaluation since it remains unclear how the baseline is implemented and why it fails. Therefore I cannot recommend acceptance at this point. If the authors can more convincingly show why prior work fails in the cross-modal case and how their method fixes that I am willing to increase my score.",5,4.0,ICLR2021
+rkqw-ubNg,1,BycCx8qex,BycCx8qex,Final review,"Overall, this is a nice paper. Developing a unifying framework for these newer
+neural models is a worthwhile endeavor.
+
+However, it's unclear if the DRAGNN framework (in its current form) is a
+significant standalone contribution. The main idea is straightforward: use a
+transition system to unroll a computation graph. When you implement models in
+this way you can reuse code because modules can be mixed and matched. This is
+nice, but (in my opinion) is just good software engineering, not machine 
+learning research.
+
+Moreover, there appears to be little incentive to use DRAGNN, as there are no
+'free things' (benefits) that you get by using the framework. For example:
+
+- If you write your neuralnet in an automatic differentiation library (e.g.,
+  tensorflow or dynet) you get gradients for 'free'.
+
+- In the VW framework, there are efficiency tricks that 'the credit assignment
+  compiler' provides for you, which would be tedious to implement on your
+  own. There is also a variety of algorithms for training the model in a
+  principled way (i.e., without exposure bias).
+
+I don't feel that my question about the limitations of the framework has been
+satisfactorily addressed. Let me ask it in a different way: Can you give me
+examples of a few models that I can't (nicely) express in the DRAGNN framework?
+What if I wanted to implement https://openreview.net/pdf?id=HkE0Nvqlg or
+http://www.cs.jhu.edu/~jason/papers/rastogi+al.naacl16.pdf? Can I implement the
+dynamic programming components as transition units and (importantly) would it be
+efficient?
+
+ disagree that the VW framework is orthogonal, it is a *competing* way to
+implement recurrent models. The main different to me appears to be that VW's
+imperative framework is more general, but less modular.
+
+The experimental contribution seems useful as does the emphasis on how easy it
+is to incorporate multi-task learning.
+
+Minor:
+
+- It would be useful to see actual code snippets (possibly in an
+  appendix). Otherwise, its unclear how modular DRAGNN really are.
+
+- The introduction states that (unlike seq2seq+attention) inference remains
+  linear. Is this *necessarily* the case? Users define a transition system that
+  is quadratic, just let attention be over all previous states. I recommend that
+  authors rephrase statement more carefully.
+
+- It seems strange to use A() as in ""actions"", then use d as ""decision"" for its
+  elements.
+
+- I recommend adding i as an argument to the definition of the recurrence
+  function r(s) to make it clear that it's the subset of previous states at time
+  i, otherwise it looks like an undefined variable. A nice terse option is to
+  write r(s_i).
+
+- Real numbers should be \mathbb{R} not \mathcal{R}.
+
+- It's more conventional to use t for a time-step instead of i.
+
+- Example 2: ""52 feature embeddings"" -> did you mean ""52-DIMENSIONAL feature
+  embeddings""?
+",5,4.0,ICLR2017
+r1rwrnWVg,3,SkuqA_cgx,SkuqA_cgx,Borderline,"This paper introduces a new dataset to evaluate word representations. The task considered in the paper, called outlier detection (also known as word intrusion), is to identify which word does not belong to a set of semantically related words. The task was proposed by Camacho-Collados & Navigli (2016) as an evaluation of word representations. The main contribution of this paper is to introduce a new dataset for this task, covering 5 languages. The dataset was generated automatically from the Wikidata hierarchy.
+Entities which are instances of the same category are considered as belonging to the same cluster, and outliers are sampled at various distances in the tree. Several heuristics are then proposed to exclude uninteresting clusters from the dataset.
+
+Developing good ressources to evaluate word representations is an important task. The new dataset introduced in this paper might be an interesting addition to the existing ones (however, it is hard to say by only reviewing the paper). I am a bit concerned by the lack of discussion and comparison with existing approaches (besides word similarity datasets). In particular, I believe it would be interesting to discuss the advantages of this evaluation/dataset, compared to existing ones such as word analogies. The proposed evaluation also seems highly related to entity typing, which is not discussed in the paper.
+
+Overall, I believe that introducing ressources for evaluating word representations is very important for the community. However, I am a bit ambivalent about this submission. I am not entirely convinced that the proposed dataset have clear advantages over existing ressources. It also seems that existing tasks, such as entity typing, already capture similar properties of word representations. Finally, it might be more relevant to submit this paper to LREC than to ICLR.",5,3.0,ICLR2017
+PUl8LBFiJj4,3,P42rXLGZQ07,P42rXLGZQ07,A different take on training VAEs,"This paper proposes an evolutionary optimization framework for training vartional autoencoders (VAEs) with discrete latents. In contrast to the standard VAE paradigm, the proposed TVAE approach does not require an encoder for amortized inference given the input. The method instead relies on a pool of latent variable samples for each data point to activate the decoder network. The latent variable pools are maintained and iteratively updated to increase the average lower-bound of the marginal log-likelihood of the input data. Experimental results show that non-linear decoders optimized by the TVAE framework outperform their linear counterparts on a denoising task.  Further results demonstrate method's competitiveness on zero-shot denoising, where a TVAE decoder is only trained on the noisy input image to reconstruct a smoothed version of the input image.
+ 
+The paper is well-written and easy to follow. The work proposes an interesting alternative for optimizing VAEs which does not require an encoder network for amortized inference. I however have a number of concerns that are as follows:
+
+The method instead imposes the overhead of maintaining and evolving a collection of latent variables for every data point, which can be both sample inefficient and memory-heavy for sizable problems. Then when or why would one trade amortized inference for the proposed approach?
+
+The paper falls short of comparing the proposed optimization procedure with other alternatives for training standard or discrete VAEs. 
+
+From reading the paper it is not clear how the size of the pool may need to be varied as the nature of the task or the number of latent changes.
+
+The authors use a greedy approach to update the pool of latent samples. Does it not cause the optimization procedure to get stuck in local modes? Could one instead use MCMC type sampling approaches to enable mode jumps? 
+
+Why only denoising experiments? Can the authors use their approach for data synthesis tasks where the latents can be shown to control different attributes of the data generative process?",6,4.0,ICLR2021
+HklnbLThtr,1,Hyg-JC4FDr,Hyg-JC4FDr,Official Blind Review #1,"The primary contribution of this paper is a principled algorithm for off-policy imitation learning. Generative Adversarial Imitation Learning (GAIL), proposed by Ho and Ermon in 2016, is an on-policy imitation learning method that (provably) minimizes the divergence between a target state-action distribution and the policy state-action distribution. Followups to this work (Kostrikov et al., 2019) show that the same algorithm can be applied to an off-policy setting (replacing the on-policy samples with samples from the replay buffer) and the method still works, but is no longer theoretically justified. I believe using importance ratios would make this approach justified as well, but Kostrikov et al. found that using importance ratios actually degrades the performance of their method (due the difficulty associated with estimating importance ratios). This paper attempts to bridge this gap: a method that is theoretically justified, and still works. 
+
+The paper takes most of its inspiration from the recently proposed DualDICE paper (Nachum et al, 2019), where the authors introduce a method for estimating discounted stationary distribution ratios (i.e. d^{\pi}/d^{D}, where \pi is some (known) policy, and D is a given dataset of experience (for example, a replay buffer)). The authors essentially apply the method proposed in DualDICE to estimate d^{\pi}/d^{exp} instead, where d^{exp} is a dataset of expert trajectories. While it would be possible to simply use this term as a reward and then run reinforcement learning, the authors note that the specific form of estimating d^{\pi}/d^{exp} allows them to instead directly a train a value function, which can then be used for updating a policy.
+
+The authors also argue that their method reduces complexity since it does not require a separate RL optimization routine. However, I think having a separate RL optimization routine has its own advantages - it is relatively easy to implement GAIL on top of any existing RL algorithm, which makes it easy to take advantage of recent advances in RL. For the method proposed in this paper, it would not be straightforward to do so. 
+
+The authors also note that they need a number of practical modifications to their original ValueDICE objective in order to make things work (Section 5: Some practical considerations). Notably, the the original ValueDICE objective only needs access to expert samples and the initial state distribution, and does not need access to the the replay buffer samples (apart from the initial state ones). This would likely not work well in practice - similar to how behavior cloning often does worse than GAIL when learning from a small number of expert examples. In order to combat this, the authors incorporate replay buffer regularization 
+
+The authors provide experiments on  a simple synthetic ""ring MDP"", and on four continuous control tasks from OpenAI gym - HalfCheetah, Hopper, Ant and Walker2d. When compared to prior approaches (i.e. Kostrikov et al 2019) in the low-data regime (where behavior cloning fails), the proposed method does significantly better on one task (Ant), slightly worse on one task (walker), and about the same on two tasks (Hopper and HalfCheetah). I do notice the proposed method as being somewhat unstable though - the reward appears to be going down after reaching the max on two of the tasks - HalfCheetah and Ant. Overall, I don't thing experiments are thorough enough to demonstrate that the method is empirically better than competing approaches, but I believe that is not the main point of the paper. 
+
+Overall, my recommendation is a weak accept. It is interesting that the authors were able to get a principled method for off-policy imitation learning working (large following from prior work in Nachum et al 2019), but I don't think the method currently offers any significant practical advantages over competing methods.
+",6,,ICLR2020
+H19W6GPVl,3,BJbD_Pqlg,BJbD_Pqlg,"Review of ""Human Perception in Computer Vision""","The author works to compare DNNs to human visual perception, both quantitatively and qualitatively. 
+
+Their first result involves performing a psychophysical experiment both on humans and on a model and then comparing the results (actually I think the psychophysical data was collected in a different work, and is just used here).   The specific psychophysical experiment determined, separately for each of a set of approx. 1110 images, what the noise level of additive noise would have to be to make a just-noticeable-difference for humans in discriminating the noiseless image from the noisy one.   The authors then define a metric on neural networks that allows them to measure what they posit might be a similar property for the networks.  They then correlate the pattern of noise levels between neural networks that the humans.    Deep neural networks end up being much better predictors of the human pattern of noise levels than simpler measure of image perturbation (e.g. RMS contrast).  
+
+A second result involves comparing DNNs to humans in terms of their pattern errors in a series of highly controlled experiments using stimuli that illustrate classic properties of human visual processing -- including segmentation, crowding and shape understanding.  They then used an information-theoretic single-neuron metric of discriminability to assess similar patterns of errors for the DNNs.   Again, top layers of DNNs were able to reproduce the human patterns of difficulty across stimuli, at least to some extent. 
+
+A third result involves comparing DNNs to humans in terms of their pattern of contrast sensitivity across a series of sine-grating images at different frequencies.  (There is a classic result from vision research as to what this pattern should be, so it makes a natural target for comparison to models.)   The authors define a DNN correlate for the propertie in terms of the cross-neuron average of the L1-distance between responses to a blank image and responses to a sinuisoid of each contrast and frequency.   They then qualitatively compare the results of this metric for DNNs models to known results from the literature on humans, finding that, like humans, there is an apparent bandpass response for low-contrast gratings and a mostly constant response at high contrast.  
+
+Pros:
+    * The general concept of comparing deep nets to psychophysical results in a detailed, quantitative way, is really nice.   
+
+    * They nicely defined a set of ""linking functions"", e.g. metrics that express how a specific behavioral result is to be generated from the neural network.  (Ie. the L1 metrics in results 1 and 3 and the information-theoretic measure in result 2.)   The framework for setting up such linking functions seems like a great direction to me. 
+
+    * The actual psychophysical data seems to have been handled in a very careful and thoughtful way.   These folks clearly know what they're doing on the psychophysical end.  
+
+
+Cons:
+    * To my mind, the biggest problem wit this paper is that that it doesn't say something that we didn't really know already.   Existing results have shown that DNNs are pretty good models of the human visual system in a whole bunch of ways, and this paper adds some more ways.    What would have been great would be: 
+         (a) showing that they metric of comparison to humans that was sufficiently sensitive that it could pull apart various DNN models, making one clearly better than the others. 
+         (b) identifying a wide gap between the DNNs and the humans that is still unfilled.   They sort of do this, since while the DNNs are good at reproducing the human judgements in Result 1, they are not perfect -- gap is between 60% explained variance and 84% inter-human consistency.    This 24% gap is potentially important, so I'd really like to see them have explored that gap more -- e.g. (i) widening the gap by identifying which images caused the gap most and focusing a test on those, or (ii) closing the gap by training a neural network to get the pattern 100% correct and seeing if that made better CNNs as measured on other metrics/tasks. 
+
+In other words, I would definitely have traded off not having results 2 and 3 for a deeper exploration of result 1.    I think their overall approach could be very fruitful, but it hasn't really been carried far enough here. 
+
+   * I found a few things confusing about the layout of the paper.  I especially found that the quantitative results for results 2 and 3 were not clearly displayed.   Why was figure 8 relegated to the appendix?  Where are the quantifications of model-human similarities for the data shown in Figure 8?  Isn't this the whole meat of their second result?   This should really be presented in a more clear way.    
+
+    * Where is the quantification of model-human similarity for the data show in Figure 3?  Isn't there a way to get the human contrast-sensitivity curve and then compare it to that of models in a more quantitively precise way, rather than just note a qualitative agreement?   It seems odd to me that this wasn't done. 
+",6,4.0,ICLR2017
+vcakzIUU_Qx,2,JbAqsfbYsJy,JbAqsfbYsJy,"I reject this paper since the formulation of the information-theoretic (i.e., divergence minimisation-based) view of action and perception is already established and well-known.","The authors of this paper propose a unified optimisation objective for (sequential) decision-making (i.e., _action_) and representation learning (i.e., _perception_), built on joint (KL) divergence minimisation. As also mentioned by the authors, this is a concept paper and it includes no empirical study.
+
+In particular, the authors demonstrate how existing ideas and approaches to (sequential) decision-making and representation learning can be expressed as a joint KL minimisation problem between a target and ""actual"" distribution. Such examples are (a) MaxEnt RL, (b) VI, (c) amortised VI, (d) KL control, (e) skill discovery and (f) empowerment, which are all cases of the KL minimisation between a target and an ``actual'' distributions.
+
+**Concerns**:
+1. Although the proposed perspective and language is rich and expressive, I question the novelty of the proposed framework, since the information-theoretic view of decision-making and perception is a rather established and old idea, even the term/idea of perception-action cycle is already defined [1]!
+2. The power of latent variables for decision-making and their interpretation is also a known idea [1].
+
+**References**
+
+[1] Tishby, N. and Polani, D., 2011. Information theory of decisions and actions. In Perception-action cycle (pp. 601-636). Springer, New York, NY.
+
+",3,4.0,ICLR2021
+r1g5G3esKH,1,B1eoyAVFwH,B1eoyAVFwH,Official Blind Review #2,"This paper studies multi-task learning (MTL) from the deep learning perspective where a number of layers are shared between tasks followed by specific heads for each task. One of the main challenges in this problem is to decide the best configuration among a large number of possible ones (e.g., the number of layers , number of neurons, when to stop the shared part of the network). In this paper, the authors fix the network architecture, and learn which filters (among the already learned ones) should be dedicated to (and hence fine-tuned for) a specific, and which ones should be shared between multiple tasks. 
+
+Instead of deciding on other hyper-parameters such as the number of layers, the authors chose to study how to efficiently share the capacity of the network: to decide which filters should be used for which tasks, and which filters should be shared between tasks. 
+Specifically, this is controlled by task specific binary vectors which get multiplied with feature activations for each task, hence blocking or allowing the signal to pass for a specific filter. In addition, they define a different set of binary vectors for the foreground and background passes. This allows simpler tasks to benefit from features learnt from more complicated tasks such as ImageNet classification while avoiding ‘catastrophic forgetting’ at the same time.
+
+Moreover, the authors develop a simple yet elegant strategy to reduce their parameter search space (by using the matrix P which controls the percentage of filters used per task + the percentage of filters shared between each pair of tasks) and quickly evaluate the performance of each configuration (using distillation). The advantages of these approaches are well discussed and validated quantitatively.
+
+The paper is well written and the approach itself appears to be sound and it led to improvement over independent task estimator.  However, I am mostly concerned about the experimental setting: there are no comparisons with any other MTL algorithm. 
+
+The authors perform a search over the matrix P, which is similar to neural architecture search over the entire possible ways of sharing the capacity of a network. This could potentially lead to improvement beyond multi-task learning. Experimental comparison on this could be provided.
+I think the paper will make a strong case if it is compared with existing deep MTL algorithms including [Misra et al: Cross-stitch networks for multi-task learning]. In addition, the network seems to share a similar spirit with [Mallya et al: PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning], in that they also share the capacity of the network between tasks, and hence a comparison here seems reasonable. 
+
+Overall, I think this paper makes a borderline case.
+
+Other comments: 
+In the supplementary material, providing a detailed description of the algorithm (e.g., pseudo code and an accompanying discussion) that calculates the matrices M from P could help reproduce and build upon the experiments reported in the paper. I wonder if M is uniquely defined from M.",6,,ICLR2020
+vEQugCSV-Wv,1,nIqapkAyZ9_,nIqapkAyZ9_,"A simple regularizer, but the performance is yet to be validated","I have reviewed this paper in this year's Neurips. At that time, reviewers and AC have some good points but are not yet addressed (I have compared the two submissions carefully). 
+
+This paper proposes a Feature Embedding Regularizer (named SVMax) to regularize feature embedding during the CNN learning process. The idea is to push the distribution of high-dim feature embeddings to be a uniform distribution across the embedding space. The idea is implemented by adding a regularize to maximize the averaged singular value computed from a mini-batch. Experiments show that many deep metric learning methods can benefit from the proposed SVMax regularizer.
+
+Pros:
+
+1) A simple but effective regularizer.
+2) This paper is well-written and easy to follow.
+
+Cons:
+I re-emphasize some major points pointed by reviewers and AC:
+1) [From other reviewers] The experimental results seem to be worse than SOTA and the baseline method seems to be not well trained. For example, on CUB datasets, the baseline Trip method can reach above 50% R@1, but your result is only 47.7%. The same phenomenon happens on other baseline methods and backbone network ResNet50 (To my best knowledge, ResNet50 baseline can achieve 60%+ R@1.) So the experimental results make me confused about the effectiveness of the proposed method.
+
+2) Though the effectiveness of the proposed SVMax Regularizer is validated on the fine-grain image retrieval datasets. I would also want to see its performance on a broader image retrieval task. For example:
+
+Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking
+
+The reason is that for this broader image retrieval task, the original feature embeddings may have been well-distributed across the embedding space, rather than shrunk to a limit embedding space for the fine-grain datasets.
+
+I think the above experiment needs to be done if we want to draw the conclusion that SVMax regularizer really works for deep metric learning methods.
+",5,5.0,ICLR2021
+MqwwiU_a9sV,5,piLPYqxtWuA,piLPYqxtWuA,Official Review,"This work presents several improvements over the original teacher-student framework in FastSpeech: 1) training the model with ground-truth target instead of the output from teacher, 2) extract phoneme duration, pitch and energy from speech and directly take them as conditional inputs in training and use predicted values in inference.  Importantly, it uses pretrained forced aligner (MFA) to extract the phoneme durations for training.  
+
+Comments:
+
+1, Typo: In introduction, ""alleviate the one-to mapping""
+
+2, The ""one-to-many mapping"" is not an issue in general. This setting widely exists in generative modeling, e.g., label conditioned image synthesis, mel spectrogram conditioned waveform synthesis. The problem of FastSpeech (and other non-autoregressive TTS models) is really due to the over-simplified output distributions, which assume conditional independence between frames and frequency bins for mel spectrogram. As a result, these models doesn't account for the variations in real data.
+
+3, The introduction, section 2.1, and section 2.2 contain too much duplicated information. One may shorten the text properly.
+
+4, The motivation & architecture of FastSpeech 2s (Figure 1 (a)) is similar to the previous text-to-waveform model clarinet. Both of them use mel prediction task to guide the training. The authors overclaim that ""FastSpeech 2s is the first attempt to directly generate waveform from phoneme sequence"".
+
+5, The model is similar to traditional TTS pipeline with separate duration model, pitch/F0 model etc. The CWT/iCWT-based pitch predictor is interesting. 
+
+6, The MOS of autoregressive Tacotron 2 is relatively low. Which implementation did you use?
+
+7, One may also report the standard deviation, skewness, and kurtosis of pitch in synthesized audio from autoregressive model (Tacotron 2 and Transformer TTS). I assume their results would be closer to GT.
+
+8, In Table 5, the durations from teacher model are extracted by teacher forcing ground-truth mel spectrogram? One may mention that MFA is pretrained on a much larger dataset, thus may have better generalization than the teacher model trained on small LJSpeech dataset.
+
+Pros:
+- Good sample quality.
+- Sufficient ablation study.
+
+Cons:
+- The proposed pipeline is far more complicated than existing end-to-end TTS model.
+- Inaccurate & confusing claims (see my comments).
+- The novelty is rather limited. ",5,5.0,ICLR2021
+SkxupTwbTX,3,HJg3rjA5tQ,HJg3rjA5tQ,Interesting idea but the paper needs work,"This paper proposes a way to define f-divergences for densities which may have different supports. While the idea itself is interesting and can be potentially very impactful, I feel the paper itself needs quite a bit of work before being accepted to a top venue.  The writing needs quite a bit oof polish for the motivations to clearly stand out. Also, some of the notation makes things way more confusing that it should be. Is it possible to use something other than p() for the noise distribution, since the problem itself is to distinguish between p() and q(). I understand the notational overload, but it complicates the reading unnecessarily. I have the following questions if the authors could please address:
+
+1) The inequality of Zhang et a. (2018) that this paper uses seems to be an easy corollary of the Data Processing Inequality :https://en.wikipedia.org/wiki/Data_processing_inequality Did I miss something? Can the authors specify if that is not the case?
+
+2) In terms of relevance to ICLR, the applications of PCA, ICA and training of NNs is clearly important. There seems to be a significant overlap of Sec 5.3 with Zhang et al. Could the authors specify what the differences are in terms of training methodology vis-a-vis Zhang et al? It seems to me these are parallel submissions with this submissions focussing more on properties of Spread Divergences and its non deep learning applications, while the training of NNs and more empirical evidence is moved to Zhang et al. 
+
+3) I am having a tough time understanding the derivation of Eq 25, it seems some steps were skipped. Can the authors please update the draft with more detail in the main text or appendix ?
+
+4) Based on the results on PCA and ICA, I am wondering if the introduction of the spread is in some ways equivalent to assuming some sort of prior. In the PCA case, as an exercise to understand better, what happens if some other noise distribution is used ? 
+
+5) I do not follow the purpose of including the discussion on Fourier transforms. In general sec 3 seems to be hastily written. Similarly, what is sec 3.2's goal ? 
+
+6) The authors mention the analog to MMD for the condition \hat{D}(p,q)=0  \implies p =q. From sec 4, for the case of mercer spread divergence, it seems like the idea is that  ""the eigenmaps of the embedding should match on the transformed domain"" ? What is [a,b] exactly in context of the original problem? This is my main issue with this paper. They talk about the result without motivation/discussion to put things into context of the overall flow, making it harder than it should be for the reader. I have no doubt to the novelty, but the writing could definitely be improved.",5,4.0,ICLR2019
+Ypay74WVDiD,2,RayUtcIlGz,RayUtcIlGz,"Review: Nice idea and very clear writing, uncertain about concern raised by reviewer 2. ","### Update: 
+I'd like to thank the authors for carefully addressing my concerns.. 
+
+Reviewer 2 claims $P^4$ training is a special case of previous work. It seems the authors agree at least to some extent *""we agree that the P4 update scheme is closely related to dynamic trivializations""*. It is not clear to me how harshly this should be penalized. In some cases, it is interesting to research special cases of known general results. Unfortunately, it is hard for me to judge if this is the case here, as I know of the previous articles reviewer 2 refer to, but I have not studied them carefully. As a result, I am decreasing my confidence from 3 to 1 and my score from 7 to 6. 
+
+I hope this does not discourage the authors, and wish them good luck with future research. 
+
+___
+
+### Summary 
+**Objective:** train neural networks while retaining properties like invertibility or orthogonality of layers. 
+
+**Approach:** instead of optimizing normal weights optimize an rank 1 update. Occasionally, the rank 1 updates are merged with the network parameters.  
+
+### Strengths
+**[+]** Preserving properties like invertibility is important and has been studied by much previous work. The authors present a novel approach based on rank 1 updates, which, to the best of my knowledge, is completely novel.  
+
+**[+]** The article is very clearly written, it seems to me that the authors spent a great deal of time polishing the paper. 
+
+### Weaknesses
+**[-]** I have a minor concern wrt. optimization of rank 1 updates which I elaborate below. 
+
+### Recommendation: 6
+**[+]** The paper presents a novel approach for preserving invertibility, which could benefit many deep learning researchers. 
+
+**[+]** The paper demonstrates how rank 1 updates can be used in deep learning, which I believe will inspire further research into this interesting direction. 
+
+I condition my recommendation on an MLP experiment I already discussed with the authors before submitting this review. Furthermore, I'll re-evaluate my conviction after reading the comments by the other reviewers. 
+
+### Questions 
+The following question was already answered by the authors before the submission of this review. I repeat the question here for completeness. 
+
+---
+
+Recent research explore the idea that SGD variants perform better with networks that has more parameters. Informally, it is argued it is harder to get stuck at local minima when SGD can move in more directions. If one optimize rank 1 updates, SGD would have $2d$ directions to move in instead of $d^2$ directions. I am concerned this might impacts the performance of SGD. This concern would be address by the following experiment: train two MLPs on MNIST, one with SGD and one with $P^4$, do they attain similar loss curves?
+
+---
+
+### Additional Feedback
+I didn't find any typos, the article seems to be very polished. ",6,1.0,ICLR2021
+uEKobAY-kxh,3,KYPz4YsCPj,KYPz4YsCPj,Official Blind Review #4,"Summary:
+The paper provides an interesting way to inductively represent a temporal network with the proposed Causal Anonymous Walks (CAWs) which work as temporal motifs to represent the network dynamics. The CAWs can be further encoded by the proposed CAW-N which supports online training. 
+
+Pros:
+Overall, I like the idea to represent the temporal network with the proposed causal anonymous walks. Experimental results also show significant improvements over the existing baselines.
+- Representation learning for dynamic graphs is very practical and important. This paper proposed an effective solution to this problem.
+- The proposed CAW is novel for capturing the temporal dynamics in temporal networks. The description of the method is easy to follow.
+- This paper provides sufficient experimental results which show the effectiveness of the proposed CAW-N model on the link prediction task, including a wide variety of baselines and datasets.
+
+Cons: 
+- This paper has been listed on the author's homepage (https://scholar.google.com/citations?user=Ch3YUgsAAAAJ&hl=en), potentially violate the double blind review rules.
+- Some figures, e.g. figure 4, are not very useful in explaining the corner cases. Simple sentences are enough for clarification.
+- It would be better to include the training/inference time comparisons as the introduction claims ""the model scalability"". 
+- It would be more convincing if the authors can provide representation visualization comparisons in the rebuttal period. 
+ 
+Minors: 
+- missing conclusion section
+- Eq.2, I_AW(w; P), definition of P
+
+
+",6,4.0,ICLR2021
+Hkl3sL522m,3,Syx_Ss05tm,Syx_Ss05tm,Adversarial Reprogramming,"This paper extends the idea of 'adversarial attacks' in supervised learning of NNs, to a full repurposing of the solution of a trained net. 
+
+The note of the authors regarding 'Transfer learning' is making sense even to the extend that I fail to see how the proposed study differs from the setting of Transfer learning. The comment of 'parameters' does not make much sense in a semi-parametric approach as studied. The difference might be significant, but I leave it up to the authors to formulate a convincing argument.
+
+",4,3.0,ICLR2019
+ryl7v3dJcB,3,BJx3_0VKPB,BJx3_0VKPB,Official Blind Review #2,"The paper considers the problem of debiasing word representation in a trained language model using some sort of key-value memory structure with annotations of gender that is constrained to view only a subset of keys. 
+
+The current form of the manuscript contains too few details to evaluate the work in a meaningful way. Please include 
+
+(1) a more detailed description of data input and output for system (i.e. where are gendered bits coming from), and more details of the memory (is it a global fixed memory? then how gendered bits assigned?)
+
+(2) a more complete definition of bias amplification -- beyond just an example (i.e. what training representation are you comparing to?). A formula would be useful here.
+
+(3) more complete experimental definition. What datasets exactly are you evaluating on?
+
+(4) overview sections to motivate your solutions and your setup (i.e. an introduction, related work)",1,,ICLR2020
+B1RdHXTef,3,B1lMMx1CW,B1lMMx1CW,A good applied paper with a novel approach and good experimental results,"This paper presents a practical methodology to use neural network for recommending products to users based on their past purchase history. The model contains three components: a predictor model which is essentially a RNN-style model to capture near-term user interests, a time-decay function which serves as a way to decay the input based on when the purchase happened, and an auto-encoder component which makes sure the user's past purchase history get fully utilized, with the consideration of time decay. And the paper showed the combination of the three performs the best in terms of precision@K and PCC@K, and also with good scalability. It also showed good online A/B test performance, which indicates that this approach has been tested in real world.
+
+Two small concerns:
+1. In Section 3.3. I am not fully sure why the proposed predictor model is able to win over LSTM. As LSTM tends to mitigate the vanishing gradient problem which most likely would exist in the predictor model. Some insights might be useful there.
+2. The title of this paper is weird. Suggest to rephrase ""unreasonable"" to something more positive. ",7,3.0,ICLR2018
+l1u1J_91-ts,3,3InxcRQsYLf,3InxcRQsYLf,"Weakly evaluated, limited novelty and selective citation","Summary: Authors propose to model video by combining a VQ-VAE encoder-decoder model and a GPT model for the prior.
+
+The primary contribution as stated by the authors: ""Our primary contribution is VideoGen, a new method to model complex video data in a computationally efficient manner""
+___________
+Pros:
+-
+An interesting model and an ablation of its components.
+
+___________
+Cons:
+-
+- The primary contribution is stated but not validated. The claim is a new method to model complex video efficiently.
+ - There is no experiments and/or benchmarks validating this claim anywhere in the paper.
+ - There is work on efficiency in the video generation field that is neither cited nor benchmarked against.
+   - TGANv2 (https://link.springer.com/article/10.1007%2Fs11263-020-01333-y) and LDVD-GAN (https://www.sciencedirect.com/science/article/abs/pii/S0893608020303397) come to mind.
+   -  ""Computational efficiency is a primary advantage to our method, where we can first use the VQ-VAE to downsample by space time before learning an autoregressive prior"" - TGANv2, LDVD-GAN and DVD-GAN also do this .
+
+- Some questionable highlights:
+ - ""VideoGen can generate realistic samples that are competitive with existing methods such as DVD-GAN""
+   - A very weak highlight because several existing methods already do this better as shown in Table 1. and DVD-GAN is not the state-of-the-art for this benchmark as shown in the same table.
+ - "" VideoGen can easily be adapted for action conditional video generation"" 
+   - This is applicable to every video generation model
+ -  ""Our results are achievable with a maximum of 8 Quadro RTX 6000 GPUs (24 GB memory),
+significantly lower than the resources used in prior methods such as DVD-GAN"" 
+   - This claim is not experimentally validated. DVD-GAN is also trainable on 8 Quadro RTX 6000 GPUs (24 GB memory). I would go further to argue that DVD-GAN would train faster and result in a higher performance than VideoGen. I would like to see a head to head benchmark or at the very least the wall clock time for training both the GPT prior and the VQ-VAE encoder-decoders.
+
+- Selective Citation: The video generation and prediction field has been around for a long time now. It is hard to believe that the authors can manage to find and cite every relevant (un)published paper by google and deepmind authors yet they fail to find work published by other groups in this field. They then go on to talk about the slow progress in the field of video generation without acknowledging all the work being done in this field. The following statements highlight this: 
+ - ""However, one notable modality that **has not seen the same level of progress** in generative modeling is high fidelity natural videos. ""
+ - "" The complexity of the problem also demands more compute resources which can be considered as one important reason for the **slow progress** in generative modeling of videos.""
+
+
+- Missing References to published articles (related to the previous point)
+   - TGAN: Temporal GAN - ICCV 2017 (First appeared on Arxiv - Nov 2016) - https://openaccess.thecvf.com/content_iccv_2017/html/Saito_Temporal_Generative_Adversarial_ICCV_2017_paper.html
+   - MoCoGAN - CVPR 2018  (First appeared on Arxiv - Jul 2017) - https://openaccess.thecvf.com/content_cvpr_2018/html/Tulyakov_MoCoGAN_Decomposing_Motion_CVPR_2018_paper.html
+   -  Progressive Video GAN - Masters Thesis (First appeared on Arxiv - Oct 2018) - https://arxiv.org/abs/1810.02419
+   - MDP-GAN: Markov Decision Process for Video Generation - ICCV 2019 (First appeared on Arxiv - Sep 2019) - https://openaccess.thecvf.com/content_ICCVW_2019/html/HVU/Yushchenko_Markov_Decision_Process_for_Video_Generation_ICCVW_2019_paper.html
+   - TGANv2: Train Sparsely, Generate Densely -  Journal of Computer Vision 2020 - (First appeared on Arxiv - Nov 2018) - https://link.springer.com/article/10.1007%2Fs11263-020-01333-y
+   - LDVD-GAN: Lower Dimensional Kernels for Video Discriminators - Journal of Neural Networks 2020 -  (First appeared on Arxiv - Dec 2019) - https://www.sciencedirect.com/science/article/abs/pii/S0893608020303397
+- If we were to include unpublished preprints on arxiv in this area, this list would at least double in size.
+
+
+ 
+___________
+Specific Points:
+- ""However, one notable modality that has not seen the same level of progress in generative modeling is high fidelity natural videos. The complexity of **natural videos** requires modeling correlations across both space and time with much higher input dimensions, thereby presenting a natural next challenge for current deep generative models"" 
+ - The only natural video dataset benchmarked on is BAIR, the rest are all synthetic. Please benchmark on other datasets of natural video such as UCF101 and Kinetics-600 which also have comparative benchmarks at similar spatio-temporal resolutions. 
+
+- ""Can we generate high-fidelity samples from complex video datasets with limited compute?"" 
+   - Please address and expand on this point. It is currently left unanswered.
+
+___________
+Current recommendation: Rejection
+-
+All in all, this paper is lacking in novelty and does not do a good job of convincing readers of its primary contributions. The ablation studies provide for the most interesting insights with regard to this work. The BAIR evaluations show that the proposed model is more expensive and has a lower performance than many existing models. The claims of efficiency are also questionable given that the vqvae prior is notoriously expensive to train for image models, let alone video models and there is no head to head comparison or wall clock benchmark to demonstrate otherwise. Lastly, the very selective referencing of work situated around google and deepmind while ignoring related and highly relevant (and famous) work from scientists in other institutes is detrimental to research in this field. I am happy to update my review and score if these issues are addressed. But this work in it's current form is not publishable at any conference.",4,5.0,ICLR2021
+q7cTTyfraZ9,1,PXDdWQDBsCG,PXDdWQDBsCG,Repeated work and limited contributions.,"This paper investigates incorporating shape information in deep neural networks to improve their adversarial robustness. It proposes two methods: the first one is to augment the input with the corresponding edge and then adversarially train a CNN on the augmented input. The second idea is to train a conditional GAN to reconstruct images from edge maps and use the reconstructed image as input to a standard classifier. 
+
+1. The description of the proposed defense in section 3 seems to be limited. It is not clear why the author applied a conditional GAN to reconstruct clean images from edge maps. In other words, what is the motivation for designing GSD on top of EAT?
+
+2. The authors use Canny edge detector to extract edges. Why not use neural network based edge extractors [2] as they give better edges? What is the motivation here? 
+
+3. Considering the possible obfuscated gradient issues of white-box attacks [3], the authors should explicitly describe their efforts to evaluate against strong custom adaptive attacks.
+
+4. In terms of the experiments, the authors claim that they investigated adaptive attack but I did not see any quantitative experiment results. They also claim that any adaptive attack would cause perceptible changes to the edges. This is not an excuse for not doing quantitative study; the authors already considered adversarial perturbations with magnitude as large as 64. Such magnitude can also cause perceptible changes to images as Figure 8 shows.
+
+5. For EAT, what is the performance if the model is not adversarially trained? Why use adversarial training in EAT but not in GSD? I believe these analyses are required for an in-depth understanding of how the proposed defense works.
+
+6. Last but not least, the algorithms proposed in this paper looks similar (almost the same) to this paper [1] from previous year: (a) The edge-guided adversarial training (EST) is basically applying adversarial training on EdgeNetRob in [1]; (b) The GAN-based shape defense (GSD) is exactly the same as EdgeGANRob in [1]; (c) Both of them use canny edge detector to extract edges. Can the authors highlight the differences? If this is a separate paper, given the previous work [1] that already proposed this idea, the contribution of this work seems to be limited. 
+
+[1] Shape Features Improve General Model Robustness. https://openreview.net/forum?id=SJlPZlStwS, 2019.
+
+[2] Richer Convolutional Features for Edge Detection. Liu, et al TPAMI, 2019.
+
+[3] Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples. Anish et al, ICML 2018.
+",3,4.0,ICLR2021
+edlYnHc1j4a,1,kdm4Lm9rgB,kdm4Lm9rgB,Model discrepancy between enviornments plays a role in generalization ,"This paper focuses on the generalization issue in reinforcemetn leanring, specifically aims to address the problems of domain randomization(DR) technique. Different from standard DR which treats all the sample environment as equal, this paper proposed to improve the performance over all possible environments and the worst-case environment concurrently. This paper theoretically derives a lower bound for the worst-case performance of a given policy over all environment, and in practical, the proposed method, monotonic robust policy optimization(MRPO) carries out a two-step optimization to imporve the lower bound such as to maximize the averaged and worst-case policy perfomance. 
+
+
+This paper is well written and the key concept is clearly introduced. The Theorem.1 makes the connections between the averaged and the worst-case performance, such that maximizing the worst-case performance can be solved by maximizing the averaged performance problem with some trajectories from environments with both poor and good-enough performance. The emprical results also support the theorical analysis.
+
+1. For Lemma 1: The conclusion is based on the assumption that the the worst case $\rho(\pi|p_w) - \max_p \rho(\pi|\rho)$ is bounded (Proof A.1). However, such equation does not strictly holds without bounded reward function. The author should stated the condition.
+
+2. About the monotonic worst-case performance improvement theorem, the proof says ""... the approximation is made under the assumption that the worst-case environment between two iterations are similar, which stems from the trust region constraint we impose on the update step between current and new policies..."", however, the trust region constraint can only limit the difference between policy updates, the similarity between worst-case environments can not be promised.
+
+3. In theorem 2, the fomula (50) and (51) in the proof, is this approximation reasonable? Since the policy is updated, the worst-case environment may have changed a lot. Similarly, if the updated policy changes very little, can we make $\pi_{new}=\pi_{old}$ ? 
+
+4. The experiments are slightly inadequate, the effects of tunable hyperparameter k should be further analyzed; In unseen environment, the MRPO algorithm is only tested on one environment.
+
+
+
+",7,4.0,ICLR2021
+Byx03WpnnX,2,r1VPNiA5Fm,r1VPNiA5Fm,approximation theory using relu networks,"This paper describes results regarding approximations of certain function families using ReLU neural networks. The authors emphasize two points about these networks: finite width and depth that is logarithmic in the approximation error parameter $\epsilon$. 
+
+The first result concerns approximation of polynomials, which is used as a building block for all subsequent results. This result itself is quite simple and mostly follows from simple observations or known results, though it is possible that these have not been explicitly written in this form anywhere. The other results concern smooth functions, and some kinds of non-smooth functions such as the Weirstrass function. There are two neat observations (i) using the sawtooth function to approximate sinusoidal ones and (ii) using overlapping ""approximation"" to simulate an indicator. 
+
+The paper is refreshingly well-written and pleasant to read. Most of the results are tailored to work for either periodic functions, or can be expressed as: if piecewise polynomials are a good approximation, then so are constant depth neural networks with ReLU. I'm not sure that ICLR is the best venue for these kinds of results, as any connection with learning is at best tenuous, and the kind of approximation results don't seem to have any direct bearing on machine learning.",5,4.0,ICLR2019
+wSluC9D1CJK,2,Du7s5ukNKz,Du7s5ukNKz,Good empirical results but the theoretical justification doesn't seem right,"## Sumary
+
+This paper tackles a very important problem of reinforcement learning/imitation learning in the presence of noisy rewards/labels and uses contemporary literature to motivate a simple solution (in a good sense). The empirical results look encouraging on the benchmark problems but the analysis is a little wanting and doesn't get at the heart of the matter. I think with a little more work this paper can be a very good conference paper but at this stage I can't recommend publication in ICLR. 
+
+**Before Author Response** Of course, if the authors address my concerns then I'll increase my rating. 
+**After Author Response** I think the authors made a good effort to address the concerns and I have recommended to accept the paper.
+
+## Contributions 
+
+This paper tackles the problem of learning from imperfect reward/imperfect demonstrations using a new evaluation metric called ""Correlated Agreement"". The proposed metric ""regularizes"" learning under weak supervision by penalizing ""over-agreement"" with the supervision signal. The proposed method is meant to be used in situations where the supervision signal is known to be noisy but it does not estimate parameters of the noisy channel   which is corrupting the supervision signal.
+
+## Strengths
+
+The strongest aspect of the paper in my opinion are the strong empirical results on fairly standard RL benchmarks. The paper also explains the concept of learning from ""peer agents"". The analysis of the methods presented covers the most obvious questions one might ask at the beginning about the proposed method and the appendices cover those well.
+
+## Weaknesses
+
+There are two problems I see with the analysis in the paper that gives me concern:
+
+1. The main result in the paper is Theorem 1 which relies on Lemma 1. Lemma 1 essentially states that the proposed new metric (Peer RL reward) that subtracts a chance agreement baseline (Second term in equation 1) from the standard agreement objective (First term in equation 1) is an affine function of the true reward. Theorem 1 then crucially relies on this result for the ensuing convergence analysis. The problem I see is that an analogous result to Lemma 1 -- and therefore Theorem 1 -- can also be proved for the noisy reward. Specifically equation (12) in Appendix A.1 shows that even the noisy reward 𝔼[\tilde{r}] is an affine function of the true reward, with the exact same slope and just a different intercept. Given that observation the proof for Theorem 1 can be carried out /mutatis mutandis/ to prove convergence with the naive noisy reward instead of the peer reward. So to me all of the analysis in the current paper doesn't really explain ""why"" PeerRL is actually performing better than using noisy rewards. To me it seems that more attention must be paid to the intercept terms in equation (12) and (17) to really understand the gap in performance which the authors currently do not consider. This is the first issue I see in the analysis.
+
+2. The second issue I have is with the ""Correlated Agreement"" Objective itself. The ""CA with weak supervision"" objective shown in equation (1) seems to give unfair advantage to completely random, but high entropy, weak supervision. For example, in the toy example on Page 4 the authors give a demonstration that the ""CA metric"" will be 0.375 thereby punishing full agreement with a weak baseline. But if instead the weak baseline was a random and unbiased coin toss then the expected CA objective will be 1 - 0.25 - 0.25 = 0.5 > 0.375. So just because the weak supervision signal had higher entropy and therefore lower chance of random agreement therefore its score was higher. This aspect should be dealt with more thoroughly in the paper. ",6,3.0,ICLR2021
+SJx202Uchm,1,S14h9sCqYm,S14h9sCqYm,"Interesting paper, but does it really need RL?","This paper focuses on the alignment of different Knowledge Graphs (KGs) obtained from multiple sources and languages - this task is similar to the Link Prediction setting, but the objective is learning a mapping from the entities (or triples) in one Knowledge Graph to another. In particular, the paper focuses on the setting where the number of available training alignments is small.
+
+The model and training processes are slightly convoluted:
+- Given a triple in the KG A, the model samples a candidate aligned triple from a KG B, from a (learned) alignment distribution.
+- The objective is a GAN loss where the discriminator needs to distinguish between real and generated alignments.
+The objective is non-differentiable (due to the sampling step), and it's thus trained via policy gradients.
+
+Question: to me it looks like the whole process could be significantly simpler, and end-to-end differentiable. For instance, the loss may be the discrepancy between the alignment distribution and the training alignments. As a consequence, the whole procedure would be significantly more stable; there would be no need of sampling; or tricks for reducing the variance of the gradient estimates. What would happen with such a model? Would it be on par with the proposed one?
+
+The final model seems to be better than the considered baselines.
+",5,3.0,ICLR2019
+HyJutQ6gz,3,ryY4RhkCZ,ryY4RhkCZ,A paper that needs some rewriting to be more clear to judge its quality,"This paper presents a methodology to allow us to be able to measure uncertainty of the deep neural network predictions, and then apply explore-exploit algorithms such as UCB to obtain better performance in online content recommendation systems. The method presented in this paper seems to be novel but lacks clarity unfortunately. My main doubt comes from Section 4.2.1, as I am not sure how exactly the two subnets fed into MDN to produce both mean and variance, through another gaussian mixture model. More specifically, I am not able to see how the output of the two subnets get used in the Gaussian mixture model, and also how the variance of the prediction is determined here. Some rewriting is needed there to make this paper better understandable in my opinion. 
+
+My other concerns of this paper include:
+1. It looks like the training data uses empirical CTR of (t,c) as ground truth. This doesn't look realistic at all, as most of the time (t,c) pair either has no data or very little data in the real world. Otherwise it is a very simple problem to solve, as you can just simply assume it's a independent binomial model for each (t,c).
+2. In Section 4.2.1, CTR is modeled as a Gaussian mixture, which doesn't look quite right, as CTR is between (0,1).
+3. A detailed explanation of the difference between MDN and DDN is needed.
+4. What is OOV in Section 5.3?",4,3.0,ICLR2018
+H1xFcKa_hX,1,HkxAisC9FQ,HkxAisC9FQ,Interesting idea but poorly written,"This paper explores augmenting the training loss with an additional gradient regularization term to improve the robustness of models against adversarial examples. The authors show that this training loss can be interpreted as a form of adversarial training against optimal L2 and L_infinity adversarial perturbations. This augmented training effectively reduces the Lipschitz constant of the network, leading to improved robustness against a wide variety of attack algorithms.
+
+While I believe the results are correct and possibly significant, the paper is poorly written (especially for a 10 page submission) and comparison with prior work on reducing the Lipschitz constant of the network is lacking. The authors also made little to no effort in writing to ensure the clarity of their paper. I would like to see a completely reworked draft before opening to the idea of recommending acceptance.
+
+Pros:
+- Theoretically intuitive method for improving the model's robustness.
+- Evaluation against a wide variety of attacks.
+- Empirically demonstrated improvement over traditional adversarial training.
+
+Cons:
+- Lack of comparison to prior work. The authors are aware of numerous techniques for controlling the Lipschitz constant of the network for improved robustness, but did not compare to them at all.
+- Poorly written. The paper contains multiple missing figure references, has a duplicated table (Tables 1 and 3), and the method is not explained well. I am confused as to how the 2-Lip loss is minimized. Also, the paper organization seems very chaotic and incoherent, e.g., the introduction section contains many technical details that would better belong in related works or methods sections.
+
+--------------------------------------------
+
+Revision:
+
+I thank the authors for incorporating my suggestions and reworking the draft, and I have updated my rating in response to the revision. While I believe the organization is much cleaner and easier to follow, there is still much room for improvement. In particular, the paper does not introduce concepts in a logical order for a non-expert to follow (e.g. Reviewer 1) and leaps into the paper's core idea too quickly. I am strongly in favor of exceeding the suggested page limit of 8 pages and using that space to address these concerns.
+
+A more pressing concern is the evaluation of prior work. The authors added a short section (Section 5.4) comparing their method to that of (Qian and Wegman, 2018). This is certainly a reasonable comparison and the results seem promising, the evaluation lacks an important dimension -- varying the value of epsilon and observing the change in robustness. This is an important aspect for defenses against adversarial examples as certain defense may be less robust but are insensitive to the adversary's strength. Showing the robustness across different adversary strengths gives a more informative view of the authors' proposed method in comparison to others. The evaluation is also lacking in breadth, ignoring other similar defenses such as (Cisse et al., 2017) and (Gouk et al., 2018).",4,3.0,ICLR2019
+Hyl3-5ICFS,2,SylVNerFvr,SylVNerFvr,Official Blind Review #3,"The paper presents an architecture that captures equivariance to certain transformations that happen in text, like synonym words and some simple transformation over word order.  
+
+* General comments: 
+  
+Increasing compositional generalization using equivariance is a very interesting idea. Sections 1-3 are well written and the solution of modeling the translation function as a G-equivariant function is well motivated. 
+
+Section 4 is far less clear. In its current form, it is very hard to understand the model construction as well as the design choices. This section should be significantly improved in order for me to increase my score. A direct by-product of the confusing writing is that the experiments cannot  be reproduced.
+
+The experiments show improvement in one out of four tasks, where the single phrase “Around right” is held from the training set. There are no examples, not qualitative analysis, no ablation experiments. Overall, more evidence needed to convince that the approach is useful. In addition to deeper error analysis, the authors can hold out other phrases (e.g., “around left”, and many others). 
+
+* Specific comments which I hope the authors address:
+
+1. To the best of my understanding, the authors do not explicitly specify the group G that they want to be invariant to. Is it a product of a few cyclic groups? (a cycle for each set of words that are interchangeable?)
+
+2. The authors suggest using G-convolution, i.e. the group convolution on G. This is in contrast to the (arguably) more popular choice of using linear layers that are G-equivariant (as in, for example,  deep sets (Zaheer et al. 2017), Deep Models of Interactions Across Sets (Hartford et al. 2018),Universal invariant and equivariant graph neural networks (Keriven and Peyré ) and in general convolutional layers for learning images).
+I have several questions regarding this choice:
+2a. Can the authors discuss the differences/advantages of this approach over the approach mentioned above? It seems like the approach mentioned above will be more efficient (as there is no need to sum over all group elements)
+2b. In order to use G-convolution, one has to use functions defined on G. Can the authors explain how they look on the input words as functions on G?
+2c. How is the G-Conv actually implemented? 
+2d. Can the authors provide some intuition to what this operator does? 
+
+3. Is the whole model G-equivariant? The authors might want to clearly state this. To the best of my understanding, this is the main motivation of this construction.
+
+4. It might be helpful for readers that are not familiar with deep learning for NLP tasks to provide a visualization of the full model (can be added to the appendix)
+
+5. Why are words represented as infinite one-hot sequences? Don’t we assume a finite vocabulary? This is pretty confusing.
+
+6. As a part of the G-RNN the authors apply a G-conv to the state h_{t-1}. What is the dimension of this hidden state? How does G act on it? 
+
+7. Please explicitly state the dimensions of each input/output/parameter in the network (this can be combined with the illustration above illustration)
+
+* Minor comments:
+
+Section 4.1 pointwise activation are in general equivariant only to permutation representations
+Page 2 - typo - ‘all short’-> ‘fall short’ 
+",6,,ICLR2020
+B1xKIsQcnQ,2,SyxMWh09KX,SyxMWh09KX,Interesting approach for few-shot text classification,"This paper presents a meta learning approach for few-shot text classification, where task-specific parameters are used to compute a context-dependent weighted sum of hidden representations for a word sequence and intermediate representations of words are obtained by applying shared model parameters. 
+
+The proposed meta learning architecture, namely ATAML, consistently outperforms baselines in terms of 1-shot classification tasks and these results demonstrate that the use of task-specific attention in ATAML has some positive impact on few-shot learning problems. The performance of ATAML on 5-shot classification, by contrast, is similar to its baseline, i.e., MAML. I couldn’t find in the manuscript the reason (or explanation) why the performance gain of ATAML over MAML gets smaller if we provide more examples per class. It would be also interesting to check the performance of both algorithms on 10-shot classification.
+
+This paper has limited its focus on meta learning for few-shot text classification according to the title and experimental setup, but the authors do not properly define the task itself.",5,3.0,ICLR2019
+anoh47sWMIn,2,sfgcqgOm2F_,sfgcqgOm2F_,My concerns are mainly about the the significance of the proposed method in practice and the comparison to the previous work,"This paper proposes a novel unbiased bidirectional compressor for vanilla SGD to reduce the communication overhead. Both theoretical and empirical analysis are provided. In overall, the paper is technically sound.
+
+My concerns are mainly about the the significance of the proposed method in practice and the comparison to the previous work:
+
+1. Dist-EF-SGD [1] and SignSGD [2] both achieve nearly 32x bidirectional compression, while natural compression only achieves roughly 8x in the experiments. Although this paper focuses on unbiased compressors and Dist-EF-SGD and SignSGD are biased compressors, the gap in the compression ratio is hard to ignore, which makes it hard to claim that this paper achieves SOTA in practice.
+
+2. For the CIFAR-10 experiments, all the models (the smallest is ResNet50) are heavily over-parameterized, which means that there is huge redundancy in the gradients and models themselves. As a result, it will be easy to converge to small training loss for any compressor. Furthermore, since the experiments lacks comparison to other works (the only baseline with compression is standard dithering), it's hard to justify the importance of the proposed compressor. It will be better if the authors could show results on simpler models such as resnet20.
+
+3. The paper lacks comparison to other methods. I still highly recommend to compare to dist-EF-SGD [1], since it achieves SOTA performance in communication-efficient distributed SGD. For other compressors, I understand that the other unbiased compressors may not have bidirectional compression. However, SignSGD [2] achieves bidirectional 32x compression without error feedback (error feedback improves the convergence of SignSGD, but not neccessary in some cases). Although SignSGD is biased compressor, it satisfies the requirement ""vanilla distributed SGD with bidirectional compression"" mentioned in Section 4, which makes it a good baseline. However, SignSGD is not compared or cited in this paper.
+
+4. Most of the experiments are relatively small. For the ImageNet experiments in the appendix, no comparison in training time is provided.
+
+
+References:
+[1] Zheng, Shuai, Ziyue Huang, and James Kwok. ""Communication-efficient distributed blockwise momentum SGD with error-feedback."" Advances in Neural Information Processing Systems. 2019.
+[2] Bernstein, Jeremy, et al. ""signSGD: Compressed Optimisation for Non-Convex Problems."" International Conference on Machine Learning. 2018.",5,3.0,ICLR2021
+HkeLn38Kn7,1,SyeKf30cFQ,SyeKf30cFQ,review,"This paper gives a model for understanding locally connected neural networks. The main idea seems to be that the network is sparsely connected, so each neuron is not going to have access to the entire input. One can then think about the gradient of this neuron locally while average out over all the randomness in the input locations that are not relevant to this neuron. Using this framework the paper tried to explain several phenomena in neural networks, including batch normalization, overfitting, disentangling, etc.
+
+I feel the paper is poorly written which made it very hard to understand. For example, as the paper states, the model gives a generative model for input (x,y) pairs. However, I could not find a self-contained description of how this generative model works. Some things are described in Section 3.1 about the discrete summarization variables, but the short paragraph did not describe: (a) What is the ""multi-layer"" deterministic function? (b) How are these z_\alpha's chosen? (c) Given z's how do we generate x? (d) What happens if we have z_\alpha and z_\beta and the regions \alpha and \beta are not disjoint? What x do we use in the intersection?
+
+In trying to understand the paper, I was thinking that (a)(b) The multilayer deterministic function is a function which gives a tree structure over the z_\alpha's, where y is the root. (I have no idea why this should be a deterministic function, intuitively shouldn't y be chosen randomly, and each z_\alpha chosen randomly conditioned on its parent?)  (c) there is a fixed conditional distribution of P(x_\alpha|z_\alpha), and I really could not figure out (d). The paper definitely seems to allow two receptive fields to intersect as in Figure 1(b).
+
+Without understanding the generative model, it is impossible for me to evaluate the later results. My general comments there is that there are no clear Theorems that summarizes the results (the Theorems in the paper are all just Lemmas that are trying to work towards the final goal of giving some explanations, but the explanations and assumptions are not formally written down). Looking at things separately (as again I couldn't understand the single paragraph describing the generative model), the Assumption in Theorem 3 seems extremely limiting as it is saying that x_j is a discrete distribution (which is probably never true in practice). I wouldn't say ""the model does not impose unrealistic assumptions"" in abstract if you are going to assume this, rather the model just makes a different kind of unrealistic assumptions (Assumptions in Theorem 2 might be much weaker, but it's hard to judge that).
+
+==== After reading the revision
+
+The revised version is indeed more clear about how the teacher network works, and I have tried to understand the later parts of the paper again. The result of the paper really relies on the two assumptions in Theorem 2. Of the two assumptions, the first one seems to be intuitive (and it is OK although exact conditional independence might be slightly strong). The second assumption is very unclear though as it is not an assumption that is purely about the model/teacher network (which are the x and z variables), it also has to do with the learning algorithm/student network (f's and g's). It is much harder to reason about the behavior of an algorithm on a particular model and directly making an assumption about that in some sense hides the problem. The paper mentioned that the condition is true if z is fine-grained, but this is very vague - it is definitely true if z is super fine-grained to satisfy the assumption in Theorem 3, but that is too extreme.
+
+Overall I still feel the paper is a bit confusing and it would benefit from having a more concrete example. I like the direction of the work but I can't recommend for recommendation at this stage.",5,3.0,ICLR2019
+H1xR7sfNe,1,HJ0UKP9ge,HJ0UKP9ge,Review: New deep end-to-end attention architecture for machine comprehension with novel aspects and convincing results,"The paper presents an architecture for answering questions about text. The paper proposes a novel architecture which jointly attends over the context and the query.
+
+1.	The paper is clearly written and illustrated.
+
+2.	The architecture is new and incorporates novel and interesting aspects:
+2.1.	The attention is not summarized immediately but the features are only weighted with the attention to not loose information.
+2.2.	The approach estimates two directions of attention, by maximizing in two directions of the similarity matrix S – towards the context and towards the query.
+
+3.	The paper extensively evaluates the approach on three datasets SQuAD, CNN and Daily Mail. In all cases showing state-of-the-art performance. It is worth noting that the SQuAD and the CNN/Daily Mail are slightly different tasks and it is positive that the model works well in both scenarios. 
+3.1.	It is worth noting that the paper even compares mainly favorably to concurrent work (including other ICLR 2017 submissions), recently published/listed on the evaluation server for SQuAD
+
+4.	The paper also includes an ablation study and qualitative results.
+
+5.	I think the paper provides a good discussion of related work and I like that it points out the relations to Visual question answering (VQA). It would be interesting to see how the architecture can be adapted and works on the VQA task. 
+
+6.	The authors revised the paper based on the comments from reviewers and others.
+
+7.	It would be interesting to see more qualitative results, e.g. in an appendix. 
+7.1.	Fig. 3 seems to miss the predicted answer. 
+7.2.	It would also be interesting to compare the results of different approaches, maybe in a more compact format.
+
+Given the new architecture with novel aspects and the strong experimental evaluation I recommend to accept the paper.
+
+",8,5.0,ICLR2017
+rkKuj7zgz,1,HyRnez-RW,HyRnez-RW,Intuitive model for scaling question answering ,"The authors present a scalable model for questioning answering that is able to train on long documents. On the TriviaQA dataset, the proposed model achieves state of the art results on both domains (wikipedia and web). The formulation of the model is straight-forward, however I am skeptical about whether the results prove the premise of the paper (e.g. multi-mention reasoning is necessary). Furthermore, I am slightly unconvinced about the authors' claim of efficiency. Nevertheless, I think this work is important given its performance on the task.
+
+1. Why is this model successful? Multi-mention reasoning or more document context?
+I am not convinced of the necessity of multi-mention reasoning, which the authors use as motivation, as shown in the examples in the paper. For example, in Figure 1, the answer is solely obtained using the second last passage. The other mentions provide signal, but does not provide conclusive evidence. Perhaps I am mistaken, but it seems to me that the proposed model cannot seem to handle negation, can the authors confirm/deny this? I am also skeptical about the computation efficiency of a model that scores all spans in a document (which is O(N^2), where N is the document length). Can you show some analysis of your model results that confirm/deny this hypothesis?
+
+2. Why is the computational complexity not a function of the number of spans?
+It seems like the derivations presents several equations that score a given span. Perhaps I am mistaken, but there seems to be n^2 spans in the document that one has to score. Shouldn't the computational complexity then be at least O(n^2), which makes it actually much slower than, say, SQuAD models that do greedy decoding O(2n + nm)?
+
+Some minor notes
+- 3.3.1 seems like an attention computation in which the attention context over the question and span is computed using the question. Explicitly mentioning this may help the reading grasp the formulation.
+- Same for 3.4, which seems like the biattention (Seo 2017) or coattention (Xiong 2017) from previous squad work.
+- The sentence ""We define ... to be the embeddings of the l words of the sentence that contains s."" is not very clear. Do you mean that the sentence contains l words? It could be interpreted that the span has l words.
+- There is a typo in your 3.7 ""level 1 complexity"": there is an extra O inside the big O notation.",7,4.0,ICLR2018
+tgkKGCCTyR,1,ztMLindFLWR,ztMLindFLWR,The paper proposes to explore powerful aggregators to improve the expressiveness of the GNN. It is quite difficult to read the paper. Not sure that the numerical results show significant improvement.  ,"Summary of the paper: The main objective of the paper is to improve the expressiveness of the GNN by exploring powerful aggregators. The requirements to build more powerful aggregators are analysed. It is closely related to finding strategy for preserving the rank of hidden features, and implies that basic aggregators correspond to a special case of low-rank transformations.
+
+Strengths: The idea is promising. A new GNN formulation is proposed: the aggregation is represented as the multiplication of hidden feature matrix of neighbours and the aggregation coefficient matrix.
+
+Weaknesses: The strength mentioned above (multiplication of hidden features values and the aggregation) is also a weakness: I have an impression that already known results are presented in a much more complex way. The paper is not easy to follow in general ( e.g., the sentence ""The difference is that each dimension of hidden features is aggregated with an independent weighted aggregator which works like a comb"".)
+
+The paper needs to be throughly read: use \citep instead of \cite where it is necessary. 
+
+The improvements reported in the experimental section seem to be not really significant.
+
+Questions: Could you provide an intuition for the definition of the distinguishing strength? (Section 3.1). 
+
+",5,3.0,ICLR2021
+T-b-dTEcdmv,1,t86MwoUCCNe,t86MwoUCCNe,Clean Distributed Mean Estimation approach,"The paper considers distributed mean estimation in two variations (mean estimation and variance reduction), applicable for instance in distributed learning where several machines needs to figure out the mean of their locally computed gradients.
+
+The paper measures the quality of an estimator in terms of the input variance, where earlier work has implicitly assumed that the input across the machines had mean zero, and instead measured quality in terms of the inputs
+In that sense the approach takes in this paper generalizes previous work. 
+
+The authors provide matching upper and lower bounds for the the two problems considered, as well as a practical implementation of the general form of algorithms presented. Finally, experiments back up the quality of the approach considered.
+
+Pros:
+- I think the definition of the problems is natural and clean and the right one to consider (instead of assuming zero centered inputs).
+- The approach makes application of these algorithms much simpler as the zero mean assumption is removed and does not need to be handled separately
+- The general latticed based algorithms are natural and very reasonable.
+- The efficient algorithm instantiation of the general approach is nice.
+- It is great that the authors provide matching upper and lower bounds and in general the works seems very thorough.
+- The experiments show the applicability of the general approach.
+
+Cons:
+- The actual algorithm used does not match the optimal bounds given.
+- Given the nature of the problem the constants may be relevant instead of using O notation in particular in the actual algorithm presented and used in experiments.
+
+The cons i have listed i think are all small and overall i think this is a good paper as it provides a clean practically applicable version of the problem, the bounds shown are tight and an actual new algorithm is provided and shown to have good practical qualities.
+
+Question.
+Definition 9, the packing radius. Maybe i misunderstand. Is it supposed to be the smallest r such that  two balls of radius r centered around any two different lattices points do not intersect? Because that is not what i read from the definition, but that is used in the proofs.
+",7,4.0,ICLR2021
+3dWM9MaVXpb,1,KTlJT1nof6d,KTlJT1nof6d,Empirical result studying spectral initialization and Frobenius decay on factorized NN,"This paper studies initialization and regularization in factorized neural networks (reparameterize a weight matrix by the product of several weight matrices). The authors proposed spectral initialization, that is to initialize the factorized matrices using the SVD of the un-factorized matrix. The authors also proposed Frobenius decay that is to regularize the Frobenius norm of the product of the factorized weight matrices. The motivation is to simulate the routines for non-decomposed counterparts. The authors empirically showed the effectiveness of spectral initialization and Frobenius decay in different applications: compressed model training, knowledge distillation, and multi-head self-training. 
+
+I think it’s important to study the initialization and regularization for factorized neural networks. A priori, it needs different initialization and regularization methods due to different architecture compared with its non-decomposed counter-part. This paper gave very simple and natural solutions and was able to show its effectiveness in experiments. 
+
+I also have some questions as below:
+1. In the experiments in section 5 (knowledge distillation), default initialization is used instead of spectral initialization. I wonder if SI leads to a bad performance here. If that’s the case, it requires more explanation of why SI fails in this setting.
+2. In Figure 1, it seems FD is a stronger regularizer compared with default weight decay. It seems if the regularization coefficient is carefully tuned for each regularizer, the benefits of FD is actually not very significant. Also, what’s ""no decay (normalized)""?
+3. In section 2, the definition of the factorized CNN is not very clear to me. It might be good to give more detailed definitions here. 
+4. Spectral initialization requires computing SVD of the weight matrix. If the matrix dimension is high, this step can be very time-consuming. I wonder if there is any more efficient way to construct the factorized matrices so that their product is still as i.i.d. Gaussian matrix. Because we don't need to compute the SVD for an arbitrary matrix, what we need is only to make sure that the product of the factorized matrixes is distributed as i.i.d. Gaussian. ",6,3.0,ICLR2021
+H17N5b5lf,1,rJ1RPJWAW,rJ1RPJWAW,"Very nice paper showing how large networks can actually be ""simple"", in spite of their large capacity.","Summary:
+This paper presents very nice experiments comparing the complexity of various different neural networks using the notion of ""learnability"" --- the learnability of a model (N1) is defined as the ""expected agreement"" between the output of N1, and the output of another model N2 which has been trained to match N1 (on a dataset of size n).  The paper suggests that the learnability of a model is a good measure of how simple the function learned by that model is --- furthermore, it shows that this notion of learnability correlates well (across extensive experiments) with the test accuracy of the model.
+
+The paper presents a number of interesting results:
+1) Larger networks are typically more learnable than smaller ones (typically we think of larger networks as being MORE complicated than smaller networks -- this result suggests that in an important sense, large networks are simpler).
+2) Networks trained with random data are significantly less learnable than networks trained on real data.
+3) Networks trained on small mini-batches (larger variance SGD updates) are more learnable than those trained on large minibatches.
+
+These results are in line with several of the observations made by Zhang et al (2017), which showed that neural networks are able to both (a) fit random data, and (b) generalize well; these results at first seem to run counter to the ideas from statistical learning theory that models with high capacity (VC dimension, radamacher complexity, etc.) have much weaker generalization guarantees than lower capacity models.  These results suggest that models that have high capacity (by one definition) are also capable of being simple (by another definition).  These results nicely complement the work which studies the ""sharpness/curvature"" of the local minima found by neural networks, which argue that the minima which generalize better are those with lower curvature.
+
+Review:
+Quality:  I found this to be high quality work. The paper presents many results across a variety of network architectures.  One area for improvement is presenting results on larger datasets (currently all experiments are on CIFAR-10), and/or on non-convolutional architectures.  Additionally, a discussion of why learnabiblity might imply low generalization error would have been interesting (the more formal, the better), though it is unclear how difficult this would be.
+
+Clarity:  The paper is written clearly.  A small point: Step 2 in section 3.1 should specify that argmax of N1(D2) is used to generate labels for the training of the second network.  Also, what dataset D_i is used for tables 3-6? Please specify.
+
+Originality: The specific questions tackled in this paper are original (learnability on random vs. real data, large vs. small networks, and large vs. small mini-batch training).  But it is unclear to me exactly how original this use of ""learnability"" is in evaluating how simple a model is.  It seems to me that this particular use of ""learnability"" is original, even though PAC learnability was defined a while ago.
+
+Significance:  I find the results in this paper to be quite significant, and to provide a new way of understanding why deep neural networks generalize.  I believe it is important to find new ways of formally defining the ""simplicity/capacity"" of a model, such that ""simpler"" models can be proven to have smaller generalization gap (between train and test error) relative to more ""complicated"" models. It is clear that VC dimension and radamacher complexity alone are not enough to explain the generalization performance of neural networks, and that neural networks with high capacity by these definitions are likely ""simple"" by other definitions (as we have seen in this paper).  This paper makes an important contribution to this conversation, and could perhaps provide a starting point for theoreticians to better explain why deep networks generalize well.
+
+Pros
+- nice experiments, with very interesting results.
+- Helps explain one way in which large networks are in fact ""simple""
+
+Cons
+- The paper does not attempt to relate the notion of learnability to that of generalization performance.  All it says is that these two metrics appear to be well correlated.",7,4.0,ICLR2018
+sTco47Y0ify,2,qVyeW-grC2k,qVyeW-grC2k,"Interesting and thorough analysis, unclear whether this is a useful benchmark","*Summary*: This paper proposes a new benchmarks for the host of recently-proposed transformer variants focused on efficiency and scaling to longer sequence lengths (xformers). The authors reimplement and study the performance of 10 xformers on their benchmark. Furthermore, the authors conduct a study of the memory consumption and speed of the models on their text classification benchmark.
+
+*Strengths*: Detailed and thoughtful comparison of 10 models aimed at more-or-less alleviating the same problems. The tasks span a wide range of sequential data modalities.
+
+*Weaknesses*: It’s not entirely clear to me that LRA is best-positioned as a benchmark, rather than an analysis tool; the authors themselves seem to also note this (“Hence, the results provided in this paper are not meant to be a final authoritative document on which xformer is the best”). This toolkit seems more useful to me as an analysis tool---the choice of tasks itself also reflects the benchmark’s values. For instance, a NLP researcher might care more about performance of ListOps, since it’s possible that these results have more relevance for the type of structure found in natural language.
+
+Instead of seeking to rank models (calling LRA a “benchmark” and having an “average” LRA score directly feeds into this), I’d like to see this toolkit shift toward being more customizable for individual user questions and values; similarly, this paper might take a more analytical approach to empirically studying the performance of these 10 models on this starter set of tasks, versus trying to rank them.
+
+Despite this lack of clarity of goals, I think that this toolkit and study offer useful contributions.
+
+*Recommendation*: 7 . While it remains unclear to me that this paper has a useful benchmark contribution, the analysis of existing models is valuable. Furthermore, the observation that inherent tradeoffs in performance and speed make no model the one-size-fits-all option is important; in light of this, I’d like to see the authors move towards making their toolkit better for determining what the “right” option is for a given user’s use case.
+
+Comments and Questions:
+
+Figure 2: y axis label should be “Span”, not “Apan”
+
+Some abstraction is taken with respect to hyperparameters---a single set of hyperparameters is used across all models. Do you have a sense of how much this can potentially impact performance? For instance, for a single task, comparing results when you use this single set of hyperparameters vs individually tuning models for the task.
+
+Do you think that future developers of xformers should “hillcimb” on LRA?",7,4.0,ICLR2021
+ryxapKiBhX,1,SJxJtiRqt7,SJxJtiRqt7,"A good idea, poor development and results.","The authors present a novel method for generating images from sounds using a two parts model composed by a fusion network, aka. multi-modal layers, for learning sound and visual features in a common semantic space, and two conditional GANs for converting sound features into visual features and those into images. To validate their approach they created an ad-hoc dataset, based on Flickr-SoundNet dataset, which contains 104K pairs of sounds and images with matching scene content. Their model was trained as two separate models, the fusion network was trained to classify both images and sounds minimizing their cross-entropy and their L1 distance, while the two conditional GANs were trained until convergence penalizing the discriminator to prevent fast convergence.
+
+Although the idea of generating images from sounds with the aid of Generative Adversarial Networks is quite novel and interesting, the paper exhibits several problems starting with the lack of clarity explaining the purpose of the proposed method and the contributions of the work itself. Overall, the idea is good but not well developed. Introduction should present more clearly the problem and framework.
+
+In the related work section the authors omitted some relevant recent prior works such as “Look, Listen and Learn” paper by Arandjelović and Zisserman presented on ICCV’17, “Objects that Sound” by Arandjelović and Zisserman presented on ECCV’18, “Audio-Visual Scene Analysis with Self-Supervised Multisensory Features” by Owens and Efros presented on ECCV’18, and “Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input” by Harwath et al. also presented presented on ECCV’18. These works propose different methods for aligning visual and sound features.
+
+There are also several concerns on the validity of the results: 1) none of the results achieved by training their multi-modal layers were validated against a baseline, e.g. evaluating the quality of the learned visual features against VGG or a simple GAN instead of two stacked conditional GANs, 2) it is not clear why they learned features minimizing L1 loss + Cross-Entropy while using L2 distance to address the quality of their learned features, a simple way of doing so would be evaluating their retrieval capabilities using any standard measure from the retrieval community, e.g. the normalized discriminative cumulative gain (nDCG) or the classical mean-average precision (mAP) as proposed in “Objects that Sound”, 3) the authors assume that using a conditional GAN is suitable for generating images from visual features, but they don’t provide any quantitative results supporting this claim, they only provide a few successful qualitative results and elaborate their model from there. 4) Ablation is completely missing: it would be interesting to prove the effective contribution for i) the multi-modal fusion ii) the two-steps of image generation iii) the L_ losses for the two GANs.
+
+There are many missing citations throughout the paper, in particular: 1) the concatenation of visual and sound features followed by a fusion network for learning features in a common semantic space was already proposed on “Look, Listen and Learn”, 2) when the authors describe their strategy for sound features extraction in section four, they never mentioned that the idea of using pool5 layer features was already introduced by SoundNet authors, and 3) in section 5.3 when they mention that using a conditional GAN to convert between two different feature domains it might be that the discriminator may converge too rapidly while the generator does not learn sufficiently.
+
+Finally although using an ad-hoc extremely simplified dataset with pairs of images and sounds matching scene content, the complete model is able to generate images which achieve only a 8,9% matching rate for the top 3 predicted classes. Given that the dataset was created with 100% matching on the top 3 scores for sound and images, the results are definitely  poor.
+",4,5.0,ICLR2019
+CmnOLsrs213,3,PoP96DrBHnl,PoP96DrBHnl,Review,"### Summary of Contributions
+
+The paper proposes the gradient descent TD difference learning (GDD) algorithm which adds a term to the MSPBE objective to constrain how quickly a value function can change. They argue that their approach has a quicker convergence rate, and empirically demonstrate in several examples with linear function approximation that it substantially improves over existing gradient-based TD methods.
+
+### Review
+
+I like the simplicity of the proposed method, and its intuitive interpretation as a value-based trust region. However, I have the following questions and concerns:
+
+1) There doesn't seem to be any information regarding how many independent runs were performed in the empirical evaluation, and there was no no reported statistical significance testing. Can the authors clarify this information, and comment on the significance of the results?
+
+2) While it led to improvements over GTD2, it largely didn't improve over regular (semi-gradient) TD apart from Baird's counterexample, which was designed to make TD fail. As such, I don't think the addition of a new parameter was convincingly justified. Some of the results seemed to suggest that the improvement grew as the state space/complexity increased, that it may be the case that the evaluation falls a bit short on exploring more complex environments. While the breadth of the ablation studies is really nice, we observe similar trends in many neighbouring figures that the space in the main text from showcasing the many different configurations could be summarized with representative examples, and the additional space could have been used to provide some additional experiments/insights (like those suggested in the discussion).
+
+3) From how modular the addition of the term is to the objective, have the authors tried incorporating the regularization to semi-gradient TD? Is there anything about the semi-gradient update that bars its use? TD generally performed really well in the paper's evaluation (outside of Baird's counterexample) that it would make a stronger case if the extension was demonstrated to be more generally applicable, and that it consistently improved over the methods it was applied to. This sort of ties into what was described in 2), where what was presented seems to fall a bit short, and how the space could have showcased a bit more.
+
+4) While the paper's focus was on the case of linear function approximation, can the authors comment on how readily the approach can be extended to the non-linear case? GTD methods have not seen as much adoption as their approximate dynamic programming counterparts when combining TD methods with non-linear function approximation, that it can raise questions as to how the methods scale to more complicated settings.
+
+Given the above, I am erring toward rejection at this time. I think 1) is a rather significant issue that needs to be addressed, and I'm willing to raise my score if that, and my other concerns, can be sufficiently addressed.
+
+----- Post Discussion -----
+
+Taking the other reviews and the authors' response into account, I still maintain my score. While I agree that it's good to be thorough in something clear and simple, it can still be done to a point of redundancy, and consequently seem less thorough in the overall picture and claims made. I'm still largely unsure on the choice to only apply the supposedly modular extension to GTD2, and not try it with TD which seemed like a clearer winner (apart from Baird's counterexample). As others suggested, there are additional methods which might be good to compare to, and other evaluation metrics might make more sense for the claims being made. Many of my concerns were largely brushed off as future work, that little got addressed- without having to carry out the experiments, high level comments/current thoughts could be provided regarding how readily the approach can extend to the scenarios suggested, or if there are nuances that need to be worked out, etc.",5,4.0,ICLR2021
+BygR5oLw27,2,ByxkCj09Fm,ByxkCj09Fm,An interesting paper but can still be improved.,"This paper proposes a new soft negative log-likelihood loss formulation for multi-class classification problems. The new loss is built upon the taxonomy graph of labels, which is provided as external knowledge, and this loss provides better semantic generalization ability compared to a regular N-way classifier and yields more accurate and meaningful super-class predictions.
+
+This paper is well-written. The main ideas and claims are clearly expressed. The main benefits of the new loss are caused by the extra information contained by the taxonomy of labels, and this idea is well-known and popular in the literature. Based on this reason, I think the main contribution of this paper is the discussion on two novel learning settings, which related to the super-classes. However, the formulation of the new soft NLL loss and the SG measurement involves lots of concepts designed based on experiences, so it’s hard to say whether these are the optimal choices. So, I suggest the authors discuss more on these designs.
+Another thing I concern about is the source of label taxonomy. How to efficiently generate the taxonomy? What if the taxonomy is not perfect and contains noises? Will these significantly affect the models’ performance? I think it’s better to take these problems into consideration. 
+In conclusion, I think this is an interesting paper but can still be improved.",5,3.0,ICLR2019
+H1gI2HOCYB,2,SkeYUkStPr,SkeYUkStPr,Official Blind Review #2,"This paper proposes a method to cluster subjects based on the latent lifetime distribution. The proposed model clusters by maximizing the empirical divergence between different clusters. 
+
+The problem setting of this paper is not clear to me. I am not sure which variables are observed and which are not.  For example, in the Friendster experiment, the data of 5 months from joining is used for clustering. However, the termination windows are chosen to be 10 months. Therefore, it is clear that the observed data will not contain the termination signals, and I do not believe the training of the model is possible, without observing any termination signals.  In the paper, do we consider only one type or multiple types of events? Is $M_{k, i}$ a vector that represents the attributes or properties of an event?
+
+Some details of the model are not clear to me. In Equation (2), the input of the neural network differs in length across different subject $u$, because the number of observed events for each subject is different. 
+How does the proposed neural network take inputs of different lengths?
+How the non-decreasing function $\xi^{(u)}$ is defined in Section 3.2? Is it a function of the observed data for each subject?  
+
+How the empirical distribution $\hat{S}_i$ in Equation (4) is computed is also not clear to me. How $\hat{S}_i$ is a vector? Is it constructed by concatenating $\hat{S}_k(W_1, W_2; D)[t] $ with different $t$? How to normalize $\hat{S}_i$ such that it is a valid probabilistic distribution? Since $\hat{S}_i$ is high dimensional, it looks very challenging to estimate the joint distribution.
+
+The overall objective function is given by Equation (4), is it correct? In Equation (4), why should we compute the minimum values across all possible pairs of clusters rather than the summation of all pairs? If Equation (4) is the overall objective function, then it looks like the model does not contain a component that maximizes a likelihood function. How is it guaranteed that the model will fit the data? It looks like the model will converge to a trivial solution that $\beta$ is a constant such that $\beta = 1$ for one cluster and $\beta = 0$ for another cluster, if the likelihood function is not involved. This will give a maximum divergence between distributions. 
+
+In summary, it seems that numerous technical details are missing and the paper might contain technical flaws. I do not suggest the acceptance of this paper.
+
+",6,,ICLR2020
+rylUQbwIiB,3,HkghoaNYPB,HkghoaNYPB,Official Blind Review #1,"This paper describes ""AlgoNets"", which are differentiable implementations of classical algorithms. Several AlgoNets are described, including multiplication algorithm implemented in the WHILE programming language, smooth sorting, a smooth while loop, smooth finite differences and a softmedian.
+
+The paper additionally presents RANs (similar to GANs but with an AlgoNet embedded) and Forward AlgoNets (where the the AlgoNet is embedded in a feedforward net). 
+
+The smooth implementations normally amount to replacing hard functions with soft equivalents, for example ""if"" conditions are replaced by logistic sigmoids.
+
+The research direction in this paper is very interesting and could lead to important advancements, however a strong argument needs to be presented to the readers about why this way of making algorithms smooth is better than other published or obvious techniques.
+
+The argument could be theoretical, proving for example faster convergence under certain assumptions, or it could be empirical, showing that the method achieves better results than other techniques on some benchmarks. I could not see however any such arguments in this paper.",1,,ICLR2020
+dVzaLfSHXFU,3,F9sPTWSKznC,F9sPTWSKznC,Progress towards better evaluation of context-aware NMT,"In this paper, the authors propose specific test sets for document-level NMT. They target anaphora, coherence/readability, lexical consistency and discourse connectives, and are available for multiple language pairs. The first two challenge sets rely on model-based evaluation.
+
+Strengths:
+
+The test sets, which target various discourse phenomena, directly evaluate the output of the models (contrarily to some existing multiple-choice challenge sets).
+
+The types of mistakes made by NMT models are manually examined.
+
+The authors validate the quality of their metrics by comparing against human judgements (although the number of samples is arguably small).
+
+Weaknesses:
+
+The evaluated NMT models all date from 2018 or earlier.
+
+The anaphora challenge sets are only a minor update over previous work.
+
+All language pairs use English as the target language.
+
+Other remarks and questions:
+
+For the anaphora and coherence/readability test sets, future work may ""cheat"" by using the evaluation models as part of the NMT systems.
+
+Is normalizing the scores actually useful? The reference scores should be the same across all systems, so it only shifts all results by a constant and doesn't affect relative performance.
+
+Whats steps would be needed to construct similar test sets for En->X language pairs?",7,3.0,ICLR2021
+SJgdto_epX,3,r1xrb3CqtQ,r1xrb3CqtQ,The technical part is weak,"The authors demonstrate that it is possible to transfer across modalities (e.g., image-to-audio) by first abstracting the data with latent generative models and then learning transformations between latent spaces. We find that a simple variational autoencoder is able to learn a shared latent space to bridge between two generative models in an unsupervised fashion, and even between different types of models (e.g., variational autoencoder and a generative adversarial network). Some detailed comments are listed as follows, 
+1. The technical parts are weak since the authors use the existing method with to some extent evolution. 
+
+2 The proposed method can transfer the positive knowledge. However, for the transfer learning, one concerned and important issue is that some negative knowledge information can be also transferred. So how to avoid the negative transferring? Some necessary discussions about this should be given in the manuscript.
+
+2 There are many grammar errors in the current manuscript. The authors are suggested to improve the English writing.
+",4,4.0,ICLR2019
+XtqbsDdHe2_,3,N6JECD-PI5w,N6JECD-PI5w,"Clearly written, relatively simple, fairly effective","This paper introduces a novel technique for debiasing pretrained contextual embedding models. Their approach trains a 2 layer fully-connected neural network which takes as input the output from the pretrained model and outputs a new, ""debiased"" representation. This model is trained by minimizing the InfoNCE between the representation produced of original sentence and the representation of that same sentence with some tokens replaced with differently-biased tokens (e.g. ""his"" -> ""hers""). This paper also introduces a regularizer which minimizes the CLUB between the generated representation and a word embedding for a biased token. 
+
+Generally this paper is clearly written, addressing an interesting problem, presents some relevant supporting experiments, and seems original (I'm less familiar with other debiasing work, so I leave a better originality estimate up to the other reviewers).
+
+The regularizer is described as a neural network, but that seems unnecessary. If it's parameterized as a neural network, and the weights are updated, then I think this is just part of the model and not really a regularizer. In addition, the motivation doesn't seem great -- it's not clear that we should be directly comparing (the mutual information of) a single word embedding and a contextual word embedding (which is built from the full sentence). All that said, empirically it seems to work well at debiasing the word embeddings, so it's a valuable contribution.
+
+I think the full list of debiasing dimensions isn't included -- maybe I missed it somewhere? That should definitely be included in the paper (not just as a citation), and if there isn't space it should be added to the appendix.
+
+The three fine-tuning datasets could be improved, CoLA especially is known to have really high variance even just fine-tuning BERT multiple times with different random seeds. Since there are two other fine-tuning datasets there is sufficient evidence that the approach works well. As this is a fairly general approach, clearly written up, and seems to work well, I recommend it for acceptance. The experiments are a little light, and the regularization approach is a little unorthodox, and I would increase my score if there were further experiments (on other fine-tuning tasks and measuring other types of bias) and the regularization was better motivated.
+
+Edit: after reading the author response, my score remains unchanged. ",6,4.0,ICLR2021
+H1th_uZNg,2,HJTzHtqee,HJTzHtqee,A solid empirical study,"This paper proposes a compare-aggregate framework that performs word-level matching followed by aggregation with convolutional neural networks. It compares six different comparison functions and evaluates them on four datasets. Extensive experimental results have been reported and compared against various published baselines.
+
+The paper is well written overall.
+
+A few detailed comments:
+* page 4, line5: including a some -> including some
+* What's the benefit of the preprocessing and attention step? Can you provide the results without it?
+* Figure 2 is hard to read, esp. when on printed hard copy. Please enhance the quality.
+",7,5.0,ICLR2017
+ByPQQOX1G,1,ryepFJbA-,ryepFJbA-,"A simple regularization term for training GANs is introduced, with good numerical performance.","Summary
+========
+The authors present a new regularization term, inspired from game theory, which encourages the discriminator's gradient to have a norm equal to one. This leads to reduce the number of local minima, so that the behavior of the optimization scheme gets closer to the optimization of a zero-sum games with convex-concave functions.
+
+
+Clarity
+======
+Overall, the paper is clear and well-written. However, the authors should motivate better the regularization introduced in  section 2.3.
+
+
+Originality
+=========
+The idea is novel and interesting. In addition, it is easy to implement it for any GANs since it requires only an additional regularization term. Moreover, the numerical experiments are in favor of the proposed method.
+
+
+Comments
+=========
+- Why should the norm of the gradient should to be equal to 1 and not another value? Is this possible to improve the performance if we put an additional hyper-parameter instead?
+
+- Are the performances greatly impacted by other value of lambda and c (the suggested parameter values are lambda = c = 10)?
+
+- As mentioned in the paper, the regularization affects the modeling performance. Maybe the authors should add a comparison between different regularization parameters to illustrate the real impact of lambda and c on the performance.
+
+- GANs performance is usually worse on very big dataset such as Imagenet. Does this regularization trick makes their performance better?
+
+
+
+Post-rebuttal comments
+---------------------------------
+
+I modified my review score, according to the problems raised by Reviewer 1 and 3. Despite the idea looks pretty simple and present some advantages, the authors should go deeper in the analysis, especially because the idea is not so novel.",5,2.0,ICLR2018
+jYCEgv7LSK4,4,XoF2fGAvXO6,XoF2fGAvXO6,"interesting model, but there are positioning and presentation issues","
+The paper proposes a new model for numerical reasoning in machine comprehension. Given a passage and a query, the model outputs an arithmetic expression over numbers/dates in the passage (e.g. max(23, 26, 42)). The model is trained with weak supervision in the form of numerical answers only. This weak supervision is used to define reward for reinforcement learning training. A key claimed advantage of the model compared to the prior art is that it trains end-to-end from the rewards as the only form of supervision. This is contrasted to  neural module networks, which require program supervision for good performance, as well as GenBERT, which requires additional synthetic training data for pretraining. Two key quantitative results include: 
+better performance on the DROP-num datasets, compared to NMNs with less supervision and GenBERT without data augmentation
+comparable to strongly-supevised NMN performance on DROP-Pruned-num.
+
+The general approach is quite elegant and makes sense. It is encouraging that the paper reports successful training with RL. It is also important to build models that use less extra supervision. 
+
+That said, I have some concerns regarding the paper’s positioning. The introduction, as well as many other places in the text categorizes some of the prior art as “learning a multi-type answer predictor over different reasoning types (e.g., max/min, diff/sum, count, negate) and directly predicting the corresponding numerical expression, instead of learning to reason”. What is learning to reason then? For example, in NAQANet the model predicts whether the numbers in the passage should be summed or subtracted from each other, why is this not learning to reason? Second, calling this model a “module network” is misleading, in my opinion. In neural module networks modules learn to do things, and here the key modules of discrete reasoning are predefined. The model also contains “modules” that are conditioned on different question spans. But there is no experiment checking whether having multiple such modules is actually useful. There is furthermore no qualitative explanation of what these modules are supposed to do and what they are actually doing. Lastly, I am not sure the use of synthetic data for GenBERT can be called “strong supervision”. Data augmentation and strong supervision are not the same thing. 
+
+The paper is very dense and is quite hard to read. A lot of space is allocated to a very detailed technical presentation of “modelling interactions”, while the high-level picture of how the model functions is still hard to grasp. For example, Figure 2 is confusing because it has a “Stacked Span Prediction” pathway that leads to nowhere. Reading the text I find that apparently the output of this part is actually used for “modelling interaction between programs and number entities”, which is in the right part of the figure. These basic high-level architectural decisions are hard to understand as the reader is overwhelmed by technical details, such as sliding windows and scaling factors for various attentions. 
+
+The way the results are displayed in the table is somewhat confusing. The fully-supervised NMN baseline is shown in Table 2 for DROP-Pruned-num but is not shown in Table 1 for Table-num. Instead, it can be found in Table 3. I would recommend presenting all results in one table, even if it shows that the current model performs worse than others with more supervision or data augmentation. Furthermore, I think that comparing top-k accuracies for the proposed model and top-1 accuracies for other models, as it is done in Table 3, does not make sense.
+
+In summary, while I think the paper might be proposing an interesting model with promising results, I also think that presentation needs work. It would be great to see a clearer high-level explanation of the difference between the proposed model and the prior work. To this end, the prior work should be better discussed (notably there is no Related Work section at the moment). Besides, more quantitative and/or qualitative results are needed to support the hypothesis that the model performs “noisy query execution”. 
+
+Other minor comments:
+- it is the first time I see the word skyline used to mean “the baseline from above” 
+- there is a lot of really long sentences in the paper, which makes the reading very hard. I’d recommend to try and break them up into shorter ones. 
+",5,3.0,ICLR2021
+Im23lluzCeN,2,0NQdxInFWT_,0NQdxInFWT_,A borderline case?,"In this paper, the authors consider the problem of compressed sensing where the underlying signal of interest is captured and restored based only on sparse measurements: Specifically, this paper focuses on the scenario of Deep Probabilistic Subsampling (DPS) which finds sparse measurements in the way that the models designed to solve specific learning problems based on these measurements are jointly optimized. The authors extend DPS to a sequential framework that iteratively and actively selects the next measurement points: The proposed approach encodes the information accumulated until a time step into a context vector which is updated, and used in selecting the next point, in an LSTM-like framework (see minor comments below). In the experiments with two toy problems (including MNIST) and an MRI reconstruction problem, the authors demonstrated that the proposed Active DPS (ADPS) outperforms DPS (in toy problems) and three other compressed sensing algorithms (for MRI reconstruction).
+
+I think this paper makes a borderline case: DPS provides a framework that combines the compressed sensing part (sparse data acquisition) and the subsequent learning part in an end-to-end manner. This paper contributes by extending DPS into an active/sequential learning framework achieving significant performance gains over DPS (mainly on toy problems. see minor comments below). On the other hand, the proposed approach appears to be incremental: ADPS adds a simple sequential update structure (of a context vector) to DPS, which can be described by only two equations (6 and 7). The simplicity of the changes proposed (over DPS) is not a limitation, but it could be accompanied by an in-depth theoretical analysis, a convincing qualitative discussion or _extensive_ experiments demonstrating the practical relevance of the proposed approach.
+
+Minor comments
+- Apart from the last one paragraph, the Introduction Section focuses on discussing the context and motivation of Deep Probabilistic Subsampling (DPS). Instead, the authors could use this space to describe and characterize the proposed Active DPS in detail. 
+- I was not sure why the proposed architecture (Figure 1 and equations 6 and 7) is called LSTM, it has a recurrent network structure but I was not able to find any attention (gating) mechanism that characterizes LSTM. Please advise me if I missed anything. 
+- Please test if the improvements gained by ADPS over DPS on MRI reconstruction are statistically significant.
+
+Update:
+
+Thank the authors for their responses, clarification, and additional experiments. I read through authors’ responses and the comments from the other reviewers. I still think this paper makes a borderline case for 1) its technical contribution on extending DPS and thereby achieving significant performance gain on a toy problem and MRI reconstruction tasks, still 2) with limited novelty and room for a more extensive experimental validation (perhaps, beyond MRI). My other concerns on clarity and significance of experiments have been addressed. I would raise my rating to marginally above acceptance threshold (borderline).
+",6,3.0,ICLR2021
+ryMdpXref,2,ryZ283gAZ,ryZ283gAZ,further comparisons & inconsistent reported baselines,"The authors cast some of the most recent CNN designs as approximate solutions to discretized ODEs. On that basis, they propose a new type of block architecture which they evaluate on CIFAR and ImageNet. They show small gains when applying their design on the ResNet architectures. They also draw a comparison between a stochastic learning process and approximation to stochastic dynamic systems.
+
+Pros:
+(+) The paper presents a way to connect NN design with principled approximations to systems
+(+) Experiments are shown on compelling benchmarks such as ImageNet
+Cons:
+(-) It is not clear why the proposed approach is superior to the other designs
+(-) Gains are relatively small and at a price of a more complicated design
+(-) Incosistent baselines reported
+
+While the effort of presenting recent CNN designs as plausible approximations to ODEs, the paper does not try to draw connections among the different approaches, compare them or prove the limits of their related approximations. In addition, it is unclear from the paper how the proposed approach (LM-architecture) compares to the recent works, what are the benefits and gains from casting is as a direct relative to the multi-step scheme in numerical ODEs. How do the different approximations relate in terms of convergence rates, error bounds etc.?
+
+Experimentwise, the authors show some gains on CIFAR 10/100, or 0.5% (see ResNeXt Table1), while also introducing slightly more parameters. On ImageNet1k, comparisons to ResNeXt are missing from Table3, while the comparison with the ResNets show gains in the order of 1% for top-1 accuracy. 
+
+Table3 is concerning. With a single crop testing scheme, ResNet101 is yielding top-1 error of 22% and top-5 error of 6% (see Table 5 of Xie et al, 2017 (aka ResNeXt)). However, the authors report 23.6% and 7.1% respectively for their ResNet101. The performance stated by the authors of ResNe(X)t weakens the empirical results of LM-architecture.",5,3.0,ICLR2018
+rkxPP-2itH,1,S1g7tpEYDS,S1g7tpEYDS,Official Blind Review #2,"This paper propose an extension to deterministic autoencoders. Motivated from VAEs, the authors propose RAEs, which replace the noise injection in the encoders of VAEs with an explicit regularization term on the latent representations. As a result, the model becomes a deterministic autoencoder with a L_2 regularization on the latent representation z. To make the model generalize well, the authors also add a decoder regularization term L_REG. In addition, due to the encoder in RAE is deterministic, the authors propose several ex-post density estimation techniques for generating samples.
+
+The idea of transferring the variational to deterministic autoencoders is interesting. Also, this paper is well-written and easy to understand. However, in my opinion, this paper needs to consider more cases for autoencoders and needs more rigorous empirical and theoretical study before it can be accepted. Details are as follow:
+
+1. The RAEs are motivated from VAEs, or actually CV-VAEs as in this paper. More precisely, the authors focus on VAEs with a constant covariance Gaussian distribution as the variational distribution and a Gaussian distribution with the identity matrix as the covariance matrix as the model likelihood. However, there might be many other settings for VAEs. For example, the model likelihood can be a Gaussian distribution with non-constant covariance, or even some other distributions (e.g. Multinomial, Bernoulli, etc). Similarly, the variational distribution can be a Gaussian distribution with non-constant covariance, or even some more complicated distributions that do not follow the mean-field assumption. Any of these more complex models may not be easily transferred to the RAE models that are mentioned in this paper. Perhaps it is better if the authors can consider RAEs for some more general VAE settings.
+
+2. Perhaps the authors needs more empirical study, especially on the gain of RAE over CV-VAE and AE. 
+a) As the motivated model (CV-VAE) and the most related model in the objective (AE), they are not appearing in the structured input experiment (Section 6.2). It will be great if they can be compared with in this experiment.
+b) The authors did not show us clearly whether the performance gain of RAE over VAE, AE and CV-VAE is due to the regularization on z (the term L_z^RAE) or the decoder regularization (the term L_REG) in the experiments. In table 1, the authors only compare the standard RAE with RAE without decoder regularization, but did not compare with RAE without the regularization on z (i.e. equivalent to AE + decoder regularization) and CV-VAE + decoder regularization. The authors would like to show that the explicit regularization on z is better than injecting the noise, hence the decoder regularization term should appear also in the baseline methods. It is totally possible that perhaps AE + decoder regularization or CV-VAE + decoder regularization perform better than RAE. 
+c) The authors did not show how they tune the parameter \sigma for CV-VAE. Since the parameter \beta in the objective of RAE is tunable, for fair comparison, the authors needs to find the best \sigma for CV-VAE in order to get the conclusion that explicit regularization is better than CV-VAE.
+d) Although the authors mention that the 3 regularization techniques perform similarly, from Table 1, it is still hard to decide which one should we use in practice in order to get a performance at least not too much worse compared to the baseline methods. RAE-GP and RAE-L2 perform not well on CelebA while RAE-SN perform not well on MNIST, compared to the baseline methods. We know that the best performance over the 3 methods is always comparable to or better than the baselines, but not none of the single methods do. It is better if the authors can provide more suggestions on the choice for decoder regularization for different datasets.
+
+3. The authors provided a theoretical derivation for the objective L_RAE (Equation 11), but this is only for the L_GP regularization. Besides, this derivation (in Appendix B) has multiple technique issues. For example, in the constraints in Equation 12, the authors wrote ||D_\theta(z1) - D_\theta(z2)|| < epsilon for all z1, z2 ~ q_\phi(z | x), this is impossible for CV-VAE since this constraint requires D_theta() to be bounded while q_\phi(z | x) in CV-VAE has an unbounded domain. Moreover, in the part  (||D_\theta(z1) - D_\theta(z2)||_p=\nabla D_\theta(\tilde z)\cdot ||z_1-z_2||_p) of Equation 13, \nabla D_\theta(\tilde z) is a vector well the other two terms are scalars, which does not make sense. There are many other issues as well. Please go through the proof again and solve these issues.
+
+
+Questions and additional feedback:
+
+1. Can the authors provide more intuitions why do you think the explicit regularization works better compared to the noise injection? Can you provide a theoretical analysis on that?
+
+2. Can the authors provide some additional experiments as mentioned above? Also, can the authors provide more details about how do they tune the parameters \beta and \lambda?
+
+========================================================================================================
+
+After the rebuttal:
+
+Thanks the authors for the detailed response and the additional experiments. I agree that the additional experiment results help to support the claims from the authors, especially for the CV-VAE for the structured data experiments and the AE + L2 experiment. So I think now the authors have more facts to support that RAE is performing better compared to the baseline methods. 
+
+Therefore, I agree that after the revision, the proposed method RAE is supported better empirically. So I am changing my score from ""weak reject"" to ""weak accept"". But I still think the baseline CV-VAE + regularization is important for Table 1 and the technical issues in the theoretical analysis needs to be solved. Hope the authors can edit them in the later version.",6,,ICLR2020
+B1OjV-f4l,1,S1dIzvclg,S1dIzvclg,Cool paper,"This paper poses an interesting idea: removing chaotic behavior or RNNs.
+While many other papers on new RNN architecture usually focus too much on the performance improvement and leave the analysis part on their success as a black-box, this paper does a good job on presenting why its method may work well.
+
+Although, the paper shows lots of comparison between the chaotic systems (GRUs & LSTMs) and the stable system (proposed CFN model), the reviewer is not fully convinced by the main claim of this paper, the nuance that chaotic behaviour makes dynamic system to have rich representation power but makes the system too unstable. In the paper, the LSTM shows a very sensitive behaviour, even when a very small amount of noise is added to the input. However, it still performs surprisingly well with this chaotic behaviour. 
+
+Measuring the model complexity is a very difficult task, therefore, many papers manage to use either same number of hidden units or choose approximately close model sizes. In this paper, the experiments were carried by using the same amount of parameters for both the LSTM and CFN. However, I think the CFN may have much more simpler computational graph. Taking the idea of this work, can we develop a stable dynamic system, but which does not only have one attractor?
+
+It is also interesting to see that the layers of CFNs are updated in different timescales in a sense that the decaying speed decreases when the layer gets higher. Could you provide more statistics on this? For example, what is the average relaxation time of the whole hidden units at each layer?
+
+Batch normalization and layer normalization can be helpful to make the training of RNNs become more stable. How would the behaviour of batch normalized or perhaps layer normalized LSTM look like? Also, it is often not trivial to make batch normalization or layer normalization to work on a new architecture. I think it may be useful to compare batch normalized or layer normalized versions of the LSTM and CFN.
+
+The quality of the work is good, explanation is clear enough along with nice analyses and proofs. Overall, the performance is not any better than LSTMs, but it is still interesting when thinking of simplicity of this model. I am a bit concerned if this model might not work that well in more harder task, e.g., translation. Figure 4 of this paper is very interesting, where the proposed architecture shows that the hidden units at the second layer tends to keep its information longer than the first layer ones.",7,4.0,ICLR2017
+Hy9zmitlG,2,HkZy-bW0-,HkZy-bW0-,Spike based learning for temporal redundant data,"This paper presents a novel method for spike based learning that aims at reducing the needed computation during learning and testing when classifying temporal redundant data. This approach extends the method presented on Arxiv on Sigma delta quantized networks (Peter O’Connor and Max Welling. Sigma delta quantized networks. arXiv preprint arXiv:1611.02024, 2016b.). Overall, the paper is interesting and promising; only a few works tackle the problem of learning with spikes showing the potential advantages of such form of computing. The paper, however, is not flawless. The authors demonstrate the method on just two datasets, and effectively they show results of training only for Feed-Forward Neural Nets (the authors claim that “the entire spiking network end-to-end works” referring to their pre-trained VGG19, but this paper presents only training for the three top layers). Furthermore, even if suitable datasets are not available, the authors could have chosen to train different architectures. The first dataset is the well-known benchmark MNIST also presented in a customized Temporal-MNIST. Although it is a common base-line, some choices are not clear: why using a FFNN instead that a CNN which performs better on this dataset; how data is presented in terms of temporal series – this applies to the Temporal MNIST too; why performances for Temporal MNIST – which should be a more suitable dataset — are worse than for the standard MNIST; what is the meaning of the right column of Figure 5 since it’s just a linear combination of the GOps results. For the second dataset, some points are not clear too: why the labels and the pictures seem not to match (in appendix E); why there are more training iterations with spikes w.r.t. the not-spiking case. Overall, the paper is mathematically sound, except for the “future updates” meaning which probably deserves a clearer explanation. Moreover, I don’t see why the learning rule equations (14-15) are described in the appendix, while they are referred constantly in the main text. The final impression is that the problem of the dynamical range of the hidden layer activations is not fully resolved by the empirical solution described in Appendix D: perhaps this problem affects CCNs more than FFN. 
+Finally, there are some minor issues here and there (the authors show quite some lack of attention for just 7 pages):
+-	Two times “get” in “we get get a decoding scheme” in the introduction;
+-	Two times “update” in “our true update update as” in Sec. 2.6;
+-	Pag3 correct the capital S in 2.3.1
+-	Pag4 Figure 1 increase font size (also for Figure2); close bracket after Equation 3; N (number of spikes) is not defined
+-	Pag5 “one-hot” or “onehot”; 
+-	in the inline equation the sum goes from n=1 to S, while in eq.(8) it goes from n=1 to N;
+-	Eq(10)(11)(12) and some lines have a typo (a \cdot) just before some of the ws;
+-	Pag6 k_{beta} is not defined in the main text;
+-	Pag7 there are two “so that” in 3.1; capital letter “It used 32x10^12..”; beside, here, why do not report the difference in computation w.r.t. not-spiking nets?
+-	Pag7 in 3.2 “discussed in 1” is section 1?
+-	Pag14 Appendix E, why the labels don’t match the pictures;
+-	Pag14 Appendix F, explain better the architecture used for this experiment.",6,4.0,ICLR2018
+XJEFrCiFDDP,4,tbwjUvUzQRU,tbwjUvUzQRU,A distributed method for kernel k means with some performance guarantees,"Summary:
+
+This paper proposes a distributed version of kernel k-means clustering where some federated structure is used to do distributed processing on the data. Privacy and communication issues are also studied. Numerical results are provided.
+
+Reasons for the score:
+
+This paper seems provide a new algorithm for distributed clustering. However, the way the algorithm is presented look like a patch of a number of things coming together one after the other with no general structure. This might be caused by the fact that the algorithm is only presented in the appendix. 
+
+The paper is written in a convoluted manner. This is the main limitation, at some point, we are talking about k means, SVS, DLA,DSPGD, EVD, SPGD, a bunch of other methods that are coupled together towards the main approach.
+
+Problem 1 seems to be an integer programming problem, thus with very high computation complexity. It is not clear how this is solved.
+
+In the abstract please let me know what are those two levels of privacy you are talking about.
+
+In the abstract, what does it mean that the clustering loss of the distributed method approaches the centralized one, please elaborate.
+
+Why developing a federated learning algorithm is a promising approach? Please elaborate.
+
+The second part of the intro turns into a detail technical analysis of the algorithm components, and so far we haven’t seen the algorithm so it all remains a technical abstract  discussion that takes away the main messages.
+
+The algorithm is in the appendix, so the description and analysis is made on an item that has not been presented in the main text.
+
+The way the result is presented makes it look like the proposed method is a concatenation of other results, rather than the solution of a technical challenge in the problem.
+
+Numerical results are well presented,",6,3.0,ICLR2021
+rJggAvorhQ,2,H1lIzhC9FX,H1lIzhC9FX,The expanded generator will also raise the storing problem as that in episodic memory strategy,"This paper attempts to mitigate catastrophic problem in continual learning. Different from the previous works where episodic memory is used, this work adopts the generative replay strategy and improve the work in (Serra et al., 2018) by extending the output neurons of generative network when facing the significant domain shift between tasks.
+ 
+Here are my detailed comments:
+Catastrophic problem is the most severe problem in continual learning since when learning more and more new tasks, the classifier will forget what they learned before, which will be no longer an effective continual learning model. Considering that episodic memory will cost too much space, this work adopts the generative replay strategy where old representative data are generated by a generative model. Thus, at every time step, the model will receive data from every task so that its performance on old tasks will retain. However, if the differences between tasks are significant, the generator cannot reserve vacant neurons for new tasks or in other words, the generator will forget the old information from old tasks when overwritten by information from new tasks. Therefore, this work tries to tackle this problem by extending the output neurons of the generator to keep vacant neurons to retain receive new information. As far as I am concerned, this is the main contribution of this work.
+ 
+Nevertheless, I think there are some deficiencies in this work.
+ 
+First, this paper is not easy to follow. The main reason is that from the narration, I cannot figure out what is the idea or technique of other works and what is the contribution of this paper. For example, in Section 4.1, I am not sure the equation (3), (4), (5), (6) are the contributions of this paper or not since a large number of citations appear.
+ 
+Second, the authors mention that to avoid storing previous data, they adopt generative replay and continuously enlarge the generator to tackle the significant domain shift between tasks. However, in this way, when more and more tasks come, the generator will become larger and larger. The storing problem still exists. Generative replay also brings the time complexity problem since it is time consuming to generate previous data. Thus, I suggest the authors could show the space and time comparisons with the baseline methods to show effectiveness of the proposed method.
+ 
+Third, the datasets used in this paper are rather limited. Three datasets cannot make the experiments convincing. In addition, I observe that in Table 1, the proposed method does not outperform the Joint Training in SVHN with A_10. I hope the author could explain this phenomenon. Furthermore, I do not see legend in Figure 3 and thus I cannot figure out what the curves represent.
+ 
+Fourth, there are some grammar mistakes and typos. For example, there are two ""the"" in the end of the third paragraph in Related Work. In the last paragraph in Related Work, ""provide"" should be ""provides"". In page 8, the double quotation marks of ""short-term"" are not correct.
+ 
+Finally yet importantly, though a large number of works have been proposed to try to solve this problem especially the catastrophic forgetting, most of these works are heuristic and lack mathematical proof, and thus have no guarantee on new tasks or scenarios. The proposed method is also heuristic and lacks promising guarantee.",3,5.0,ICLR2019
+1FvCXiLffh,3,ohz3OEhVcs,ohz3OEhVcs,The paper proposes a graph deconvolutional network to reconstruct the original graph signal from smoothed node representations obtained by graph convolutional networks.,"Graph Autoencoders with Deconvolutional Networks
+
+The paper proposes a graph deconvolutional network to reconstruct the original graph signal from smoothed node representations obtained by graph convolutional networks.
+
+The proposed deconvolution incorporates a denoising component based on graph wavelet transforms.
+
+Pros:
+-The idea of defining graph deconvolution operators is appealing and and may potentially lead to improvement in the performance of graph reconstruction/generation tasks.
+-The Visualization section of the paper shows an advantage of the proposed approach compared to other methods for reconstruction. It maintains information about high-frequency signal.
+
+Cons:
+-The potential of a graph deconvolution operator are not fully exploited in the paper, mainly because of the considered tasks that do not require deconvolution because they are not intrinsically reconstruction tasks.
+The paper applies the proposed approach to tasks of graph classification in Table 1 and social recommendation (matrix completion) in Table 2. While the comparison with other unsupervised learnimg methods looks favourable to the proposed approach, supervised learning methods are naturally more suited for the tasks in Table 1 and tend to perform slightly better (a comparison with supervised approaches would be appreciated).
+Graph Autoencoders are usually applied to tasks of graph generation such as molecule design, where they are one of the most suited approaches. Many works in literature face this problem. A comparison on the generation task would be interesting.
+-Considering the Ablation results, the improvement with respect to the ablation approaches seems marginal. Again, my opinion is that the considered tasks are not well suited for the proposed model.
+-Hyper-parameter selection: In Appendix A the hyper-parameter selection procedure is not sufficiently detailed. How do you choose the hyper-parameter values? You report the considered ranges but not the procedure you adopt to select them. ""parameters of downstream classifiers"", in my understanding it refers to the C parameter, or to other hyper-parameters as well?
+
+Minor:
+ALATION-GCN -> ABLATION
+
+---Rebuttal--
+I acknowledge having checked the authors' response and the revised version of the manuscript, that has been improved since the first revision. Authors did not answer to my request for more details about the hyper parameter selection procedure that has been adopted. ",5,4.0,ICLR2021
+5w4Riya4Skg,3,uV7hcsjqM-,uV7hcsjqM-,"Review for ""Contrastive Code Representation Learning""","This paper studies the self-supervised code functional representation learning and proposes a method called ContraCode. ContraCode utilizes some code functionality invariant transformations to generate positive pairs from the same code and negative pairs from different codes. After that, these codes pairs will be used to do the contrastive pre-training. Experiment results based on two tasks are reported.
+
+Pros:
+-	The task of  code functional representation learning is important and valuable.
+-	The transformations proposed in this paper may produce some vaviance to the code while maintaining the same functionality.
+
+Cons:
+-	The superiority of the proposed method is unclear. Many self-supervised code representation learning methods are mentioned in the introduction, such as [Ben-Nun et al., 2018; Feng et al., 2020; Kanade et al., 2020]. However, this paper fail to discuss of the differences (especially the advantages) between ContraCode and other self-supervised methods empirically.
+-	Since no addtional supervision is evolved, unsupervised feature leanring models are good competing baselines.. The authors are  strongly recommended to compare the performance of ContraCode with other unsupervised methods under the same training dataset (both augmented).
+-	The key question is the whether the self-supervision generated by such transformation really makes any difference. Some transformation only change the formatting, which usually resulting the same feature representation because the formatting information is usually not considered in most of the feature learning methods for code. It appears that by applying the set of transformation, the code would not differ from its previous appearance much. Consequently, the feature representations generated by some unsupervised method from the original code and its transformed counterpart could be very similar to each other EVEN IF no self-supervision is enforced, which means self-supervision is not necessary.  Please clarify this be providing empirical evidences such as the portion of the changed lines or tokens from the original code, the similarity between the original code and its transformed counterpart over any two different pieces of code based on the features learned in some unsupervised way (with the same scale of training data).
+",4,4.0,ICLR2021
+S1DWSMU4l,3,HyoST_9xl,HyoST_9xl,nice new training method for deep networks,"Training highly non-convex deep neural networks is a very important practical problem, and this paper provides a great exploration of an interesting new idea for more effective training.  The empirical evaluation both in the paper itself and in the authors’ comments during discussion convincingly demonstrates that the method achieves consistent improvements in accuracy across multiple architectures, tasks and datasets. The algorithm is very simple (alternating between training the full dense network and a sparse version of it), which is actually a positive since that means it may get adapted in practice by the research community.
+
+The paper should be revised to incorporate the additional experiments and comments from the discussion, particularly the accuracy comparisons with the same number of epochs. ",8,3.0,ICLR2017
+BJez1UBa27,2,H1gMCsAqY7,H1gMCsAqY7,Very exciting work,"This paper presents a straightforward looking approach for creating a neural networks that can run under different resource constraints, e.g. less computation but lower quality solution and expensive high quality solution, while all the networks are having the same filters. The idea is to share the filters of the cheapest network with those of the larger more expensive networksa and train all those networks jointly with weight sharing. One important practical observation is that the batch-normalization parameters should not be shared between those filters in order to get good results. However, the most interesting surprising observation, that is the main novelty of the work that even the highest quality vision network get substantially better by this training methodology as compared to be training alone without any weight sharing with the smaller networks, when trained for object detection and segmentation purposes (but not for recognition). This is a highly unexpected result and provides a new unanticipated way of training better segmentation models. It is especially nice that the paper does not pretend that this phenomenon is well understood but leaves its proper explanation for future work. I think a lot of interesting work is to be expected along these lines.",9,5.0,ICLR2019
+zXybr9FEmON,3,iMKvxHlrZb3,iMKvxHlrZb3,Review #3,"This paper extends from SIGN (https://arxiv.org/abs/2004.11198) model to heterogeneous graphs.
+
+The SIGN model argues that simply applying MLP on graph-smoothed node features (concatenating k-th hop neighbor features, k $\in$ [1-L]) can achieve similar results compared with learnable aggregation applied in GNNs. To extend to the heterogeneous graph, this paper proposes to sample relation graphs, by:
+(1) sample several subsets $R_i$ of relations; 
+(2) sample relation subgraphs whose edges belong to $R_i$; 
+(3) treat each subgraph as homogeneous graphs and perform neighbor aggregation (simply average). 
+(4) Apply MLP on each node for node classification.
+
+I have several questions about the proposed approaches:
+
+(1) The difficulty of heterogeneous graphs is that each node might have different types of features. For example, in a social network, nodes can be associated with image, text, or some discrete profiles. Thus, the neighborhood smoothing only works when the input features are both 1) already very informative, and don't need too much transformation; 2) features from different node types are projected to the same space. Therefore, I'm afraid the authors' proposed aggregation might not generalize to more complicated heterogeneous graphs. (It seems that during experiments, the authors utilize TransE to pre-train embeddings for all the nodes, so that they are naturally within the same space, making the problem simpler. One evidence is that when using other feature initialization strategies, such as simple average, the performance of this model drops significantly)
+
+(2) Also, it's confusing to me why can we fuse all the subgraphs with different subgraph schema. Intuitively, with different relation set, the semantic of the relation subgraph should be very different, but the authors seem to treat them equally. It would be better if the authors can provide some analysis on this, for example, for a given node, what is the variance of final node embeddings calculated with subgraphs of different relation sets.
+
+(3) How to get the inference results for large graphs? It seems that the proposed method should get a different predictions for each node with a different relation set. So which set the authors to use? Complete set or average over random sampling? (If it's random, the reported variance, which is close to 0, seems to be very strange).
+
+Also, though the authors show superior experimental results, I have several concerns about experiment settings:
+
+(1) About feature initialization. From Table 5, we can see that the proposed NARS method highly relies on the TransE
+ embedding initialization. When using a standard feature initialization method (such as average neighbors), the results are much lower than HGT and R-GCN. However, the authors didn't provide implementation details about how to train such TransE embedding (normally it's weird to use TransE for heterogeneous graph, as the node number can be much larger than the knowledge graph and we don't have that much relation type. For example, two papers published by the same author and on the same venue might have exactly the same TransE embedding, if we don't consider text input. So it's confusing to me why the results with TransE embedding is better). The authors should better elaborate on this part or release the code for clarity.
+
+(2) About baseline results. Since the utilization of TransE embedding, the experimental settings of the baseline are different from the original papers. But there's still some confusing part. For example, the HGT model's implementation on OGB-MAG uses neighbor average strategy, and the accuracy result is 0.5, while the result reported in table 5 is 0.489. Also, the model parameter is not matched with the reported number. 
+
+(3) About inference time. As discussed above, I'm not sure how the proposed method can efficiently get accurate inferences for all the nodes in the test set. If the authors want to claim their method is more scalable, it would be better to include the inference time comparison.
+
+
+Overall, I think the simplified procedure (direct neighbor average) over heterogeneous graph limits the usage of this model, and there's also some unclear part in experimental settings.",5,5.0,ICLR2021
+52APBAuFo49,1,TaYhv-q1Xit,TaYhv-q1Xit,Initial review,"***Summary***
+
+I would firstly like to thank the authors for an interesting read. I enjoyed going through the submission very much.
+
+The authors propose to understand the qualitative effects of nonlinearities by studying the impact they have on the Fourier spectrum of deep neural networks. The central hypothesis is that nonlinearities with a lot of energy in their side lobes (high frequencies), lead to neural networks that have a rougher mapping and that are consequently tougher to train because the derivative landscape is also rougher. They back this hypothesis up with some mathematical arguments from the area of harmonic distortion analysis and with empirical experiments to support the qualitative predictions of this theory.
+
+***Pros***
+
+I found the submission very readable. I think the balance of text to mathematics in the main submission was about right, reserving the appendix for a more in depth discussion.
+
+I think that while the central finding that deep mappings are smoother is, in itself, not particularly novel, the chain of reasoning to get to this fact is new. I like the use of the Fourier spectrum to show this and the analysis behind how the spectra of various nonlinearities affect overall network smoothness..
+
+The choice of experiments, which sequentially back up the claims, makes for a good paper. I particularly enjoyed the results in Figure 2, which were very instructive and gave good insight into the predictions of the theory.
+
+
+***Cons and constructive feedback***
+
+In order from start to finish.
+
+In the abstract should differential be differentiable?
+
+I think a good paper to cite would be “Avoiding Pathologies in Very Deep Networks” (Duvenaud et al., 2014) who analyze deep kernels in Gaussian processes. While the underlying models are different, the kinds of qualitative results in this paper are very similar to the submission.
+
+I am concerned about the use of the Fourier spectrum to model the ReLU nonlinearity. Will there not be issues with the Gibb’s phenomenon? The discontinuous gradient will mean that a spectrum exists, but reconstructions are poor.
+
+Paragraph below equation 3: uniformely -> uniformly
+
+Equation 4: using t_j is confusing given that you use t in eqn 1. Please change to another symbol
+
+Eqn 6: Please define the autocorrelation symbol in the main text.
+
+Eqn 6: Please define z versus z_j
+
+Section 3.2 discussion: I would assume that while higher order autocorrelations would broaden the spectrum they would also smooth it out. For high orders it would like Gaussian-like in shape. This would not necessarily lead to blue-shifting.
+
+Section 3.3: therfore -> therefore
+
+Section 3.4 trivial -> trivially
+
+Section 3.5: Exponential downweighting. ResNets have combinatorially more medium length paths than short or long ones. So the average weight of a medium path is far higher than short or long ones. I would have liked to have seen a deeper analysis of this effect.
+
+Experiments: I found these very interesting. What is the motivation for only focussing on networks at initialization? I would have loved to have seen what a pertained network looks like.
+
+Are ensembles covered within the scope of this theory? They seem to have good performance but since each member is trained individually there is no smoothing of the training function, although the test loss function is smoother when all member models are combined.
+
+
+***Post rebuttal review***
+
+Having read the rebuttal, I am very happy with the author responses. My main concerns about the Gibb's phenomenon and the choice to consider blueshifting at initialization have been thoroughly addressed. It is clear to me that the authors have thought long and hard about the rebuttal and used it to improve their submission. Therefore I maintain that this is still a clear accept.
+ ",8,4.0,ICLR2021
+BkxSJJoJ9H,2,H1xJhJStPS,H1xJhJStPS,Official Blind Review #1,"I think it is an intriguing paper, but unfortunately left me a bit confused. I have to admit is not a topic I'm really versed in, so it might be that this affected my evaluation of the work. But also, as a paper submitted to ICLR I would expect the paper to be self-contained and be able to provide all the details needed. 
+
+I do appreciate the authors providing in the appendix the proofs of the other theorems even if they come form other works. 
+
+
+The paper introduces C-EP, an extension of  a previously introduced algorithm EP, such that it becomes biologically plausible. In particular EP is local in space but not in time (you need the steady state of the recurrent state after the first stage at the end of the second stage to get your gradients). I think this is fair, and the need for biological plausibility is well motivated in the beginning of the work.  
+
+My first issue is with the proof for the equivalence between EP and C-EP. This is done by taking the limit of eta goes to 0. I think I must be missing something. But the proof relies on eta being small enough such that \theta_i = \theta (i.e. theta does not change). Given this state evolves the same way as for EP, because we basically not changing theta. 
+Yet the crux of my issue is exactly here. The proof relies on the fact that we don't change theta. So then when you converged on the second phase, isn't theta the same as theta_0? So you haven't actually learned anything!? Basically looking at the delta isn't this just misleading? 
+Ok lets assume that on the last step you allow yourself to change eta to be non-zero. (I.e. we are just after the delta in theta, and what to show we can get the same delta in theta as EP which is how the proof is phrased). Then in that difference aren't you looking at s_{t+1} and s_t rather than s_{t+1} and s_0, which is what EP would do? In EP you have s^\beta_* - s_*. This is not what you get if you don't update theta and apply C-EP? 
+
+I think there might be something I'm missing about the mathematical argument here. 
+
+At a higher-level question, we talk about the transition function F as being a gradient vector field, i.e. there exist a phi such that F is d phi/d theta.  Why is this assumption biologically plausable ? Parametrizing gradient vector fields in general is far from trivial, and require very specific structure of the neural implementation of F to be true. Several works have looked at parametrizing gradient vector fields (https://arxiv.org/abs/1906.01563, https://arxiv.org/pdf/1608.05343.pdf) and the answer is that without parametrizing it by actually taking the gradient of a function there is not much of a choice. 
+Incidentally, here we exploit that  F = sigma (Wx), with W symmetric. This is a paramtrization of a gradient vector field, i.e. of xU, where UU^T =W I think. But if you want to make F deep than it becomes non-trivial to restrict it to gradient vector field. Is the assumption that we never want to move away from vanilla RNNs? And W symmetric is also not biologically plausible. In C-EP you say is not needed to be symmetric, but that implicitly means there exist no phi and everything that follows breaks, no? 
+
+I'm also confused by how one has access to d phi / ds and d phi / d theta. Again I feel like I'm missing information and the formalism is not introduced in a way that it is easy to parse. My understand is that you have an RNN that updates the state s. And the transfer function of this RNN is meant to be d phi / ds, which is trues if the recurrent weight is symmetric. Fine. But then why do we have access to d phi/ dtheta? Who is this function? Is the assumption that d s / dtheta is something we can compute in a biologically plausible way? Is this something that is obvious? 
+ 
+
+
+",3,,ICLR2020
+B1x5phA1T7,3,HyN-M2Rctm,HyN-M2Rctm,Normalization method that assumes multi-modal distributions,"The authors proposed a normalization method that learns multi-modal distribution in the feature space. The number of modes $K$ is set as a hyper-parameter. Each sample $x_{n}$ is distributed (softly assigned) to modes by using a gating network. Each mode keeps its own running statistics. 
+
+1) In section 3.2, it is mentioned that the MN didn't need and use any regularizer to encourage sparsity in the gating network. Is MN motivated to assign each sample to multiple modes evenly or to a distinct single mode? It would be better to provide how the gating network outputs sparse assignment along with the qualitative analysis.
+
+2) The footnote 3 showed that individual affine parameters doesn't improve the overall performance. How can this be interpreted? If the MN is assuming multi-modal distribution, it seems more reasonable to have individual affine parameters.
+
+3) The overall results show that increasing the number of modes $K$ doesn't help that much. The multi-task experiments used 4 different datasets to encourage diversity, but K=2 showed the best results. Did you try to use K=1 where the gating network has a sigmoid activation?",5,4.0,ICLR2019
+Hkeo4WgCYr,2,rkeUcRNtwS,rkeUcRNtwS,Official Blind Review #3,"The paper introduces a new method in the family of local perturbation-based interpretations for deep networks and more specifically for fine-grained classification tasks. Compared to other saliency map methods (gradient-based, propagation-based, etc), this family has the advantage of needing only black-box access to the model. The introduced method GLAS, scans over an image and lights/shadows each part of the image to assign an importance score to different regions of the image based on the change in model's prediction. The motivation of this work is to increase the inherently low speed of (some) methods in this family and to give better explanations by 
+
+
+
+I vote for rejecting this paper as the contributions to what already exists in the literature are not clear and the provided experimental results are not convincing.
+
+Compared to previous perturbation-based methods, the main advantage seems to be speed. There are perturbation-based methods that do not suffer from low speed e.g. Dabkowski & Gal and give real-time perturbation-based saliency maps. The authors do not mention this work (and similar works) and do not compare both their speed and their performance against it.
+
+One important problem (probably the most important) with the perturbation-based saliency maps is the fact that the perturbations might push a given image out of the true data manifold and therefore give an invalid interpretation of the model. The authors do not discuss the matter and how their method would address this issue. Intuitively, the introduced algorithm, more specifically the RGLAS algorithm, seems to suffer from this issue not any less than other existing methods.
+
+The experimental results seek to demonstrate the superiority of the introduced method over rival methods. The justification behind the provided visual examples is their focus on more discriminative features (in human eyes). This does not necessarily mean that a given saliency map is more ""truceful""; i.e. having a more visually appealing saliency map has nothing to do with a more truthful explanation of a model's decision making. The results focused on the mistakes of the model seem more convincing and interesting.
+
+ The objective results first focus on the target localization metric which has traditionally been used in the literature. Although it is much faster to execute, the introduced method is only marginally superior to other methods. The most important problem, however, is that as mentioned above, there are fast methods in the literature and therefore a fair objective comparison is should include other methods as well. Secondly, the IOU measure is used. GLAS is not compared to other methods in this metric.
+
+The authors mention the effectiveness of their work for ""fine-grained"" classification tasks while throughout the paper there is no convincing evidence or discussion that the method is curated for such tasks. As mentioned above, changing the scale parameter for getting more visuall appealing saliency maps is not enough evidence.
+
+All in all, the true contribution of this work to other existing methods in this family is not enough for this venue.
+
+
+A few questions and suggestions:
+* How should one adjust the scale parameter? In other words, what is the hyper-parameter search scheme for this method which would make it robust against the human-biased choice of hyper-parameters which would result in visually more appealing saliency maps but not necessarily explain the model?
+* Explanation of RGLAS is not clear.
+* For a general reader, metrics such as IOU should be explained more clearly.
+* The paper has many many typing errors.",1,,ICLR2020
+SJgdT5JY5S,3,BJx7N1SKvB,BJx7N1SKvB,Official Blind Review #3,"This paper analyzed the asymptotic training error of a simple regression model trained on the random features for a noisy autoencoding task and proved that a mixture of nonlinearities can outperform the best single nonlinearity on such tasks.
+
+Comments:
+1.The paper is well written and provides sound derivation for the theories.
+
+2. Since this area is out of my expertise, I’m not sure whether merely extending the work of Pennington & Worah (2017) to non-Gaussian data distributions is significant enough or not.
+
+3. Except for Fig 4, the other figures seem out of the context. There is no explanation for the purpose of those figures in the main contents. It is a bit hard for the audience to figure out what to look at in the figures or what the figures try to prove. 
+
+4. In “..., and our analysis actually extends to general such distributions, ... ”, “general” should be “generalize”.
+
+5. In “And whether these products generate a medical diagnosis or a navigation decision or some other important output, ..”, “whether” should be “no matter”.
+
+6. “..., they may not be large in comparison to the number of constraints they are designed asked satisfy.” should be “...  they are designed to satisfy”.
+",6,,ICLR2020
+5h66gfRhjS,3,bQNosljkHj,bQNosljkHj,Unreadable,"This paper is basically unreadable. The sentence structure / grammar is strange, and if that was the only issue it could be overlooked. The paper also does not describe or explain the motivation and interpretation of anything, but instead just lists equations. For example, eta is the parameter that projects a spherical geodesic onto an the ellipsoid one, and an ellipsoid geodesic prevents updates of the core-set towards the boundary regions where the characteristics of the distribution cannot be captured. However, what are these characteristics, and how can they motivate how to choose eta?",3,4.0,ICLR2021
+b_aq1LJhSKh,5,n7wIfYPdVet,n7wIfYPdVet,Good Paper with Sound Technique and Sufficient  Experimental Support,"This paper pinpoints the key issues of Auxiliary Learning: (1). how to design useful auxiliary tasks, (2) how to combine auxiliary tasks into a single coherent loss. Motived by the issues, this paper proposes a novel Auxiliary Learning frame work, named AuxiLearn. The paper is globally well organized and clearly written. 
+
+Pros:
+1.	The motivation is straightforward.
+2.	The paper proposes sound the technique contributions. Adopting bi-level optimization in Auxiliary Learning makes sense.
+3.	The theoretical analysis in this paper supports the efficiency of the method proposed in this paper.
+4.	The experimental results and experimental analysis in this paper is plausible.
+
+Cons:
+1.	The efficiency of the hypergradient, which is the main shortage, should be discussed. 
+2.	How to determine the number of iterations J in Alg. 2 is not given.
+",7,3.0,ICLR2021
+1-KP5_GnsC1,1,tEw4vEEhHjI,tEw4vEEhHjI,Interesting idea for covariance correction to improve UQ,"Working under the Bayesian Neural Network setting, the authors proposed a way to inflate the resulting posterior covariance, so as to get improved predictive uncertainty quantification for new data points that are far from the training set, while also maintain similar performance when these new data are close to the training set.
+
+Overall I think the writing is clear, but I had to revisit previous sections in order to tie-up and understand all the various approximation schemes used. I find the derivation of the double-sided cubic spline and the use of infinitely many ReLU's to boost the posterior covariance novel and interesting. This method has potential to be applied to other learning algorithms and get improved confidence statement, such as Variational Bayes where it is known to produce overconfident output.
+
+Despite the authors' claim of doing an extensive theoretical analysis, I find that theoretical arguments quite heuristic in some places. It would be more informative to quantify the approximation errors incurred when using the various approximation methods discussed in this paper, in particular, network linearlization through Taylor's theorem. It seems to me that you just treating the neural network $f$ as a typical differentiable function and ignore the network structure within it by just doing $\approx$.
+
+The level-wise RGPR covariance kernel in (6) does not seem to be correct. For level 1, you have 
+$\boldsymbol{h}^{(1)} (\boldsymbol{x}_{*})$. 
+
+For level 2 however, $\boldsymbol{h}^{(2)}$ is obtained by $g(W\boldsymbol{h}^{(1)}+\boldsymbol{c})$ where $W$ is the weight matrix at level 1, $\boldsymbol{c}$ is the bias and $g$ is some activation function applied entry-wise, e.g., ReLU. Since the entries of $W$ is part of $\boldsymbol{\theta}$ the parameter for the entire network, $W$ is random because $\boldsymbol{\theta}$ is assigned a prior. This implies that $\boldsymbol{h}^{(1)}$ and $\boldsymbol{h}^{(2)}$ are dependent and likewise for higher levels. Hence the covariance kernel of $\hat{f}=\hat{f}^{(0)}+\cdots+\hat{f}^{(L-1)}$ is not the just them sum of the individual kernels but something more complicated because they are now dependent due to $\boldsymbol{h}_{*}$. Can the authors please clarify this?
+
+I find the addition of a kernel function to do correction quite ad-hoc and not very Bayesian. Is it possible to incorporate this term into the prior?
+
+Some other comments:
+1. Section 2.2, line 6, $c_d$ should be $c_D$. Also, is this the same $D$ for the dimension of $\boldsymbol{\theta}$ the network weights? Then taking $D\to\infty$ means you have infinite weights?
+
+2. (3) does not seem to cover $0$
+
+3. In Proposition 1, $\boldsymbol{\mu}$ and $\mathrm{\Sigma}$ are the mean and covariance of the approximate posterior of $\boldsymbol{\theta}$, and for $\boldsymbol{g}_{*}$, $\boldsymbol{0}$ should be $\boldsymbol{\mu}$
+
+",5,4.0,ICLR2021
+SJ7PzWDeM,1,rkr1UDeC-,rkr1UDeC-,Promising and interesting direction to scale distributed training,"This paper provides a very original & promising method to scale distributed training beyond the current limits of mini-batch stochastic gradient descent. As authors point out, scaling distributed stochastic gradient descent to more workers typically requires larger batch sizes in order to fully utilize computational resource, and increasing the batch size has a diminishing return. This is clearly a very important problem, as it is a major blocker for current machine learning models to scale beyond the size of models and datasets we currently use. Authors propose to use distillation as a mechanism of communication between workers, which is attractive because prediction scores are more compact than model parameters, model-agnostic, and can be considered to be more robust to out-of-sync differences. This is a simple and sensible idea, and empirical experiments convincingly demonstrate the advantage of the method in large scale distributed training.
+
+I would encourage authors to experiment in broader settings, in order to demonstrate that the general applicability of the proposed method, and also to help readers better understand its limitations. Authors only provide a single positive data point; that co-distillation was useful in scaling up from 128 GPUs to 258 GPUs, for the particular language modeling problem (commoncrawl) which others have not previously studied. In order for other researchers who work on different problems and different system infrastructure to judge whether this method will be useful for them, however, they need to understand better when codistillation succeeds and when it fails. It will be more useful to provide experiments with smaller and (if possible) larger number of GPUs (16, 32, 64, and 512?, 1024?), so that we can more clearly understand how useful this method is under the regime mini-batch stochastic gradient continues to scale. Also, more diversity of models would also help understanding robustness of this method to the model. Why not consider ImageNet? Goyal et al reports that it took an hour for them to train ResNet on ImageNet with 256 GPUs, and authors may demonstrate it can be trained faster.
+
+Furthermore, authors briefly mention that staleness of parameters up to tens of thousands of updates did not have any adverse effect, but it would good to know how the learning curve behaves as a function of this delay. Knowing how much delay we can tolerate will motivate us to design different methods of communication between teacher and student models.",8,4.0,ICLR2018
+rkec-H1UKS,1,S1eZYeHFDS,S1eZYeHFDS,Official Blind Review #1,"It is rather interesting for a humble academic to review this paper. It already has a discussion, which I find very valuable, and many tweets and social media exposure and endorsements. It is onerous to review in this setting.
+
+The paper makes a valuable contribution. The adversarial discussions in this website and the unhelpful hype can in this case be addressed to some extent by the authors. I will start with discussing this. Clearly, the title is too broad. This is not deep learning for symbolic mathematics. In no way does this paper address the essence of what is understood by ""symbolic mathematics"". What the authors address is mapping sequences of discrete quantities to other sequences of discrete quantities. The sequences in this paper correspond to function-integral i/o sequences, and 1st/2nd ODEs-function i/o sequences. I will leave it to the authors to come up with a more informative title, but something like deep learning or transformers for symbolic (1d) integration and simple ODEs with be far more accurate.
+
+To hammer this point, note that Section 3 discusses removing ""invalid"" expressions: log(0) or sqrt(-2). However, it is the manipulation of infinity and imaginary numbers that could be considered to be one of the greatest achievements of symbolic mathematics over the last couple of hundred years. It is reasonable to expect neural nets to do this one day, because humans can, but this should come with results. It's too early to make the claim in the paper title.
+
+Sentences such as ""This suggest (sic) that some deeper understanding of mathematics has been achieved by the model."" and ""These results are surprising given the incapacity of neural models to perform simpler tasks ..."" are speculative, potentially inaccurate and likely to increase hype. This hype is not needed.
+
+Hype and over-claiming aside, I did enjoy reading this paper. The public commenters have already asked important questions about methodology and related work on neural programming that the authors have addressed in comments. I look forward to these being incorporated in the revised pdf.
+
+A big part of the paper is about generating the datasets, and I therefore sympathise with the comment about requesting either a dataset release or the generating code. I see no obvious ethical concerns in this case, and the authors have already kindly offered to do this. This is a commendable and important service to our community and for this alone I would be inclined to vote for acceptance at ICLR.
+
+The paper is clear and well written. However (i) it would be good to show several examples of input and output sequences (as done already in this website) and (ii) the Experiments section needs work. I'll expand on this next.
+
+The seq2seq transformer with 8 heads, 6 layers and dimensionality 512 is a sensible choice. The authors should however explain why they expect this architecture to be able to map the sequences they adopt. That is, it is well known that a deep neural network is just a skeleton for an algorithm. By estimating the parameters, we are coming up with (fitting) the algorithm for the given datasets. What is the resulting algorithm? Why are 6 layers enough? Here some visualization would be helpful. See for example https://arxiv.org/pdf/1904.02679.pdf and https://arxiv.org/pdf/1906.04341.pdf For greater understanding of the problem, it may be useful to also try sparse transformers eg https://arxiv.org/abs/1805.08241
+
+Beam search is a crucial component of the current solution. However, the authors simply cite Koehn 2004 for this. First, that work used language models to compute probabilities for beam search. I assume no language models are used in this case. What I'm getting to is that there are not enough details about the beam search in this paper. The authors should include pseudocode for the beam search and give a few examples. The paper (even better thesis) of Koehn is a good template for what should be included. This is important and should be explained. 
+
+For Mathematica, it would be useful to state it does other things and has not been optimized for the two tasks addressed in this paper only. It would also be useful, now that you have more time, to run it for a week or two and get answers not only for 30s but also for 60s. How often does it take longer than 30s? How do you score it then?
+
+Please do include train and test curves. This would be helpful too. I will of course consider revising my score once the paper is updated. 
+
+Thanks for constructing this dataset and writing this paper. It is very interesting and promising.
+
+
+
+
+",6,,ICLR2020
+HkxtvrkCKr,2,HygQ7TNtPr,HygQ7TNtPr,Official Blind Review #2,"This paper proposes two rules for efficient training of quantized networks by investigating the scale of the logit values and gradient flow. The authors claim that accuracy degradation of recent quantization methods results from the violation of these two rules.
+
+One of my main concerns is that the analysis of the rules for weight and activation quantization are separated. E.g., the analysis of weight quantization in Section 3.2 is based on eq (1a)-(1d) where no activation quantization is considered. In this case, does the analysis still hold when applying weight and activation are quantized simultaneously?
+
+Moreover, the analysis is only suited for a limited range of quantization methods. In the proposed SAT, the authors propose to multiplies the normalized weight with the square root of the reciprocal of the number of neurons in the linear layer, to make up for the variance difference caused by quantization. However, this increase indeed depends on the initialization of the weights. If the weights are not sampled from ""a Gaussian distribution of zero mean and variance proportional to the reciprocal of the number of neurons"" as at the end of page 5, then this recipe may not work any longer. Moreover, the proposed SAT seems to be only suited for the specific quantization function for Dorefa-Net in (5), what about many other recent quantization functions that do not need this kind of clamping?
+
+Others:
+1. The citation format is wrong.
+2. In the abstract, ""Recent quantization approaches violates ... and results ..."" => ""Recent quantization approaches violate ... and result ...""
+3. What is the ""scaling factor in Eq. (3)"" before the subsection ""Efficient Training Rule II (ETR II)""?
+4. Keep the same number of decimal places in the tables.",3,,ICLR2020
+I07Ms4UK0e3,4,BvrKnFq_454,BvrKnFq_454,Adam-type step size adjustment on the fly,"Summary: This paper proposes the Expectigrad algorithm that normalizes the exponential moving average (EMA) of first moments on the fly. This avoids normalizing historical gradients by future gradients. The normalization factor is an unweighted average, instead of an EMA, of the historical second moments. For the special case where the EMA constant of the first moment is zero, the paper shows that Expectigrad converges to the optimum on the online convex problems proposed by Reddi et al. (2018) for which the vanilla ADAM fails to converge. For general stochastic smooth convex problems with bounded stochastic gradients, the paper shows that the convergence rate of mini-batch Expectigrad is $O(1 / \sqrt{T} + 1 / b)$ where $b$ is the mini-batch size.
+
+Pros:
+(1) The idea of normalizing on the fly is interesting because it is more sensitive when the gradient dynamic is non-stationary.
+
+(2) The algorithm performs well on the examples considered in the paper.
+
+Concerns:
+(1) Theorem 1 assumes $\beta = 0$ ""for simplicity"". I think this is too weak to justify the algorithm. When $\beta = 0$, the algorithm is almost identical to AdaGrad, except that AdaGrad does not take the sparsity into account and sets $n_{t} = t$. On the Reddi problem, the gradient is not sparse, and thus Expectigrad with $\beta = 0$ is equivalent to AdaGrad, if I am not mistaken. Reddi et al. (2018) has already shown that AdaGrad does not diverge on the counterexamples. In order to justify the algorithm, it is necessary to prove a similar result for the case $\beta > 0$, since it is set to be $0.9$ in the experiments. Reddi et al. (2018) have results of this kind (which only requires $\beta < \sqrt{\beta_2}$ where $\beta_2$ is the EMA constant of the second moment).
+
+(2) The paper claims (in the bottom of page 6) that Expectigrad has strictly better complexity than Yogi. I do not see why this is the case. The convergence rate of Yogi is $O(1 / T + 1 / b)$ (their Corollary 4) while that of Expectigrad if $O(1 / \sqrt{T} + 1 / b)$. The latter is worse. To take one step further, in order to have $E ||\nabla f(x)||\le \epsilon$, we need to set the convergence rate as $\epsilon^2$ since the bound is proved for $E ||\nabla f(x)||^2$. For Yogi, the mini-batch size $b$ needs to be $O(1 / \epsilon^2)$ and the number of iterations $T$ needs to be $O(1 / \epsilon^2)$ as well. So the overall complexity is $O(1 / \epsilon^4)$ which matches the complexity of SGD. However, for Expectigrad, the mini-batch size $b$ needs to be $O(1 / \epsilon^4)$ and the number of iterations $T$ needs to be $O(1 / \epsilon^2)$. The overall complexity is $O(1 / \epsilon^6)$, which is much higher than SGD. I do not understand why Expectigrad is better than Yogi or even SGD.
+
+(3) The footnote in page 2 states that ""This limitation is not specific to our work but affects convergence results for all first-order methods, including SGD"". This is incorrect. Under assumption 2 of the bounded stochastic gradient, SGD converges for non-smooth functions.",5,3.0,ICLR2021
+HkxOeWlCYS,1,H1gNOeHKPS,H1gNOeHKPS,Official Blind Review #2,"The authors propose the Neural Multiplication Unit (NMU), which can learn to solve a family of arithmetic operations using -, + and * atomic operations over real numbers from examples. They show that a combination of careful initialization, regularization and structural choices allows their model to learn more reliably and efficiently than the previously published Neural Arithmetic Logic Unit.
+
+The NALU consists of two additive sub-units in the real and log-space respectively, which allows it to handle both additions/subtractions and multiplications/divisions, and combines them with a gating mechanism. The NMU on the other hand simply learns a product of affine transformations of the input. This choice prevents the model from learning divisions, which the authors argue made learning unstable for the NALU case, but allows for an a priori better initialization and dispenses with the gating which is empirically hard to learn. The departures from the NALU architecture are well justified and lead to significant improvements for the considered applications, especially as far as extrapolation to inputs outside of the training domain.
+
+The paper is mostly well written (one notable exception: the form of the loss function is not given explicitly anywhere in the paper) and well executed, but the scope of the work is somewhat limited, and the authors fail to properly motivate the application or put it in a wider context.
+
+First, divisions being difficult to handle does not constitute a sufficient justification for choosing to exclude them: the authors should at the very least propose a plausible way forward for future work. More generally, the proposed unit needs to be exposed to at least 10K examples to learn a single expression with fewer than 10 inputs (and the success rate already drops to under 65% for 10 inputs). What would be the use case for such a unit? Even the NMU is only proposed as a step on the way to a more modular, general-purpose, or efficient architecture, its value is difficult to gauge without some idea of what that would look like.
+
+",6,,ICLR2020
+BJxzPorY3Q,1,ByGq7hRqKX,ByGq7hRqKX,Promising results in cross-task transfer. Missing references to prior works,"This work proposes to train an RL-based agent to simultaneously learn Embodied Question Answering and Semantic Goal Navigation on the ViZDoom dataset. The proposed model incorporates visual attention over the input frames, and also further supervises the attention mechanism by incorporating an auxiliary task for detecting objects and attributes.
+
+Pros:
+-Paper was easy to follow and well motivated
+-Design choices were extensively tested via ablation
+-Results demonstrate successful transfer between SGN, EQA, and the auxiliary detection task
+
+Cons:
+-With the exception of the 2nd round of feature gating in equation (3), I fail to see how the proposed gating -> spatial attention scheme is any different from the common inner-product based spatial attention used in a large number of prior works, including  [1], [2], and [3] and many more.
+-The use of attribute and object recognition as an auxiliary task for zero-shot transfer has been previously explored in [3]
+
+
+Overall, while I like the results demonstrating successful inductive transfer across tasks, I did not find the ideas presented in this work to be sufficiently novel or new.
+
+[1] Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering, Huijuan Xu, Kate Saenko
+[2] Drew A. Hudson, Christopher D. Manning, Compositional Attention Networks for Machine Reasoning
+[3] Aligned Image-Word Representations Improve Inductive Transfer Across Vision-Language Tasks, Tanmay Gupta, Kevin Shih, Saurabh Singh, Derek Hoiem",5,5.0,ICLR2019
+pV6GmG90MmX,1,HkUfnZFt1Rw,HkUfnZFt1Rw,An interesting yet limited comparative review,"Using 7500 LFR-generated graphs as a benchmarks suite, the authors compare 25 graph clustering measures, determining the best measure for every area of the parameter space. The paper is well written, mathematically sound and interesting, and definitely useful to the graph theory community. However, as acknowledged by the authors, the study is limited by the structure of the benchmark suites, which is restricted to networks that can be generated by LFR rules. Overall, I rate it as a weak accept.
+
+Pros:
+- the analysis is clear and grounded, with a sufficient level of mathematical details
+- the authors point out a clear winner out of the set of compared metrics
+
+Cons:
+- the amount of novelty in the manuscript is limited 
+- the benchmark suite is very specific, and no real world example (that would have added great value to the submission) is provided
+",6,4.0,ICLR2021
+BkOtNnKxM,2,SygwwGbRW,SygwwGbRW,Exciting idea but disappointing implementation (revised),"*** Revision: based on the author's work, we have switched the score to accept (7) ***
+
+Clever ideas but not end-to-end navigation.
+
+This paper presents a hybrid architecture that mixes parametric (neural) and non-parametric (Dijkstra's path planning on a graph of image embeddings) elements and applies it to navigation in unseen 3D environments (Doom). The path planning in unseen environments is done in the following way: first a human operator traverses the entire environment by controlling the agent and collecting a long episode of 10k frames that are put into a chain graph. Then loop closures are automatically detected using image similarity in feature space, using a localization feed-forward ResNet (trained using a DrLIM-like triplet loss on time-similar images), resulting in a topological graph where edges correspond to similar viewpoints or similar time points. For a given target position image and agent start position image, a nearest neighbor search-powered Dijkstra path planning is done on the graph to create a list of waypoint images. The pairs of (current image, next waypoint) images are then fed to a feed-forward locomotion (policy) network, trained in a supervised manner.
+
+The paper does not discuss at all the problems arising when the images are ambiguous: since the localisation network is feed-forward, surely there must be images that are ambiguously mapped to different graph areas and are closing loops erroneously? The problem is mitigated by the fact that a human operator controls the agent, making sure that the agent's viewpoint is clear, but the method will probably fail if the agent is learning to explore the maze autonomously, bumping into walls and facing walls. The screenshots on Figure 3 suggest that the walls have a large variety of textures and decorations, making each viewpoint potentially unique, unlike the environments in (Mirowski et al, 2017), (Jaderberg et al, 2017) and (Mnih et al, 2016).
+
+Most importantly, the navigation is not based on RL at all, and ignores the problem of exploration altogether. A human operator labels 10k frames by playing the game and controlling the agent, to show it how the maze looks like and what are the paths to be taken. As a consequence, comparison to end-to-end RL navigation methods is unclear, and should be stressed upon in the manuscript. This is NOT a proper navigation agent.
+
+Additional baselines should be evaluated: 1) a fully Dijkstra-based baseline where the direction of motion of the agent along the edge is retrieved and used to guide the agent (i.e., the policy becomes a lookup table on image pairs) and 2) the same but the localization network replaced by image similarities in pixel space or some image descriptor space (e.g., SURF, ORB, etc…). It seems to me that those baselines would be very strong.
+
+Another baseline is missing: (Oh et al, 2016, “Control of Memory, Active Perception, and Action in Minecraft”).
+
+The paper is not without merit: the idea of storing experiences in a graph and in using landmark similarity rather than metric embeddings is interesting. Unfortunately, that episodic memory is not learned (e.g., Neural Turing Machines or Memory Networks).
+
+In summary, just like the early paper released in 2010 about Kinect-based RGBD SLAM: lots of excitement but potential disappointment when the method is applied on an actual mobile robot, navigating in normal environments with visual ambiguity and white walls. The paper should ultimately be accepted to this conference to provide a baseline for the community  (once the claims are revised), but I street that the claims of learning to navigate in unseen environments are unsubstantiated, as the method is neither end-to-end learned (as it relies on human input and heuristic path planning) nor capable of exploring unseen environments with visual ambiguity.",7,5.0,ICLR2018
+QvLIOScP5Q,3,Ux5zdAir9-U,Ux5zdAir9-U,A refinement of synthetic knowledge graph benchmarks,"This work proposes a method for generating synthetic datasets for testing path-based (knowledge) graph completion. Until recently, there were not many good benchmarks for evaluating reasoning with learned rules or learned knowledge, but there has been a fair amount of work on developing benchmarks for this lately. The synthetic datasets generated here are distinguished by the ability to produce datasets that share a controllable amount of rules. This permits the benchmarks to be used to evaluate multitask learning, robustness to distribution shift, etc. As an illustration of this, the paper includes experiments with a variety of baseline methods showing how (a) the generalization ability of various methods grows and then declines as the number of tasks is increased and (b) fine-tuning on diverse tasks improves accuracy. They also show that in a continual learning setting, the baseline methods all exhibit catastrophic forgetting.
+
+There is value in being able to control the ""relatedness"" of these synthetic tasks in a principled way (the experiments are a good illustration of how this may be used), so I am leaning towards acceptance. My hesitation is that it's a bit on the incremental side, and seems oversold in a few places, as follows:
+
+The work suggests that the proposed benchmarks examine the ""compositional generalization"" abilities of models, but it is not clear to me how this is so. The relations are drawn from a fixed set of types. It's true that the rules may examine new combinations of relations, but the end result is just that one of the existing relations is inferred to hold between the start and end of the path. The fact that those rules hold for all bindings is a kind of generalization, but I would not call it ""compositional."" Compositional generalization would mean something more like learning how to map a grammatical structure to the logical form correctly, irrespective of what was in the constituent phrases.
+
+Also, the work says ""GraphLog is the only dataset specifically designed to test logical generalization capabilities on graph data."" That seems a little bit of a stretch. CLUTTR seems to feature a very similar set of synthetic benchmarks, the only difference is that CLUTTR *additionally* produces some text, but the abstract focus of the task was an essentially similar knowledge graph completion task. Placing this work side-by-side with CLUTTR, the difference is really (just) in being able to tune the degree of shared structure between multiple datasets. That's valuable, I agree. But this one difference is only discussed by a table featuring four X's for each of the variants of the task setup that this feature enables. The lack of a proper discussion of related work obscures the actual extent of the contribution.",6,3.0,ICLR2021
+rJAgO6KlM,2,rkmu5b0a-,rkmu5b0a-,review for MGAN,"Summary:
+
+The paper proposes a mixture of  generators to train GANs. The generators used have tied weights except the first layer that maps the random codes is generator specific, hence no extra computational cost is added.
+
+
+Quality/clarity:
+
+The paper is well written and easy to follow.
+
+clarity: The appendix states how the weight tying is done , not the main paper, which might confuse the reader, would be better to state this weight tying that keeps the first layer free in the main text.
+
+Originality:
+
+ Using multiple generators for GAN training has been proposed in many previous work that are cited in the paper, the difference in this paper is in weight tying between generators of the mixture, the first layer is kept free for each generator.
+
+General review:
+
+- when only the first layer is free between generators, I think it is not suitable to talk about multiple generators, but rather it is just a multimodal prior on the z, in this case z is a mixture of Gaussians with learned covariances (the weights of the first layer). This angle should be stressed in the paper, it is in fine, *one generator* with a multimodal learned prior on z!
+
+- Taking the multimodal z further , can you try adding a mean to be learned, together with the covariances also? see if this also helps?  
+ 
+- in the tied weight case, in the synthetic example, can you show what each ""generator"" of the mixture learn? are they really learning modes of the data? 
+
+- the theory is for general untied generators, can you comment on the tied case? I don't think the theory is any more valid, for this case, because again your implementation is one generator with a multimodal z prior.  would be good to have some experiments and  see how much we loose for example in term of inception scores, between tied and untied weights of generators.
+",7,5.0,ICLR2018
+A9lAyKvzMlt,1,u846Bqhry_,u846Bqhry_,Official Blind Review #1,"**Overview:**
+
+The paper considers the issue of gradient distortion in long-tailed recognition including shifted gradient direction towards data-rich classes and the enlarged variance introduced by data-poor classes. It proposes to disentangle the data-rich and data-poor classes and train a model via a dual-phase learning process. The experimental results proved the effectiveness of the method.
+ 
+**Strengths:**
+
+The observation is interesting and the method is reasonable. The memory retentive loss via graph matching in the second phase makes sense, and the method is well evaluated in four commonly used datasets.
+ 
+ 
+**Weaknesses, questions, and suggestions:**
+
+1. One of my main concerns is that the separation of the data-rich and data-poor classes is unnatural because the long-tailed distribution is continuous. The main idea of the method is kind of similar to that of Gidaris et al.[1] as the authors also cited, but in the few-shot setting in their paper, the separation of base and novel classes is more natural and reasonable.
+The issue of gradient distortion is addressed and is the motivation of the method, however, the variety in gradient does not necessarily lead to worse performance, and even when the data-rich classes are extracted for training in Phase I, the imbalance and shifting to the relatively rich classes still exists while being less informative because of fewer classes and less data. The trade-off is not easy to balance and could be sensitive to the disentanglement points as in Figure 3, where 294 and 864 are carefully selected for Places-LT and ImageNer-LT This introduces limitation to the generality of the method.
+ 
+2. Are the formations of equation (2) and (5) presented correctly? Maybe the authors intend to present \sum_i{exp{...}} as the denominator?
+ 
+3. The presentation and clarity need to be improved. For example,
+
+- The caption for Figure 1 is too long and could be more concise, e.g., to use 'titles' to distinguish CIFAR100 and CIFAR100-LT; to use grad_rich / grad_poor or something as the legend and save any extra explanation. Some of the details can be embedded in the main paper, and I’m not sure how are the gradient statistics calculated, layer-by-layer, or directly use the average of the whole network. It would be better to see it clarified in the paper or appendix, and sorry if I missed it.
+
+- There are double y-axis in Figure 2 and 3, but there is no legend or caption showing each plot is assigned to which axis.
+
+- What does 'extended parameters' mean in the experiment section?
+ 
+4. The performance improvement seems not significant compared with previous works, e.g. ImageNet-LT on ResNet-10 and iNaturalist.
+ 
+5. I’m not sure I’ve correctly comprehended the construction of the exemplar memory bank and the reason why use s(c_j + \Delta, X_1) to search for the new entry. More explanation for Equation (4) and a clearer presentation of the algorithm (e.g. use an Algorithm module) is preferred.
+
+[1] Gidaris et al., Dynamic few-shot visual learning without forgetting, in CVPR 2018
+
+**Post-Rebuttal**
+
+After reading the rebuttal and other reviewers' comments, I am actually on the fence for this submission. On the one hand, it provides several interesting observations about the phase transition in long-tailed recognition, which would be valuable to the community. On the other hand, its experimental evaluation needs to be strengthened. The authors are encouraged to include more many-shot/medium-shot/few-shot analysis across the dual phases.
+
+Therefore, I upgrade my score to 6 (marginally above acceptance threshold).
+
+",6,5.0,ICLR2021
+2rpJ-VjoSVL,3,IG3jEGLN0jd,IG3jEGLN0jd,Well written theoretical topic modeling paper,"This submission considers contrastive learning approach to representation learning under topic modeling assumptions. It proves that the proposed procedure can recover a representation of documents that reveals their underlying topic posterior information in case of linear models. It is experimentally demonstrated that the proposed procedure performs well in a document classification task with very few training examples in a semi-supervised setting.
+
+The idea behind the proposed procedure is to split randomly sampled documents into two parts and either keep them as is with the positive label or replace one of the parts with a randomly sampled document and assign the negative label. These data are then passed to the binary cross entropy loss which is optimised to find a transformation that assigns the probability of co-occurence of the two parts.
+
+The main theoretical result comprises the population case, where infinite amount of data is sampled from the described generative model under an additional anchor word assumption, and proves that in such case the outcome of the proposed contrastive learning procedure is linearly related to all moments of the topic posterior up to a cerain degree. The finite sample case is futher analysed.
+
+Synthetic experiments illustrate performance of the proposed procedure when the data is sampled from a topic model with varying degree of sparsity. It is observed, as expected, that the quality of topic recovery is better when the degree of sparsity is higher.
+
+The algorithm is further compared on the AG news topic classification dataset with standard benchmarks and it is observed that the performance of the proposed procedure is good and overall comparable with word2vec, being slightly better in the lower sample size setting. The authors thereofre conclude that the proposed procedure can be interesting in semi-supervised learning applications with low amount of data.
+
+Overall, the submission is a solide pice of work and well written. The results are theoretically sound and the observed improvement in the low sample size setting could be of interest in some challenging applications and might be worth further investigating. This line of research could be interesting to the theoretical topic modelling community.
+
+typo p. 7: all of these methods all of these methods
+",7,4.0,ICLR2021
+odDKQp_erRV,3,INhwJdJtxn6,INhwJdJtxn6,Review,"The paper studies a pre-training approach to reinforcement learning. The objective is, first to pre-train a model considering that, without reward, interaction with an environment is cheap, and second, to fine-tune a policy given a particular reward function. 
+
+As a first contribution, the paper proposes two strategies to use a pre-trained policy for discovering an efficient task-dependent policy: i) the action set is expanded such that the policy can choose to follow the pre-trained policy and ii) exploration may be done by following the pre-trained policy on t timesteps, t being randomly chosen. As a second contribution, the paper proposes a criterion to pre-train a policy based on a coverage criterion.  The principle is to encourage a policy to generate trajectories that are going through as many states as possible. 
+
+Experiments are made on different environments. First, the pre-trained policies are evaluated based on how much reward they are able to collect, and compared to other unsupervised approaches. Second, the final policy is compared to epsilon-greedy and epsilon-z-greedy approaches. In the two cases, the proposed approach outperform the baselines. 
+
+== Comments: 
+First of all, the paper clearly lacks of details, and it is difficult to be sure about what is really done, and what is the final algorithms. As far as I understand, instead of proposing a really new approach, the paper is more stacking two approaches (i.e NGU and R2D2, just changing a little bit the action space) and it does not really provide any justification about what is done. From my point of view, it is more a paper investigating if coverage may be a good unsupervised training criterion than a paper presenting a new model. 
+
+Concerning the exploration/exploitation of pre-train policies, the paper does not really describe how the 'flights' are generated (section 3). I would advise the authors to provide more details on this aspect of their algorithm. 
+
+Concerning the coverage approach, I am not convinced that the paper allows us to draw any conclusion on the interest of using such a criterion during the pre-training phase. Indeed, the authors are mainly evaluating their approach on Atari games, on which there is a clear relationship between the length of the episodes and the final score achieved by an agent. This is what is shown in Section 5.1: coverage is a good surrogate objective for solving atari games.  The article is thus lacking evaluation on other types of environments, and the performance obtained by the model on Atari games is mainly due to the use of the NGU model which has been developed more specifically for this type of environment.
+
+To conclude, I think that the paper is failing to provide evidence that the coverage approach is a good approach to unsupervised pre-training of policies. Moreover, I have the feeling that the coverage criterion may be good for particular types of environments (like Atari games), but not for some others, making the proposed approach very specific to particular problems. Combined with the lack of novelty, and the lack of details, I recommend to reject the paper. 
+
+Considering the answers from the authors, I decided to not change my score. ",4,4.0,ICLR2021
+r1gblS6iFr,1,B1liraVYwr,B1liraVYwr,Official Blind Review #2,"Contributions:
+
+The main contribution of this paper lies in the proposed LocalGAN for neural response generation. The key observation is that for a given query, there always exists a group of diverse responses that are reasonable, rather than a single ground-truth response. Therefore, the local semantic distribution of responses given a query should be modeled. Besides the original GAN loss, the proposed LocalGAN adds an additional local-distribution-oriented objective, resulting in a hybrid loss for training, which claims to achieve better performance on response generation datasets.  
+
+Strengths:
+
+I think the proposed model contains some good intuitions, that is, the generated responses should be modeled as a local distribution, rather than a single ground-truth output during training. The motivation of this paper is therefore clear. Experimental results in Table 1 seems encouraging. 
+
+However, I would have to say that the current draft is poorly presented. There are a lot of unclear parts that should be more carefully clarified, with details below.
+
+Weaknesses:
+
+(1) Writing: I think the language in this paper is repetitive, and can be much more precise and concise. Also, there are typos here and there throughout the whole draft. I would suggest the authors doing a careful proofreading before next submission. 
+
+Minor: in the line before Eqn. (4), change ""SIMPLY"" to ""simply"". 
+
+(2) Clarity: Overall, the presented method is unclear. 
+
+a) It is not entirely clear what the authors mean by saying ""this paper has given the theoretical proof of the upper-bound of the adversarial training ..."". I am not sure whether Eqn. (6) is totally correct, or at least how useful it is.
+b) The notations throughout the paper is a little bit confusing. The authors should normalize all the notations to be consistent. 
+c) It is not clear what Eqn. (3) truly means. What is the value for s? The KL divergence should take two distributions as input, but here, the input are two triplets. 
+d) In the line below Eqn. (6), what is \tilde{R}_q? This is not defined. 
+e) The proposed method relies on the use of R_q. However, how to define, or learn R_q is not clear. In the dataset, given a given query q, how do we find R_q?
+f) It is not clear why Deep Boltzmann Machines are needed here. I'd like the authors to more clearly clarify this design. Further, since DBM is used, then how the final model is trained together? Now, the models contains both adversarial learning, and contrastive-divergence-based algorithms for DBM training. This seems make the whole model training more unstable. 
+g) Generally, I think Section 3 and Section 4 are hard to follow. Further, I did not see how useful Lemma 1 & 2 and Theorem 1 are. The final objective Eqn. (17) is also confusing.
+
+(3) Experiments: My biggest concern about the experiments is that human evaluation should be conducted, given the subjective nature of the task. This is lacked in the current draft. Only reporting numbers like Table 1 is not convincing. 
+
+** This paper provides a link that actually links to a github repo. I am not sure whether this violates the policy of ICLR submissions or not. But at least from my point of review, this link should be anonymized. **",3,,ICLR2020
+SyxSXmjEYB,1,SJl9PTNYDS,SJl9PTNYDS,Official Blind Review #1,"The paper introduces a new geometric convolution that can be applied to point clouds approximating a 2D manifold in R^3. Similar to previous work, the continuous mathematical description of the method involves a kernel that is a function on R^2, which is pushed to the tangent spaces of the manifold via a choice of frame u. A local neighbourhood around the kernel is parameterized by the tangent space via the exponential map and matched against the kernel. The frame u, which determines the orientation of the kernel, is chosen by first picking an origin and then setting u1 to the direction of steepest ascent of the function that measures geodesic distance to the origin. The other basis vector u2 is computed as u1 x n where n is the normal vector. In a practical implementation, the point cloud is voxelized in order to compute the distance function and it's gradient. The gradient is then interpolated back to the point cloud. If I understood correctly, a convolution is then directly applied to the point cloud, though I could not find a description of how this is done.
+
+The problem of defining geometric convolutions on point clouds is interesting and relevant, and something like what is presented in this paper might work well. Indeed the experiments demonstrate that the method produces decent scores on modelnet and S3DIS. The method is not very fast, due to the voxelization preprocessing which is required, which at .5s precludes real-time applications. Also, as discussed below, there are some issues related to singularities in the vector fields u1,u2 that are not adequately discussed. My main concern, and the reason I have given a ""weak reject"" rating, is that the clarity and completeness of the paper is insufficient and I find the paper hard to follow (see below). I would actually like to see an improved version published, and if the authors can address the concerns I would increase my score.
+
+
+# Writing
+I find that the paper is hard to read because of the way it is structured, as well as numerous small mathematical and expository issues (discussed below) and missing information.
+
+The present paper builds on the PTC method of Schonsheck et al., but this method is not properly explained in the paper. PTC is mentioned in 1.1, but only described at a very high level. Since I had not read this paper before (I have now) the present paper was hard to follow on a first reading.
+
+Section 1.2 is a large sub-section of the related work section that actually discusses the proposed NPTC. I think it's better not to spend so much of the related work section on this. Moreover, section 3.1.1. ""General idea of NPTC"", contains another high level description, which is somewhat redundant (though at the same time, both sections leave lots of questions unanswered - see below). The following sections, 3.1.2 and 3.1.3 explain the computation of the distance function and the vector fields, but contain no clear explanation of the whole algorithm.
+
+Many important questions about how the method is actually implemented are left unanswered, even though the main contribution of this paper is at the implementation level rather than the theoretical level.
+- How is voxelization done? Count the number of points per voxel? How is the voxel size chosen? How do you make sure that you obtain a ""narrow band"" around the surface defined by the point cloud? What kind of datastructure do you use to store a sparse voxel grid?
+- How is the filter parameterized? In the theory, k is a function on R^2. How do we evaluate it at an arbitrary point in R^2?
+- How is convolution done? I think it is done on points not voxels, similar to eq. 4, but this is never stated in the paper. How do you approximate the integral over the tangent space in Eq 2, if eq 2 is what the conv is based on?
+
+I would remove section 2.1 where the notion of a connection is formally introduced. The reader will either know this concept, in which case it is unnecessary, or not, in which case the description is far too compact (one must read a textbook to really understand this notion). It may be better to include only a short intuitive introduction to parallel transport, refering the reader to other texts for mathematical definitions of basic notions like connections. 
+
+The fact that the voxelization of the point cloud is sparse is only mentioned very late in the paper. For a while I was thinking the idea would be to construct a dense voxel grid, which would be extremely memory intensive. Also, the phrase ""narrow-band"" is often used with a very different meaning, as in a small part of the spectrum. It was not clear to me what this was referring to until halfway through the paper. Better to explain the general idea of the algorithm in the introduction.
+
+
+# Singularities
+I looked at the papers citing the main related work (the PTC paper by Schonscheck et al., 2018), and of the four listed in google scholar, one seems relevant in that it prominently cites Schonscheck et al. and claims to improve upon it, but this work is not cited in the present paper:
+Cohen, Weiler, Kicanaoglu, Welling, Gauge Equivalent Convolutional Networks and the Icosahedral CNN.
+This work highlights one potential downside of the PTC approach, which is that it is not possible to find a global frame u (called in w in their paper) without singularities on manifolds like the sphere. Moreover, the choice of frame is fundamentally arbitrary and should not affect the results. In the PTC approach, a certain choice of frame with singularities is made, which leads to filter orientations that are 1) essentially arbitrary, and 2) change smoothly in most places but abruptly near the singularities. The results are heavily dependent on the choice of frame.
+
+The present paper dismisses these issues by saying that ""We remark that possible singularities will lead to no convolution operation at those points. These are isolated points on a closed manifold and do not effect experiment results.""
+However, this is not actually demonstrated in the paper, and one might reasonably believe that this problem affects more than a small finite set of points. Near a singularity (e.g. the poles of the sphere in fig 2d), nearby filters will be oriented very differently. If we draw a small circle around the singularity, the orientation of the filter changes 360 degrees as we travel around the circle. As a result, the meaning of the patterns in the output feature map is very different near the poles than it is near the equator. Nevertheless, the next layer will try to match these patterns with the same set of filters / perform weight sharing across different positions. In my opinion this issue should be more frankly discussed.
+
+Fortunately, in some applications with point clouds this issue can be avoided. For point clouds coming directly from a (RGB+)depth sensor, the manifold is essentially like a wrinkeled sheet (homeomorphic to a disk), and one can easily find vector fields / frames without singularities on such manifolds. (this is not necessarily true for point clouds that represent a whole scene, e.g. in SLAM). I think this is what is mentioned in paragraph 3 of 3.1.1. However, it is not clear to me if the frame used in this paper is without singularities even when it is possible to find such a frame. The reason is that the frame is defined in terms of the distance function rho to an arbitrary origin. If I understand correctly, one will always have a singularity at the origin. This could in principle easily be solved though by simply using the x and y direction of the camera plane as a frame (projecting it from R^3 to each tangent space). However, if the authors decide to go that route, one would also like to see a comparison to simply applying 2d convolutions to the RGB+D images.
+
+
+# Experiments
+
+The experiments on Modelnet40 show that, when compared with other point-cloud methods, the proposed method outperforms. I do not know if there are better point-based methods than those listed in the paper, but on the official Modelnet40 leaderboard (which is incomplete) there are several methods that score better (97%+). Since this difference could possibly be attributed to a loss of information due to the discretization into a point cloud, I still think these results are quite good. Similarly on S3DIS the method scores quite well, though does not reach state of the art.
+
+I could not find the architecture for the classification network, and only a high-level description of the segmentation architecture. More details on the training and evaluation procedures could be added for both experiments.
+
+
+# Misc
+
+Some more things I noticed below. Many of these are not serious problems, just things that can be improved / fixed.
+
+- p2:
+-- ""Spatial mesh-based methods are more intuitive"" - this is subjective
+-- f(v)  where v is a tangent vector in T_xe M is not defined. One would expect f to be a function on the manifold. Typically one writes f(exp_x v) instead.
+-- script M is not defined (though it is intuitive that this is the manifold). Later in 2.1 it is written ""Let M be a 2 dimensional differential manifold"". Later still (3.1) the notation is switched to P without notice.
+-- TxM is not defined. Later it is explained this is the tangent plane. Neither is g_x (it is obvious this should be the metric, but only if one knows differential geometry).
+-- ""and do not effect experimental results"" -> affect
+-- ""singularities from a given vector field can be overcome ...."" I don't understand what this is saying. How can we avoid the issues with singularities? 
+-- consider writing log instead of exp^{-1}
+-- ""while uses tangent planes"" -> using
+- p3:
+-- You have f(z) in the subscript of the indicator I_argmax ...
+-- In the explanation of pointnet it is not clear where there are any parameters in the kernel k if k(xi, xj) = delta(xi, xj)
+-- I don't understand the explanation of edge convolution. If f(x_j) = 1 (for all j(?)) then k(xi, xj) = MLP(f(xi), f(xi) - f(xj)) would equal MLP(1, 0) which makes no sense.
+-- Description of figure 1: ""are in parallel"": need to say along what curve they are parallel. If I understand correctly, they are only in parallel if they are attached to different points on a minimizing geodesic emanating from the origin.
+- p4:
+-- ""embed in R^3"" -> embedded
+-- In 2.1 it is explained that T_x M is the tangent plane, but you've already discussed T_x M several times before.
+-- A vector field is not just any assignment M -> TM. It is a section of TM, ie a map M -> TM that, when composed with the projection TM -> M, yields the identity on M.
+-- The interval ""I"" mentioned below eq 5 has not been defined. ""I"" was used as indicator function before.
+-- ""X is the unique section of Gamma(TM)"". The concept of ""section"" has not been introduced. Moreover, one could say ""section in Gamma(TM)"" or ""section of TM"" but not ""section of Gamma(TM)"" since Gamma(TM) *is* the set of sections of TM. Probably best to say ""unique vector field"".
+-- ""there will be a geodesic connecting x0 and x1"" - only true if M is connected.
+- p5:
+-- ""how a distance functions are"" -> function
+-- ""easliy""
+-- below eq 6 ""f(x) is a strictly positive function"", but f(x) does not appear in eq 6
+-- in 3.1., script P has not been defined. I first thought it is just a new letter for M, but I think it is the point cloud? Needs to be made explicit.
+-- ""Geodesic curve represents, in some snse, the shortest path"". This is false. A geodesic is a *straightest* curve, but a geodesic need not be distance minimizing in any sense.
+-- ""ascend direction"" -> ascent
+-- In the first paragraph of 3.1.1 it is not mentioned explicitly that y plays the role of an ""origin"". This makes the last sentence confusing.
+-- ""The value f(v) is computed by f(v) = f(z) where z in P is the closest point to v"". Firstly, this sentence is in the middle of a paragraph about the vector fields u1,u2, but seems unrelated. Secondly, what are v and z? I guess v is a point in M and z in P? Or is v a tangent vector? In that case, do you mean closest in the embedding space R^3?
+- p6:
+-- ""consider point clouds in R^3"". This assumption was already made before, right?
+-- ""it is not straightforwardly compute distance function"" -> ""straightforward to compute the distance function""
+-- ""Here Lambda is chosen as certain point on the point cloud"" -> ""a certain point"". Also, Lambda was used before as a subset of Omega-bar (below eq 6), but here it is a point. I guess this point is the origin. In other places (earlier) you also refer to the origin, e.g. by y. Needs to be made consistent.
+-- ""Show that the directly"" - remove ""the""
+-- ""Finally, we interpolate the distance function from the voxels to point cloud"" -> ""the point cloud"". On first reading, this (page 6!) was the point where I realized (guessed) that you are going to define the convolution on the point cloud not voxel grid. This needs to be explained much earlier.
+-- ""it is nature to"" -> natural
+- p7
+-- ""in our implement"" -> implementation
+-- ""Appendix."" -> ""Appendix A"" or ""the Appendix.""
+- p8
+-- ""Scene semantics segmentation"" -> semantic
+",3,,ICLR2020
+f_lYl-vwbB,3,PghuCwnjF6y,PghuCwnjF6y,Better comparison with regular hyperparameter search is required ,"This paper proposes a new dataset which contains experiment / model details coupled with optimizer information so as to model the behavior of optimizer, and their effect on performance on test set. The paper is not very difficult to follow, but I am not super convinced of an actual practical use cases. 
+
+I think that the authors should provide a concrete examples for real life test time applications. I suppose the meta-learning algorithm for the optimizer would take the experiment definition, and map this information to an optimal optimizer, but I think that it would be easier for the reader if this information could be made mode explicit in the paper, perhaps with a concrete example. 
+
+I also think that the comparison between the proposed meta learning approach, vs. the regular hyperparameter search on a given dataset should be made clearer. Right now it is limited to figure 3, and in my opinion the details on how the random search is carried out is not clear enough. What is the range of hyper parameters that are sampled? What are the distributions from which the hyperparameters come from? 
+
+Also, it is hard to make the conclusion only from the experiments provided in Figure 3 that, the proposed meta-learning approach would be preferable over the standard hyperparameter search just by the two tasks explored in this particular figure. Ideally a third dimension of tasks should also be added to the figures so that we know that this meta learning approach generalizes over a variety of tasks. (The same comments apply to figure 4, if I understand correctly, which does similar experiments on more realistic models/tasks)
+
+If I am missing something that is already in the paper, I apologize, but without further experimental evidence which suggests that the proposed meta learning scheme would be clearly preferable over standard hyperparameter search, it is hard to see a clear-cut application for this paper. 
+
+I appreciate the ambitious task that this paper is trying to tackle, but I feel more convincing experimental evidence, and better presentation of the experiments is required to consolidate the case that this paper is trying to make. 
+
+",5,2.0,ICLR2021
+r1g-eWsqKH,1,Hkx7_1rKwS,Hkx7_1rKwS,Official Blind Review #2,"Summary
+The present work proposes a new algorithm, ""Follow the Ridge"" (FR) that uses second order gradient information to iteratively find local minimax points, or Stackelberg equilibria in two player continuous games. The authors show rigorously that the only stable fixed points of their algorithm are local minimax points and that their algorithm therefore converges locally exactly to those points. They show that the resulting optimizer is compatible with heuristics like RMSProp and Momentum. They further evaluate their algorithm on polynomial toy problems and simple GANs.
+
+Decision
+I think that this is a solid paper that addresses the well-defined goal of finding an optimizer that only converges to local minimax points. This is established based on both theoretical results and numerical experiments. Since there has been a recent interest in minimax points as a possible solution concept for GANs, I believe the paper should be accepted.
+
+The paper occasionally makes claims that the solutions of GANs should consist of local minimax points (""We emphasize that GAN training is better viewed as a sequential game rather than the simultaneous game, since the primary goal is to learn a good generator.""), which are not backed up by empirical results or reference to existing literature. If anything, the empirical results in this paper do not show improvement of the resulting generator (with the exception of the 1-dimensional example that has a particular rigidity since low discriminator output can easily restrict the movement of generator mass based on first order information). The right solution concept for GANs is not what the paper is about, but before publication the authors should remove these claims, identify them as speculative, or substantiate them .
+
+Suggestions for revision
+(1) In the last displayed formula on page 4 it should be the gradient w.r.t x.
+(2) Remove, substantiate, or mark as speculative the claims regarding the right notion of solution concept for GANs.
+
+Questions to the authors
+(1) You write "" There is also empirical evidence against viewing GANs as simultaneous games (Berard et al., 2019). "". Could you please elaborate, why Berard et al. provides empirical evidence against viewing GANs as simultaneous games?
+(2) The Batch size for MNIST of 2000 is much larger than the values I have seen in other works. What is the effect of using more realistic batch sizes in training?
+(3) When measuring the speed with which consensus optimization and FR converge, shouldn't you allow consensus optimization five times as many iterations, since you are using five iterations of CG to invert the Hessians in each step?
+(4) You mention that you use CG to invert the Hessian, but the Hessian is not positive definite? Do you apply CG to the adjoint equations?",6,,ICLR2020
+rkxvKVc0YS,3,HyxCRCEKwB,HyxCRCEKwB,Official Blind Review #2,"Summary
+The present work proposes to combine GANs with adversarial training replacing the original GAN lass with a mixture of the original GAN loss and an adversarial loss that applies an adversarial perturbation to both the input image of the discriminator, and to the input noise of the generator. The resulting algorithm is called robust GAN (RGAN). Existing results of [Goodfellow et al 2014] (characterizing optimal generators and discriminators in terms of the density of the true data) are adapted to the new loss functions and generalization bounds akin to [Arora et al 2017] are proved. Extensive experiments show a small but consistent improvement over a baseline method.
+
+Decision
+The authors do a thorough job at characterizing the proposed method using both theoretical analysis and wide ranging experimental studies. My main criticism of the paper in its present form is the lack of motivation for the proposed method. Why, out of the many possible ways to impose additional regularization should one use adversarial training to regularize GANs? While it is remarkable that the experimental results seem to be improving consistently, the improvement is quite small. Similarly, while theoretical results are provided, a discussion of what they mean for the performance of RGAN is sorely lacking, leaving me unconvinced that adversarial training leads to an improvement over GANs when compared with simpler methods of regularization. Therefore I vote to reject the paper in its present form.
+
+Suggestions for improvement on the experiments
+My main concern with the experiments is that a similar small improvement over the baseline could be achieved by tuning the hyperparameters in an alternative simpler regularization method. For instance, instead of using an adversarial perturbation, one could simply use a random perturbation applied to both the random noise and the discriminator input at testing time. The former would amount to a variance of the truncation trick [Brock et al 2019], while the latter would amount to using instance noise. These are established methods to improve GAN performance and to make a case for adversarial training of GANs one would need to show improvements compared to these simpler strategies, in my opinion.
+
+My main suggestion for the theoretical part is to make a stronger case of what (if anything) these theoretical results say about the performance of RGAN compared to the usual GAN. In particular, the generalization bound does not seem to depend on lambda, (which interpolates between the original GAN and RGAN). What is to be inferred from these results regarding the performance of RGAN?
+
+Questions to the authors
+(1) I assume you perform adversarial training in practice by backpropagating in image/noise space? How does this affect performance? How would the convergence plots look like if wall-clock time, or the number of model evaluations were used on the x-axis?
+
+(2) Did you try investing a similar computational budget to tune hyperparameters for simpler regularization methods as mentioned above and compare the resulting improvement?
+
+(3) Is the value (I presume, standard deviation) given after each the inception score computed for different multiple iterations of the same run or multiple runs with different initialization and random seed?",3,,ICLR2020
+8-3DEPdJmj,5,a-xFK8Ymz5J,a-xFK8Ymz5J,Review of DiffWave,"Summary:
+
+The authors adapt the recent trend of work on denoising diffusion probablistic models to the task of conditional or unconditional waveform generation.
+Using the same principles as in (Ho et al 2020), as well as a Wavenet-like non causal model, the authors provide state of the art results for both tasks, as evaluated on spoken digits dataset (for unconditional and conditional generation) and on the LJ speech dataset (for deep vocoding). 
+The model proposed is faster to evaluate than WaveNet, and has less parameters than WaveGlow. It achieves a slightly higher MOS than WaveFlow for a comparable model size. The generation speed is comparable to previous methods.
+
+## Review
+
+The paper is clear, well structured and the authors provide many experiments to validate their approach.
+While serious, the paper does lack novelty, as the method is completely taken from (Ho et al. 2020). The architecture is similar to Wavenet, but non causal. Note that this exact same non causal wavenet architecture has already been used for source separation [Rethage et al. 2018, Lluis et al. 2019].
+
+One limitation is that unconditional, or weakly conditioned generation (i.e. not conditioned on a mel-spectrogram) is only evaluated on single digit generation, which is relatively limite. While the samples shows an improvement over WaveNet, it seems the proposed architecture would still struggle to generate longer sequences, like an entire sentence for instance.
+It would be interesting to add WaveGlow or WaveFlow to the SC09 comparison.
+
+Overall I recommend acceptance as the paper show that denoising diffusion process can be used for waveform generation, even though the paper does not bring further novelty.
+
+References:
+Rethage et al. 2018: A Wavenet for Speech Denoising
+Lluis et al. 2019: End-to-end music source separation: is it possible in the waveform domain?",7,3.0,ICLR2021
+Byxpp1v0cS,3,S1et8gBKwH,S1et8gBKwH,Official Blind Review #4,"This paper presents a semi-supervised approach to learn the rotation of objects in an image. The primary motivation is that for rotation estimation datasets may not always be fully labeled, so learning partially from labeled and partially for unlabeled is important. The approach is to use a CVAE with a supervised loss and an unsupervised loss and to jointly train the network. Limited experiments that show performance are presented.
+
+First, the paper solves a very interesting problem with potentially wide applications. The paper is reasonably well-written.
+
+Unfortunately, I don't believe that the contributions of the paper meet the standards of ICLR. I justify my opinion below. The experiments are also very weak.
+
+- While the high level goal of ""pose estimation"" is clear. Even after reading the paper multiple times, I did not understand the setting well. It appears like the paper looks at the problem of 2D orientation estimation of objects in images. However, this setting is restrictive and not very practical in reality. We mostly care about 3D pose estimation. It would have been good to see results on 3D rotations at the very least.
+
+- Contribution: It is unclear to me what the primary contribution(s) of the paper is. The entire section on CVAE's and losses are quite standard in literature. The interesting part is in combining the supervised and unsupervised parts of the method for the task for pose estimation. But in the end this is a simple weighted loss function (equation 5). So I wonder what is the novelty? What are the new capabilities enabled by this approach?
+
+- Related Work:
+
+Implicit 3D Orientation Learning for 6D Object Detection from RGB Images, ECCV 18
+
+
+- I would have loved to see a description of the differences in the loss functions (1) and (2). Perhaps this can help elevate the contribution more?
+
+- I also missed justification of why the particular design choice is suitable for this problem? Would direct regression using a simple CNN work better?
+
+- In equation (4), how are the two losses balanced?
+
+- The dataset generation part is just confusing. ModelNet40 is rendered but only 2D rotation is predicted? What does 2D rotation mean for a 3D object?
+
+- Could this method be tested on a dataset like dSprites (https://github.com/deepmind/dsprites-dataset) which has 3D rotations?
+
+- Regarding experiments: I was disappointed to see no comparisons with other approaches or even a simple baseline. A CNN that directly regresses orientation could help put the tables and plots in perspective.
+
+Overall, the problem is important (if lifted to 3D) with important applications. However, the paper does not say anything new about how to solve the problem and the experiments are weak. In its current state, I am unable to recommend acceptance.",1,,ICLR2020
+2zHSMvTZkyE,2,RcjRb9pEQ-Q,RcjRb9pEQ-Q,This work proposes a new perturbation method for generating unrestricted adversarial examples through introduction of stylistic and stochastic modifications.,"The paper presents a new method for generating unrestricted adversarial examples.  Based on Style-GAN, this work separates stylistic and noise modifications so as to control higher-level aspects and lower-level aspects of image generation.
+
+By handling style and noise variables separately and changing the different levels of synthesis networks, the model can input various types of perturbations in generating adversarial images.  As the authors claim, the style variables from different layers affect different aspects of images.  Generation of adversarial images are tested in both un-targeted and targeted attacks.  Overall,
+the paper is well-motivated, well-written, and the method is evaluated with three tasks, classification, semantic segmentation, and object detection.
+
+On the other hand, although the different layers of the networks are concerned with different aspects of the images and the proposed method can generate a variety of images, we may not be able to intentionally control specific aspects of the images.  This is an incremental work on top of Style-GAN so that the novelty of the paper is not very high.
+
+Please make clear how the parameter values are determined.  For example, how did you select the step sizes?",6,2.0,ICLR2021
+HJg4ytqaYS,1,S1elRa4twS,S1elRa4twS,Official Blind Review #3,"The paper studies batch meta learning, i.e. the problem of using a fixed experience from past tasks to learn a policy which can quickly adapt to a new related task. The proposed method combines the techniques of Fujimoto et al. (2018) for stabilizing batch off-policy learning with ideas from Rakelly et al. (2019) for learning a set invariant task embedding using task-specific datasets of transitions. They learn task-specific Q-values which are then distilled into a new Q function which is conditioned on the task embedding instead of the task ID. The embedding is further shaped using a next-state prediction auxiliary loss. The algorithmic ideas feel a bit too incremental and the experimental evaluation could be stronger--I'd recommend trying the method on more complicated environments and including ablation studies.
+
+Specific comments:
+1. I disagree that the cheetah and hopper environments are ""challenging""--they're one of the simplest MuJoCo environments. 
+2. The problem of adapting to run at a specific speed when the meta-learner observes the dense rewards is actually not a meta learning problem because the meta learner can uniquely identify the target speed from a single transition. This is because the current speed is part of the observation, and so given the value of the dense reward at this state, it is simple to calculate the target speed. Hence these environments are effectively the same as directly giving the agent the target speed as an input. Given this interpretation, I'm not sure what is ""meta"" about this environment. The problem then reduces to the question of whether the agent can generalize from the 16 or 29 training tasks. That this should be the case is not surprising considering the one-dimensional nature of the task space.
+3. It would also be useful to see some ablations. For example, is the auxiliary prediction task necessary? Would it be possible to side step the distillation process and directly learn Q_S from the buffer as done e.g. in Rakelly et al. (2019)? Could you show some data that the corrections from Fujimoto et al. (2018) are important in the batch setting?
+
+------------------------------------------------------------------------------------------------------------
+Thanks for your comments. I still think this is too incremental, and my concerns regarding the environments and using the dense reward as a feature which identifies the task haven't changed and so I'm keeping my score as is.",3,,ICLR2020
+xGePHk3XOWe,3,o2N6AYOp31,o2N6AYOp31,Interesting and simple (which is positive) method - but experiments are lacking,"The general idea of the paper is interesting: when using an AE one can use the constraint that ""interpolated images"" should also correspond to ""interpolated latent codes"". While the idea is interesting, the experimental results are not really that compelling. 
+
+The proposed approach in section 4 is an interesting and well motivated extension to Interpolative AEs. The idea is quite simple and the proposed setting to use synthesized and augmented images for the above mentioned interpolation constraint seems interesting to explore. 
+
+The main weakness of the paper are the experimental results in my view. 
+
+While qualitatively (and potentially hand-picked) examples seem to show that the proposed approach is working well, the experiments are not sufficient to convince me as a reviewer about the power of the approach. Let me be more specific
+
+- In general, quantitative results are rare and thus it is close to impossible to assess the performance of the proposed method. In essence mostly qualitative results are shown that are obviously anecdotal only. An exception is table 2, where FID and an error is shown. While I understand the FID score, I am lacking comparisons to FID scores for these models trained to the ""same"" domain. Otherwise it is unclear how good the numbers really are. Also, the error numbers where not entirely clear to me what they correspond to.
+
+- An important ingredient and component of the approach is the way the synthesized images are obtained via augmentation. While the reader get a vague idea about what kind of augmentation is used, there is no experiment that shows which kind of augmentation is necessary and which kind of augmentation will break the system. In fact, without such an ""ablation-type"" experiment the paper is not particularly insightful. To me some experiments around this essential component the paper is incomplete and should not be accepted. 
+
+- Finally, somewhat linked to the previous comment, the paper does not really show failure modes (with the somewhat too obvious failure mode given in fig 3 right) to understand the limitations of the proposed method
+
+
+So overall the proposed method is interesting and simple (which is positive) - but the experimental results are not convincing and complete enough to justify acceptance at ICLR. 
+
+Update after the rebuttal:
+
+Thanks for the responses. Given that the other reviewers also raise serious issues I will stick with my initial rating. The paper seems not to be ready for publication at ICLR",5,4.0,ICLR2021
+HkgwSoPq2Q,2,B1lxH20qtX,B1lxH20qtX,"An interesting idea of dynamical ""self-assembly"" but unclear implications of the proposed message passing ","The paper describes training a collection of independent agents enabled with message passing to dynamically form tree-morphologies.  The results are interesting and as proof of concept this is quite an encouraging demonstration.
+
+Main issue is the value of message passing
+- Although the standing task does demonstrate that message passing may be of benefit. It is unclear in the other two tasks if it even makes a difference. Is grouping behavior typical in the locomotion task or it is an infrequent event?
+  - Would it be correct to assume that even without message passing and given enough training time the ""assemblies"" will learn to perform as well as with message passing? The graphs in the standing task seem to indicate this. Would you be able to explain and perform experiments that prove or disprove that?
+  - The videos demonstrate balancing in the standing task and it is unclear why the bottom-up and bidirectional messages perform equally well. I would disagree with your comment about lack of information for balancing in the top-down messages. The result is not intuitive.
+  - Given the above, does message passing lead to a faster training?  Would you be able to add an experimental evidence of this statement?",7,3.0,ICLR2019
+R1vdH15I01f,2,1TIrbngpW0x,1TIrbngpW0x,Interesting idea but the results are not convincing,"This paper proposes an independent mechanism that divides hidden representations and parameters into multiple independent mechanisms. The authors claim that the mechanism benefits the computation of sparse tensors; it does learn better inductive biases than a sizeable monolithic model. This idea is particularly similar to Recurrent Independent Mechanisms (RIM) [1], mentioned in the paper. The main contribution of this work is introducing competition between independent mechanisms. The authors evaluate their models on the image transformer model, speech enhancement, and NLP tasks.
+
+The main thing that has been missing in this paper is the fine-details of the approaches. 
+
+I think the paper has exciting ideas on reconstructing the architecture of the Transformer. However, the proposed models only give marginal improvements; thus it is tough to find the reasons for using this model. 
+
+To have a stronger claim, I suggest adding a comparison to RIM on Sequential MNIST Resolution Task to show that independent mechanisms can benefit the generalization performance.
+
+Strengths:
+- The authors introduce a novel transformer architecture to split the parameters into independent sections for a better inductive bias
+
+Weaknesses:
+- The performance improvement is not consistent in all experiments. The competition mechanism benefits only some tasks.
+- The evaluation results are not convincing
+
+Several issues and questions for the authors:
+1. Can you show the dynamics of the competition between mechanisms over time?
+2. In terms of evaluation, I didn't see any baseline for the Image Transformer model. It isn't easy to understand the significance of the proposed method.
+3. Please consider citing the relevant papers [2] [3]
+
+***Post-Rebuttal***
+
+> I want to thank the author for addressing my concerns. However, the authors did not address the issues of the evaluation. I think the paper can be further improved by providing more convincing results and analysis. For me, the significance of the proposed method is minimal. Thus, I will not change my score. 
+
+References
+1. Recurrent Independent Mechanisms https://arxiv.org/pdf/1909.10893.pdf
+2. LeCun, Y., Cortes, C., & Burges, C. J. (2010). MNIST handwritten digit database.
+3. Krizhevsky, A. (2009). Learning Multiple Layers of Features from Tiny Images.",5,5.0,ICLR2021
+HJe5K0lT37,2,rkgpy3C5tX,rkgpy3C5tX,"Limited baselines in comparison, cost not clear","The authors proposed a meta-learning approach which amortizes hierarchical variational inference across tasks, learning an initial variational distribution such that, after a few steps of stochastic optimization with the reparametrization trick, they obtain a good task-specific approximate posterior. The optimization is performed by applying backpropagation through
+gradient updates. Experiments on a contextual bandit setting and on miniImage net show how the proposed approach can outperform a baseline based on the method MAML. Although in miniImagenet the proposed method does not produce
+gains in terms of accuracy, it does produce gains in terms of uncertainty estimation.
+
+Quality:
+
+The derivation of the proposed method is rigorous and well justified. The experiments performed show that the proposed method can result in gains. However, the comparison is only with respect to MAML and other techniques could have also be included to make it more meaningful. For example,
+
+Gordon, Jonathan, et al. ""Decision-Theoretic Meta-Learning: Versatile and
+Efficient Amortization of Few-Shot Learning."" arXiv preprint arXiv:1805.09921
+(2018).
+
+or the methods included in the related work section, or Garnelo et al. 2018.
+
+The authors do not comment on the computational cost of the proposed method.
+
+Clarity:
+
+The paper is clearly written and easy to read.
+
+Novelty:
+
+The proposed method is new up to my knowledge. This is one of the first methods to do Bayesian meta-learning.
+
+Significance:
+
+The experimental results show that the proposed method can produce gains. However, because the authors only compare with a non-Bayesian meta-learning method (MAML), it is not clear how significant the results are. Furthermore, the computational cost of the proposed method is described well enough.",5,3.0,ICLR2019
+r1mT8HDgf,1,S1EwLkW0W,S1EwLkW0W,"Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients","Summary: 
+The paper is trying to improve Adam based on variance adaption with momentum. Two algorithms are proposed, M-SSD (Stochastic Sign Descent with Momentum) and M-SVAG (Stochastic Variance-Adapted Gradient with Momentum) to solve finite sum minimization problem. The convergence analysis is provided for SVAG for strongly convex case. Numerical experiments are provided for some standard neural network structures with three common datasets MNIST, CIFAR10 and CIFAR100 compared the performance of M-SSD and M-SVAG to two existing algorithms: SGD momentum and Adam. 
+ 
+Comments:
+Page 4, line 5: You should define \nu clearly.
+
+Theorem 1: In the strongly convex case, assumption E ||g_t ||^2 \leq G^2 (if G is a constant) is too strong. In this case, G could be equal to infinity. If G is not infinity, you already assume that your algorithm converges, that is the reason why this assumption is not so good for strongly convex. If G is infinity (this is really possible for strongly convex), your proof would get a trouble as eq. (40) is not valid anymore.
+
+Also, to compute \gamma_{t,i}, it requires to compute \nabla f_{t,i}, which is full gradient. By doing this, the computational cost should add the dependence of M, which is very large as you mentioned in the introduction. According to your rate O(1/t), the complexity is worse than that of gradient descent and SGD as well. 
+
+As I understand, there is no theoretical results for M-SSG and M-SVAG, but only the result for SVAG with exact \eta_i^2 in the strongly convex case. Also, theoretical results are not strong enough. Hence, the experiments need to make more convincingly, at least for some different complicated architecture of deep neural network. As I see, in some dataset, Adam performs better than M-SSD, some another dataset, Adam performs better than M-SVAG. Same situation for M-SGD. My question is that: When should we use M-SSD or M-SVAG? For a given dataset, why should we not use Adam or M-SGD (or other existing algorithms such as Adagrad, RMSprop), but your algorithms? 
+
+You should do more experiments to various dataset and architectures to be more convincing since theoretical results are not strong enough. Would you think to try to use VGG or ResNet to ImageNet?
+
+I like the idea of the paper but I would love if the author(s) could improve more theoretical results to convince people. Otherwise, the results in this paper could not be considered as good enough. At this moment, I think the paper is still not ready for the publication. 
+
+Minor comments:
+Page 2, in eq. (6): You should mention that “1” is a vector.
+Page 4, line 4: Q in R^{d} => Q in R^{d x d}
+Page 6, Theorem 1: You should define the finite sum optimization problem with f since you have not used it before.
+Page 6, Theorem 1: You should use another notation for “\mu”-strongly convex parameter since you have another “\mu”-momentum parameter in section 3.4
+Page 4, Page 7: Be careful with the case when c = 0 (page 4) and mu = 1 (page 7-8) with dividing by 0. 
+",4,4.0,ICLR2018
+SJevAu1qhX,1,H1xD9sR5Fm,H1xD9sR5Fm,Well written and interesting; experiments could be improved,"In this paper the authors distinguish between two families of training objectives for seq2seq models, namely, divergence minimization objectives and max-margin objectives. They primarily focus on the divergence minimization family, and show that the MRT and RAML objectives can be related to minimizing the KL divergence between the model's distribution over outputs and the ""exponentiated payoff distribution,"" with the two objectives differing in terms of the direction of the KL. In addition, the authors propose an objective using the Hellinger distance rather than the KL divergence, and they conduct experiments on machine translation and summarization comparing all the considered objectives.
+
+The paper is written extremely clearly, and is a pleasure to read. While the discussion of the relationship between RAML and MRT (and MRT and REINFORCE) is interesting and illuminating, many of these insights appear to have been discussed in earlier papers, and the RAML paper itself notes that it differs from REINFORCE style training in terms of the KL direction.
+
+On the other hand, the idea of minimizing Hellinger distance is I believe novel (though related to the alpha-divergence work cited by the authors in the related work section), and it's nice that training with this loss improves over the other losses. Since the authors' results, however, appear to be somewhat below the state of the art, I think the main question left open by the experimental section is whether training with the Hellinger loss would further improve state of the art models. Even if it would not, it would still be interesting to understand why, and so I think the paper could be strengthened either by outperforming state of the art results or, perhaps through an ablation analysis, showing what aspects of current state of the art models make minimizing the Hellinger loss unnecessary.
+
+In summary,
+
+Pros:
+- well written and interesting
+- a new loss with potential for improvement over other losses
+- fairly thorough experiments
+
+Cons:
+- much of the analysis is not new
+- unclear if the proposed loss will improve the state of the art, and if not why 
+
+Update after author response: thanks for your response. I think the latest revision of the paper is improved, and even though state of the art BLEU scores on IWSLT appear to be in the mid 33s, I think the improvement over the Convolutional Seq2seq model is encouraging, and so I'm increasing my score to 7. I hope you'll include these newer results in the paper.",7,4.0,ICLR2019
+S1lPeozCYH,1,H1e5GJBtDr,H1e5GJBtDr,Official Blind Review #2,"This paper proposes axial attention as an alternative of self-attention for data arranged as large multidimensional tensors, which costs too much computational resource since the complexity of traditional self-attention is quadratic in order to capture long-range dependencies for full receptive fields. The axial attention is applied within each axis of the data separately while keeping information along other axes independent. Therefore, for a d-dimensional tensor with N = S^d, axial attention saves O(N^{(d−1)/d}) computation over standard attention. The proposed axial attention can be used within standard Transformer layers in a straightforward manner to produce Axial Transformer layers, without changing the basic building blocks of traditional Transformer architecture.  The authors did experiments on two standard datasets for generative image and video models: down-sampled ImageNet and BAIR Robot Pushing, and they claim that their proposed method matches or outperforms the state-of-the-art on ImageNet-32 and ImageNet-64 image benchmarks and sets a significant new state-of-the-art on the BAIR Robot Pushing video benchmark. 
+
+Reasons to accept:
+
+1.	Simple, easy-to-implement yet effective approach to adapt self-attention to large multidimensional data, which can save considerable computation for efficiency, while still have competitive performance.
+2.	Clear writing, with sufficient but not redundant introduction of background knowledge and explanation of both the advantages and drawbacks of existing models (too large computational complexity on high-dimensional data).
+
+Suggestions for improvement:
+
+1.	It would be better if the authors can provide more analysis or case study to show the reason why Axial attention (Axial Transformer) can reach good performance even if it omits considerable operations compared to traditional Transformers, or to show why the attention operations within axis are important instead of attention operations between axis. 
+2.	Definition of “axis” should be more clear in section 3 (there could be some ambiguities of “axis”).
+",6,,ICLR2020
+rJgx7oKM5B,3,r1eh30NFwB,r1eh30NFwB,Official Blind Review #3,"This paper proposes adding additional flow layers on the decoder of VAEs. The authors make two claims
+1. The proposed model achieves better image quality than a standalone Glow.
+2. The proposed model is faster to train than Glows.
+The intuition is a VAE can learn a distribution close enough to be target distribution, and the Glow only needs to do much less work than standalone Glow, hence faster. Some positive results are reported in the experiments, including better image quality, faster training time, and the Glow indeed sharpens the output of VAEs. 
+
+The paper indeed has some good results, particularly they can achieve it only with single-scale Glows with additive coupling layers. However, I think the claims are not sufficiently supported. Taking point 1 as an example, it is not clear to me why VAE+Glow is better than a standalone Glow. Imagine two models
+
+M1: VAE+Glow (proposed in the paper)
+M2: Glow0+Glow (standalone Glow)
+
+sharing the last ""Glow"" part. M1 is better than M2 implies ""VAE"" is more powerful than ""Glow0"", which I doubt. Similarly, for point 2, it is not clear to me why ""VAE"" is faster than ""Glow0"". I think comparing the proposed model with IAF make more sense, because the proposed model just adds flows to the decoder and the prior. But the relationship with Glows needs to be considered more thoroughly.
+
+Another confusing detail for me is the two-stage training in Sec. 3.4. The explanation ""likely because the Glow layer is unable to train efficiently with a changing base distribution"" doesn't make sense. Because IAF does successfully train their q-net without 2-stage training. There might be other reasons?
+
+The baselines are not strong enough. Most importantly, Flow++ [1] reports a likelihood 3.08 on Cifar10 with standalone flows, which should also be a part of the baseline. I also wonders whether the proposed model benefits from deeper model, like standalone flows do. Will standalone flows surpasses the proposed model as the number of layers goes to infinity?
+
+[1] Ho, Jonathan, et al. ""Flow++: Improving flow-based generative models with variational dequantization and architecture design."" arXiv preprint arXiv:1902.00275 (2019).
+
+Finally, the paper is somewhat incremental. Particularly comparing with VAE-IAF, where this paper just adds flow layers to not only q but also p.
+
+Update
+=====
+
+After reading the rebuttal I found some of my concerns are unaddressed. (regarding to the two-stage training, and the novelty)
+
+Point 1 is still not answered. The answer I expect to have is how ""The interaction between two models when they are being stacked may affect the overall performance in such a way that it is more than just the sum of its parts."" Why are these two models perform particularly well when combined? The purpose of my initial question is for some in-depth analysis and intuition / theory.
+
+Therefore I will keep my score unchanged.",3,,ICLR2020
+Skp1e-zgM,1,rkTBjG-AZ,rkTBjG-AZ,The novelty in this paper is below what is expected for a publication at ICLR. I recommend rejection.,"The author present a language for expressing hyperparameters (HP) of a network. This language allows to define a tree structure search space to cover the case where some HP variable exists only if some previous HP variable took some specific value. Using this tool, they explore the depth of the network, when to apply batch-normalization, when to apply dropout and some optimization variables. They compare the search performance of random search, monte carlo tree search and a basic implementation of a Sequential Model Based Search. 
+
+The novelty in this paper is below what is expected for a publication at ICLR. I recommend rejection.",4,5.0,ICLR2018
+Syl23DhcFr,1,BygKZkBtDH,BygKZkBtDH,Official Blind Review #1,"** Summary **
+In this paper, the authors propose a new variant of Transformer called Tied-multi Transformer. Given such a model with an N-layer encoder and an M-layer decoder, it is trained with M*N loss functions, where each combination of the nth-layer of the encoder and the mth-layer of the decoder is used to train an NMT model. The authors propose a way to dynamically select which layers to be used when a specific sentence comes.  At last, the authors also try recurrent stack and knowledge to further compress the models.
+
+** Details **
+1.	The first question is “why this work”:
+a.	In terms of performance improvement, in Table 2, we can see that dynamic layer selection does not bring any improvement compared to baseline (Tied(6,6)). When compared Tied(6,6) to standard Transformer, as shown in Table 1, there is no improvement. Both are 35.0.
+b.	In terms of inference speed, in Table 2, the method can achieve at most (2773-2563)/2998 = 0.07s improvement per sentence, which is very limited.
+c.	In terms of training speed, compared to standard Transformer, the proposed method takes 9.5 time of the standard Transformer (see section 3.4, training time).
+Therefore, I think that compared to standard Transformer, there is not a significant difference.
+2.	 The authors only work on a single dataset, which is not convincing.
+3.	In Section 5, what is the baseline of standard RS + knowledge distillation?
+",1,,ICLR2020
+88PQ9ekLtnT,2,48goXfYCVFX,48goXfYCVFX,Interesting Problem but Important Technical Components Missing,"This paper tackles ingredient recommender systems problem. This paper proposes the Interpretable Relational Representation Model (IRRM) to achieve both usefulness and interpretableness. There are two variants of the model, first is to model latent relation between two ingredients, the second is to leverage external knowledge base and results from TransE to learn relational representations.
+
+The problem setting is interesting: using ingredient recommender systems to help chefs to create better or more creative ingredient combinations. This topic relatively recessives little attention in recommender system community but it seems worth a while pursuing this direction. 
+
+I like the way the paper models the problem but there are several technical components are too important to miss:
+
+1. The baseline methods chosen are too simple to be compared against. FREQ, PMI and TFIDF are simple rule-based methods and NMF is the only commonly used method in recommender system but matrix factorization is too weak a baseline given that the proposed method is a neural network approach. This is also related to second point.
+
+2. Connections of this problem to Sequential Recommendation is largely missing. It seems to me this problem is actually closed related to sequential or session-based recommendations. Because the problem can be formulated as given a list of ingredients the recipe has interacted with, what is the next ingredient that this particular recipe is most likely to engage next? Here ingredient is the item and recipe is the user in sequential recommendation, so all the methods that have been developed for sequential recommendation can be used for this problem. Worth noting methods in this category is 2018 SASRec paper (https://arxiv.org/abs/1808.09781) and 2020 SSE-PT paper (https://dl.acm.org/doi/abs/10.1145/3383313.3412258).
+
+3. Many ablation studies are missing. Going over the entire paper, I am not entirely sure what components are most important to the good performance of the model. Many neural network design choices seem too arbitrary to me, e.g. why we need to add p and q in Figure 1. A rigorous approach should be able to convince readers that each choice of model design is the best among all alternatives and without it, the performance would suffer.
+
+4. Regarding qualitative study, I was just wondering if it is possible to ask crowd-sourcing chefs to rate some newly created recipes for unseen recipes. This way the offline results would be convincing because we are not quite sure that better offline ranking results would necessarily lead to better online performance for this particular problem.",5,5.0,ICLR2021
+HyeuCKuCtB,2,Bye6weHFvB,Bye6weHFvB,Official Blind Review #2,"## Paper Summary
+
+While cast slightly differently in the intro, it seems to me that this paper learns a goal-conditioned value function that is used at test time to construct visual plans by selecting an appropriate sequence of data points from the training data. Similar to prior work they learn a local distance metric without supervision using a temporal proximity objective and then construct a graph of the training data points using this metric. The main novelty that this paper introduces seems to be the idea to distill the results of planning algorithms run at training time into a global, goal-conditioned value function, which allows to reduce the required planning time at test time. The authors perform experiments on constructing visual plans for a simulated toy navigation task, a robotic rope manipulation task and the StreetLearn navigation task. The paper reports favorable results under a time-constrained test setting but does not include strong baselines that were designed for this setting. 
+
+## Strengths
+
+- bootstrapping a learned local distance metric to a global distance metric to reduce test-time planning cost is an interesting problem
+- the paper has nice visualizations / analysis on the toy dataset
+- the learning procedure for the local distance metric is clearly described
+- the paper uses a large variety of different visualizations to make concepts and results clearer 
+
+## Weaknesses
+
+(1) missing links to related work: the author's treatment of related work does not address the connections to some relevant papers (e.g. [1-3]) or is only done in the appendix (especially for [4]). It is not clearly delineated between techniques and ideas that are introduced in other papers (see (2) below) and the novel parts of this work. This makes it hard to understand the actual contributions of this paper. 
+
+(2) only minor contribution: the core parts of this paper build heavily on prior work: time-contrastive objectives for distance learning have been introduced in [5] and also been used in a very similar setup as here in [4], further [4, 3] also use semi-parametric, graph-like representations for planning with a learned local distance metric. The major contribution seems to be to distill the plans derived with either (a) n-step greedy rollout or (b) Dijkstra graph-search into a value-function so that planning does not need to be performed at test time. This somewhat small contribution is in contrast to the claims from the introduction that this paper ""pose[s] the problem of unsupervised learning a plannable representation as learning a cognitive map of the domain"".
+
+(3) comparison to weak baselines: the main comparison in the experimental section is to a version of [4] where the authors constrain the planning horizon to a single step, which means effectively greedily using the local metric from Sec. 3.1. To be clear: this is in no way the method of [4]: they use Dijkstra-based planning at test time and it is clear that a ""version"" of [4] that does not use planning is not able to work. To me this seems rather like an ablation of the proposed method than a real baseline. The baseline that plans greedily with embeddings based on visual similarity has even less hope of working. The paper lacks thorough comparison to (a) baselines with the same semi-parametric structure that perform planning at test time (like the real method of [4]) and (b) methods that generate reactive policies without constructing a semi-parametric memory (e.g. off-policy RL). Only then a thorough comparison of pros and cons of planning at training/test time is possible (see detailed suggestions below).
+
+(4) lack of qualitative samples for generated plans: for both the rope and the StreetLearn domain the authors do not provide thorough evaluation. For the rope domain only a single qualitative rollout is shown, for the StreetLearn domain no qualitative samples are provided for either the proposed method or the comparisons. (see suggestions for further evaluation below)
+
+(5) explanation of core algorithmic part unclear: the explanation of how the local metric is used to learn the global value function is somewhat unclear and the used notation is confusing. Key problems seem to be the double-introduction of symbols for the local metric in Alg. 2 and the confusing usage of the terms ""global embedding"" and ""value function"" (see detailed questions below) 
+
+(6) terms used / writing structure makes paper hard to follow: the connection between used concepts like ""global embedding"", ""plannable representation"" and ""goal-conditioned value function"" are not clear in the writing of the paper. The authors talk about concepts without introducing them clearly before (e.g. problems of RL are listed in the intro without any prior reference to RL).
+
+(7) lacks detail for reproducing results: the paper does not provide sufficient detail for reproducing the results. Neither in the main paper nor in the appendix do the authors provide details regarding architecture and used hyperparameters. It is unclear what policy was used to collect the training data, it is unclear how the baselines are working in detail (e.g. how the 1-step planning works) and how produced plans are checked for their validity.
+
+
+## Questions
+
+(A) What policy is used to collect the training data on each environment?
+(B) What is the relation between the ""global embedding"" \Phi and the ""goal-conditioned value function"" V_\Phi(x, x_prime) in Algorithm 2? 
+(C) What is the difference between the local metric function \phi and the reward function in Algorithm 2? Are they the same?
+(D) If they are the same, how can the local metric accurately estimate rewards for states x and x_g that are far apart from one another as would naturally be the case when training the value function?
+(E) What does the notation N(1, \eps) in line 5 of Algorithm 2 mean?
+(F) What is the expectation over the length of trajectories between start and goal on the StreetLearn environment (to estimate what percentage of that the success horizon of 50 steps is)?
+
+
+## Suggestions to improve the paper
+
+(for 1) please add a more thorough treatment of the closest related works on semi-parametric memory + learned visual planning + learned distance functions (some mentioned below [1-5]) to the main part of the paper, clearly stating differences and carving out which parts are similar and where actual novelty lies.
+
+(for 2) please explain clearly the added value of distilling the training-plans into a value function for O(1) test-time planning and point out that this is the main difference e.g. to [4] and therefore the main contribution of the paper. 
+
+(for 3) in order to better understand the trade-offs between doing planning at test time (like [3,4]) or learning an O(1) planner contrast runtime and performance of both options (i.e. compare to the proper method of [4]). This will help readers understand how much speed they gain from the proposed method vs how much performance they loose. It might also make sense to include an off-policy RL algorithm (e.g. SAC) that uses the local metric as reward function (without constructing the graph) to investigate how much planning via graph-search can help at training time. Another interesting direction can be to investigate the generalization performance to a new environment (e.g. new street maze, new rope setup) after training on a variety of environment configurations. [3] showed that explicit test-time planning performs better than ""pure"" RL, it would be interesting how the proposed ""hybrid"" approach performs.
+
+(for 4) please add randomly sampled qualitative results for both environments and all methods to the appendix. It can additionally be helpful to add GIFs of executions to a website. It might also be interesting to add a quantitative evaluation for the plans from the rope environment as was performed in Kurutach et al. 2018.
+
+(for 5) please incorporate answers to questions (B-E) into the text in Sec 3.2 explaining Algorithm 2. It might also help to structure the text in such a way as to follow the flow of the algorithm.
+
+(for 6) restructure and shorten the introduction, clarify terms like ""inductive prior within image generation"" or ""non-local concepts of distances and direction"" or ""conceptual reward"" or ""planning network"", clarify how the authors connect the proposed representation learning objective and RL. Avoid sentences that are a highly compressed summary of the paper but for which the reader lacks background, like in the intro: ""training a planning agent to master an imagined “reaching game” on a graph"".
+
+(for 7) add details for architecture and hyperparameters to the appendix, add details for how baselines are constructed to the appendix. add details about data collection and evaluation for all datasets to the appendix (e.g. how is checked that a plan is coherent in StreetLearn). It might also help to add an algorithm box for the test time procedure for the proposed method.
+
+
+## Minor Edit Suggestions
+- Fig 2 seems to define the blue square as the target, the text next to it describes the blue square as the agent, please make coherent
+- for Fig 7: the numbers contained in the figure are not explained in the caption, especially the numbers below the images are cryptic, please explain or omit
+
+
+[Novelty]: minor
+[technical novelty]: minor
+[Experimental Design]: Okay
+[potential impact]: minor
+
+################
+[overall recommendation]: weakReject - The exposition of the problem and treatment of related work are not sufficient, the actual novelty of the proposed paper is low and the lack of comparison to strong baselines push this paper below the bar for acceptance.
+[Confidence]: High
+
+
+[1] Cognitive Planning and Mapping, Gupta et al., 2017
+[2] Universal Planning Networks, Srinivas et al., 2018
+[3] Search on the Replay Buffer: Bridging Planning and Reinforcement Learning, Eysenbach et al., 2019
+[4] Semi-Parametric Topological Memory for Navigation, Savinov et al., 2018
+[5] Time-Contrastive Networks, Sermanet et al., 2017
+
+
+### Post-rebuttal reply ###
+I appreciate the author's reply, the experiments that were added during the rebuttal are definitely a good step forward. The authors added comparison to a model-free RL baseline as well as proper comparison to a multi-step planning version of SPTM. However, these comparisons were only performed on the most simple environment: the open room environment without any obstacle. These evaluations are not sufficient to prove the merit of the proposed method, especially given that it is sold as an alternative to planning methods. The method needs to be tested against fair baselines on more complicated environments; the current submission only contains baselines that *cannot* work on the more complicated tasks. I therefore don't see grounds to improve my rating.",3,,ICLR2020
+H1l2wrZ_cr,4,BJgWE1SFwS,BJgWE1SFwS,Official Blind Review #4,"1.The goal of the paper is to connect flexible choice modeling with a modern approach to ML architecture to make said choice modeling scalable, tractable, and practical. 
+2. The approach of the paper is well motivated intuitively, but could more explicitly show that PCMC-Net is needed to fix inferential problems with PCMC and that e.g. SGD and some regularization + the linear parameterization suggested by the original PCMC authors isn't scalable in itself. 
+3. The approximation theorem is useful and clean, and the empirical results are intriguing. While consideration of more datasets would improve the results, the metrics and baselines considered demonstrate a considerable empirical case for this method. 
+
+My ""weak accept"" decision is closer to ""accept"" than ""weak reject."" (Edit 11/25/19: I raised my score in to accept in conjunction with the author's improvements in the open discussion phase)
+
+Improvement areas(all relatively minor):
+- While I personally enjoy the choice axioms focused on by the PCMC model and this paper, stochastic transitivity, IIA, and regularity are probably more important to emphasize than Contractibility. Because the properties of UE and contractibility were not used, it may be more appropriate to use this space to introduce more of the literature on neural-nets-as-feature-embeddings stuff. 
+- This paper could be improved by generalizing to a few other choice models- in particular the CDM (https://arxiv.org/abs/1902.03266) may be a good candidate for your method. This is more a suggestion for future work if you expand this promising initial result. 
+- Hyper-parameter tuning: I noticed that several of your hyper parameters were set to extremal values for the ranges you considered. If you tuned the other algorithms' hyper parameters the same way, it could be the case that the relative performance is explained by the appropriateness of those ranges. Would be interesting to have a more in-depth treatment of this, but I do understand that it's a lot of work. 
+
+
+Specific Notes:
+Theorem 1 is nice, and the proof is clean, but doesn't explicitly note that a PCMC model jointly specifies a family of distributions \pi_S for each S \in 2^U obtained by subsetting a single rate matrix Q indexed by U. It's clear that PCMC-Net will still approximate under this definition, as \hat q_ij  approximates each q_ij because \hat q_ij doesn't depend on S. While the more explicit statement is true with the same logic in the theorem, the notational choice to have ""X_i"" represent the ""i-th"" element in S is confusing at first, as e.g. X_1 is a different feature vector for S = {2,3} and S={1,3}. I don't see this issue as disqualifying, but it took me a while to realize that there wasn't more than a notational abuse problem when I returned to the definitions where the indexing depended on the set S under consideration. 
+
+
+Typos/small concerns:
+-Above equation (1), the number of parameters in Q_S is |S|(|S|-1) rather than (|S|-1)^2, as each of the |S| alternatives has a transition rate to the other |S|-1 alternatives. 
+-Below equation (3), I think you mean j \in S rather than 1<= j <= |S|, as S may not be {1,2,...,|S|}. Later I noticed that you always index S with {1,\dots,|S|}, but using i \in S in combination with 1<=j<=|S| was a bit confusing. 
+-X_i as the i-th element of S is a bit of an abuse of notation, as it surpasses dependence on S
+-In Figure 1, you show X_0 in a vector that is referred to as ""S."" It is my understanding that X_0 represents user features. As the user is not in the set, this is confusing. The use of a vertical ellipsis to connect \rho(X_0) to \rho(X_1) is also confusion, as \rho(X_1) is input into the Cartesian product while X_0 is input into the direct sum. 
+
+Overall, nice job! Really enjoyed the paper and approach, good to see connections made between these literatures so that progress in discrete choice can be used at scale. 
+",8,,ICLR2020
+BJeCW-u0tS,2,rJlnOhVYPS,rJlnOhVYPS,Official Blind Review #1,"After reading the reviews and the comments, I confirm my rating.
+
+=================
+
+The paper proposes an unsupervised framework to address the problem of noisy pseudo labels in clustering-based unsupervised domain adaptation (UDA) for person re-identification. The noise derives from the limited transferability of source-domain features, the unknown number of target-domain identities, and the imperfect results of the clustering algorithm. 
+
+The proposed framework, Mutual Mean-Teaching (MMT), performs pseudo label reﬁnery by optimizing the neural networks under the joint supervisions of off-line reﬁned hard pseudo labels and on-line reﬁned soft pseudo labels. Inspired by the teacher-student approaches (Reference: Tarvainen & Valpola, 2017; Reference: Zhang et al., 2018b), the proposed MMT framework provides robust soft pseudo labels in an on-line peer-teaching manner to simultaneously train two same networks. The networks gradually capture target-domain data distributions and thus reﬁne pseudo labels for better feature learning.
+
+The main contribution is proposing an unsupervised framework (MMT) capable of tackling the noise problem in state-of-art UDA methods for person re-identification, via producing reliable soft labels in order to achieve better performance. Since the conventional triplet loss cannot properly work with soft labels, a softmax-triplet loss is proposed to enable training with soft triplet labels for mitigating the pseudo label noise.
+
+The proposed MMT is evaluated on Market1501, DukeMTMC-reID, and MSMT17 datasets with four adaptation tasks: Market-to-Duke, Duke-to-Market, Market-to-MSMT, and Duke-to-MSMT. It outperforms the state-of-the-art methods with significant improvements in terms of Mean average precision (mAP) and Cumulative matching characteristic (CMC). In addition to that, ablation studies conducted to evaluate each component in the proposed MMT framework.",8,,ICLR2020
+Bya5vnbVg,2,HJrDIpiee,HJrDIpiee,Review,"The paper presents a deep RL with eligibility traces. The authors combine DRQN with eligibility traces for improved training. The new algorithm is evaluated on a two problems, with a single set of hyper-parameters, and compared with DQN.
+
+The topic is very interesting. Adding eligibility traces to RL updates is not novel, but this family of the algorithms have not been explored for deep RL. The paper is written clearly, and the related literature is well-covered. More experiments would make this promising paper much stronger. As this is an investigative, experimental paper, it is crucial for it to contain a wider range of problems, different hyper-parameter settings, and comparison with vanilla DRQN, Deepmind's DQN implementation, as well as other state of the art methods. ",4,5.0,ICLR2017
+rJOKM41-M,3,r1lUOzWCW,r1lUOzWCW,Good overview; main contribution is theoretical proof,"The main contribution of the paper is that authors extend some work of Bellemare: they show that MMD GANs [which includes the Cramer GAN as a subset] do possess unbiased gradients. They provide a lot of context for the utility of this claim, and in the experiments section they provide a few different metrics for comparing GANs [as this is a known tricky problem]. The authors finally show that an MMD GAN can achieve comparable performance with a much smaller network used in the discriminator.
+
+As previously mentioned, the big contribution of the paper is the proof that MMD GANs permit unbiased gradients. This is a useful result; however, given the lack of other outstanding theoretical or empirical results, it almost seems like this paper would be better shaped as a theory paper for a journal. I could be swayed to accept this paper however if others feel positive about it.
+
+",6,2.0,ICLR2018
+HJg9OkW_hQ,1,S1lKSjRcY7,S1lKSjRcY7,A paper reviewing and improving different types of gradient estimators,"The papers studies estimators of gradients taken from expectations with respect to the distribution parameters. The paper has studied two main types of estimators, Finite Difference and Continuous Relaxation. The paper made several improvements to existing estimators. 
+
+My rating of the paper in different aspects (quality 6, clarity 8, originality 6, significance 4). 
+
+Pros: 
+1. The paper has made a nice introduction of FD and CR estimators. The improvements over previous estimators are concrete -- it is generally clear to see the benefit of these improvements. 
+
+2. The first method reduces the running time of the RAM estimator. The second method (IGM) reduces the bias of GM estimator. The first improvement avoids many function evaluations when the probability is extreme. The second improvement helps to correct bias introduced by continuous approximation of \zeta_i itself. 
+
+Cons: 
+1. the paper content is a little disjointed: the improvement over RAM has not much relation with later improvements. It seems the paper is stacking different things into the paper. 
+
+2. All these improvements are not very significant considering a few previous papers on this topic. Some arguments are not rigorous. (see details below)
+
+3. A few important papers are not well discussed and omitted from the experiment section. 
+
+Detailed comments
+
+1. The REBAR estimator [Tucker et al., 2017] and the LAX estimator [Grathwohl et al., 2018] use continuous approximation and correct it to be unbiased. These papers in this thread are not well discussed in the paper. They are not compared in the experiment either.  
+
+2. In the equation 7 and above: what does 4 mean? When beta \neq 4, do you still get unbiased estimation? My understanding is that the estimator is unbiased only when beta=4. (correct me if I'm wrong)
+
+3. The paper argues that the variance of the estimator is mostly decided by the variance of q(zeta)^-1 when the function is smooth. I feel this argument is not very clear. First, what do you mean by saying the function is smooth? The derivative is near a constant in [0, 1]? 
+
+4. In the PWL development, the paper argues that we can choose alpha_i \approx 1/(q_i(1-q_i)) to minimize the variance. However, my understanding is, the smaller alpha_i, the smaller variance.
+
+",6,4.0,ICLR2019
+Hky8MaWVx,1,BkIqod5ll,BkIqod5ll,"Important problem, but lacks clarity and I'm not sure what the contribution is.","This work proposes a convolutional architecture for any graph-like input data (where the structure is example-dependent), or more generally, any data where the input dimensions that are related by a similarity matrix. If instead each input example is associated with a transition matrix, then a random walk algorithm is used generate a similarity matrix.
+
+Developing convolutional or recurrent architectures for graph-like data is an important problem because we would like to develop neural networks that can handle inputs such as molecule structures or social networks. However, I don't think this work contributes anything significant to the work that has already been done in this area. 
+
+The two main proposals I see in this paper are:
+1) For data associated with a transition matrix, this paper proposes that the transition matrix be converted to a similarity matrix. This seems obvious.
+2) For data associated with a similarity matrix, the k nearest neighbors of each node are computed and supply the context information for that node. This also seems obvious.
+
+Perhaps I have misunderstood the contribution, but the presentation also lacks clarity, and I cannot recommend this paper for publication. 
+
+Specific Comments:
+1) On page 4: ""An interesting attribute of this convolution, as compared to other convolutions on graphs is that, it preserves locality while still being applicable over different graphs with different structures.""  This is false; the other proposed architectures can be applied to inputs with different structures (e.g. Duvenaud et. al., Lusci et. al. for NN architectures on molecules specifically). ",3,3.0,ICLR2017
+dzqFNC7_H8p,4,4mkxyuPcFt,4mkxyuPcFt,The paper mainly shows that adversarial robustness can be decomposed into small variance directions and large variance directions of the data manifold.,"In this paper, the author mainly show that adversarial robustness can be disentangled in small variance directions and large variance directions.. Theoretically, they also investigated the excess risk and optimal saddle point of the minimax problem of latent space adversarial training. 
+
+Positive:
+1.  found the regular adversarial examples attack tend to lies in small variance directions of the data.
+2. found generative adversarial examples attack towards to the large variance directions of the data.
+3. explore standard adversarial training as well as latent space adversarial training to deal with  on-manifold and off-mainfold issue.
+4. The theory analysis may be useful to use original/latent adversarial training to increase the model robustness.
+
+
+Negative:
+1. The theoretical analysis is mainly based in probabilistic principle component analysis, a linear generative model. The extension to nonlinear model is unclear, which may be more common in practice.
+2. In addition to LeNet and ResNet, it may be more convincing to test two more extra models to confirm the theoretical findings. The analysis rely on the eigenvalues, what if the original features are in high-dimensional space. computing eigenvalue decomposition may be expensive.
+3. In the Table 1, it seems that using both regular adversarial examples and generative adversarial examples sometime does not obtain test accuracy, any more discussion?
+
+In summary, the authors provide a theoretical study on theoretical analysis of the attacking mechanisms of the two kinds of adversarial examples: regular and generative adversarial examples. They should w that adversarial robustness can be disentangled in directions of the data manifold.  Such finds may be useful in designing defense algorithms.
+",6,3.0,ICLR2021
+B1g6gslaFr,1,BJlZ5ySKPH,BJlZ5ySKPH,Official Blind Review #3,"This paper proposes a new attention mechanism for unsupervised image-to-image translation task. The proposed attention mechanism consists of an attention module and a learnable normalization function. Sufficient experiments and analysis are done on five datasets.  
+
+Pros:
+1. The proposed method seems to generalize well to the different datasets with the same network architecture and hyper-parameters compared to previous works. This could benefit other researchers who want to apply the method to other data or tasks.
+2. The translated results seem more semantic consistent with the source image compared to other methods, although the sores are not the top on photo2portrait and photo2vangogh. The results also look more pleasing.
+
+Cons:
+1. The CAM loss is one of the key components in the proposed method. However, there is only the reference and no detailed description in the paper. More intuitive descriptions are necessary for easy understanding.
+2. The local and global discriminators are not explained until the result analysis. It’s a bit confusing when I see the local and global attention maps visualization results. It’s better to mention it in the method section.
+3. I wonder why some translations are not done at all in the results without CAM in Figure 2(f). Because without CAM, the framework would be somehow similar to MUNIT or DRIT. I suppose the hyper-parameters are not suitable for this setting.
+4. The generator model architecture in Figure 1 is confusing. The adaptive residual blocks only receive the gamma and beta parameters. I suppose that the encoder feature maps are also fed into the adaptive residual blocks.
+5. In Figure 3, the comparison of the results using each normalization function is reported. While in my view, the results only using GN in decoder with CAM looks more natural. I wonder why the proposed method only consists of instance norm and layer norm? I suppose the group norm might help with the predefined group.
+6. In the ablation study, the CAM is evaluated for generator and discriminator together. I would recommend doing this ablation study for generator and discriminator separately to see if it’s necessary for generator or discriminator.
+7. It would be good to see some discussion on the attention mechanism compared with other related works. For example,  [a,b] predict the attention masks for unsupervised I2I, but applies them on the pixel/feature spatial level to keep the semantic consistency.
+[a] Unsupervised-Attention-guided-Image-to-Image-Translation. NIPS’18
+[a] Exemplar guided unsupervised image-to-image translation with semantic consistency. ICLR’19
+
+My initial rating is above boardline.",6,,ICLR2020
+Sye5BLIyG,1,SJDJNzWAZ,SJDJNzWAZ,"An interesting attempt for time-event information fusion, but can be much improved.","Quality above threshold.
+Clarity above threshold.
+Originality slightly below threshold.
+Significance slightly below threshold.
+
+Pros:
+This paper proposed a RNN for event sequence prediction. It provides two constructed choices for combining time(duration) information to event. Experiments on various datasets were conducted and most details are provided.
+
+Cons (concerns):
+
+1. Event sequence prediction is a hard problem as there’s no clear way to fuse the features about event and the information about the time. It is a nice attempt that in this work, duration is used for event representation. However, the choices are not “principled” as claimed in the paper. E.g., the duration is simply a scaler, but ""time mask"" approache converts that to a multi-dimensional vector while there’s not much information to regularize it.
+
+2. Event-time joint embedding sounds sensible as it essentially remaps from the original value to some segments. E.g., 10 minutes and 11 minutes might have same effect on next event in one dataset while 3 days and a week might have similar effect on next event prediction. But the way how the experiments are designed and analyzed do not provide such insights.
+
+3. The experimental results are not persuasive as no other baselines besides RNN-based methods are provided. Parametric and nonparametric methods both exist for this event prediction problem in previous work. In the results provided, no significant difference between the listed model choices is found, partly because only using event type and duration is not enough. Other info such as time of day, day of week matters a lot. ",4,4.0,ICLR2018
+HJxlWY32KS,2,rke3U6NtwH,rke3U6NtwH,Official Blind Review #3,"This paper extends DiffPool for hierarchical graph representation learning (in particular, graph classification). The authors empirically show that for several data sets, the approach outperforms quite a few recently proposed strong competitors.
+
+The proposed approach is reasonable, but is not much innovative. The prior work DiffPool uses GCNs to parameterize the node embedding matrix Z and the cluster assignment matrix S. This paper computes a number of Zs and Ss, each of which result from a different hyperparameter choice of GCN, and then combines them through concatenation and feed forward transform. Significance of the contribution is a bit marginal.
+
+The experimental results appear to be exciting, in light of the substantial boost in classification accuracy on the D&D data set. However, the experiment design and the reporting of results are doubtful. A major concern is the copying of results reported by prior work to Table 2. It is unclear whether these numbers were obtained from the same experiment setting. For example, the number for SAGPool + D&D comes from Lee et al., 2019, but the DiffPool number in that paper is 66.95, which is significantly different the one shown here, 80.64, copied from Ying et al., 2018.
+
+Another concern is the missing numbers for SAGPool. Although the authors explain the difficulty of obtaining these numbers, a lack of them does not complete the empirical evaluation.
+
+Minor comments/questions:
+
+- In the first sentence of 4.2, the word ""ter90o8ims"" is a typo of ""terms"".
+
+- What are the neural networks f_c in (3) and f_p in (5)?
+
+- Texts after (5) read f_g instead of f_p.
+",3,,ICLR2020
+rEL8AqSt5Z,4,uRuGNovS11,uRuGNovS11,This paper should not be accepted by ICLR2021,"This paper proposes a robust Bayesian deep metric learning framework against noise label inspired the BLMNN (Wang & Tan, 2018), deep metric learning (Hoffer & Ailon, 2015; Hu et al., 2015; Wang et al., 2017; Lu et al., 2017; Do et al., 2019), and Bayes by Backprop (Blundell et al., 2015). Directly applying the variational Bayes learning (Wang & Tan, 2018) in deep learning is challenging since it requires sampling from a distribution of the neural network parameters. Instead, this paper adapts the variational inference by Blundell et al. (Blundell et al., 2015), which allows to efficiently sample the parameters of a Bayes neural networks by using a backpropagation-compatible algorithm. The experimental results on several noisy data sets show that our novel proposed method can generalize better compared to the linear BLMNN (Wang & Tan, 2018) and the point estimation-based deep metric learning (Hoffer & Ailon,
+2015; Lu et al., 2017), especially when the noise level increases.
+
+Pros:
+1.  This paper is well-organized and well-written.
+2.  Adapting the variational inference by Blundell et al. (Blundell et al., 2015) for Bayesian DML sounds good.
+3.  The theoretical analysis is completed (though meaningless)
+
+Cons:
+1.	The novelty is limited (just using sliding window to instead growing window). It is more like a combination of variational inference (Blundell et al., 2015) and DML.
+2.	The results of Theorem 1 are meaningless. Some papers [1] Robustness and Generalization for Metric Learning. and [2] Deep Metric Learning: The Generalization Analysis and an Adaptive Algorithm. may help you to understand this point. ",5,4.0,ICLR2021
+SklXBpOQcB,3,BJxVT3EKDH,BJxVT3EKDH,Official Blind Review #3,"This paper proposes a domain-specific corpus-based approach for generating semantic lexicons for the low-resource Amharic language. Manual construction of lexicons is especially hard and expensive for low-resource languages. More importantly, the paper points out that existing dictionaries and lexicons do not capture cultural connotations and language specific features, which is rather important for tasks like sentiment classification. Instead, this work proposes to automatically generate a semantic lexicon using distributional semantics from a corpus.
+
+The proposed approach starts from a seed list of sentiment words in 3 pre-determined POS classes. This is followed by deriving a PPMI matrix from the word-context co-occurence matrix (context size= 2). Now given a word, the cosine distance is computed from centroid of each seed classes and words which are similar than a given threshold, are added to the original list. This process is repeated for a pre-specified number of iterations. They apply the generated lexicons to subjectivity detection and sentiment classification task in a small annotated Amharic corpus of facebook posts
+
+Strengths:
+
+1. NLP tasks on low-resource language is very challenging. This paper proposes a efficient, unsupervised way of gathering semantic lexicons that perform reasonably well on a downstream task.
+
+Weaknesses:
+
+1. Related work: The paper needs to cite and mention a much more broader literature. Currently, it mentions just 3 related work, and contrasts itself with one. But given there is so much work on distributional semantics (in English and other languages), they deserve mentioning. Few example of ones are 1. The distributional inclusion hypotheses and lexical entailment (Geffet and Dagan 2005), Distributed representations of words and phrases and their compositionality (Mikolov et al., 2013), Improving hypernymy detection with an integrated path-based and distributional method (Shwartz et al., 2017), Distributional Inclusion Vector Embedding for Unsupervised Hypernymy Detection (Chang et al., 2018). However, this is by no means a complete list and you should refer to these papers to get a list of other relevant work.
+2. Novelty of work: The novelty of the work is rather limited and the paper should try more low-dimensional embedding based approaches which has been proven to be very effective for a wide variety of tasks. In section 3, the paper mentions there are primarily two kinds of approaches (count based vs embedding based) — The paper should motivate why it chose one over another.
+3. Baselines: The paper would benefit from having some comparison with learned baselines trained from some distantly supervised data from, for example the SWN and SOCAL lexicons. 
+4. Organization of the paper: The writing and the organization of the paper needs to be better. For example, equation 3, 4, 5 are rather simple and can be condensed to one equation (no need to show averaging of the seed embeddings). Also in equation 3, would \vec{w} would be \vec{x}?. The results and discussions did not have either and over-all there are many grammatical errors. 
+5. I am not sure what the subjectivity detection task is in Table 3. Is it a standard task -- if not, the paper should define it first.
+
+Overall, the paper is a nice effort but in its current form it is not ready for ICLR and I hope the comments will help make the paper better for upcoming NLP workshops and conferences.",1,,ICLR2020
+rJg8GDbTFB,2,rygG7AEtvB,rygG7AEtvB,Official Blind Review #2,"The paper proposes a method to learn mixed-strategies Nash equilibrium in multi-player games. To do so they describe a gradient-descent method that aims at minimizing a Kikaido-Isoda function (which is zero if an equilibrium is found). The paper offer proofs of the convergence towards a stationary Nash equilibrium in the case of convex cost functions. It also provides an application with a strategy approximation made with a deep neural network. They authors exemplify the strengths of the method on toy problems that are quite standard in the domain. 
+
+I liked the paper very much but I have some concerns. First, I feel that the framework is based on a variational approach which would be well suited for a 0-order optimisation (like a black box or an evolutionary method). I wonder why the authors wanted to use a gradient based approach that adds a second layer of approximation and additional meta parameters to tune. 
+
+Second, I felt the theoretical proofs are not using much more than standard algebra and the convex assumption was a bit unrealistic in most of multi-agent problems. I'd like the authors to comment on this. 
+
+I was also wondering how this work can be related to other papers that try to learn a Nash equilibrium (or an \epsilon-Nash) on the bases of a difference in some norm between the value of the current policy and the value of the NE. For instance ""Learning Nash Equilibrium for General-Sum Markov Games from Batch Data"" by Perolat et al. This work addresses discrete action spaces but seems similar in spirit to me. Could the authors comment on this ?
+
+",6,,ICLR2020
+tge1iOd-HOE,3,tkAtoZkcUnm,tkAtoZkcUnm,Paper is marginally above the acceptace threshold,"*****  Paper's Summary  *****
+
+The authors proposed an algorithm named Neural Thompson Sampling (NeuralTS) for solving contextual multi-armed bandit problems. NeuralTS uses deep neural networks for dealing with exploration and exploitation. In the paper, the authors proved the sub-linear regret of NeuralTS, which is also verified using experiments. 
+
+
+*****  Paper's Strengths  *****
+
+As NeuralTS uses the deep neural network, it can be used for estimating non-linear reward function in the contextual bandits problem.
+
+The authors proved the sub-linear regret using the recent theoretical results from deep learning. The regret upper bound is similar (in terms of the number of contexts and effective dimension) to the regret bound of existing methods.
+
+The performance of NeuralTS matches with the state-of-the-art baselines. In some cases, the performance is even better than the existing methods.
+
+
+*****  Paper's Weaknesses  *****
+
+The weak point of the paper is its novelty. The result incorporates recent deep learning and contextual bandits results [Zhou et al. 2019] (paper in ICML 2020) with the existing Thompson Sampling variant for contextual bandits problem.
+
+Since the neural network (parameters m and L) is fixed before using the algorithm, it may not be possible to estimate any arbitrary reward function with the fixed neural network. Therefore, NeuralTS can have linear regret for the cases where the reward function can not be estimated.
+
+
+*****  Comments  *****
+
+It is difficult to understand the second part of Assumption 3.4. More clarity may help readers.
+
+The experiments can be repeated 50 or more times to get a better confidence interval.
+
+
+*****  Questions for the Authors  *****
+
+Please address the above weaknesses of the paper.
+
+How are the values of '\nu,' 'L,' and 'm' set in the experiments?
+
+Why does NeuralTS need T as input? 
+
+
+***** Post Rebuttal *****
+
+I thank authors for the clarifications! After reading the rebuttal and comments of other reviewers, I am increasing my score. ",7,3.0,ICLR2021
+rkTiSFH4l,2,ryXZmzNeg,ryXZmzNeg,Review,"The authors propose to sample from VAEs through a Markov chain [z_t ~ q(z|x=x_{t-1}), x_t ~ p(x|z=z_t)]. The paper uses confusing notation, oversells the novelty, ignoring some relevant previous results. The qualitative difference between regular sampling and this Gibbs chain is not very convincing, judging from the figures. It would be a great workshop paper (perhaps more), if the authors fix the notation, fix the discussion to related work, and produce more convincing (perhaps simply upscaled?) figures.
+
+Comments: 
+ - Rezende et al's (2014) original VAE paper already discusses the Markov chain, which is ignored in this paper
+ - Notation is nonstandard / confusing. At page 1, it’s unclear what the authors mean with “p(x|z) which is approximated as q(x|z)”.
+- It’s also not clear what’s meant with q(z). At page 2, q(z) is called the learned distribution, while p(z) can in general also be a learned distribution.
+- It’s not true that it’s impossible to draw samples from q(z): one can sample x ~ q(x) from the dataset, then draw z ~ q(z|x).
+- It's not explained whether the analysis only applies to continuous observed spaces, or also discrete observed spaces
+- Figures 3 and 4 are not very convincing.
+",3,4.0,ICLR2017
+r1lL5EbTnm,3,Bye5OiR5F7,Bye5OiR5F7,"ultimately, I am not sure there is anything ""Wasserstein"" going on in this new GAN algorithm.","The authors propose a new GAN procedure. It's maybe easier to reverse-engineer it from the simplest of all places, that is p.16 in the appendix which makes explicit the difference between this GAN and the original one: the update in the generator is carried out l times and takes into account points generated in the previous iteration. 
+
+To get there, the authors take the following road: they exploit the celebrated Benamou-Brenier formulation of the W2 distance between probability measures, which involves integrating over a vector field parameterized in time. The W2 distance which is studied here is not exactly that corresponding to the measures associated with these two parameters, but instead an adaptation of BB to parameterized measures (""constrained""). This metric defines a Riemannian metric between two parameters, by considering the resulting vector field that solve this equation (I guess evaluated at time 0). The authors propose to use the natural gradient associated with that Riemannian metric (Theorem 2). Using exactly that natural gradient would involve solving an optimal transport problem (compute the optimal displacement field) and inverting the corresponding operator. The authors mention that, equivalently, a JKO type step could also be considered to obtain an update for \theta. The authors propose two distinct approximations, a ""semi-backward Euler formulation"", and, next, a simplification of the d_W, which, exploiting the fact that one of the parameterized measures is the push foward of a Gaussian, simplifies to a simpler problem (Prop. 4). That problem introduces a new type of constraint (Gradient constraint) which is yet again simplified.
+
+In the end, the metric considered on the parameter space is fairly trivial and boils down to the r.h.s. of equation 4. It's essentially an expected squared distance between the new and the old parameter under a Gaussian prior for the encoder.  This yields back the simplification laid out in p.16.
+
+I think the paper is head over heels. It can be caricatured as extreme obfuscation for a very simple modification of the basic GAN algorithm. Although I am *not* claiming this is the intention of the authors, and can very well believe that they found it interesting that so many successive simplifications would yield such a simple modification, I believe that a large pool of readers at ICLR will be extremely disappointed and frustrated to see all of this relatively arduous technical presentation produce such a simple result which, in essence, has absolutely nothing to do with the Wasserstein distance, nor with a ""Wasserstein natural gradient"".
+
+other comments::
+
+*** ""Wasserstein-2 distance on the full density set"": what do you mean exactly? that d_W(\theta_0,\theta_1) \ne W(p_{\theta_0},p_{\theta_1})? Could you elaborate where this analogy breaks down? 
+
+*** It is not clear to me why the dependency of \Phi in t has disappeared in Theorem 2. It is not clear either in your statement whether \Phi is optimal at all for the problem in Theorem 1.
+
+*** the ""semi-backward Euler method"" is introduced without any context. The fact that it is presented as a proposition using qualitative qualifiers such as ""sufficient regularity"" is suspicious. ",3,5.0,ICLR2019
+VROGuIPtvNR,1,TiXl51SCNw8,TiXl51SCNw8,Learnable Quantization Bits,"This paper basically proposed to learn the quantization bits (precision) in each layer. Specially, weights are constructed with binary representation as $W_s = \[W_s^1,...,W_s^b\]$. During training, $W_s^i$ is relaxed to $ \in \[0, 2\]$. And a group sparsity is imposed to all $W_s^i$ for all weights in a layer, leading to certain $W_s^i \to 0$, thus cancelling the bit allocation in $i$-th. Experimental results is promising.
+
+Pros:
+1. It is interesting to see that weights are represented in binary format, while each bit is trained in a full-precision scheme. 
+
+Cons:
+1. Training process is intricated: one has to tune the penalty in group sparsity. Also, training is separated in several steps: training and post-training finetuning.
+
+Questions:
+1. After determining the quantization bit in (""fake"") quantization training (although $W_q$ is quantized but $W_s^i$ is not exactly binary, which is the exactly weight we want) using Eq.5. Author mention in ""Re-quantization and precision adjustment"" that $W_q^{'}$ is converted to binary value. But how to deal with the precision loss here? i.e. from $W_s^i \in \[0,2\]$ to $\{0, 1\}$
+2. Author mentioned that DoReFa-Net is adopted to finetune the trained model. Since DoReFa-Net use tanh to constrain value to $[0,1]$. it seems there is no connection to the proposed quantization scheme (Eq.6). How to exactly finetune ?
+3. Why is necessary for $W_s$ to be separated into postive and negative part ($W_p$, $W_n$) in processing ?
+4. Since $W_s^i$ is float and trainable, is it necessary to incorporate a trainable $s$ ?",6,4.0,ICLR2021
+BylSgJYp2Q,2,BJg4Z3RqF7,BJg4Z3RqF7,image reconstruction from noisy samples ,"This is a very interesting paper that achieves something that seems initially impossible: 
+to learn to reconstruct clear images from only seeing noisy or blurry images. 
+
+The paper builds on the closely related prior work AmbientGAN which shows that it is possible to learn the *distribution* of uncorrupted samples using only corrupted samples, again a very surprising finding. 
+However, AmbientGAN does not try to reconstruct a single image, only to to learn the clear image distribution. The key idea that makes this is possible is knowledge of the statistics of the corruption process: the generator tries to create images that *after they have been corrupted* they look indistinguishable from real corrupted images. This surprisingly works and provably recovers the true distribution under a very wide set of corruption distributions, but tells us nothing about reconstructing an actual image from measurements. 
+
+Given access to a generative model for clear images, an image can be reconstructed from measurements by maximizing the likelihood term. This method (CS-GAN) was introduced by Bora et al. in 2017. Therefore one approach to solve the problem that this paper tackles is to first use AmbientGAN to get a generative model for clear images and then use CS-GAN using the learned GAN. If I understand correctly, this is the 'Conditional AmbientGAN' approach that is used as a baseline. This is a sensible approach given prior work. However, the authors show that their method ('Unpaired Supervision') performs significantly better compared to the Conditional AmbientGAN baseline. This is very surprising and interesting to me. Please discuss this a bit more ? As far as I understand the proposed method is a merging of AmbientGAN and CS-GAN, but much better than the naive separation. Could you give a bit more intuition on why ?
+
+I would like to add also that the authors can use their approach to learn a better AmbientGAN. After getting their denoised images, these can be used to train a new AmbientGAN, with cleaner images as input , which should be even better no ?
+
+In the appendix where is the proposed method in fig 5- 8 ?
+
+Does the proposed method outperform Deep Image Prior ? 
+
+
+",8,4.0,ICLR2019
+tc_jTy4F7oN,2,1Q-CqRjUzf,1Q-CqRjUzf,interesting and useful findings but not comprehensive enough empirical validation,"The paper investigates two methods to reduce churn in neural network classification prediction. Churn is when two networks trained on the same data produce outputs that disagree, due to randomness in the training process. The authors identify several sources of randomness, from underlying hardware differences to parameter initialization and more. The authors propose two ways to mitigate churn. One is to use entropy minimization to favor more confident predictions. The second is to use co-distillation, a form of online ensemble learning. The authors show that both together do a good job of reducing churn on three data sets.
+
+The paper does a decent job of making an important point about churn, investigating its prevalence, and proposing a solution. The approach is sound and promising. For a narrow result like this, specific to one metric of neural networks, I would like much more empirical validation that the authors provide. Only three data sets and three baselines does not seem like enough, given that the experiments provide the main take-home message of the paper.
+
+I'd like a deeper discussion about why churn is bad. Can the authors give a concrete example where churn will make a machine learning system more undesirable? For example, imagine a facial recognition system. What if new data or new training lead to a new model that is just as accurate but makes mistakes on different people than before. In what application is that inherently bad? Can you formulate the problem with churn more formally? In the current paper, it's mostly assumed to be undesirable. To an extent, I agree, but I'd like to understand more clearly why it is undesirable. I think the comparison to reproducible scientific experiments is a little loose. A machine learning algorithm is not a scientific experiment. I don't think the authors need to cite quite so many papers about the much more general problem of reproducibility in science.
+
+This finding is interesting and instructive: ""churn observed in Table 2 is not merely caused by the discontinuity of the arg max"".
+
+This finding is fascinating: ""Even with extreme measures to eliminate all sources of randomness, we continue to observe churn due to unavoidable hardware non-determinism.""
+
+I have a question about the minimum entropy procedure. Doesn't it depend on the confidence scores being accurate? For example, if some scores were overconfident, the minimum entropy procedure would tend to select those predictions and reduce accuracy. Imagine that confidence scores are normally distributed: some confidence scores are accurate but some are under- or over-confident. Minimum entropy will tend to pick the over-confident ones even though the confidence is in error. Is this a real danger and do the authors observe this at all with their technique?
+
+This is more a question for Cormier et al. than for the current authors, but why use the word ""churn"" instead of ""disagreement""? Is there a difference? From what I can tell, churn and disagreement are the same thing, and churn has a different English meaning. Disagreement seems like the better term for this.
+
+The following paper explored the use of disagreement as a model selection tool and I think may have also proven Lemma 1:
+
+https://papers.nips.cc/paper/2603-co-validation-using-model-disagreement-on-unlabeled-data-to-validate-classification-algorithms
+
+Minor comments and typos:
+
+Why is Table 2 on page 7 when it is referred to on page 3? Why is Table 2 referred to before Table 1?
+
+""linear warmup and join updates""
+Did you mean?:
+linear warmup and joint updates
+
+""any of the participating model can be used for inference""
+Did you mean?:
+any of the participating models can be used for inference
+
+worst cast bound
+worst-case bound
+
+runs.We
+runs. We
+
+Intuitively,encouraging
+Intuitively, encouraging",5,2.0,ICLR2021
+S1xA6on62m,3,BJlXUsR5KQ,BJlXUsR5KQ,LEARNING NEURON NON-LINEARITIES WITH KERNEL-BASED DEEP NEURAL NETWORKS,"The paper investigates the problem of designing the activation functions of neural networks with focus on recurrent architectures. The authors frame such problem as learning the activation functions in the space of square integrable functions by adding a regularization term penalizing the differential properties of candidate functions. In particular, the authors observe that this strategy is related to some well-established approaches to select activation functions such as ReLUs. 
+
+
+The paper has some typos and some passages are hard to read/interpret. The write-up needs to be improved significantly.  
+
+While some of the observations reported by the authors are interesting it is in general hard to evaluate the contributions of the paper. In particular the discussion of Sec. 2 is very informal, although ti describes the key technical observations used in the paper to devise the model (Sec. 3) that is then evaluated in the experiments (Sec. 4). In particular, it is unclear whether the authors are describing some known results - in which case they should add references - or original contributions - in which case they should report their results with more mathematical rigour. Indeed, in the abstract, the authors state that a representation theorem is given, but in the text they provide only an informal discussion of such result. 
+
+Overall, it is hard to agree with the authors' conclusion that ""the KBRN architecture exhibits an ideal computational structure to deal with classic problems of capturing long-term dependencies"": the theoretical discussion does not provide sufficient evidence in this sense.
+
+Some minor points: 
+
+Confusing notation: why were the alpha^k replaced with the \chi^k between Sec. 2 and Sec. 3?
+
+Unclear motivation for some design choices. For instance 1) the justification given by the authors to neglect the linear terms from both g(x) and k(x) in Sec. 3 is unclear. 2) why was the \ell_1 norm used as penalty for the regularizer R(\chi) in Sec.3? One could argue that \ell_1 is used to encourage sparse solutions, but the authors should explain why sparsity is desirable in this setting. 
+
+",5,3.0,ICLR2019
+Hygat6E1qH,2,BkgHWkrtPB,BkgHWkrtPB,Official Blind Review #1,"The paper deals with where the information is in a deep network and how information is propagated when new data points are observed. The authors measure information in the weights of a DNN as the trade-off between network accuracy and weight complexity. They bring out the relationships between Shannon MI and Fisher Information and the connections to PAC-Bayes bound and invariance. The main result is that models of low information generalize better and are invariance-tolerant. 
+
+The paper is very well written and concepts are theoretically-well documented.
+
+In Definition 3.1 for the ‘Information in the Weights’, how does the complexity of the task vary with \beta? Is the Pareto curve provided in the paper? ",8,,ICLR2020
+WIx0rccmV6T,4,HgLO8yalfwc,HgLO8yalfwc,Good paper that generalizes policy regularization in regularized MDPs,"This paper shows a formulation of regularized Markov Decision Processes (MDPs), which is slightly different from that of Geist et al. (2019). Then, the authors propose a novel inverse reinforcement learning under regularized MDPs. One of the contributions is that policy regularization considered here is more general than that of Yang et al. (2019). 
+
+This paper is written very well and is of publishing quality. I think it is sufficiently significant to be accepted. Still, I have the following questions. 
+
+1. The proposed method is based on the relationship between imitation learning and statistical divergence minimization. If my understanding is correct, Bregman divergence plays a role in generalizing generalized adversarial imitation learning. However, as the authors mentioned in Section 6, Bregman divergence does not include f-divergence, which is also studied in imitation learning. Would you discuss the connection to the formulation using f-divergence in more detail?
+
+2. I am interested in the relationship between the proposed method and Lee et al. (NeurIPS2018). Is the proposed method nearly the same as Lee et al. (2018) when Tsallis entropy is selected as regularization? If not, does the proposed method outperform Lee et al. (2018) in the MuJoCo control tasks? 
+
+3. The authors claim that the solutions provided by Geist et al. (2019) are intractable in the Introduction. However, it is shown that the reward baseline term in Corollary 1 is intractable except for some well-studied setups. Does it imply that the proposed method faces the same difficulty when applied with arbitrary policy regularization?
+
+4. The experimental results shown in Figure 3 is interesting, but I have a few concerns. In some cases, the averaged Bregman divergence of RAIRL-NSM (\lambda = 1) was larger than that of Random. Would you show the example of the learned policy for the readers’ understanding? Besides, is the same policy regularization used in Behavior Cloning? Finally, are exp, cos, and sin the meaningful regularizer? 
+
+5. To derive the practical algorithms, the authors consider the same form of the policy regularization used by Yang et al. (2019), which is given by - \lambda E[\phi(\pi(a))]. Is it possible to derive the algorithm in which the regularizer is given by \Omega(\pi)?
+",8,4.0,ICLR2021
+HkeNJPwAFS,2,ryx4PJrtvS,ryx4PJrtvS,Official Blind Review #2,"This paper tackles the problem of black-box hyperparameter optimization when multiple related optimization tasks are available simultaneously, performing transfer learning between tasks. Different tasks correspond to different datasets and/or metrics. Gaussian copulas are used to synchronize the different scales of the tasks.
+
+I have several reservations with this paper. First and foremost, it seems to be lacking a fair and trivial baseline (I will describe it below) that justifies the apparently unnecessary complicated path followed in this paper. Second, there are a few small incorrect or improperly justified technical details throughout the paper.
+
+
+1) Mistaken/unjustified technical details:
+
+- In equation 1, the last term seems to be constant. For each task, the function psi is not parametric, so its gradient is also not parametric and the input is the inverse of z, i.e., y, which is also fixed. So why is it included in the cost function? This sort of probabilistic renormalization is important in e.g. warped GPs because the transformation is parametric. In this case, I don't see the point. It can be treated as a normalization of the input data, prior to its probabilistic modeling.
+
+- Before equation 1, the text says ""by minimizing the Gaussian negative log-likelihood on the available evaluations (x, z)"" But then, equation 1 is not the NLL on z but on y.
+
+- In section 4.2 the authors model the residuals of the previous model using a powerful Matern-5/2 GP. Why modeling the residuals this way and not the observations themselves? The split of modeling between a parametric and non-parametric part is not justified.
+
+- One of the main points of the variable changes is to normalize the scales of the different tasks. However, equations 1 adds together the samples of the different tasks (which, as pointed out by the authors might have different sizes). Even if the scales of the outputs are uniform, the different dataset sizes will bias the solutions towards larger datasets. Why would that be a good thing? This is not mentioned and doesn't seem correct: there should not be a connection between a dataset size and the prior influence of the corresponding task. In fact, this will have the same effect as if the cost had different scales for different tasks, which is precisely the problem that the authors are trying to avoid.
+
+
+2) Trivial baseline
+
+Given that the authors are trying to aggregate information about the optimal hyperparameters from several tasks, they should not compare with single-task approaches, but with the simplest way to combine all the tasks. For instance:
+    a) Normalize the outputs of every task. This can be accomplished in the usual way by dividing by the standard deviation, or even better, by computing the fixed transform z = psi(y), separately for each task.
+    b) Collect the z of all tasks and feed them into an existing GP black-box Bayesian optimizer.
+
+This is a very simple way to get ""transfer learning"" and it's unclear that the extra complexities of this paper (copulas, changes of variable with proper renormalization when the transformation is parameter free, etc) are buying much else.
+
+
+Minor improvements:
+
+- Page 2: ""is the output of a multi-layer perceptron (MLP) with d hidden nodes"" Is d really the number of hidden nodes of the MLP? Or the number of outputs? Given that d is also the size of w, it seems it's actually the latter.
+
+- Explain why the EI approach is used for the second model (with the GP), but not for the first model.
+
+Edit after rebuttal:
+“The term is not constant over z” -> Sure, it’s not constant over z. But z is constant. So the term is constant.
+
+“The NLL is minimized in z and there is indeed no y in equation 1.” -> Sure, there’s no y in the equation, that’s correct. But it is still the NLL of y, and not the NLL of z.
+
+About the new baseline: Instead of simply renormalizing using mean and standard deviation, I suggested above using the same z=psi(y) that is used in the paper for the normalization. Is that where the advantage of the proposed method is coming from?
+
+""Note that this is orthogonal to the scale issues we focus on: larger tasks will have larger gradient contributions but the scaling we propose still allows us to learn tied parameters across tasks as their scales are made similar. "" Both issues affect the scaling of the task, so I don't see how they can be orthogonal. Their scales are not made similar precisely because of the different sample sizes.
+",3,,ICLR2020
+B1RFuqfEx,3,Bk2TqVcxe,Bk2TqVcxe,Well motivated model for relationship prediction in abstract scenes with good experimental analysis in a controlled setting.,"+ Understanding relations between objects is an important task in domains like vision, language and robotics. However, models trained on real-life datasets can often exploit simple object properties (not relation-based) to identify relations (eg: animals of bigger size are typically predators and small-size animals are preys). Such models can predict relations without necessarily understanding them. Given the difficulty of the task, a controlled setting is required to investigate if neural networks can be designed to actually understand pairwise object relations. The current paper takes a significant step in answering this question through a controlled dataset. Also, multiple experiments are presented to validate the ""relation learning"" ability of proposed Relation Networks (RN).
+
++ The dataset proposed in the paper ensures that relation classification models can succeed only by learning the relations between objects and not by exploiting ""predator-prey"" like object properties.
+
++ The paper presents very thorough experiments to validate the claim that ""RNs"" truly learn the relation between objects.
+  1. In particular, the ability of the RN to force a simple linear layer to disentangle scene description from VAE latent space and permuted description is very interesting. This clearly demonstrates that the RN learns object relations.
+  2. The one-shot experiments again demonstrate this ability in a convincing manner. This requires the model to understand relations in each run, represent them through an abstract label and assign the label to future samples from the relationship graph.
+
+Some suggestions:
+
+- Is g_{\psi}(.) permutation invariant as well. Since it works on pairs of objects, how did you ensure that the MLP is invariant to the order of the objects in the pair?
+- The RNs need to operate over pairs of objects in order to identify pairwise interactions. However, in practical applications there are more complicated group interactions. (eg. ternary interaction: ""person"" riding a ""bike"" wears ""helmet""). Would this require g(.) of RN to not just operate on pairs but on every possible subset of objects in the scene? More generally, is such a pairwise edge-based approach scalable to larger number of objects?
+- The authors mention that "" a deep network with a sufficiently large number of parameters and a large enough training set should be capable of matching the performance of a RN"". This is an interesting point, and could be true in practice. Have the authors investigated this effect by trying to identify the minimum model capacity and/or training examples required by a MLP to match the performance of RN for the provided setup? This would help in quantifying the significance of RN for practical applications with limited examples. In other words, the task in Sec. 5.1 could benefit from another plot: the performance of MLP and RN at different amounts of training samples.
+- While the simulation setup in the current paper is a great first-step towards analyzing the ""relation-learning"" ability of RNs, it is still not clear if this would transfer to real-life datasets. I strongly encourage the authors to experiment on real-life datasets like Coco, visual genome or HICO as stated in the pre-review stage.
+- Minor: Some terminologies in the paper such as ""objects"" and ""scene descriptions"" used to refer to abstract entities can be misleading for readers from the object detection domain in computer vision. This could be clarified early on in the introduction.
+- Minor: Some results like Fig. 8 which shows the ability of RN to generalize to unseen categories are quite interesting and could be moved to the main draft for completeness.
+
+The paper proposes a network which is capable of understanding relationships between objects in a scene. This ability of the RN is thoroughly investigated through a series of experiments on a controlled dataset. While, the model is currently evaluated only on a simulated dataset, the results are quite promising and could translate to real-life datasets as well.",7,4.0,ICLR2017
+ByxXM0cuKH,1,SJxSDxrKDr,SJxSDxrKDr,Official Blind Review #1,"Summary: the paper introduces a novel protocol for training neural networks that aims at leveraging the empirical benefits of adversarial training while allowing to certify the robustness of the network using the convex relation approach introduced by Wong & Kolter. The key ingredient is a novel algorithm for layer-wise adversarial (re-)training via convex relaxations. On CIFAR-10, the proposed protocol yields new state-of-the-art performance for certifying robustness against L_inf perturbations less than 2/255, and comparable performance over existing methods for perturbations less than 8/255 (where the comparison excludes randomized-smoothing based approaches as proposed by Cohen et al.).
+
+The proposed methodology seems original and novel. The concept of latent adversarial examples, the layer-wise provable optimization techniques and the sparse representation trick are interesting in their own regard and could be valuable ingredients for future work in this direction. The improvement over the state-of-the-art on CIFAR-10 for perturbations less than 2/255 is significant (although I wouldn't call it substantial). For perturbations less than 8/255 the picture is less clear. The authors' explanation that they couldn't achieve state-of-the-art certified robustness because of smaller network capacity makes sense, however, it also highlights that their protocol doesn't scale as well as previous approaches.
+
+I am not concerned about the missing comparison with randomized smoothing-based approaches (I find the rationale provided in Section 2 convincing).
+
+The discussion of the relatively weak performance of previous provable defenses on page 3 is a bit vague, e.g. the statement that ""the way these methods construct the loss makes the relationship between the loss and the network parameters significantly more complex than in standard training"", thus causing the ""resulting optimization problem to be more difficult"". To me, these are one and the same thing, and a bit more rigour in the argumentation would be advisable here, in my opinion.
+
+-------------
+
+I acknowledge I have read the authors' response and also the other reviews/comments which confirm my opinion that this paper is worthy to be published at ICLR.",8,,ICLR2020
+r1xHGWLRKr,2,SyxC9TEtPH,SyxC9TEtPH,Official Blind Review #3,"The paper presents an invertible generative network, for conditional image generation.  The model is an extension of Real NVP with a conditioning component. Experiments are performed for image generation on two tasks: class conditional generation on MNIST and image colorization conditioned on a grey scale image (luminance). Comparisons are performed with a conditional VAE and a conditional GAN (Pix2Pix). An ablation study motivates the importance and role of the different components.
+The model itself is a relatively simple extension of Real NVP, where a condition vector is added to the initial model as an additional input to the NN components of the invertible blocks. In the experiments conditioning may be a simple class indicator (MNIST) or a more complex component corresponding to a NN mapping of an initial conditioning image (colorization). The experiments show that this model is able to generate good quality images, with an important diversity, showing that the conditioning mechanism works well. The quantitative comparison also shows that the proposed model is competitive with two baselines taken in the VAE and GAN families. The model works well for the non-trivial task of colorization.
+The authors claim is that they are the first to propose conditional invertible networks. The main contribution is probably the implementation of the model itself. They make use of several “tricks” that improve a lot on the performance as demonstrated by the ablation study.  As such more details and motivations for these different ideas that improve the performance and stability of the model would be greatly helpful.  It looks like these are not details, but requirements to make the whole thing work. The Haar component for example should be better motivated. There is no comparison in the ablation study with an alternative, simpler decomposition.
+The baselines are probably not the strongest models to date, and better results could certainly be obtained with other VAE or GAN variants. For example, there have been several works trying to introduce diversity for GANs. This is not redhibitory, but this should be mentioned. Besides a short description of the two baselines, would make the paper more self-contained.
+The quantitative comparison with the VAE baseline, shows that the two models are quite similar w.r.t. different measures. This could be also commented.
+The notations for the Jacobian do not integrate the conditioning, this could be corrected.
+Concerning the interpretation of the axis for the MNIST experiment, it is not clear if they are axis in the original space or PCA axis. If this is the first option, more details are needed in order to understand how they were selected.
+
+
+------post rebuttal -----
+
+The authors clarified several of the raised points. I keep my score.
+
+",6,,ICLR2020
+r1xGNNd2FS,1,SyevDaVYwr,SyevDaVYwr,Official Blind Review #1,"This paper focuses on instance-dependent label noise problem, which is a new and important area in learning with noisy labels. The authors propose confidence-scored instance-dependent noise (CSIDN) to overcome strong assumptions on noise models. They clearly define confidence scores and justify their availability. To solve CSIDN model, they propose instance-level forward correction with theoretical guarantees. Their experiments on both synthetic and real-world datasets show the advantage of this algorithm.
+
+Pros:
+
+1. This paper is clearly written and well-structured in logic. For example, in Section 2, they introduce from class-conditional noise to instance-dependent noise first, which paves the way for confidence-scored instance-dependent noise. This make readers easy to follow the main contribution, namely the new noise model.
+
+2. This paper pushes the knowledge boundary of learning with noisy labels, since it focuses on more realistic and challenge topic ""instance-dependent label noise"". The authors leverage the idea of confidence scores, and propose confidence-scored instance-dependent noise (CSIDN). Compared to previous solutions, CSIDN is a tractable instance-dependent noise model, which enjoys several benefits, such as multi-class classification, rate-identifiability and unbound-noise.
+
+3. This paper proposes an algorithm to solve CSIDN inspired by forward correction called instance-level forward correction (ILFC). Their algorithm has been verified in both synthetic datasets and real-world datasets. The empirical results show the advantage of ILFC.
+
+(Minor) cons:
+
+1. Section 3 is a bit dense in understanding the estimation of transition matrix. The authors are encouraged to polish this section. 
+
+2. Although ILFC outperform CT and LQ in real-world datasets, the authors need to add the reasults of MAE and FC to more thoroughly verify the performance of ILFC.",8,,ICLR2020
+logkY-LQ8RL,2,pGIHq1m7PU,pGIHq1m7PU,This paper violates the rule of anonymity,"In Appendix I.1 it says ""We are machine learning scientists from the University of Munich..."", which violates the rule of anonymity. Please let me know if I should continue on reviewing this paper. Thanks!",1,3.0,ICLR2021
+B1LwiG9gz,2,HkGcX--0-,HkGcX--0-,Good paper,"The proposed approach is straight forward, experimental results are good, but don’t really push the state of the art. But the empirical analysis (e.g. decomposition of different cost terms) is detailed and very interesting.   ",7,4.0,ICLR2018
+SJLNIX9lG,3,rJVruWZRW,rJVruWZRW,Not exciting,"The authors propose an RNN that combines temporal shortcut connections from [Soltani & Jang, 2016] and Gated Recurrent Attention [Chung, 2014]. However, their justification about the novelty and efficacy of the model is not well demonstrated in the paper. The experiment part is modest with only one small dataset Penn Tree Bank is used. The results are not significant enough and no comparisons with models in [Soltani & Jang, 2016] and [Chung, 2014] are provided in the paper to show the effectiveness of the proposed combination. To conclude, this paper is an incremental work with limited contributions.
+
+Some writing issues:
+1. Lack of support in arguments,
+2. Lack of referencing to previous works. For example, the sentence “By selecting the same dropout mask for feedforward, recurrent connections, respectively, the dropout can apply to the RNN, which is called a variational dropout” mentions “variational dropout” with no citing. Or “NARX-RNN and HO-RNN increase the complexity by increasing recurrent depth. Gated feedback RNN has the fully connection between two consecutive timesteps” also mentions a lot of models without any references at all.
+3. Some related papers are not cited, e.g., Hierarchical Multiscale Recurrent Neural Networks [Chung, 2016]
+",4,4.0,ICLR2018
+SJxzoDYLcS,2,S1gFvANKDS,S1gFvANKDS,Official Blind Review #2,"This paper proposes a new tool based on Feynman diagrams to analyze wide networks (e.g., feed-forward networks with one or more large hidden layers or CNNs with a large number of filters).
+
+The main contributions of the paper are:
+- a new method (using Feynman diagrams) to bound the asymptotic behavior of correlation functions (ensemble averages of the network functions and its derivatives). The method is presented as a conjecture.
+- tighter bounds for gradient flow of wide networks
+- an extended analysis of SGD for wide networks
+- a formalism for deriving finite-width corrections
+
+
+The study of (infinitely) wide networks has been active over the last few years. A better understanding of wide networks could, amongst other things, shed light on recent empirical results related to over-parametrized networks.  As such, improving our theoretical understanding of wide networks and especially properties of finite-width networks, which is what this paper explores, seems significant and potentially very impactful.",6,,ICLR2020
+_B-C27rmvO1,2,uKhGRvM8QNH,uKhGRvM8QNH,"Good improvements in empirical results, experimental analysis and discussion with related work can be improved to provide further insights.","The paper proposes a knowledge distillation method for object detection. In particular, the technical contribution is mainly two-fold: attention-guided distillation module and non-local distillation module, as shown in (a) and (b) of Figure 2. The proposed modules provide consistent improvements in detection MAP across different architectures.
+
+On the positive side, I believe the paper has the following merits:
+- The proposed modules are well-motivated, and seem to provide a consistent boost in different architectures.
+- The authors provide very detailed information to reproduce the method. Besides, I also appreciate the authors have provided the code in the supplementary material.
+- The observation that a high-AP teacher is important for distillation is quite intriguing.
+
+Overall I believe the paper has made a decent contribution, but the following aspects can be improved:
+- As mentioned by the authors, distillation is discussed for object detection in several works (n (Chen et al., 2017; Li et al., 2017; Wang et al., 2019; Bajestani & Yang, 2020). I believe it would be helpful to summarize the key difference/similarity wrt to the previous work, thus to provide a better understanding of the relation in a larger context.
+- it's not fully clear to me what the attention in attention-guided distillation captures, would the authors provide further experimental analysis? and what are the failure cases of attention-guided distillation? 
+- it would be helpful to provide results on another dataset(e.g. cityscapes) to confirm the effectiveness in a different detection setting.
+
+--- 
+After reading the authors' response and other reviews. I still believe the paper has made a good contribution thus I would stick with my original rating.",6,4.0,ICLR2021
+#NAME?,4,US-TP-xnXI,US-TP-xnXI,A good and novel idea on structured prediction. ,"Recently, multiple research papers focus on task transformations by bridging the gap between different tasks[1,2,3,4]. The original idea may go back to [5]. This paper follows this line of research ideas by reducing a structured prediction problem to a translation problem. The general idea is novel and very interesting. By defining several manually-designed rules, multiple structured outputs are transformed into the output of the translation model. The writing is clear and well-structured. 
+
+Pros:
+ - A novel and interesting idea for formulating structured prediction tasks to translation problems. This idea is well-motivated in low-resource scenarios and multi-task learning settings. 
+ - The general framework is easy to implement (only requiring some scripts).
+
+
+
+Cons:
+ - My main concern with the proposed approach is the decoding process. If the translated sequence has a nice structure, the transformation process is well performed. However, the translated results might be invalid for a specific task. For example, in CoNLL NER, a nested or overlapping structure might be generated. It may need specially designed rules to filter them out. However, this paper does not have many discussions on this point. I would like to know more about this part.
+ - I also would like to know the effectiveness of different pre-trained language models. In this paper, a T5-base model is utilized. It might be beneficial to know the empirical effectiveness of different kinds of language models.
+ - Some words are not precise. For example, the phrase ""generative models"" are frequently used to illustrate the translation model. However, in the ML field, generative models may indicate the models that have a generative process of data and model the joint distribution of observed samples.
+
+
+I am willing to increase my score if some of the questions are well clarified by the authors.
+
+[1] Strzyz et al. Viable Dependency Parsing as Sequence Labeling, NAACL 2019
+
+[2] Yu et al. Named entity recognition as dependency parsing, ACL 2020
+
+[3] Gómez-Rodríguez et al. Constituency parsing as sequence labeling, EMNLP 2018
+
+[4] Li et al. A Unified MRC Framework for Named Entity Recognition. ACL 2020
+
+[5] Vinyals et al. Grammar as a Foreign Language. 
+
+",6,4.0,ICLR2021
+HkgRdsvQqr,3,S1x0CnEtvB,S1x0CnEtvB,Official Blind Review #1,"The paper presents a meta-learning algorithm to automatically detemine the depth of neural network through a policy to add depth if this bring improvement on accuracy.
+
+I have conserved opinion based on the technique being used here is extremely simple, basically is an implementation of naive greedy algorithm in such a scenario, which implies the problem may not be intrinsically hard, or even useful. The paper consists of detailed narrative about how these procedure are conducted, but still, it is really hard for me to find the true merit to appreciate, and why this brings a nontrivial and usefull contribution. The tables, visualization figures also didnot imply too much about whether this is more than overfitting on previous works with hand-chosen depth. ",3,,ICLR2020
+vSikmE43uxO,1,1s1T7xHc5l6,1s1T7xHc5l6,Review,"Equivariant Steerable CNNs for 2D/3D rotation+reflection+translation groups have generally been implemented as a filter transform/expansion step followed by a standard convolution. The filter expansion step involves taking a linear combination of steerable basis filters. These basis filters are pre-computed before network training by solving a linear system or by sampling the continuous analytical solution (this can take a few minutes). Depending on the chosen group representation wrt which the network layer is equivariant, a different filter basis will emerge, but in general one can see that the basis filters come out as rotated and flipped copies of some basis filters, with the occasional sign flip (this has been stated in some earlier works). The precise way in which a basis filter is to be rotated and flipped to obtain the steerable filter basis for the 2D case had not been worked out before, to my knowledge, and this is one of the contributions of this paper. The analysis is done for each (input, output) representation type chosen from {trivial, irreducible, regular} representations. Having worked this out, the paper proposes to use this as a way of implementing the filter expansion step, starting from a basis filter and rotating/flipping it to obtain an expanded filter bank. 
+
+The proposed method (FILTRA) does not require a precomputation step if I understand correctly, which is a significant practical advantage. Experiments further show that the method is similar or faster at filter expansion. Finally, the method is validated by training networks on benchmark tasks and shown to perform similarly to or better than the steerable CNN implementation of Weiler & Cesa, which is the best existing implementation.
+
+The paper briefly mentions that filters cannot be rotated exactly on a discrete grid, but I didn't figure out how the authors propose to deal with this issue. How exactly are the filters rotated?
+
+I think the method proposed in the paper is useful, as it is both faster and better than existing steerable CNN implementations. The paper itself is fairly well written and technically correct as far as I can tell, but may be challenging to read for those who are not yet knowledgeable about steerable CNNs. Those readers however are unlikely to be interested in learning about the implementation details of steerable CNNs anyway, so perhaps this is fine. The reason I am not giving a higher rating is that I think that although this is a useful contribution to the literature on steerable CNNs, which are being used in an increasing number of applications, the paper does not represent a major breakthrough and although the calculations are non-trivial, does not contain highly unexpected or deep theoretical results.
+
+Typos:
+Cadestrian -> Cartesian
+irreduciable -> irreducible
+equity -> equality
+
+Edit:
+Having read the reviews, rebuttal and updated paper, I have decided to maintain my score of 6.",6,4.0,ICLR2021
+ByxIDYM0KS,2,Bklu2grKwB,Bklu2grKwB,Official Blind Review #2,"The rebuttal did not address my concerns convincingly. There were also simple fixes that the authors could have implemented but they decided not to update the paper. I will keep my original assessment. 
+
+--------------
+
+The premise of the work is very interesting: RNNs that are permutation-invariant. Unfortunately, the paper seems rushed and needs a better justification for not having a RNN memory that is associative. It also should cast the contributions in light of other existing work (not cited). The paper says ""In this section and the remainder of the paper, we focus on the latter [commutative RNN memory operator], namely introducing a constraint (or equivalently, regularizer) that is commutative"", but it never talks about the impact of a RNN memory using a non-associative operator. Being commutative is easy, isn't Equation (2.4) commutative if \Theta = W? Being associative is hard, since non-linear activations are not easily amenable to associativity.
+
+Section 4: ""The above example demonstrates that RNNs can in some cases be a natural computational model for permutation invariant functions."" => Janossy pooling (Murphy et al., 2019) gives an alternative way to use RNNs, with a way to make their method tractable. Actually, my guess to why the RNNs experiments work well, even without an associative memory, is because the training examples come in multiple permuted forms, which is the data-augmentation version of the pi-SGD optimization described in Janossy pooling. 
+
+On page 1, ""consider the problem of computing the permutation invariant function f(x_1, . . . , x_n) = max_i x_i"", what follows is not a proof of necessity. It is an informal argument that either should be made formal or should be described as informal.
+
+There is a lot of missing related work for sets:
+Murphy, Ryan L., Balasubramaniam Srinivasan, Vinayak Rao, and Bruno Ribeiro. ""Janossy pooling: Learning deep permutation-invariant functions for variable-size inputs."" ICLR 2019.
+Wagstaff, Edward, Fabian B. Fuchs, Martin Engelcke, Ingmar Posner, and Michael Osborne. ""On the limitations of representing functions on sets."" ICML 2019.
+Lee, Juho, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. ""Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks."" ICML 2019.
+
+Also missing related work for graphs:
+Bloem-Reddy, Benjamin, and Yee Whye Teh. ""Probabilistic symmetry and invariant neural networks."" arXiv:1901.06082 (2019).
+Murphy, Ryan L., Balasubramaniam Srinivasan, Vinayak Rao, and Bruno Ribeiro. ""Relational Pooling for Graph Representations."" ICML 2019.
+
+The paper has an interesting question but needs to build on prior work. As of now, I am unconvinced that not having an associative operator for the RNN memory will lead to a good nearly permutation invariance function (unless there is data augmentation, per Janossy pooling).
+",1,,ICLR2020
+S1xHuO_7g,1,SkwSJ99ex,SkwSJ99ex,"Interesting analysis, flawed paper, no answer to reviewer questions","This paper looks at the idea of fusing multiple layers (typically a convolution and a LRN or pooling layer) into a single convolution via retraining of just that layer, and shows that simpler, faster models can be constructed that way at minimal loss in accuracy. This idea is fine. Several issues:
+- The paper introduces the concept of a 'Deeprebirth layer', and for a while it seems like it's going to be some new architecture. Mid-way, we discover that 1) it's just a convolution 2) it's actually a different kind of convolution depending on whether one fuses serial or parallel pooling layers. I understand the desire to give a name to the technique, but in this case naming the layer itself, when it's actually multiple things, non of which are new architecturally, confuses the argument a lot.
+- There are ways to perform this kind of operator fusion without retraining, and some deep learning framework such as Theano and the upcoming TensorFlow XLA implement them. It would have been nice to have a baseline that implements it, especially since most of the additional energy cost from non-fused operators comes from the extra intermediate memory writes that operator fusion removes.
+- Batchnorm can be folded into convolution layers without retraining by scaling the weights. Were they folded into the baseline figures reported in Table 7?
+- At the time of writing, the authors have not provided the details that would make this research reproducible, in particular how the depth of the fused layers relates to the depth of the original layers in each of the experiments.
+- Retraining: how much time (epochs) does the retraining take? Did you consider using any form of distillation?
+Interesting set of experiments. This paper needs a lot of improvements to be suitable for publication.
+- Open-sourcing: having the implementation be open-source always enhances the usefulness of such paper. Not a requirement obviously.
+
+",4,4.0,ICLR2017
+H1eP1XmtcH,3,H1e552VKPr,H1e552VKPr,Official Blind Review #4,"This work proposes a subgraph attention mechanism on graphs. Compared to the previous graph attention layer, the node in the graph attends to its subgraph. The subgraph is represented by an aggregated feature representation with a sampled fixed-size subgraph. The methods are evaluated on both node classification and graph classification problems.
+
+I have major concerns about the novelty, and experiments in this work.
+
+1. The motivation is not clear. Using a subgraph or neighborhood to represent a node is reasonable. However, this work samples a subset of nodes from the one-hop neighborhood and aggregates them for attention mechanism. It is very similar to a GCN + GAT. The sampling process even loses some neighborhood information in the graph.
+
+2. The experimental setups are very strange. In Table 2, the methods are compared to GCN and GAT on node classification problems. The performance of GAT is too low and even lower than that reported in GAT. Can authors explain this? It is highly recommended to use the same experimental settings as in GCN and GAT. The same problem exists in Table 3. Can authors provide a performance comparison based on the same settings in GIN?
+
+3. The performance improvements are very unstable and marginal. In Table 3, the proposed methods can not compete with previous methods especially on large datasets like IMDB-MULTI. I wonder how the proposed methods perform on very large datasets such as reddit.
+
+4. Can authors provide comparisons with a simple GCN+GAT? ",1,,ICLR2020
+H1x4jhziFS,1,H1edEyBKDS,H1edEyBKDS,Official Blind Review #2,"The authors describe a method for training plug and play language models, a way to incorporate control elements into pre-trained LMs. In contrast to existing work, which often trains conditioned upon the control element, the authors emphasize that their method does not require re-training the initial LM. This is exciting and a great research direction. It is evaluated in a number of different settings.
+
+1. The authors claim that this method is a baseline for controlled text generation (see e.g. the title). However, there does not appear to be any evaluation with any existing work that performs controlled text generation. I don't see how this can be proposed as a baseline for controlled text generation is there is no comparison to other methods. I imagine the authors will emphasize that that's not fair - because their method doesn't require retraining the language model - but it is relevant to demonstrate if there is a gap in performance or not. As is, there is only one baseline- unconditional language model - and to me this is mostly a way to calibrate the evaluators and not a way to compare their model against other models. 
+
+2. Can the authors make a point or discuss the relationship of this work to neural style transfer? Compared to unsupervised style transfer approaches, which also use lists of words or attributes to learn to dis-entangle content and style, what are the benefits of the proposed approach and how would it compare?
+
+3. Can the authors discuss the effectiveness of their control mechanism for less logical control settings? For example, what if there was ""religion"" for ""the potato"" prompt? Does the model still respect these settings, or no? 
+
+4. Can the authors add analysis on how much the model respects the control variables? This is quite common in existing controlled generation papers. If the model is updated to have the control variables and then is not provided with one at test time, what happens? Can you also control very easy to measure attributes, such as length?
+
+This question ties in with a general point I am ambivalent to in this paper- that it is very long, but there is very little analysis done on what makes the method work, why it is better than other control methods or control baselines, where the proposed control mechanism is not effective, how the model scales if there are large quantities of topics rather than just a few of them, if the BoW and discriminator attribute models work well together or if certain attributes are easier to learn than others, so the model focuses more on those when there are conflicts, etc
+
+5. Missing citations: 
+
+Previous work has investigated controlling various attributes of text generation. Several of these works have also controlled multiple attributes simultaneously. For example, here's a list of a few of the works that were missed:
+
+Kikuchi et al 2016
+Ficler and Goldberg, 2017
+Wang et al, 2017
+Fan et al, 2018
+Baheti et al, 2018
+See et al, 2019
+Martin et al, 2019
+
+The related work section only focuses on very recent work, e.g. only one paper is discussed amongst a large body of existing work. I feel this is not an accurate reflection of how much previous work has investigated these techniques and analyzed how models deal with control variables. 
+
+Please also cite:
+- which dataset was used for story generation, appears to be missing
+- top-k sampling 
+
+
+I have read the author response. Thanks for the details and additional analysis in the paper. ",6,,ICLR2020
+NHFrT1acso,5,RVANVvSi8MZ,RVANVvSi8MZ,"Review for ""Weighted Line Graph Convolutional Networks""","The paper proposed a GNN model based on a weighted line graph, which adds weights to the line graph for the original input graph in a node/graph property prediction task. The line graph is a graph built on the original graph but with edges as nodes. A new convolution called weighted line graph convolution layer (WLGCL) is proposed to overcome the issue of ""biased topological information"" of the line graph. The weights for the line graph in WLGCL are computed based on the node degree of the original graph, which implies the node degree in the line graph is always 2. The WLGCL can be implemented for different kinds of graph convolution, which rule incorporates graph connectivity, node features and edge features.  
+Experiments compared the performance of the proposed model with existing GNN methods on graph classification tasks and computational complexity with other methods.
+1. The WLGCL introduces the weights for edges in line graph convolution, which reduces the computational cost. The performance of WLGCL on some graph classification datasets are good. 
+2. The WLGCL is a weighted version of the line graph neural networks (LGNNs) as studied previously in [Chen, Li, Bruna, Supervised Community Detection with Line Graph Neural Networks, ICLR 2019]. 
+Besides saving the computational cost and removing biased degree information, what are other benefits over LGNN? Is there a significant improvement of the test accuracy against LGNN on various types of graph datasets? Maybe saying “biased topological information” here is misleading as what change the WLGCL makes compared to LGNN is the node degree.
+3. The experiments, like Table 1, compare with some existing GNNs methods. The author should compare with more existing GNNs, such as GAT. The new datasets, Open Graph Benchmarks, should also be tested to show the performance of the proposed GNN model.
+4. What is the performance of WLGCL with the normalisation of the adjacency matrix on graph classification tasks?
+5. The study of the test accuracy vs depth of the GNN with WLGCL indicates the WLGCL may work in deep nets. Will increase the depth further be beneficial or not? Is there any interpretation from information theory?",5,4.0,ICLR2021
+BknmMUteM,2,B1nxTzbRZ,B1nxTzbRZ,From what I can tell the paper is correct but might lack in novelty or impact,"The authors introduce the task of ""defogging"", by which they mean attempting to infer the contents of areas in the game StarCraft hidden by ""the fog of war"".
+
+The authors train a neural network to solve the defogging task, define several evaluation metrics, and argue that the neural network beats several naive baseline models. 
+
+On the positive side, the task is a nice example of reasoning about a complex hidden state space, which is an important problem moving forwards in deep learning.
+
+On the negative side, from what I can tell, the authors don't seem to have introduced any fundamentally new architectural choices in their neural network, so the contribution seems fairly specific to mastering StarCraft, but at the same time, the authors don't evaluate how much their defogger actually contributes to being able to win StarCraft games.  All of their evaluation is based on the accuracy of defogging. 
+
+Granted, being able to infer hidden states is of course an important problem, but the authors appear to mainly have applied existing techniques to a benchmark that has minimal practical significance outside of being able to win StarCraft competitions, meaning that, at least as the paper is currently framed, the critical evaluation metric would be showing that a defogger helps to win games. 
+
+Two ways I could image the contribution being improved are either highlighting and generalizing novel insights gleaned from the process of building the neural network that could help people build ""defoggers"" for other domains (and spelling out more explicitly what domains the authors expect their insights to generalize to), or doubling down on the StarCraft application specifically and showing that the defogger helps to win games.  A minimal version of the second modification would be having a bot that has access to a defogger play against a bot that does not have access to one.
+ 
+All that said, as a paper on an application of deep learning, the paper appears to be solid, and if the area chairs are looking for that sort of contribution, then the work seems acceptable.
+
+Minor points:
+- Is there a benefit to having a model that jointly predicts unit presence and count, rather than having two separate models (e.g., one that feeds into the next)?  Could predicting presence or absence separately be a way to encourage sparsity, since absence of a unit is already representable as a count of zero?  The choice to have one model seems especially peculiar given the authors say they couldn't get one set of weights that works for both their classification and regression tasks
+- Notation: I believe the space U is never described in the main text. What components precisely does an element of U have?
+- The authors say they use gameplay from no later than 11 minutes in the game to avoid the difficulties of increasing variance. How long is a typical game?  Is this a substantial fraction of the time of the games studied?  If it is not, then perhaps the defogger would not help so much at winning.
+- The F1 performance increases are somewhat small. The L1 performance gains are bigger, but the authors only compare L1 on true positives. This means they might have very bad error on false positives. (The authors state they are favoring the baseline in this comparison, but it would be nice to have those numbers.)
+- I don't understand when the authors say the deep model has better memory than baselines (which includes a perfect memory baseline)",4,1.0,ICLR2018
+S1g1j74b9r,3,S1lVhxSYPH,S1lVhxSYPH,Official Blind Review #3,"The paper presents a quantization method that generates per-layer hybrid filter banks consisting of full-precision and ternary weight filters for MobileNets. 
+
+Strength:
+(1)	The paper proposes to only quantize easy-to-quantize weight filters of a network layer to ternary values while also preserving the representational ability of the overall network by relying on few full-precision difficult-to-quantize weight filters. 
+(2)	The proposed method maintains a good balance between overall computational costs and predictive performance of the overall network. Experimental results show that the proposed hybrid filter banks for MobileNets achieves savings in energy and reduction in model size while preserving comparable accuracy. 
+(3)	The description is clear in general. 
+
+Weakness:
+(1)	Though the paper claims that recent works on binary/ternary quantization either do not demonstrate their potential to quantize MobileNets on ImageNet dataset or incur modest to significant drop in accuracy while quantizing MobileNets with 4-6-bit weights, it may worth comparing to the methods that achieved start-of-art results on other datasets to demonstrate the efficiency of the proposed method. 
+(2)	Figure 1 and Figure 2 is a little blurry. 
+(3).  How is about the performance compared to latest work? Is it possible to apply current framework to MobileNetV2 ? If can, what's performance?
+",3,,ICLR2020
+SkxBRWYCFS,1,BJliakStvH,BJliakStvH,Official Blind Review #3,"The submission considers estimating the constraints on the state, action and feature in the provided demonstrations, instead of learning rewards. The authors use the likelihood as MaxEnt IRL methods to evaluate the ""correctness"" of the constraints, and find the most likely constraints given the demonstrations. While the problem is challenging (NP-hard), suboptimality of the proposed algorithm is analyzed. Experiments are provided to demonstrate the performance of the proposed method. 
+
+The problem considered is interesting, and the authors provide a straightforward but empirically effective method. However, the motivation is a little unclear to me. Specifically, what will be the practical cases, where the learning the constraints is important and necessary? Can authors further motivate this topic by providing more real-world applications? 
+
+",6,,ICLR2020
+HkgJVxk6tS,2,H1g8p1BYvS,H1g8p1BYvS,Official Blind Review #1,"Summary: This paper hypothesizes that even though we are able to achieve very impressive performance on benchmark datasets as of now (e.g. image net), it might be due to the fact that benchmarks themselves have biases. They introduce an algorithm that selects more representative data points from the dataset that allow to get a better estimate of the performance in the wild. The algorithm ends up selecting more difficult/confusing instances.
+
+This paper is easy to read and follow (apart from some hickup with a copy of three paragraphs), but in my opinion of limited use/impact.
+
+Comments:
+1) There is a repetition of the "" while this expression formalizes.."" paragraph and the next paragraph and the paragraph ""As opposed to .."" is out of place. Please fix
+2) I am not sure 
+- What applications the authors suggest. They seem to say that benchmark authors should run their algorithm and make benchmarks harder. To me it seems that benchmarks become harder because you remove most important instances from the training data (so Table 4 is not surprising - you remove the most representative instances so the model can't learn)
+- how practically feasible it is.  Even if in previous point I am wrong, the algo requires retraining the models on subsets (m iterations). How large is this m?
+3) Other potential considerations:
+-  When you change the training size, the model potentially needs to be re-tuned (regularization etc) (although it might be not that severe since the size of the training data is preserved at t)
+- How do u chose the values of hyperparams (t, m,k eta), how is performance of your algorithm depends on it
+4) I don't see any good baselines to compare with - what if i just chose instances that get the highest prediction score on a model and remove these. How would that do? For NLP (SNLI) task i think this would be a more reasonable baseline than just randomly dropping the instances,
+5) I wonder if you actually retrain the features after creating filtered dataset, new representation would be able to recover the performance. 
+
+I read authors rebuttal and new experiments that show that the models trained on filtered data generalize better are proving the point, thanks. Changing to weak accept
+",6,,ICLR2020
+H1lTTDulG,1,r1nzLmWAb,r1nzLmWAb,lacking in terms of novelty,"The paper proposed a combination of temporal convolutional and recurrent network for video action segmentation. Overall this paper is written and easy to follow.
+
+The novelty of this paper is very limited. It just replaces the decoder of ED-TCN (Lea et al. 2017) with a bi-directional LSTM. The idea of applying bi-directional LSTM is also not new for video action segmentation. In fact, ED-TCN used it as one of the baselines. The results also do not show much improvement over ED-TCN, which is much easier and faster to train (as it is fully convolutional model) than the proposed model. Another concern is that the number of layers parameter 'K'. The authors should show an analysis on how the performance varies for different values of 'K' which I believe is necessary to judge the generalization of the proposed model. I also suggest to have an analysis on entire convolutional model (where the decoder has 1D-deconvolution) to be included in order to get a clear picture of the improvement in performance due to bi-directional LSTM . Overall, I believe the novelty, contribution and impact of this work is sub-par to what is expected for publication in ICLR. ",4,4.0,ICLR2018
+SylcRv_DhQ,1,ByetGn0cYX,ByetGn0cYX,"Interesting preliminary work, but requires major revisions","The authors formulate planning as sampling from an intractable distribution motivated by control-as-inference, propose to approximately sample from the distribution using a learned model of the environment and SMC, then evaluate their approach on 3 Mujoco tasks. They claim that their method compares favorably to model-free SAC and to CEM and random shooting (RS) planning with model-based RL.
+
+This is an interesting idea and an important problem, but there appear to be several inconsistencies in the proposed algorithm and the experimental results do not provide compelling support for the algorithm. In particular,
+
+Levine 2018 explains that with stochastic transitions, computing the posterior leads to overly optimistic behavior because the transition dynamics are not enforced, whereas the variational bound explicitly enforces that. Is that an issue here?
+
+The value function estimated in SAC is V^\pi the value function of the current policy. The value function needed in Sec 3.2 is a different value function. Can the authors clarify on this discrepancy?
+
+The SMC procedure in Alg 1 appears to be incorrect. It multiplies the weights by exp(V_{t+1}) before resampling. This needs to be accounted for by setting the weights to exp(-V_{t+1}) instead of uniform. See for example auxiliary particle filters.
+
+The experimental section could be significantly improved by addressing the following points: 
+* How was the planning horizon h chosen? Is the method sensitive to this choice? What is the model accuracy?
+* Does CEM use a value function? If not, it seems like a reasonable baseline to consider CEM w/ a value function to summarize the values beyond the planning horizon. This will evaluate whether SMC or including the value function is important. 
+* Comparing to state-of-the-art model-based RL (e.g., one of Chua et al. 2018, Kurutach et al. 2018, Buckman et al. 2018). 
+* How were the task # of steps chosen? They seem arbitrary. What is the performance at 1million and 5million steps?
+* Was SAC retuned for this small number of samples/steps?
+* Clarify where the error bars come from in Fig 5.2 in the caption.
+At the moment, SMCP is within the error bars of a baseline method.
+
+Comments:
+
+In the abstract, the authors claim that the major challenges in planning are: 1) model compounding errors in roll-outs and 2) the exponential search space. Their method only attempts to address 2), is that correct? If so, can the authors state that explicitly.
+
+Recent papers (Chua et al. 2018, Kurutach et al. 2018, Buckman et al. 2018, Ha and Schmidhuber 2018) all show promising model-based results on continuous state/action tasks. These should be mentioned in the intro.
+
+The connection between Gu et al.'s work on SMC and SAC was unclear in the intro, can the authors clarify?
+
+For consistency, ensure that sums go to T instead of \infty.
+
+I found the discussion of SAC at the end of Sec 2.1 confusing. As I understand SAC, it does try to approximate the gradient of the variational bound directly. Can the authors clarify what they mean?
+
+At the end of Sec 2.2, the authors claim that the tackle the particle degeneracy issue (a potentially serious issue) by ""selecting the temperature of the resampling distribution to not be too low."" I could not find further discussion of this anywhere in the paper or appendix.
+
+Sec 3.2, mentions an action prior for the first time. Where does this come from?
+
+Sec 3.3 derives updates assuming a perfect model, but we learn a model. What are the implications of this?
+
+Please ensure the line #'s and the algorithm line #'s match.
+
+Model learning is not described in the main text though it is a key component of the algorithm. The appendix lacks details (e.g., what is the distribution used to model the next state?) and contradicts itself (e.g., one place says 3 layers and another says 2 layers).
+
+In Sec 4.1, a major difference between MCTS and SMC is that MCTS runs serially, whereas SMC runs in parallel. This should be noted and then it's unclear whether SMC-Planning should really be thought of as the maximum entropy tree search equivalent of MCTS.
+
+In Sec 4.1, the authors claim that Alpha-Go and SMCP learn proposals in similar ways. However, SMCP minimizes the KL in the reverse direction (from stated in the text). This is an important distinction.
+
+In Sec 4.3, the authors note that Gu et al. learn the proposal with the reverse KL from SMCP. VSMC (Le et al. 2018, Naesseth et al. 2017, Maddison et al. 2017) is the analogous work to Gu et al. that learn the proposal using the same KL direction as SMCP. The authors should consider citing this work as it directly relates to their algorithm.
+
+In Sec 4.3, the authors claim that their direction of minimizing KL is more appropriate for exploration. Gu et al. suggest the opposite in their work. Can the author's justify their claim?
+
+In Sec 5.1, the authors provide an example of SMCP learning a multimodal policy. This is interesting, but can the authors explain when this will be helpful?
+
+====
+
+11/26
+At this time, the authors have not responded to reviews. I have read the other reviews. Given the outstanding issues, I do not recommend acceptance.
+
+12/7
+After reading the author's response, I have increased my score. However, baselines that establish the claim that SMC improves planning which leads to improved control are missing (such as CEM + value function). Also, targeting the posterior introduces an optimism bias that is not dealt with or discussed.",5,4.0,ICLR2019
+ah_sfcB04GN,2,H8VDvtm1ij8,H8VDvtm1ij8,Interesting model but with unconvincing improvements over recalibration baseline,"This paper proposes a method for calibrating uncertainty estimates for regression models. It builds off a method proposed by Kuleshov et al (2018). The newly proposed method has the following steps:
+  1) Train a conditional density model for a regression outcome y given an input x on training data. The authors use conditional normalizing flows for this task.
+  2) For each set of points (x_t, y_t) in a validation set, pass an input x_t to the model in step 1 to learn an induced CDF for the output F_t(). Calculate F_t(y_t), i.e. the induced CDF evaluated at the actual output, and use another flow-based density model to learn the distribution of the CDF values. If the model in step 1 is perfectly calibrated, this density should be uniform, but in practice it seldom is.
+  3) For future data points, the composition of densities in steps 1) and 2) provide a new, uncertainty-calibrated density.
+
+The overall problem area is interesting, important, and underexplored. There has been a lot of work on calibration estimates for classification tasks, but there are less methods for regression. Using flows here is a cool idea. Any exact density model can be used for the method above, and normalizing flows are a good solution because they are monotonic by construction. 
+
+That being said, after reading the paper, I'm not convinced that the proposed method is a significant improvement over the method by Kuleshov et al. that it's building off of. Kuleshov et al. proposes a similar 2-step process, but instead of explicitly learning a distribution over the induced CDF values, it uses an isotonic regression to calibrate the CDF. Looking at the experimental results, it seems that the isotonic recalibration performs on par with the flow recalibration in terms of test set calibration error. 
+
+Put another way, why should someone use flow-based recalibration instead of isotonic recalibration? A possible answer mentioned in the paper is that flow-based recalibration can be used to compute distribution statistics, such as the mean, while isotonic recalibration cannot compute these statistics. (Side question: why can't the isotonic recalibration be used to compute the mean? It seems like isotonic recalibration explicitly transforms an inverse-CDF to another inverse-CDF. Can't distribution statistics be imputed from this transformed inverse-CDF?). 
+
+Even though flow recalibration can compute distribution statistics, the MSE never improves after flow recalibration; in some instances, it gets worse. What is the benefit of having distribution statistics? On the one hand, the worsened MSE might be expected behavior. Is uncertainty calibration expected to behave like a regularizer? If so, that should be stated and discussed in the paper. If not, then of course we shouldn't expect improvements in MSE after recalibration, because we're changing the model that had the best training-set performance. This could explain the results we see. In any case, there should be some justification in the paper for a) why computing distribution statistics is important, b) whether we should expect recalibration to behave like a regularizer, c) a discussion about the tradeoff between calibration performance and model error, and d) an illustration of scenarios where distribution statistics are crucial [and e), why the Kuleshov et al method can't be used to calculate test error].
+
+Additionally, the paper proposes a way to visualize recalibration results. To be honest, I found the CDF performance plot confusing and hard to interpret. How should we interpret the x-axis (are predictions standardized, and if not, what units are they in)? I found the standard qq-plot-like calibration graph a lot more interpretable. What does the new visualization answer? I think the new visualization should be better explained (it also didn't help that the legend in Figure 5 blocked the middle of the graph).
+
+Overall, I think the paper proposes an interesting model, but it doesn't adequately justify when/why the model should be used over the existing method. I think there could definitely be scenarios where it is useful -- I just don't think the paper has adequately and convincingly illustrated them.
+
+Pros:
+- Interesting and underexplored problem area
+- Thorough experiments
+- Normalizing flows are an interesting and new model for this problem
+
+Cons:
+- Doesn't justify meaningful ways the method is different from existing methods
+- Visualizations are confusing",5,3.0,ICLR2021
+c0kkpuT9kpg,1,fmtSg8591Q,fmtSg8591Q,Official Blind Review #3,"This paper focuses on episodic, factored Markov decision processes and proposes the FMDP-BF algorithm, which is computation efficient and improves the previous result of the FMDP algorithm. The author also provides a theoretical lower bound for the FMDP-BF algorithm and shows that the FMDP-BF algorithm's regret is near-optimal. Besides, this paper applies the FMDP-BF algorithm in RLwK and provides theoretic analyses of regret. However, I still have some suggestions about this paper.
+Firstly, the definition of factor MDPs is complex and challenging to understand. It is better to add some examples for the factor MDP setting.
+Secondly, there is no experimental support for the FMDP algorithm, and it is better to perform some experiments with FMDP-BF, FMDP-CH algorithm.
+Finally, I have a concern about the FMDP algorithm. The UCBVI-CH algorithm, UCBVI-BF algorithm, and FMDP-CH algorithm only maintain one value function V in each episode. The only difference is the form of a bonus term. However, the FMDP-BF algorithm needs to keep two value function V to compute the bonus term, and I wondered whether it is necessary to keep two value functions in each episode.",7,4.0,ICLR2021
+#NAME?,1,xtKFuhfK1tK,xtKFuhfK1tK,a new method for distribtued GNN,"This paper proposed a new distributed training method for GNNs. Specifically, unlike traditional distributed training methods for CNNs where data points are independent, nodes in a graph are dependent on each other. Thus, this dependence incurs communication between different workers in the distributed training of GNNs. This paper aims to reduce the communication cost in this procedure. Here, this paper proposed to sample more neighbor nodes within the same worker while reducing the sampling probability for the neighbor nodes on other workers. It also provides some theoretical analysis and conducts the experiments to verify the proposed method. 
+
+1. The idea is simple. It is just a trad-off between intra-worker sampling and inter-worker sampling. In fact, it does not address the real challenge in distributed training of GNNs. Even though sampling more intra-worker neighbor nodes can reduce the communication cost, it will impair the prediction performance. A good solution should reduce communication costs and try to make the prediction performance as good as possible. However, this method only focuses on the former one. 
+
+2. In the proof of Theorem 1, this paper assumes there exists a constant $D_1$, and further claims that $D$ is small. However, no evidence is provided to verify $D$ is small.  Thus, the claim in Theorem 1 does not hold. Moreover, without any knowledge regarding $D$, the bound for $s$ is useless. 
+
+3. Regarding experiments, an important baseline is missed. Specifically, the method only using intra-worker neighbor nodes should be used. Otherwise, the current experimental results cannot support the efficacy of the proposed method. ",4,5.0,ICLR2021
+PMEYgqXAIEK,3,0qbEq5UBfGD,0qbEq5UBfGD,Benchmarking different clustering losses in semi-supervised time series data clustering,"This paper benchmarked three different existing losses, DB/Propotype/Sihouette, on time series clustering. Although there is no technique innovation, this is an interesting topic and the results look good to me. However, there are some concerns: 
+
+1. I am not an expert in time series clustering, especially in the semi-supervised clustering case. As there is a labeled dataset, the reason to solve this problem as two steps is not clear to me: 1. learning the semi-supervised representation and 2. do the clustering. Why not just benchmarking by semi-supervised learning metric, like accuracy? 
+
+2. The effect of the clustering method looks unclear. In other words, will the conclusion be held on different clustering methods, rather than K-means. 
+
+3. Different losses have different performance rankings in different numbers of labels and datasets (Figure 2). Picking up the best and claiming the advantage does not form a fair comparison. It will be more interesting if the author can propose a method, which can consistently win other methods. 
+
+Minor concerns: 
+
+1. the bar plot should have space between each category.
+2. it will be more convincing if baselines from other papers are considered. 
+
+
+
+",5,2.0,ICLR2021
+S1lL5vpRhm,2,ByMVTsR5KQ,ByMVTsR5KQ,This paper proposes WaveGAN for unsupervised synthesis of raw-wave-form audio,"This paper proposes WaveGAN for unsupervised synthesis of raw-wave-form audio and SpecGAN that based on spectrogram. Experimental results look promising.
+
+I still believe the goal should be developing a text-to-speech synthesizer, at least one aspect.",6,3.0,ICLR2019
+0K9KEhtM4p7,4,TV9INIrmtWN,TV9INIrmtWN,Interesting approach to learning a hard attention controller using curiosity; not yet clear it is useful on challenging tasks,"This work presents a method for learning a hard attention controller using an information maximization approach. As the authors point out, such a method could be very useful for reasoning in terms of high-dimensional observations, like vision. In brief, the method learning to choose the next attention position to be the most informative by maximizing the uncertainty of the next observation. Uncertainty is quantified using a spatial memory model that is trained to reconstruct and predict the scene. The authors validate this approach by showing that the resulting attention mechanism can be used for two simple downstream tasks. The resulting agent outperforms others trained using baseline attention mechanisms: a hard attention mechanism that is trained on task reward (""environment""; similar to Mnih et al 2014), as well as models that attend to random positions or to the agent's location.
+
+This work is well motivated, easy to read, and appears technically correct. The information maximization objective is a sensible way to learn an attention mechanism, and I found the exposition very easy to follow.
+
+The main appeal of the method is that it's trained independently of the task, and for this reason might be useful on many tasks. And indeed, the authors highlight that this is the promise of this line of research ("" ...our approach is unsupervised in terms of the task
+and can be applied generally to downstream tasks in the environment""). Accordingly, my main concern is that the paper presents little evidence that the attention mechanism presented here will work in a task-agnostic manner. The paper only shows results on two simple 2D environments, and one of these evaluations has caveats that I feel significantly weaken the paper's case. 
+
+In particular, I don't find the comparison to baselines on the PhysEnv environment to be fair. This is for two reasons:
+(1) while the strongest baseline method is trained on data from the learning agent, the proposed method is trained on data from an expert policy (as described in appendix section A) instead of from random transitions. In the typical RL setup, we can't generally assume that expert demonstrations will be available when training an agent from scratch, so this restricts the applicability of the attention mechanism for RL. The resulting attention policy is likely to indirectly leak task information to the agent, which makes it hard to compare to models trained without expert data on task reward alone. 
+(2) the authors report that the strongest baseline (called ""environment""; Mnih et al 2014) doesn't perform well because the resulting policy is entropic. It's difficult to evaluate whether this behavior is due to problems with the baseline method or with the hyperparameter settings of the RL algorithm used to optimize it. For example, PPO uses a policy entropy bonus, and the authors don't report tuning this hyperparameter, but it will presumably play a large role in the entropy level of the learned attention mechanism. Because of this, it's hard to know whether the proposed methods outperforms the baseline because the baseline is poorly tuned or because the proposed method is generally better. I'm generally surprised by how poor the results of the ""environment"" method shown here are, given the it performs comparably on the other task and given that it's trained on task reward. More analysis or discussion would be very useful.
+
+At a more fundamental level, I'm not convinced that the task-agnostic strategy for information maximization proposed here is the correct one for all tasks. This method will suffer from the ""noisy TV"" problem faced by many curiosity-based methods (as described e.g. in the introduction to Savinov et al 2019: https://arxiv.org/abs/1810.02274) and will attend to regions of high variability whether or not they're task relevant. In settings where the task of interest involves direct interaction with a relatively small number of pixels, but other things in the scene are also changing, there is no guarantee that an information maximization strategy will attend to the most task-relevant pixels. Without evaluation on more challenging tasks, it's hard to know how good a strategy information maximization is. The paper should address these issues explicitly. These results would be even stronger if shown on a perceptually harder task, such as one involving natural scenes, 3D content, or more realistic dynamics.
+
+Other questions and comments (less central to my evaluation):
+- What happens if the attention policy and the agent are trained simultaneously? If this approach works, it might allow the attention model to be trained on PhysEnv without requiring expert demonstrations.
+- How does the ""random"" attention baseline perform on the two tasks? On PhysEnv, the random baseline gives the second best reconstruction results, so it would be very interesting to see how well it works for control.
+- The claims made about human cognition in the introduction need to be better justified with references to the literature. The one reference provided (Barrouillet et al 2004) is primarily about working memory spans and cognitive load (i.e. internal bottlenecks) rather than about perceptual bottlenecks and the need to build world models or use hard attention mechanisms, as the surrounding text implies.
+- The reconstruction results in section 6.1 are a good sanity check of the proposed model, but the baselines used here are very weak. This is because the proposed method uses an attention mechanism that's trained for reconstruction (via the infomax objective), while the other methods are trained either for an RL task (which may be only loosely correlated with reconstruction) or are heuristic. These results would be much more compelling with stronger baselines (and model ablations).
+- Generally, the paper would benefit from more analysis of the contribution of model components. E.g. how important is the architecture of the dynamic memory module to the reconstruction and RL results? 
+
+Minor:
+- Section 4.1: ""This quantity is the amount of surprise in the dynamics model and we will use of this again when training the glimpse agent."" -> ""...we will use this again...""
+
+- Section 4.2: ""In addition to reconstruction loss,"" -> ""In addition to the reconstruction loss,""
+
+- Section 4.2: ""The total loss for as single step"" -> ""The total loss for a single step""
+
+In summary: the authors present an interesting application of information maximization-based curiosity to hard attention control. A hard attention mechanism trained in a purely unsupervised fashion (as proposed here) that performs well on many downstream tasks would be very useful. As it stands, I am in favor of rejecting this paper because of the limitations of the evaluation and analysis. My concerns would be addressed by evaluating on more challenging, benchmark tasks and ideally on tasks with more challenging visual structure, and with a thorough analysis of how the model performs in settings where information maximization is uncorrelated with the task (as in the ""noisy TV"" problem). ",4,4.0,ICLR2021
+ry22qzclM,3,ryk77mbRZ,ryk77mbRZ,Running an RNN for one step from noisy hidden states is a valid regularizer,"In order to regularize RNNs, the paper suggests to inject noise into hidden units. More specifically, the suggested technique resembles optimizing the expected log likelihood under the hidden states prior, a lower bound to the data log-likelihood.
+
+The described approach seems to be simple. Yet, several details are unclear, or only available implicitly. For example, on page 5, the Monte Carlo estimation of Lt is given (please use equation number on every equation). What is missing here are some details on how to compute the gradient for U and Wl. A least zt is sampled from zt-1, so some form of e.g. reparameterization has to happen for gradient computation? Are all distributions from the exponential family amendable to this type of reparamterization? With respect to the Exp. Fam.: During all experiments, only Gaussians are used? why cover this whole class of distributions? Experiments seem to be too small: After all the paper is about regularization, why are there no truely large models, e.g. like state-of-the-art instances? What is the procedure at test time?",3,3.0,ICLR2018
+rkx9a33nKS,1,B1xfElrKPr,B1xfElrKPr,Official Blind Review #2,"This paper illustrates the TP-Transformer architecture on the challenging mathematics dataset. The TP-Transformer combines the transformer architecture with tensor-product representations. The experiments show a dramatic improvement of accuracies compared with SOTA models. Moreover, the paper also explains the reason why the TP-Transformer can learn the structural position and relation to other symbols with a detailed math proof.
+ 
+Overall, this paper is nice as it makes a milestone for math problem solving from unique perspectives. To be specific, the paper makes the following contributions:
+
+1. Demonstrate a novel architecture TP-Transformer in details;
+2. Achieve a better accuracies in the challenging mathematics dataset than the SOTA transformer models;
+3. Illustrate in fundamental math that why TP-Transformer can learn the structural position and relation, and solve the binding problems of stacked attention layers.
+ 
+Here are a few minor questions that may further improve the paper:
+
+1. The conclusion states that TP-Transformer beats the previously published SOTA by 8.24%. However, it does not match to the experiment results (see section 4).
+
+2. In figure 5, there are 4 tasks in the bottom with accuracies lower than 0.5. It would be nice to provide more insights on this.
+ 
+3. It would be interesting to see whether it transferable to the other downstream tasks (such as natural language understanding) besides the experiments on the challenging mathematics dataset.",6,,ICLR2020
+Skl9pF8vKB,1,r1genAVKPB,r1genAVKPB,Official Blind Review #1,"This paper's contribution is a sample complexity lower bound for linear value-based learning and policy-based learning methods. The bound being exponential in the planning horizon is bad news, and has some implications with respect to further analysing sample complexity in RL.
+
+The gist of this paper is that one can craft a hard MDP which requires visiting every state at least once, and that since this MDP's state space is exponential in the MDP's horizon, then there exists a set of MDPs which require an exponential (in the horizon) number of trajectories to be solved. As a consequence, further analysis of sample complexity in RL may need some much stronger assumptions.
+
+The writing of the paper is good, I was able to understand everything (I think). As far as I can tell, this is novel work. Unfortunately I am currently unable to see why this contribution is valuable. I have set my score to weak reject but I am very open to having my mind changed, as I feel I may have missed some critical element.
+
+I have two criticisms:
+A- I don't understand why this bound is significantly different than previous bounds.
+B- I don't understand why this is bad news for representation learning, nor how this failure mode of linear features translates to the ""deep"" case.
+
+In the same spirit, I find rather odd the way the paper is introduced. Discussions of representations usually involve some discussion of generalization, but that's not what this paper is about. Deep neural networks/representation learning are only useful if there is an opportunity for generalization.
+
+
+With respect to A, I am either grossly misunderstanding past bounds and/or your bounds, or something is wrong with the way complexities are compared:
+- In Wen & Van Roy, the ""polynomial"" sample complexity is in the number of states, it is related to |S|x|A|xH^2 (Theorem 3 of Wen & Van Roy)
+- In this paper, Theorem 4.1 states that the sample complexity is exponential because it is of the form 2^H. One *critical* assumption for this bound is precisely that |S| >= 2^H. Thus the bound that you propose is still polynomial in |S|.
+I am thus puzzled, how is this bound significantly different?
+
+
+With respect to B, I don't see how this bound has much to do with good representations, or even representations at all.
+In Lemma A.1, you essentially craft a set of features that, being mutually orthogonal, are in some sense ""mutually linearly separable"", making learning the mapping from those features to a value function ""trivial"" once data is obtained. This is barely different from saying that you assume there is a magical learner that learns in O(1) given the data, because in either case, you need to visit _every_ of the 2^H state in order to solve the MDP, because by construction of your problem, there is _no hope_ of generalization*. Since learning features or creating ""good"" features has everything to do with generalization (otherwise we'd just to tabular), I don't see how this bound is relevant to representations. (We already have Wolpert's no free lunch theorem to tell us that there are always some problems that ML just can't be general enough to solve efficiently. What is more interesting is understanding how we can efficiently learn where there _is_ structure to a problem.)
+* There is no hope of generalization, unless something about the observation space (which is left undiscussed in the paper) contains *information* about the agent being to the unique path to the reward. In such a case, I can see a probabilistic argument being made where in the worst case the agent needs to visit all 2^H states, but in the average case, the agent may learn to ignore paths where it can generalize that there is no reward. This is not entirely unreasonable, think of e.g. AlphaGo, where very few states end in victory, where there is an exponential number of states in the horizon, yet learning is totally reasonable because of structure in observation. This is where I don't agree with a statement like: ""Since the class of linear functions is a strict subset of many more complicated function classes, including neural networks in particular, our negative results imply lower bounds for these more complex function classes as well."" 
+",6,,ICLR2020
+HkeINOFOn7,1,ByeDojRcYQ,ByeDojRcYQ,"Interesting paper, some concerns with formalism and objective. Missing baselines. ","This paper proposes a distributed policy gradient method for learning policies with large, collaborative, homogeneous swarms of agents. 
+
+Formalism / objective: 
+The setting is introduced as a ""collaborative Markov team"", so the objective is to maximise total team reward, as expressed in equation (3). This definition of the objective seems inconsistent with the one provided at line (14): Here the objective is stated as maximising the agent's return, L_n, after [k] steps of the agent updating their parameters with respect to L_n, assuming all other agents are static. I think the clearest presentation of the paper is to think about the algorithm in terms of meta-learning, so I will call this part the 'inner loop' from now on. 
+Note (14) is a very different objective: It is maximising the return of an agent optimising 'selfishly' for [k] steps, rather than the ""collaborative objective"" mentioned above. This seems to break with the entire premise of collaborative optimisation, as it was stated above. 
+My concern is that this also is reflected in the experimental results: In the food gathering game, since killing other agents incurs ""a small negative reward"", it is never in the interest of the team to kill other team-mates. However, when the return of individual agents is maximised both in the inner loop and the outer loop, it is unsurprising that this kind of behaviour can emerge. Please let me know if I am missing something here. 
+
+Other comments: 
+-The L_n(theta, theta_n) is defined and used inconsistently. Eg. compare line (9), L_n(theta_n, theta), with line below, L_n(theta, theta_n). This is rather confusing 
+-In equation (10) please specific which function dependencies are assumed to be kept? My understanding is that \theata_n is treated as a function of theta including all the dependencies on the policies of other agents in the environment? 
+-Related to above, log( pi_\theta_n ( \tan_n)) in line 16 is a function of all agents policies through the joint dependency on \theta. Doesn't that make this term extremely expensive to evaluate? 
+-Why were the TRPO_kitchensink and A3C_kitchensink set up to operate on the minimum reward rather than the team reward as it is defined in the original objective? It is entirely possible that the minimum reward is much harder to optimise, since feedback will be sparse. 
+-The survival game uses a discrete action space. I am entirely missing MARL baseline methods that are tailored to this setting, eg. VDN, QMIX, COMA etc to name a few. Even IQL has not been tried. Note that MADDPG assumes a continuous action space, with the gumble softmax being a common workaround for discrete action spaces which has not been shown to be competitive compared to the algorithms mentioned above. 
+-Algorithmically the method looks a lot like ""Learning with Opponent Learning Awareness"", with the caveat that the return is optimised after one step of 'self-learning' by each agent rather than after a step of 'Opponent-learning'. Can you please elaborate on the similarity / difference? 
+-Equation (6) and C1 are presented as contributions. This is the standard objective that's commonly optimised in MARL when using parameter sharing across agents.",5,4.0,ICLR2019
+SQDxMuUejI,2,6BRLOfrMhW,6BRLOfrMhW,Review,"The paper proposes a generalization of the sandwiched Bloom filter model that maintais a set of score partitions instead of just two and an algorithm for optimizing parameters of the partition under the target false-positive rate. Authors evaluate partitioned Bloom filter on three datasets and demonstrate that delivers better false positive rates under for a given model size compared to the baselines.
+
+I find the paper quite innovative and the experimental results impressive. My main concern is regarding the paper clarity.
+1. How can a region's FPR $f_i$ be greater than 1? Perhaps, this is something very obvious, but I couldn't get this immediately and perhaps some other readers might  struggle here too.
+2. The learned model size appears in the optimization objective, but is considered given. I wonder what will change if we allow to trade-off the learned model power for a larger number of regions and larger backup Bloom filters in each region? Does it even make sense to ask such a question? Would the results change if a different model is used? I think it is possible that for a given variant of a learned Bloom filter, a different model may result into different values of optimal parameters and a difference performance, thus these should ideally be optimized independently for each of the baselines. I also think that size of the pickle file is arguably not the best estimate for the learned model size if indeed a different model is used for each of the filters, e.g. a neural network might admit a decent compression rate if a lower precision number format is used etc. Thus, it is important to separate the impact made by a learned model from an algorithmic improvement.
+3. Is there any variance caused by observing particular distributions G and H? Is it small enough to ignore or confidence interals for each of the curves might actually overlap? I would also be interested in understanding behaviour of all considered models as the sample size changes.
+
+I also feel like authors can cite *Meta-learning neural Bloom filters. Rae et al, 2019* as it considers a relevant (although a different as well) setting.
+
+Nevertheless, I think the kind of analysis presented in the paper very useful for the community and for further development of learned data structures. I recommend acceptance and I will gladly raise my rather conservative score if authors could clarify the points mentioned above.",7,3.0,ICLR2021
+DTAk5yTsn08,1,zQTezqCCtNx,zQTezqCCtNx,This paper proposes to use channel suppressing to enhance adversarial training. ,"##########################################################################
+Summary:  
+
+This paper uncovers interesting phenomenons of adversarial training, i.e., more uniformly distributed adversarial data activations than those of natural data. To force the behaviors (in this paper, channels activations) of adversarial data to be similar to those of natural data, the authors explicitly suppress the redundant channels by reweighing the channel activations. 
+
+##########################################################################
+Reason for score. 
+
+Overall, I vote for accepting. I like the uncovered phenomenons of larger and more uniformly distributed activations of adversarial data than those of natural data. 
+Technically, this paper proposed effective training strategies (i.e., channel-wise activation suppressing (CSA)) to enhance adversarial training. 
+
+##########################################################################
+Pros: 
+
+1 This paper provides the understanding of adversarial training from the channel activation perspective, showing that adversarial training can reduce the magnitude of the activation of the adversarial data, but fail to break the uniform activations by the adversarial data. 
+
+2 Figure 2 shows the efficacy of the proposed CSA methods for breaking the adversarial data's uniform activations. Compared with standard adversarial training, CSA can further suppress the redundant channel activations. 
+
+3 The experiment evaluations are comprehensive, showing CSA strategies' efficacy across various adversarial training methods, network structures, and attack methods. 
+
+##########################################################################
+Cons: 
+
+1 What is the side effect of redundant channel activations? Specifically, what is the side effect of uniform activations of the adversarial data? Would you mind explaining more?
+
+2 Although CSA successfully suppresses the redundant channels of the adversarial data, CSA also seems to suppress the activations of natural data? Is this the reason for the improvement on natural accuracy?
+",7,5.0,ICLR2021
+Bylx9EfY5B,3,Byx55pVKDB,Byx55pVKDB,Official Blind Review #4,"This simple paper shows that the normalization of softmax causes a loss of information compared to using the unnormalized logits when trying to do OOD and adversarial example detection.  The main reason for this is of course the normalization used by the softmax. The paper is mostly empirical following this specific observation, and uses a number of examples on MNIST and CIFAR to show the improvement in performance by using unnormalized logits instead of softmax.
+
+While interesting, it is to be noted that methods such as ODIN and temperature scaling specifically include a temperature to exactly overcome this same issue with softmax. The lack of comparison to such baselines makes this paper quite incomplete, especially as it is an empirical paper itself. ",3,,ICLR2020
+_PoHiGTLFAR,3,NomEDgIEBwE,NomEDgIEBwE,a very well done paper proposing methods to improve transformation invariance in contrastive learning with strong empirical results.,"Summary:
+
+- Proposes new method to improve transformation invariance in contrastive representation learning and demonstrates utility on downstream tasks
+- Proposes using feature averaging from multiple transformations at test time leading to further improvements 
+- Introduces Spirograph dataset to explore the importance of learning feature invariances in the context of contrastive learning
+
+Clarity
+- Paper very well written and easy to follow. Figures supplement the text well
+- Experiments include error bars to show statistical significance of results
+- Supplementary material clarifies experimental setup and very comprehensive
+- Perhaps consider changing “Self-supervized” -> “self-supervised”? 
+
+Novelty/Significance
+- New contrastive objective with gradient regularizer term to encourage transformation invariance and the Spirograph dataset are well motivated 
+- Results over contrastive baselines suggest the contributions are important improvements to the contrastive training recipe 
+
+Questions/Comments/Clarifications
+- “Unfortunately, directly changing the similarity measure hampers the algorithm.” - Please add a citation to validate this claim. Is it coming from the experiments in SimCLRv1 - Chen et al, 202
+
+- “However, there are many ways to maximize the InfoNCE objective without encouraging strong invariance in the encoder.” - Please add a citation to validate this claim
+
+- Eq 9 and 10:  (Fe(α, β, x) − Fe(α’ , t, x))^2. Should t not be β?
+
+- “It may be beneficial, however, to aggregate information from differently transformed versions of inputs to enforce invariance more directly” -> it is unclear why invariances need to enforced at test time if the learned representation is already invariant
+
+- The authors point this out but the gradient regularization term is unfortunately encouraging invariance only to differentiable 
+ transforms and this is a key limitation
+
+- Worth pointing out for certain tasks likely near OOD detection, you may want to be transformation covariant rather than invariant. Would be interesting to see results of the proposed method on OOD detection benchmarks following https://arxiv.org/abs/2007.05566
+- One major limitation is lack of baselines beyond vanilla contrastive training. For example, it would have been good to compare test time feature averaging with test time augmentation ensembling. Similarly instead of gradient regularization, the model could directly predict augmentation parameters and have the gradients of that loss penalized/ reversed. Adding such additional baselines, would solidify the improvements as best in class.
+- It is also unclear how much train/test time compute model adds.
+
+Overall, this is a very nicely written paper and very solid contribution. If the authors address my concerns, would be happy to increase my score.",7,4.0,ICLR2021
+lw40isQIKem,1,C3qvk5IQIJY,C3qvk5IQIJY,More details should be provided,"In this paper, the authors proposed to analyze the over-parameterization in GANs optimized by the alternative gradient descent method. Specifically, considering a GAN with an over-parameterized G and a linear D, the authors proposed the theorem 2.1 to provide a theoretical convergence rate in GAN’s training.
+
+However, this paper is problematic. The details are as follows:
+
+1. The gap between the paper’s title and theoretical claims. To my best knowledge, the OVER-PARAMETERIZATION means that the model will overfit to the training data and cannot generalize to test data. Thus, the generalization bound between training error and the test error is the main concern in this topic. However, in this paper, the authors mainly focus on the convergence rate during the training. I think this topic is more correlated to the non-convex optimization problem, rather than the over-parameterization problem.
+
+2. The theoretical claims in Sec 2 are not convincing. First, what is the data distribution? In the whole Sec 2, there is no detailed explanation about the data distribution, which is one of the most important parts of the analysis. Only the `Numerical Validation` part mentioned that the data distribution is a univariate Gaussian. Though we assume that the data is Gaussian, simply minimizing the distance between the mean of data and the ones of G is not enough: the variance in the data is not taken into consideration. I’m not sure that using a linear discriminator is enough for G to model the data distribution; generally, we assume the discriminator has infinity capacity.
+
+3. The theoretical derivation is ambitious. In Theorem 2.1, why $V$ is not optimized in Eqn. (2) & (3)? In Theorem 2.3, why $f$ is a general mapping? In GANs, it should be a parametric mapping from z to x. Besides, Theorem 2.3 claims that it is a general minimax problem. However, it is still restricted to a linear discriminator. Clarity needs to be improved.
+
+4. Finally, the experimental results cannot fully validate the authors’ claims. In Fig. 4, with smaller k, G’s capacity is not sufficient to capture the data distribution. In this case, comparing the convergence rate is unfair. Further, the paper does not provide any bound on the gap between training error and test error. I don’t know the purpose of Fig. 3 and Fig. 5.
+
+Three related work:
+[*1] talks about the existence of the equilibrium of GANs’ minimax problem. If the equilibrium does not exist, then the convergence rate cannot validate any claims as I mentioned above.
+[*2] also uses control theory to understand GAN’s training and [*3] adopts control theory to improve the training dynamics of deep models. The relationship should be discussed.
+
+
+[*1] Farnia, Farzan, and Asuman Ozdaglar. ""GANs May Have No Nash Equilibria."" arXiv preprint arXiv:2002.09124 (2020).
+
+[*2] Xu, Kun, et al. ""Understanding and Stabilizing GANs’ Training Dynamics using Control Theory.""
+
+[*3] An, Wangpeng, et al. ""A PID controller approach for stochastic optimization of deep networks."" Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.",4,5.0,ICLR2021
+B1xJ3ZoaFS,1,Syxwsp4KDB,Syxwsp4KDB,Official Blind Review #2,"POS-DISCUSSION
+I thank the authors for their answer. I updated my score assuming ryxAY34YwB does not exist, and would encourage authors to discuss in more details the relationship with MeanSum if this gets accepted
+
+PRE-DISCUSSION
+
+This is an important contribution for the field of unsupervised summarization. ""Unsupervised *"" is trendy in NLP so this is a timely contribution. Furthermore, doing this for summarization is important because of the cost of getting gold summaries and the model used in translation is harder (impossible?) to adapt to this setting where there is information loss in one direction.
+
+However, I find major drawbacks in the current state of this paper. They are best related to the three contributions the author claim:
+ - Contribution3: the use of BPE. ""BPE for X"", with X being an NLP task can hardly count as a contribution today. If we are counting who did it first, then this is taken at least by Liu & Lapata 2019 through their use of BERT
+ - Contribution1: leveraging the lead bias for pre-training. This is a great idea! However, this seems to be covered by an accompanying paper (ICLR submission ryxAY34YwB) which is not referenced. Because of common paragraphs and experimental setting I am assuming there is an overlap of the author sets in two papers. PLEASE CORRECT IF THIS IS NOT THE CASE. As you don't get to claim the same contribution twice, this contribution should go all to the benefit of the other paper.
+ - Contribution2: the use of combining reconstruction loss and theme loss for summarization is another great idea. However, the paper that introduced this for summarization (as far as I know) is not cited nor compared too (MeanSum: https://arxiv.org/abs/1810.05739). This seems like a major issue considering the similarity in the approach (including the use of the straight-through Gumbel softmax estimator).
+
+Other comments:
+
+ - Being a growing topic of study, I appreciated in particular the care taken to report a number of other approaches. Could you please clarify which version of ROUGE was used in each case? There are significant differences in the different implementations being used.
+ - Please also specify the version of ROUGE you used. 
+ - Your numbers in Table 2 do not coincide with Table 3 of ryxAY34YwB (eg: LEAD-3 for CNN/DM). Can you explain?
+ - Your ablation study (Sect 4.1) focuses on CNN/DM (NOTE: the caption of Table 4 says NYT, but the number correspond to CNN/DM. I guess this is an error), where the topic & reconstruction loss indeed helps. However this is not the case for NYT, where LEAD-3 actually beats any of your approach. This is not mention nor discussed.
+ - The example of Fig 4 reveals a major problem. The summary states an incorrect fact: the gov accountability had indeed released a report earlier that week; but this was NOT a few hours before the reported incident. What happened a few hours before was a report on Fox News.
+
+
+In a summary: a good idea combining ideas of ryxAY34YwB and adapting MeanSum. However, this is in my opinion not enough material for a full paper.",8,,ICLR2020
+Zrwgwm_GFH6,4,cL4wkyoxyDJ,cL4wkyoxyDJ,Review for Towards Counteracting Adversarial Perturbations to Resist Adversarial Examples,"The paper proposes a defense against adversarial examples. The idea of the defense is to counteract transformation generated by PGD attack. This is done by running one step of PGD on network input several times (and using different target labels) and then average the result.
+
+As described below, I think there are multiple serious flaws in the evaluation and the defense likely won’t work. Thus I recommend rejecting the paper.
+
+Issues with the paper:
+* Experiment section is lacking rigor which is necessary to properly evaluate defense against adversarial examples. Points below explain it in more detail.
+* One clear indication of a problem with evaluation is the fact that accuracy under attack is increasing (see table 2) when seemingly stronger attack is used. [i.e. PGD with larger number of steps]
+* The whole premise of the defense idea is to counteract very specific thing which PGD does, thus it’s unlikely to help against more sophisticated attacks or simply different attacks.
+* Evaluation procedure implies that the attacker has no knowledge about the defense and even no ability to query the defended model. Which is very strong restrictions on the attacker, moreover they are impractical from security standpoint [even in black box case attacker usually has an ability to query the model]
+* Authors cite (Carlini & Wagner, 2017) to justify oblivious-box attack setup. However  (Carlini & Wagner, 2017) actually proposes white-box attack and provides justification why white-box should be used.
+* https://arxiv.org/pdf/1802.00420.pdf (which authors cite) shows how to break defenses based on input transformations. Nevertheless authors do not address why the proposed defense (which is also input transformation) is not broken.
+* Authors use Guo et al.’s (2018) as one of the baselines, despite that this is a broken defense (per https://arxiv.org/pdf/1802.00420.pdf )
+* No comparison with adversarial training, in particular authors should consider comparing to  A. Madry’s paper https://arxiv.org/pdf/1706.06083.pdf which also performs experiments on CIFAR dataset.
+
+
+Feedback on how to improve paper:
+* The evaluation should be performed in the assumption that adversary is either aware of the attack (white-box case) or able to query the defended model (black box attack).
+* Use https://arxiv.org/abs/1902.06705 as a guide on how to properly evaluate models for adversarial robustness. And redo evaluation following this guide.
+* In particular, consider running gradient free attacks on the whole model (baseline model + defense on top of it) and try to make an attack which will break the proposed defense.
+",1,5.0,ICLR2021
+HJLqaM7bz,3,H1UOm4gA-,H1UOm4gA-,interesting contribution,"The paper introduces XWORLD, a 2D virtual environment with which an agent can constantly interact via navigation commands and question answering tasks. Agents working in this setting therefore, learn the language of the ""teacher"" and efficiently ground words to their respective concepts in the environment. The work also propose a neat model motivated by the environment and outperform various baselines. 
+
+Further, the paper evaluates the language acquisition aspect via two zero-shot learning tasks -- ZS1) A setting consisting of previously seen concepts in unseen configurations ZS2) Contains new words that did not appear in the training phase. 
+
+The robustness to navigation commands in Section 4.5 is very forced and incorrect -- randomly inserting unseen words at crucial points might lead to totally different original navigation commands right? As the paper says, a difference of one word can lead to completely different goals and so, the noise robustness experiments seem to test for the biases learned by the agent in some sense (which is not desirable). Is there any justification for why this method of injecting noise was chosen ? Is it possible to use hard negatives as noisy / trick commands and evaluate against them for robustness ?  
+
+Overall, I think the paper proposes an interesting environment and task that is of interest to the community in general. The modes and its evaluation are relevant and intuitions can be made use for evaluating other similar tasks (in 3D, say). ",6,4.0,ICLR2018
+sLeBmX8CPT5,3,jsM6yvqiT0W,jsM6yvqiT0W,Official Review 2,"**Justification for Score**
+The paper is missing a thorough literature review and there are many missing citations and comparisons. Overall, unfortunately there are multiple misleading (and sometimes false) claims throughout the paper. The evaluation performance is also a little unfair as they use ECE as an objective and then only show better performance on ECE. Overall, the paper is well written but the content does not have good enough quality as there seem to be many unsupported claims.
+
+## Review
+
+### Summary
+This paper proposes a post-hoc calibration method which aims to preserve the accuracy of the classifier as well as improve its uncertainty calibration. The core of the method lies in formulating a general form of the commonly used scaling method Temperature Scaling (TS) and more recent variation of it, Local TS (Local TS). The proposed method Neural Rank-Preserving Transforms (NRPT) maintains the accuracy but also shows better calibration performance compared to TS and Local TS.
+
+### Strengths
+* It is important that a post-hoc calibration does not decrease the accuracy. I like the justification of the authors about this: "" a calibrator that does not maintain the accuracy may attempt to improve the accuracy at the cost of hurting (or not improving) the calibration"". Though, I have to point out that a method which can change the rank performance also has advantages compared rank-preserving methods (i.e. TS or NRPT): The literature has shown that accuracy and calibration performance can jointly be improved - the authors don't really acknowledge this point (see below for references and examples of such methods). 
+
+* A rank preserving method has the advantage of optimizing loss functions which do not have to ""care"" about accuracy. In this case, the authors propose to use the ECE loss. A method which does not preserve the rank would completely decay the accuracy in an attempt to optimize the ECE. That being said, I still think that optimizing on the ECE and only showing improvements on the ECE metric is not surprising (see below for more discussion on this point). So it does have the advantage of optimizing ECE, but the paper does not really show the benefit here as it only shows that optimizing the ECE performs better on ECE (this is expected) but performs worse on the other metrics (see Tab. 1). Also, I would like to point out that I include DECE when I talk about ECE in this review.
+
+* Better calibration performance compared to the Baseline, TS and LTS. The authors also compared against deep ensembles which have been shown to improve calibration, though they only perform better on ECE for the case where NRPT is optimized using the ECE (NRPT-E). NRPT (without -E) performs significantly worse than ensembles larger than 4 (at least more than half better). 
+
+* Even though NRPT does not have overall better performance compared to deep ensembles (i.e. only is better at ECE metric and only when ECE is specifically optimized during training), it does have the advantage of being computationally faster. 
+
+
+### Weaknesses
+1. The major weakness of this paper is its view of the current post-hoc calibration literature seems to be outdated. A re-occurring theme throughout the paper is that post-hoc methods are very simple and always talks about Temperature Scaling (TS) as an example. The post-hoc literature has come far beyond simple methods such as TS. Even though it is true that many recent variants of TS have shown good improvements, there are many other non-""simple"" methods which have much (1) larger capacity, (2) can improve accuracy (i.e. they do not have accuracy loss) and (3) greatly improve calibration performance. 
+
+2. The related work paragraph also only talks about three post-hoc calibration methods. Even though, the literature grows very fast and it can be hard to always keep track of all new papers, there have definitely been much more than the three variants cited in this paper (see below for references). Later on, the paper does cite the work [1] and claims that they show that ""overfitting cannot be easily fixed by applying common regularizations such as L2 on the calibrator"". Despite, this statement not being entirely true (see below), it seems that the authors are aware of this paper. [1] has also presented a calibration method (i.e. Dirichlet calibration - mentioned in the title of [1]), so why has this method not been compared or cited for its calibration method? 
+[1] has thoroughly compared against multiple other calibrators (in addition to TS and Matrix scaling), so why were none of these other calibrators cited, used and compared against?
+
+3. This is why I find that there are many misleading statements throughout the paper which imply that the (large) family of post-hoc calibration methods can be summarized as ""simple"" and limited to TS or Matrix Scaling. One example of such a statement can be found in the appendix: ""Existing post-calibration methods such as temperature scaling recalibrate a trained model using rather simple calibrators with one or few parameters, which can have a rather limited capacity."" The reason I find them misleading is that for a reader/reviewer who is not familiar with post-hoc calibration, it might seem that TS (and its variants) are the only and best methods for post-hoc calibration. Recent years have shown multiple other solutions which are not limited to such simple methods.
+
+4. Here is a list of some recent and relevant approaches: GP [3], Histogram Binning [5], I-Max histogram binning [4], Beta Calibration [2], Dirichlet Calibration [1], Matrix Scaling with ODIR regularization [1], BTS [6], Isotonic Regression [9] and MnM [8]. These post-hoc calibration methods range from relatively old methods to much newer methods. Even though I certainly don't expect to see all of them cited and compared against, the authors should do a much more thorough search of the literature or not make statements which imply that the post-hoc literature is only limited to the methods which they compare against.
+
+5. Even though the authors compared against deep ensembles, the performance is only better in one case: the ECE metric and only when NRPT optimizes the ECE loss (i.e. NRPT-E). This is no surprise and cannot be used to claim that the method is better than deep ensembles. A method which optimizes the metric A directly will perform better at this metric A. Of course, there are some exceptions such as NLL training without regularization. But overall this is not really an impressive result, especially given that the ECE has lately been criticized (see above).
+
+6. State-of-the-art-performance: ""Local Temperature Scaling (LTS, see (4)), a generalization of temperature scaling that is observed to achieve state-of-the-art performance on a variety of computer vision tasks (Ding et al., 2020)"" This paper claims that LTS obtains state-of-the-art performance and by performing better than LTS implying that they have beaten state-of-the-art in post-hoc calibration. The authors of this current paper do not claim that they obtain state-of-the-art but they imply this by mentioning that they perform better than a method which is previously state-of-the-art. I looked into the LTS paper and they also did not make the claim of obtaining state-of-the-art performance. So where is this claim coming from? It is hard to claim that LTS obtains state-of-the-art performance. After a brief look in the LTS paper, it seems that they too ONLY compare against TS variants (please correct me if I am wrong). As I have listed above, there are multiple other methods which have shown to improve calibration and have a thorough comparison against multiple other post-hoc calibration methods (and not only limited to TS and some of its variants).  Again, I find this a little misleading that the authors of this paper seem to imply many statements to make their method seem good.
+
+7. Fig. 1 and Matrix Scaling: The authors seem to want to show the benefit of NRPT by showing superior performance compared to TS (""simple"") and Matrix Scaling (""complex calibrators can often overfit""). Despite, the fact that these are poor baselines to compare against (see list of calibrators above), there has been a new proposal to fix the overfitting issues of Matrix Scaling. [1] proposed to use ODIR regularization (""simple L2 regularization"") to improve the performance of Matrix Scaling. Firstly, the authors cite this work which means they are aware of this paper and this technique to reduce the overfitting. So then why has it not been used to in Fig. 1? And why was it chosen to compare against Matrix Scaling (without regularization) if there does exists a better solution for matrix scaling? Secondly, the authors instead make a misleading statement: ""It is further observed that the overfitting cannot be easily fixed by applying common regularizations such as L 2 on the calibrator (Kull et al., 2019)"" This statement is not true. This paper shows in Table 3 (of [1]) and Table 4 (of [1]) that Matrix Scaling with ODIR (i.e. L2 regularization) is sometimes the best performing method. Even for the rest of the cases where it does not perform best, it performs similarly. This shows that ""simple regularization"" can help address the over-fitting issue. This regularization has also been used in [4] with Matrix Scaling for the ImageNet classifiers and has shown to be the best performing method in terms of accuracy (showing that it helps with reducing the overfitting) Again, this is a mis-leading statement and the literature does not support this.
+
+8. Based on the previous (possibly) false claim, the next claim that ""This empirical evidence seems to suggest that complex calibrator with a a large number of parameters are perhaps not recommended in designing post-calibration methods"" is also not entirely true. As mentioned above, highly expressive methods can improve calibration.
+
+9. The authors mention that the small size of the validation set can cause highly expressive methods to overfit, thought [3,4] have both shown to greatly improve calibration performance with as little as 1000 samples for ImageNet classifier calibration and even shows better performance than methods using significantly more samples. Therefore, it is not true that a limited dataset will lead to overfitting and these results show that regularization and other techniques can handle a low data regime for calibration.
+
+10. The authors also claim that ""matrix scaling is not guaranteed to maintain the accuracy"". This claim is true, but the authors fail to mention that Matrix Scaling (and its regularized variants) can also improve the classification accuracy performance. The high capacity allows it to improve accuracy as well. After using L2 regularization the accuracy performance can be greatly improved. 
+Alternatively, [5], also presented Vector Scaling (which has less capacity than Matrix Scaling but more than TS) and they also show that it too can change the accuracy performance and actually improve the accuracy in most cases. So even though I do see the benefit of having method which preserves the rank, it should clearly be pointed out that it has the disadvantage of not being able to improve the accuracy. For example, compared to [3] which improves the accuracy performance, NRPT has the disadvantage with respect to accuracy. So despite this above statement being true, the authors use this statement to justify why methods like TS which preserve the accuracy are better, though TS actually has a disadvantage compared to the many cases where these rank-changing methods improve the accuracy.
+
+11. ECE: The authors cite [10] to use their presented debiased estimator. Though, they completely ignored the main message of that paper. [10] has shown that continuous output scaling methods (e.g. TS) have trouble estimating the ECE and instead propose to use binning or quantized output methods instead. [1] also shows how the ECE is underestimated when using too few bins during evaluation. They show that the ECE of scaling methods are non-verifiable. So even though the authors are aware of this paper, they do not at all comment on the use of ECE for continuous output scaling methods such as the one presented. Other recent works [3,4] have also discussed this and proposed solutions such as increasing the number of bins when estimating the ECE of scaling methods. How many bins were used to estimate the ECE? As the authors are aware of this work, they should discuss this weakness of ECE in their paper and more importantly use approaches used in other recent works in the literature to address these issues to some extend. Again, it seems that the authors need to do a more thorough literature search on post-hoc calibration.
+
+12. The method also uses ECE during training: Even though this might be an advantage of rank preserving methods, it is not surprising that the method using ECE as a loss performs the best at ECE. Table 1 shows that ""NRPT+E"" is only better at ECE and no other metric. As discussed earlier, the ECE metric has major flaws and problems, so using this as a loss also comes with its problems. So I do not see this variant of NRPT as very useful. It would be interesting to how ""NRPT-E"" performs on other metrics and tasks which are not directly optimized during training, as using ECE as the objective and the evaluation is unfair.
+
+13. Tab.1 : Why are the uncalibrated ECEs so high? These are unusually high ECE even for uncalibrated networks? There is no surprise then that even simple methods such as TS can show great improvement when the networks are so miscalibrated. Many other works using similar networks report much lower uncalibrated ECE. Maybe the ECE was evaluated differently?
+
+14. ECE as objective: As you use ECE during the learning, you should be providing all details about ECE (e.g. the number of bins, the bin widths or bin edges ).
+
+### Minor comments
+* Some references as using arxiv versions of the paper (e.g. Guo2017 is a ICML paper from 2017, so shouldn't be using an arxiv reference)
+
+
+### References (all can be found on arxiv)
+
+[1] Meelis Kull, Miquel Perello Nieto, Markus Kängsepp, Telmo Silva Filho, Hao Song, and Peter Flach.  ""Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration""
+
+[2] Meelis Kull, Telmo Silva Filho, and Peter Flach. ""Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers"".
+
+[3] Jonathan Wenger, Hedvig Kjellström, and Rudolph Triebel. ""Non-parametric calibration for classification""
+
+[4] Kanil Patel, William Beluch, Bin Yang, Michael Pfeiffer, Dan Zhang. ""Multi-Class Uncertainty Calibration via Mutual Information Maximization-based Binning""
+
+[5] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. ""On calibration of modern neural networks""
+
+[6] B. Ji, H. Jung, J. Yoon, K. Kim, and y. Shin. Bin-wise temperature scaling (BTS): ""Improvement in confidence calibration performance through simple scaling techniques""
+
+[7] Jize Zhang, Bhavya Kailkhura, and T Han. ""Mix-n-Match: Ensemble and compositional methods for uncertainty calibration in deep learning.""
+
+[8] Mahdi Pakdaman Naeini, Gregory F. Cooper, and Milos Hauskrecht. ""Obtaining well-calibrated probabilities using Bayesian binning.""
+
+[9]  Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates.
+
+[10] Ananya Kumar, Percy S Liang, and Tengyu Ma. ""Verified uncertainty calibration""",2,5.0,ICLR2021
+H1eC-EI6n7,3,H1xipsA5K7,H1xipsA5K7,"interesting, technical results on learning one hidden layer NN","This paper pushes forward our understanding of learning neural networks. The authors show that they can learn a two-layer (one hidden layer) NN, under the assumption that the input distribution is symmetric. The authors convincingly argue that this is not an excessive limitation, particularly in view of the fact that this is intended to be a theoretical contribution. Specifically, the main result of the paper relies on the concept of smoothed analysis. It states that give data generated from a network, the input distribution can be perturbed so that their algorithm then returns an epsilon solution. 
+
+The main machinery of this paper is using a tensor approach (method of moments) that allows them to obtain a system of equations that give them their “neuron detector.” The resulting quadratic equations are linearized through the standard lifting approach (making a single variable in the place of products of variables). 
+
+This is an interesting paper. As with other papers in this area, it is somewhat difficult to imagine that the results would extend to tell us about guarantees on learning a general depth neural network. Nevertheless, the tools and ideas used are of interest, and while already quite difficult and sophisticated, perhaps do not yet seem stretched to their limits. ",7,4.0,ICLR2019
+BkeuIKaB2X,1,ryeh4jA9F7,ryeh4jA9F7,"Nice idea, insufficient baselines","The authors focus solely on universal adversarial perturbations, considering both epsilon ball attacks and universal adversarial patches. They propose a modified form of adversarial training inspired by game theory, whereby the training protocol includes adversarial examples from previous updates alongside up to date attacks.
+
+Originality: I am not familiar with all the literature in this area, but I believe this approach is novel. It seems logical and well motivated.
+
+Quality and significance: The work was of good quality. However I felt the baselines provided in the experiments were insufficient, and I would recommend the authors improve these and resubmit to a future conference.
+
+Clarity: The work was mostly clear.
+
+Specific comments:
+1) At the top of page 5, the authors propose an approximation to fictitious play. I did not follow why this approximation was necessary or how it differed from an stochastic estimate of the full objective. Could the authors clarify?
+
+2) The method proposed by the authors is specifically designed to defend against universal adversarial perturbations, yet all of the baselines provided defend against conventional adversarial perturbations. Thus, I cannot tell whether the gains reported result from the inclusion of ""stale"" attacks in adversarial training, or simply from the restriction to universal perturbations. This is the main weakness of the paper.
+
+3) Note that as a simple baseline, the authors could employ standard adversarial training, for which the pseudo universal pertubations are found across the current SGD minibatch.
+
+
+",5,3.0,ICLR2019
+uFQYo2VFfwn,1,FsLTUzZlsgT,FsLTUzZlsgT,review,"This paper advocates for studying the effect of design choices in deep learning via their effect on entire *learning curves* (test error vs num samples N), as opposed to their effect only for a fixed N. This is a valid and important message, and it is indeed an aspect that is often overlooked in certain domains. However, although this paper addresses an important issue, there are methodological concerns (described below) which prevent me from recommending acceptance. In summary, the paper oversimplifies certain important aspects in both the setup and the experiments.
+
+Concerns:
+1. My main concern is that the discussion of learning curves ignores the effect of model size. Prior work (including Kaplan 2020 and Rosenfeld 2020) has shown that learning curves exhibit quantitatively different behavior when models are overparameterized vs. underparameterized. In particular, learning curves are only known to exhibit clean power-law behavior when model-size is not the bottleneck (e.g. if model-size is scaled up correspondingly to data size). There is no discussion of the model-size issue in the present work. This may be problematic, since data from small N are used to extrapolate to large N, but the model size is held fixed. 
+Concretely: a full discussion of how to evaluate and interpret learning curves should account for the effect of model-size.
+
+2. The curve-fitting procedure is non-standard, and produces some questionable extrapolations.
+This is concerning because one of the stated contributions of this paper is to propose an experimental methodology. Specifically: 
+
+A. If the true parametric form is a power-law with some \gamma != 0.5, why are the learning curves plotted assuming \gamma=0.5 (Figure 1)? In the regression estimate (Equation 7), why is the exponent \gamma encouraged to be close to 0.5?
+Note that the theoretical justification for \gamma=0.5 (Table 2) is weak -- it only includes parametric bounds. Non-parametric rates are in general different from \gamma=0.5.
+
+
+B. Several of the curves in Figure 1 predict cross-overs which we do not expect to occur at N=infty. For example, Figure 1g predicts that an ensemble of 6 ResNet18s will be better than 1 ResNet50 at N=\infty, which we do not expect.
+
+C. In general, the curves are extrapolated from only 5 points -- it would be more convincing to see more data-sizes tested.
+
+3. Regarding experimental setup and conclusions:
+
+A. Why are there experiments for CIFAR-100 but not CIFAR-10? Most of the current experiments have high error rates (~20%), so it would have been nice to see how the curve-fits perform down to low error-rates (< 5%) as we would see on CIFAR-10.
+
+B. The claim that ""pretraining does not bias the classifier"" is too strong to be supported by the experiments. Certainly this does not hold for any arbitrary pre-training dataset, but perhaps it holds for ""natural"" pre-training datasets close to the ones tested here. In general, several of the experimental claims are too strong in this way -- they make universal statements, but are only tested in a few limited respects. Further experiments would give more evidence to these claims. For example, it is speculated on pg 11 that \gamma does not depend much on the model architecture. Does this continue to hold for MLPs? (Only convnets are tested in this paper).
+
+
+Summary: The motivation of this paper is very good, but the proposed experimental methodology is somewhat lacking. This paper would be much improved by more thorough experiments and analysis, and more nuanced discussion of the experimental conclusion.
+
+
+Comments/clarifications which do not affect the score:
+Why are the experiments done using the Ranger optimizer? Would any conclusions differ if we use standard optimizers (SGD/Adam)?
+I would suggest moving Section 3.2 to the appendix, since the mechanics of least squares is likely familiar to readers. This would open more space for further discussion of Figure 1 experiments.
+
+---
+Edit after rebuttal: Changed score from 5 to 6 (see below)",6,4.0,ICLR2021
+rygBjki6tB,1,HJxV5yHYwB,HJxV5yHYwB,Official Blind Review #2,"After Responses:
+I understand the differences that authors pointed to the relevant literature. However, it is still lacking comparisons to these relevant methods. The proposed method has not been compared with any of the existing literature. Hence, we do not have any idea how does it stand against the existing approaches. Hence, I believe the empirical study is still significantly lacking. I will stick to my decision. Main reason is as follows; I believe the idea is interesting but it needs a significant empirical work to be published. I recommend authors to improve empirical study and re-submit.
+-------
+The submission is proposing a method for multi-objective RL such that the preference of tasks learned on the fly with the policy learning. The main idea is converting the multi-objective problem into single objective by scalar weighting. The weights are learned in a structured learning fashion by enforcing them to approximate the Pareto dominance relations.
+
+The submission is interesting; however, its novelty is not even clear since authors did not discuss majority of the existing related work. 
+
+Authors can consult the AAMAS 2018 tutorial ""Multi-Objective Planning and Reinforcement Learning"" by Whiteson&Roijers for relevant papers. It is also important to note that there are other methods which learn weighting. Optimistic linear support is one of such methods. Hence, this is not the first of such approaches. Beyond RL, it is also studied extensively in supervised learning. For example, authors can see ""Multi-Task Learning as Multi-Objective Optimization"" from NeurIPS 2018.
+
+The manuscript is also very hard to parse and understand. For example, Definition 2 uses but not define ""p"" in condition (2). Similarly, Lemma 1 states sth is ""far greater"" than something else. However, ""far greater"" is not really defined. I am also puzzled to understand the relevance of Theorem 1. It is beyond the scope of the manuscript, and also not really new.
+
+Authors suggest a method to solve multi-objective optimization. However, there is no correctness proof. We do not know would the algorithm result in Pareto optimal solution even asymptotically. Arbitrary weights do not result in Pareto optimality.
+
+Proposing a new toy problem is well-received. However, not providing any experiment beyond the proposed problem is problematic. Authors motivate their method using DOOM example. Why not provide experimental results on a challenging problem like DOOM?
+
+In summary, I definitely appreciate the idea. However, it needs better literature search. Authors should position their paper properly with respect to existing literature. The theory should be revised and extended with convergence to Pareto optimality. Finally, more extensive experiments on existing problems comparing with existing baselines is needed.",1,,ICLR2020
+rk5c3reEe,1,ry_4vpixl,ry_4vpixl,my review,"This is a nice proposal, and could lead to more efficient training of
+recurrent nets. I would really love to see a bit more experimental evidence.
+I asked a few questions already but didn't get any answer so far.
+Here are a few other questions/concerns I have:
+
+- Is the resulting model still a universal approximator? (providing large enough hidden dimensions and number of layers)
+- More generally, can one compare the expressiveness of the model with the equivalent model without the orthogonal matrices? with the same number of parameters for instance?
+- The experiments are a bit disappointing as the number of distinct input/output
+sequences were in fact very small and as noted by the authr, training
+becomes unstable (I didn't understand what ""success"" meant in this case).
+The authors point that the experiment section need to be expanded, but as
+far as I can tell they still haven't unfortunately.
+",5,3.0,ICLR2017
+__GtS1g236w,2,PBfaUXYZzU,PBfaUXYZzU,Not enough novelty and originality,"This paper proposes a simple and general-purpose evaluation framework for imbalanced data classification that is sensitive to arbitrary skews in class cardinalities and importances.  
+
+I think the problem this paper deals with is very important and of great interest to the wide range of readers. 
+In addition, the paper is generally clearly written and easy to follow.
+
+However, I found the novelty and the originality of this paper is not enough for the ICLR standard.  
+Although I am not aware of the paper that presents exactly the same concept as this paper, I feel the generalization this work presents is too straightforward that new insights, findings, and benefits brought by this paper to the community are very limited.
+For example, the micro average (or simply called ""Accuracy"" in this paper) can be regarded as a special form of equation (6), where weights or importance is given by the relative frequency of each class.
+This importance criteria can be regarded as opposite of one of the presented criteria, ""Rarity"".
+There may be some cases where the rarer a class is, the more important the class is as explained in the paper, but there may be other cases where more frequent classes are more important. As such, the concept of including importance of each class into an evaluation metric has been implicitly considered. It is true that the paper provides general form that includes the aforementioned case, but I'm afraid the formulation is not novel enough to bring a new value to the community as I mentioned above.
+
+Another weakness of the paper lies in the experimental analysis.  
+Overall, the analysis is not convincing enough to verify the benefit of the proposed metric.
+For example, at the end of ""WBA_rarity vs. Class-insensitive Metrics"" in page 7, the authors states ""This result validates that WBA_rarity provides a more sensitive tool for assessing classification performance"". I do not agree with this statement because it is no surprise that different evaluation metrics give different evaluation results, and this alone is not the ground for the validity of the proposed metric. The same argument can be applied to the ""Impact of WBA in Model Training"" in page 8. It is natural that a model trained with a specific criteria results in performing well on the criteria. This is again not the ground for the claim ""our new framework is more effective than Balanced Accuracy – ... also in training the models themselves.""
+
+The pros and cons of this paper can be summarized as follows.
+
+
+#### Pros
+1. The paper deals with practically important topic, and presents a simple, easy and intuitive solution.
+1. The paper is clearly written and easy to understand.
+
+#### Cons
+1. The novelty and the originality of this paper is not enough that there is only limited benefits to the community, which does not satisfy the ICLR standard.
+1. The experimental analysis is not convincing enough to support the usefullness of the proposed metric.",3,5.0,ICLR2021
+S1gWD_G9Yr,1,rJxAo2VYwr,rJxAo2VYwr,Official Blind Review #2,"The paper proposes a new adversarial attack for the targeted blackbox model Unlike previous approaches which use the output layer possibly with some additional terms and regularization, the proposed approaches only rely on intermediate features. In fact, the adversarial example is based on a single intermediate layer. The adversarial examples are built by training, for each target class, a binary classifier for the class based only on the features of that layer.
+
+- There are multiple similar prior works that use intermediate layers in some way, and the paper does a good job to explain the differences between the proposed approach and these.
+
+The paper proposes three variants. The simplest one appears rather weak in numerical experiments. The two other methods seem to achieve a different tradeoff, between focusing on the general error rate or on the targeted success rate.
+
+In particular, FDA+fd is similar in some ways to Zhou et al. (2018) which uses an additional loss term to maximize the distance of the features at various layers. A comparison with this method (which uses the output layer) would be interesting, and may constitute a relevant additional baseline, at least for non-targeted metrics like error (which is where FDA+ms shines).
+
+Similarly, a comparison to AA is provided in Figure 2 in the case of 10 classes, but not in Table 1 for the 1000 classes experiment. This casts some doubt on the significance of the result in the 1000 classes setting. A major issue with the proposed method is that training binary classifiers in large multi-class settings is quite costly, which is bound to hinder one's ability to identify 'high performing attack settings' In particular, the paper mentions that the 10 classes are used for that, but no evidence is provided that these settings are actually good for all classes.
+
+The conclusions are rather consistent accross different pairs of source/target network.
+
+The paper does not mention adversarial training. As this has become a common defence practice in the literature, it would be interesting to understand to what extent the proposed methods are thwarted by adversarial training. This could increase the significance of the paper if working with intermediate feature representation makes the attack more robust to this type of defense.
+
+In summary, I find the experiments satisfying, although they can definitely be made more convincing by adding additional strong baselines and demonstrating how much performance can really be attained in the 1000 classes scenario.
+
+- Generally, not using the output layer at all is an interesting approach that deserves in my opinion to be discussed and investigated further.
+
+- The paper is clearly written, and notation is rigorous.",8,,ICLR2020
+BylrP5Q92m,2,HkfPSh05K7,HkfPSh05K7,Very interesting idea; needs more details and better evaluation,"The authors improve a retriever-reader architecture for open-domain QA by iteratively retrieving passages and tuning the retriever with reinforcement learning. They first learn vector representations of both the question and context, and then iteratively change the vector representation of the question to improve results. I think this is a very interesting idea and the paper is generally well written.
+
+I find some of the description of the models, methods and training is lacking detail. For example, their should be more detail on how REINFORCE was implemented; e.g. was a baseline used?
+
+I am not sure about the claim that their method is agnostic to the choice of machine reader, given that the model needs access to internal states of the reader and their limited results on BiDAF.
+
+The presentation of the results left a few open questions for me:
+
+  - It is not clear to me which retrieval method was used for each of the baselines in Table 2.
+  - Why does Table 2 not contain the numbers obtained by the DrQA model (both using the retrieval method from the DrQA method and their method without reinforcement learning)? That would make their improvements clear.
+  - Moreover, for TriviaQA their results and the cited baselines seem to all perform well below to current top models for the task (cf. https://competitions.codalab.org/competitions/17208#results).
+  - I would also like to see a better analysis of how the number of steps helped increase F1 for different models and datasets. The presentation should include a table with number of steps and F1 for different step numbers they tried. (Figure 2 is lacking here.)
+  - In the text, the authors claim that their result shows that natural language is inferior to 'rich embedding spaces'. They base this on a comparison with the AQA model. There are two problems with this claim: 1) The two approaches 'reformulate' for different purposes, retrieval and machine reading, so they are not directly comparable. 2) Both approaches use a 'black box' machine reading model, but the authors use DrQA as the base model while AQA uses BiDAF. Indeed, since the authors have an implementation of their model that uses BiDAF, an additional comparison based on matched machine reading models would be interesting.
+- Generally, it would be great to see more detailed results for their BiDAF-based model as well.
+",6,4.0,ICLR2019
+SkrzwsKeG,1,SJcKhk-Ab,SJcKhk-Ab,Cool theoretical contribution with rather unrelated experiments,"tl;dr: 
+ - The paper has a really cool theoretical contribution. 
+ - The experiments do not directly test whether the theoretical insight holds in practice, but instead a derivate method is tested on various benchmarks. 
+
+I must say that this paper has cleared up quite a few things for me. I have always been a skeptic wrt LSTM, since I myself did not fully understand when to prefer them over vanilla RNNs for reasons other than “they empirically work much better in many domains.” and “they are less prone to vanishing gradients”. 
+
+Section 1 is a bliss: it provides a very useful candidate explanation under which conditions vanilla RNNs fail (or at least, do not efficiently generalise) in contrast to gated cells. I am sincerely happy about the write up and will point many people to it.
+
+The major problem with the paper, in my eyes, is the lack of experiments specific to test the hypothesis. Obviously, quite a bit of effort has gone into the experimental section. The focus however is comparison to the state of the art in terms of raw performance.  
+
+That leaves me asking: are gated RNNs superior to vanilla RNNs if the data is warped?
+Well, I don’t know now. I only can say that there is reason to believe so. 
+
+I *really* do encourage the authors to go back to the experiments and see if they can come up with an experiment to test the main hypothesis of the paper. E.g. one could make synthetic warpings, apply it to any data set and test if things work out as expected. Such a result would in my opinion be of much more use than the tiny increment in performance that is the main output of the paper as of now, and which will be stomped by some other trick in the months to come. It would be a shame if such a nice theoretical insight got under the carpet because of that. E.g. today we hold [Pascanu 2013] dear not because of the proposed method, but because of the theoretical analysis.
+
+Some minor points.
+- The authors could make use of less footnotes, and try to incorporate them into the text or appendix.
+- A table of results would be nice.
+- Some choices of the experimental section seem arbitrary, e.g. the use of optimiser and to not use clipping of gradients. In general, the evaluation of the hyper parameters is not rigorous.
+- “abruplty” -> “abruptly” on page 5, 2nd paragraph
+
+### References
+[Pascanu 2013] Pascanu, Razvan, Tomas Mikolov, and Yoshua Bengio. ""On the difficulty of training recurrent neural networks."" International Conference on Machine Learning. 2013.",8,4.0,ICLR2018
+qG05RqeoeCq,3,J_pvI6ap5Mn,J_pvI6ap5Mn,interesting results but a limited coverage,"The authors proposed a transfer learning scheme for graph neural networks. The proposed method ego-graph information maximization allows learning transferable models. The authors studied structure-respecting node features and provided a theoretical analysis of the transferability of GNNs. The proposed method significantly outperforms state-of-the-art methods.
+ 
+Clarity:
+Overall, this paper reads fine. There are some typos and missing definitions of symbols, e.g., `sp' in eq 1 and $U^T$ in eq 3. D function is defined by another D in eq 2. In definition 2.3, 'Ordered' ego-graph is not defined. 'Title 2.2. ANALYSI', 'structural equivalence', and 'structural different' are typos. The average structural difference denoted by $\bar{d}(,)$ 
+are not defined. The clarity of this manuscript needs to be improved.
+ 
+Strengths/Quality/Significance (pros):
+The interesting observation that the functions learned by GNNs can be viewed as functions to map a subgraph centered at a node to a class label since most GNNs have a few layers and their receptive field of a node output is a k-hop ego-graph.
+ 
+The authors studied structure-respecting node features, e.g., degrees, spectral embeddings, to show that graph filters of GNNs is transferable. Based on the structure-respecting node features, the authors provide the analysis of transferability solely depending on the graph structure. The analysis showed that the performance gap of transferred models is bounded by a function of the ordered eigenvalues of the graph Laplacian of ego-graphs.
+ 
+The proposed method achieved significant improvement against baseline approaches.
+ 
+Weaknesses (cons) & Questions:
+ 
+The writing should be improved. The manuscript should be self-contained. As mentioned above, there are functions, and variables that are introduced without definitions such as reconstruction loss, sp, $\bar{d}(,)$ and so on.
+ 
+The analysis is limited to graph structures. To benefit most GNNs in real-world applications, the transferability of GNNs needs to be analyzed with node features as well.
+ 
+In this paper, the analysis of the transferability of GNNs is limited to node classification. It is not clear whether the proposed method is effective in other tasks on graphs such as link prediction, graph classification. 
+ 
+Even in the synthetic experiments, the performance gain is obtained only in the transferrable feature settings.
+
+--- Post Rebuttal --- 
+I read the author response and I keep the original rating due to the limited operating range of the proposed method.",6,4.0,ICLR2021
+Wrqc9ZeUN3J,4,UuchYL8wSZo,UuchYL8wSZo,Convincing paper about learning transferrable reps from correlated/synthetic images via competitive interaction,"The paper intends to contribute a novel task (Cache, as realized in AI2-THOR), the architecture of a strong Cache agent which learns reusable representations which allow significant transfer performance, and novel methods for evaluating the quality of dynamic image representations. The first and third contributions are directly related to the conference topics, and the second provides additional evidence in favor of the paper’s core idea: training on interactive gameplay allows learning flexible representations (in the sense of supporting many tasks via transfer) even when images are highly correlated and synthetic.
+
+The key strength of the paper is the very general core idea it advances and how this idea is explored via a novel task. The paper is easy to read and convincing. The partition into details needed for the paper’s core argument and details specific to the experiments in the appendix is well done. (If anything even more could have been pushed to the appendix.)
+
+One weakness of the paper is that a specific notion of “flexible” (which is mentioned in the title and twice in the abstract but nowhere else) is not advanced or integrated with the core idea. How does gameplay relate to flexibility? Why might flexibility be harder to achieve via passive learning or reinforcement learning with fixed reward functions? Because the authors place stress on the idea of how play and interaction contribute to representation learning (rather than a new method), slightly more space should be given to developing the general idea. The idea is not specific to vision, but only vision-related representations are considered. Sketching how the idea ought to work for text or audio would be useful if the focus really is on this very general idea.
+
+Recommendation: strong accept. The philosophical aims of the paper make it stand out amongst the mass of related work that is otherwise very engineering focused. The experiments are soundly executed in a way that ends up clearly demonstrating the core idea.
+
+Questions for the authors:
+- A step where the hider needs to retrieve the object they hid would seem appropriate. Are there certain limitations of the AI2-THOR environment that make adding this step (which would seem to expose more of the richness of the simulated world through fixed rules of the game) infeasible to add?
+- Inversely, do the authors feel that it was important that the hider manipulate the object into the desired location? How much of the richness of the simulated world comes through in the task feels relevant to the core ideas of the paper, but the paper currently does not address this kind of detail in the design of the Cache game within AI2-THOR.
+
+Section-by-section reactions: (to see how opinions change over time)
+
+Title+Abstract:
+- The notion of “flexible” seems to be at the heart of this paper’s intended contribution. Hopefully it will be defined in the body text. Uh oh, it looks like “flex” only ever appears on the first of the submission’s 36 pages. Hopefully a synonym will get defined later.
+
+Introduction:
+- Excellent motivation.
+- Good that representations of interest (SIRs/DIRs) are named and distinguished. Many other papers, in the interest of highlighting end-to-end training, would forget to do this.
+- Good explicit list of contributions, excellent that two are specifically centered on representation learning.
+- Missed opportunity to highlight a distinct role for “flexible” representations. (I don’t quite know what it should mean beyond supporting transfer well. A representation that could easily be scaled up or down in dimensionality by stripping channels in a well defined order might be considered flexible in another sense. Likewise, one that was defined in terms of pluggable input modules to work with novel combinations of familiar input types might be considered differently flexible. What kind of flexibility do you want?)
+
+Related Work:
+- Another take on learning visual representations via interactive gameplay is seen in https://arxiv.org/abs/1812.03125 where the authors learn a SIR (trained on a proxy task of predicting videogame memory state) that supports the use of low-continuous-space exploration strategies like rapidly-exploring random trees. The representations are learned offline/passively, but they are learned as to improve the efficiency of the very exploration process that builds that dataset for offline learning.
+
+Playing Cache in a Simulation:
+- It seems notable that the hiding agent is never asked to retrieve the object they have hidden. Without this step, the hiding agent may find ways of manipulating objects in a way that makes them simply unretrievable (e.g. the object is pushed into a corner in a way that causes it to glitch out of the room, etc.). A step like this would require the hider to learn a finer grained representation of the hiding location that gives itself a clue as to how it should be retrieved (e.g. “under the couch in a place you’ll never be able to see but will be there if you actually reach for it).
+
+Learning to Play Cache:
+- Great!
+
+Experiments:
+- All well done.
+
+Discussion:
+- “We believe that it is time for a paradigm shift via a move towards experiential,
+interactive, learning.” -- something similar has been said by many other researchers in many different decades, so it would be good to say what’s different about the situation in 2021. The difference now seems to be the availability of simulators with visual fidelity comparable enough to reality to demonstrate meaningful sim2real transfer. Are there other bullet points that could be added to a why-now argument?",9,3.0,ICLR2021
+wlufE3AkFG,2,b6BdrqTnFs7,b6BdrqTnFs7,Review,"The paper proposes a new regularization method that constrains the mapping between the inputs and output spaces for achieving compositional generalization in simple grounded environments like gSCAN. The problem is interesting and important and the paper is corroborated by good experiments with 25% accuracy increase and also generalization to longer commands. However, the paper has  clarity issues with descriptions that are sometimes vague or not precise enough and quite frequent language mistakes. It also doesn't discuss almost at all existing works and mention the only very briefly, making it harder to judge the strength of the new approach as there's not enough context.
+
+In addition, the idea presented feels somewhat too specific to the particular gSCAN task. For instance, it considers particularly disentangling the representation of each step to direction, action and manner components. I would hope that a general approach will discover a disentangled representation that allows compositional generalization on its own, rather than being hand-engineered for the particular dataset, especially given its relative simplicity. 
+
+For the entropy regularization idea, while it may allow for compositional generalization, it may reduce the model ability to capture trends in the training data, and so it may produce too ""extreme"" representations that can’t account for correlations within the data, and therefore I suspect it may not work well for more complex problems beyond gSCAN.
+
+Comments and questions
+- The related work sections gives background about the subject and used dataset but has only one short like about competing approaches. It will be good to move the general text from the related work section to the introduction section and instead add a bit more detailed description about prior approaches for the problem and how they differ from the new method.
+- The experiments section doesn’t provide any details about the baselines either, so overall a more detailed comparison to existing approaches or putting the paper in the context of prior work is really missing.
+- Page 2: ""Another related work is independent disentangled representation (Higgins et al., 2017; Locatello et al., 2019), but they do not address compositional generalization."" -> Why aren’t they? The main advantage of disentangled representation is their ability to generalize to combinations of properties outside the training distribution. For instance, if you encode separate features for color and shape, you may learn to generalize to any combination of them even if the training data didn’t cover all of them, since your representation inherent compositional structure that separates out these two properties may prevent capturing spurious correlations of the two properties and encourage generalization to combinations of them.
+- The description of the task is not clear enough. A more formal/mathematical definition of the task will be useful, especially if letters are presented for e.g. the input command, output sequence etc it will be easier to refer back to them later in the paper. Also further description/ a couple of examples on what the actions and the manners are will be useful for those not familiar enough with gSCAN.
+- ""We also assume automatic collision prevention… This makes us focus on addressing grounded compositional generalization problem."" - is it fair to assume that or does it simplify the problem? Are alternative approaches assume that? How is it useful?
+- Page 3: ""For example, when ""red"" and ""square"" do not appear together in any training sample, a model might learn that square is not red. However, this causes errors for compositional generalization in test. To avoid such case..."" - Do we want to avoid such cases? A model that will learns that there is no correlation at all between pairs of properties will not work in the real world. Rather than avoiding learning the correlations it will be useful if the model will be still able to learn them but at the same time allocate some smaller probability for the case of combinations that are less common.
+- The description of entropy regularization isn’t completely cleared to me. What are x_i and y_i. Does each y_i depend on all x or only xi?
+- It sounds like basically entropy regularization reduces the capacity of the representation by adding noise + l2 loss on the activations. This sounds like a quite general regularization technique but is unclear to me why this encourages compositionality in particular. Further explanation of that will be helpful.
+- A small comment, since the generalization to length work well but one-shot learning isn't and is still an open problem, it may make more sense to put the length subsection first and then the one-shot learning one.
+
+Some Typos
+- What are the components in output -> in the output
+- other changes with ablation study -> with an ablation study
+- to understand command and the environment -> the command and
+- redundant dependency on input -> on the input
+- Grounded SCAN (gSCAN) dataset -> the Grounded SCAN dataset
+- but agent needs -> but the agent need or but agents need 
+- of agent and the environment -> of the agent
+- to change position -> to change its position
+- on addressing grounded compositional generalization problem -> addressing the problem
+- Input contains command -> the input 
+- Output contains -> the output
+- Information of color -> of the color
+- design entropy regularization layer -> a/the layer
+- finds correct object -> find the correct object
+",5,4.0,ICLR2021
+H1g8Iq1Z5S,3,rkgAb1Btvr,rkgAb1Btvr,Official Blind Review #2,"The paper presents a method to detect out-of-distribution or anomalous data points. They argue that Fourier networks have lower confidence and thus better estimates of uncertainty in areas far away from training data. They also argue for using “large” initializations in the first layers and sin(x) as the activation function for the final hidden layer.
+
+The paper does not seem to have any significant logical reasoning on why their specific architecture works, but ""describes"" what they did. It is not clear what the novelty is, besides that they found an architecture that seems to work. Additionally while Fourier networks have lower confidence, that does not necessarily mean they are more accurate estimates of uncertainty. However the reviewer does acknowledge that the estimates are mostly likely better than ReLU networks that are well known for having terrible estimates of uncertainty.
+",1,,ICLR2020
+woYZQkDY2lV,1,GMgHyUPrXa,GMgHyUPrXa,An interesting idea but the conclusions of this work are not very significant,"Summary of contributions: 
+
+In this paper, authors performed a neural architecture search to improve LISTA algorithm for solving the lasso on synthetic data. The main motivation of this paper is to investigate the effectiveness of unrolling in comparaison with a more ""black box"" architectures. The ""averaged"" top architectures found by the study give better performance than LISTA when the  two models are trained on the same dataset.
+
+Strengths: 
+
+-The idea of using NAS to study unrolled models seems to be novel to the best of my knowledge. 
+
+-This idea is interesting. I definitely agree that the relevance of unrolled architectures should be questioned. 
+
+-The authors did a good job to describe clearly their idea and the experiments they conducted (at the exception of the last paragraph ""transferring found patterns to other unrolling"" which lacks clarity in my opinion).
+
+Weaknesses: 
+
+-I think that the overall conclusions of the paper are not very significant. The authors only focus on a very simple problem : the lasso on synthetic data (with known dictionaries). In such a simple setting I do not see why one would consider using this trainable model rather than more standard algorithms for solving the lasso (which are significantly faster than ISTA and have convergence guarantees and/or duality gap).
+
+-I am not convinced by the results regarding transferability. Even if the searched model outperforms the other methods in the simplest setting, the searched model seems to be very sensitive to perturbations. Looking at line (i) of table 1 with perturbed dictionary, the gap between searched model and LFISTA is small and the searched model performs as good as Dense Lista. More generally, the searched model seems to perform as good as Dense LISTA in most of the perturbed settings. Again this question the utility of this model. 
+
+-The authors only consider the problem where the underlying optimization problem is known. I think it would be much more interesting to study the case when this underlying optimization problem is not known and has to be learnt (that aspect is briefly discussed at the end of the related work section). For example what if the dictionary is not known ? Lista offers the possibility to learn the dictionaries with end-to-end training. What would be the results of the NAS in the setting with unknown dictionary, how would unrolled models perform in that setting ?
+
+Suggestions to authors: 
+
+-It would be interesting to consider the case where the underlying optimization problem is not known a priori. In that situation I think that unrolled models make more sense  (unrolled weighted-l1sparse coding for a simple task such as denoising, experiments with learned dictionaries ...). In that setting I would be more curious regarding the results of the NAS study.
+
+-I think it would be relevant to compare the unrolled models to other algorithms for solving the lasso (lars...). In my opinion, it would clarify the conclusion of the paper. (A comparison in term of inference speed, accuracy, computational cost would be interesting).
+",4,3.0,ICLR2021
+rJ81OAtgM,1,B1zlp1bRW,B1zlp1bRW,"The paper presents interestings results about consistency of learning OT/Monge maps although weak and stochastic learning algorithms able to scale, however some parts should deserve more discussion and experimental evaluation is limited.","Quality
+The theoretical results presented in the paper appear to be correct. However, the experimental evaluation is globally limited,  hyperparameter tuning on test which is not fair.
+
+Clarity
+The paper is mostly clear, even though some parts deserve more discussion/clarification (algorithm, experimental evaluation).
+
+Originality
+The theoretical results are original, and the SGD approach is a priori original as well.
+
+Significance
+The relaxed dual formulation and OT/Monge maps convergence results are interesting and can of of interest for researchers in the area, the other aspects of the paper are limited.
+
+Pros:
+-Theoretical results on the convergence of OT/Monge maps
+-Regularized formulation compatible with SGD
+Cons
+-Experimental evaluation limited
+-The large scale aspect lacks of thorough analysis
+-The paper presents 2 contributions but at then end of the day, the development of each of them appears limited
+
+Comments:
+
+-The weak convergence results are interesting. However, the fact that no convergence rate is given makes the result weak. 
+In particular, it is possible that the number of examples needed for achieving a given approximation is at least exponential.
+This can be coherent with the problem of Domain Adaptation that can be NP-hard even under the co-variate shift assumption (Ben-David&Urner, ALT2012).
+Then, I think that the claim of page 6 saying that Domain Adaptation can be performed ""nearly optimally"" has then to be rephrased.
+I think that results show that the approach is theoretically justified but optimality is not here yet.
+
+Theorem 1 is only valid for entropy-based regularizations, what is the difficulty for having a similar result with L2 regularization?
+
+-The experimental evaluation on the running time is limited to one particular problem. If this subject is important, it would have been interesting to compare the approaches on other large scale problems and possibly with other implementations.
+It is also surprising that the efficiency the L2-regularized version is not evaluated.
+For a paper interesting in large scale aspects, the experimental evaluation is rather weak.
+ 
+The 2 methods compared in Fig 2 reach the same objective values at convergence, but is there any particular difference in the solutions found?
+
+-Algorithm 1 is presented without any discussion about complexity, rate of convergence. Could the authors discuss this aspect?
+The presentation of this algo is a bit short and could deserve more space (in the supplementary)
+
+-For the DA application, the considered datasets are classic but not really ""large scale"", anyway this is a minor remark.
+The setup is not completely clear, since the approach is interesting for out of sample data, so I would expect the map to be computed on a small sample of source data, and then all source instances to be projected on target with the learned map. This point is not very clear and we do not know how many source instances are used to compute the mapping - the mapping is incomplete on this point while this is an interesting aspect of the paper: this justifies even more the large scale aspect is the algo need less examples during learning to perform similar or even better classification.
+Hyperparameter tuning is another aspect that is not sufficiently precise in the experimental setup: it seems that the parameters are tuned on test (for all methods), which is not fair since target label information will not be available from a practical standpoint.
+
+The authors claim that they did not want to compete with state of the art DA, but the approach of Perrot et al., 2016 seems to a have a similar objective and could be used as a baseline.
+
+Experiments on generative optimal transport are interesting and probably generate more discussion/perspectives.
+
+--
+After rebuttal
+--
+Authors have answered to many of my comments, I think this is an interesting paper, I increase my score.
+",7,3.0,ICLR2018
+BylXNvH1qr,2,HygHbTVYPB,HygHbTVYPB,Official Blind Review #3,"It is argued in this paper that GANs often suffer from mode collapse, which means they are prone to characterize only a single or a few modes of the data distribution. In order to address this problem, the paper proposed a framework called LDMGAN which constrains the generator to align distribution of generated samples with that of real samples in latent space by introducing a regularized AutoEncoder that maps the data distribution to prior distribution in encoded space.
+
+The major difference of this paper from many traditional GANs is to constrain the distributions of generated data same as distributions of true data in latent space instead of constrain the ability of discriminator. The authors detailed their motivation, the algorithm, and also reported a series of evaluation results on several datasets. 
+
+Generally, this paper was well written. However, this paper has the following major concerns: 
+	（1） Though somewhat new, the novelty of this paper may be incremental to me. It looks like a combination of VEEGAN and AAE.  Though the authors mentioned that VEEGAN autoencoded the noise vectors rather than data items, and AAE exploited the adversarial learning in the encoded space rather than using an explicit divergence,   it appears not significant to me between the proposed model and these two models. At least the authors did not  address sufficiently how significant the proposed method would be.
+
+	 （2） The paper tested the proposed algorithm with a 2D Synthetic dataset. However, I found a lot of discrepancies in the results presented in Table 1 with other published works. The authors show 1 of mode captured on 2D Grid and 2D Ring using the VEEGAN method. However, the VEEGAN paper shows they get 24.6 and 8 on 2D Ring and 2D Grid respectively. Such discrepancies were also observed in Figure 3. These discrepancies must be explained.
+
+	（3） In Figure 4, the authors showed the distribution of MODE scores for GAN and LDMGAN. From the figure, it seemed that LDMGAN improved the sample quality and diversity compared to GANs, but it is still prone to characterizing only a single or a few modes of the data distribution. In another word,  this may alleviate the problem but may not fully solve the problem.  Another minor point, the coordinate and legend are too small in this figure. It would be better if they become bigger.
+
+	（4） The results of Table 4 is not convincing because the comparative methods are truly out-of-date. It would be more convincing if more latest methods can be compared with LDMGAN method. Those results reported are far lower than the state-of-the-art performance in these datasets. 
+
+         (5) There is a mistake in the second term of equation 11, it should be ⋯〖-D〗_KL (p^* (y)||p(y))  )
+",3,,ICLR2020
+S1gYbn7fqS,3,SJgs8TVtvr,SJgs8TVtvr,Official Blind Review #2,"
+Summary:
+
+The paper proposes to expand the VAE architecture with a
+mixture-of-experts latent representation, with a
+mixture-component-specific decoder that can specialize in a specific
+cluster.  Importantly, the method can take advantage of a similarity
+matrix to help with the clustering.
+
+Overall, I recommend a weak accept.  The method seems reasonable, and
+the paper is well-written, but the results are only marginally better
+than other methods, and there are several weaknesses with the proposed
+architecture and experimental setup.
+
+Positives:
+
+* The idea of a more expressive variational distribution seems good,
+  although it is not novel.
+
+* The ability to have multiple decoder networks seems reasonable.
+
+* The ability to incorporate domain knowledge (in the form of a
+  similarity matrix S) is a plus.
+
+* The experiments are thorough, although the method is generally only
+  slightly better than competing methods.
+
+Negatives:
+
+* It's not clear if the similarity matrix S is already solving the
+  clustering problem - in which case, why do we need the rest of the
+  model?  For example, in your experiments you often used UMAP to
+  cluster data.  How does using UMAP by itself work?  (Along these
+  lines, it was not clear if your GMM experiments clustered data in
+  the original space, or in the UMAP'd space - please clarify this).
+  A good ablation would be to somehow remove the S matrix, to see if
+  the model can accurately cluster samples.
+
+* There is little variance in the generated samples.  
+
+* There is not a one-to-one mapping of clusters to labels, so it is
+  hard to use this method to generate a specific type of data (for
+  example, it is hard to generate a specific digit).  This is a big
+  difference from, say, a conditional sampler as learned by a GAN.
+  This also arises in Fig. 3, where it is clear that latent cluster
+  assignments do not match human-interpretable cluster assignments.  I
+  suppose this is to be expected, but taken with the previous point
+  (little variance in generated samples) I think it seriously weakens
+  the paper's claim that this is an ""accurate an efficient data
+  generation method.""
+
+* The method does not do well when the number of clusters is large.
+  Regular GMMs seem to outperform it.
+
+* I felt that this paper made excessive use of the appendix.  The
+  paper is not self-contained enough, effectively violating the length
+  restrictions.  Please make an effort to move key results back in to
+  the main body of the paper.
+
+
+Experiments to run:
+
+An ablation regarding the similarity matrix S.
+
+Clarification of whether GMM experiments are run in data-space, or
+UMAP'd space.
+
+MIXAE features prominently in your related works, but is not compared
+to in your experiments.  It sounds like a natural comparison.  Please
+run this experiment, or explain why it is not a comparable method.
+
+
+",6,,ICLR2020
+BklBblb52Q,2,BJgTZ3C5FX,BJgTZ3C5FX,"Review for ""Generative model based on minimizing exact empirical Wasserstein distance"".","The authors propose to estimate and minimize the empirical Wasserstein distance between batches of samples of real and fake data, then calculate a (sub) gradient of it with respect to the generator's parameters and use it to train generative models.
+
+This is an approach that has been tried[1,2] (even with the addition of entropy regularization) and studied [1-5] extensively. It doesn't scale, and for extremely well understood reasons[2,3]. The bias of the empirical Wasserstein estimate requires an exponential number of samples as the number of dimensions increases to reach a certain amount of error [2-6]. Indeed, it requires an exponential number of samples to even differentiate between two batches of the same Gaussian[4]. On top of these arguments, the results do not suggest any new finding or that these theoretical limitations would not be relevant in practice. If the authors have results and design choices making this method work in a high dimensional problem such as LSUN, I will revise my review.
+
+[1]: https://arxiv.org/abs/1706.00292
+[2]: https://arxiv.org/abs/1708.02511
+[3]: https://arxiv.org/abs/1712.07822
+[4]: https://arxiv.org/abs/1703.00573
+[5]: http://www.gatsby.ucl.ac.uk/~gretton/papers/SriFukGreSchetal12.pdf
+[6]: https://www.sciencedirect.com/science/article/pii/0377042794900337",2,5.0,ICLR2019
+ByeD9kIN5S,2,HJe_Z04Yvr,HJe_Z04Yvr,Official Blind Review #2,"The paper presents an approach for style transfer with controlable parameters. The controllable parameters correspond to the weights associated to ""style losses"" or ordinary style transfer models (distance between gram matrices of generated vs style image at specific layers of a network). The authors propose to learn a single architecture that takes these parameters as input to generate an image that resembles what would be generated by optimizing directly on these parameters. Examples of transfer and of the effect of these parameters are given. A quantitative evaluation shows that the effect of changing the parameters of the new network has the effect of reducing the loss at the desired layers.
+
+Overall, I found the idea of learning controllable parameters interesting. The technical contribution is to show that learning the weights of instance normalization as a function of the hyperparameters actually is satisfactory. To be honest, I am not sure to understand why playing on instance normalization weights turns to be effective for learning an equivalent of a reduction of per-layer loss. I would have liked more motivation and intuition behind it.
+
+The paper seems overall a bit incremental with respect to prior work on style transfer. While adjustable parameters could have application in more general tasks of image generation (for instance, in the line of work on disentangled representations), it is unclear how the specific approach of learning instance normalization weights extends beyond style transfer. As such, my feeling is that the paper is borderline.
+
+other comments:
+- I found Figure 7 important because it guarantees learning has the desired effect. However, there is a bit a lack of baseline/topline: how do the true losses decrease when their weights increase?",6,,ICLR2020
+rJgZ18JcFH,1,rkeZNREFDr,rkeZNREFDr,Official Blind Review #2,"This paper proposes a feature leveling technique to improve the self-explaining of deep fully connected neural networks. The authors propose to learn a gated function for each feature dimension for whether to directly send the feature to the final linear layer. The gated function is trained with L0 regularization technique proposed in (Louizos et al., 2017) to encourage more low-level features passed to the final layer. Experimental results on MNIST, California Housing, CIFAR10 show that the proposed method can achieve comparable performance with existing algorithms on sparse neural network training.
+
+Quality:
+
+Overall, the paper is well written with some minor formatting errors. The toy example demonstrates the idea of this paper clearly. However, the novelty of this paper, when compared to NIT, is that the L0 regularization is used to pass the feature to last layer. Considering the self-explaining feature, this work can only explain that some of the input features are suited for final layer, while there is no explanation on the other features since they are used to construct higher level features.
+
+Claririty:
+
+Some parts of this paper are not clear:
+1.	Why l_k and h_k has to be disjoint? A feature suited for final classification does not suggest that it can’t be used to construct higher-level feature.
+2.	In (4), should not B() be an inverse binary activation function: (1-z)?
+3.	Is g(.) the Bernoulli distribution?
+4.	In section 5.2, why compare the gradients of a specific input example while one can directly look at z_k, the gated function?
+5.	I assume the fully connected layers have bias term. If so, (4) suggests that the gated location will also be added with a learned bias, which is different than what the paper proposes.
+
+Novelty:
+
+The novelty of this paper lies in the sparse training objective becomes passing as many lower-level features to final layer as possible instead of zeros out the intermediate weights. However, the key technique, L0 regularization, has been proposed and used as stated in the related work. While the authors state the application L0 to a novel context to select features is different from prior work, the novelty is rather incremental.  
+
+Significance:
+
+This work demonstrates that the L0 regularization technique for sparse neural network training can also be applied to learn a skip-layer connection. However, from both novelty, performance, and self-explaining perspectives, this work does not introduce much to the field.
+
+
+Pros:
+
+1.	The paper is well written.
+2.	The toy example showcases the issue that this work tries to tackle.
+3.	The experimental results show the comparable performance to existing works.
+
+Cons:
+1.	The novelty is not sufficient considering the prior works on sparse neural network training.
+2.	There are some clarification issues as mentioned before.
+3.	The performance is only comparable to existing works.
+4.	The self-explaining contribution is not clear since only a few input features can be explained if they are passed to the final layers.
+5.	There is no experiment on how \lambda would affect the resulting network architecture.
+
+
+Minor corrections:
+1.	First paragraph on sec. 5: three datasets: two datasets (or mention it’s in appendix).
+2.	5.2 compare to NIT: the citations are in wrong format. Also the reference for NIT is corrupted.
+",3,,ICLR2020
+BkeQQ3N7nQ,2,BJxPk2A9Km,BJxPk2A9Km,An interesting topic but need to think of strategy that is more reasonable to compute the similarity between each memory entry,"This paper attempts to study memory-augmented neural networks when the size of the data is too large. The solution is to maintain a fix-sized episodic memory to remember the important data instances and at the same time erase the unimportant instances. To do so, the authors improve the method called DNTM (Gulcehre et al., 2016) by incorporating the similarity between each memory entry besides the similarity between the current data the each memory entry. Experiments show the effectiveness of the proposed method.
+
+Here are my detailed comments:
+This is an interesting topic where augmented memory is used to improve the performance of neural networks. It is important to put the most important information in the limited external memory and discard the less important contents. In the work DNTM, the similarity of the current data instance and each memory entry is introduced to determine which memory entry should be rewritten. The authors think that this measurement is not enough and consider the relationship between each memory entry. In my opinion, this is a reasonable extra measurement since the information is also important if it has strong connection with other stored information.
+
+However, a deficiency of this work is that the relationship between each memory entry is not calculated in a reasonable way because the authors only use the bidirectional GRU to do this. From the motivation, we know that the authors want to obtain the relationship between every memory entry. However, as we know RNN models including GRU are suitable for those data that have sequence order. More specifically, bidirectional RNN models are used when we want to obtain not only the impact from beginning to end but also the impact from the end to the beginning. In addition, by using bidirectional RNN, we cannot obtain the relationship between each memory entry. If the authors want to realize that, it is necessary to disrupt the order of the memory entries and input the disordered entries into RNN models for n! times where n is the number of the memory entries and this will cost many computations. Although in experiments the proposed method shows its effectiveness and outperforms the baseline methods, the baseline methods are not enough to convince me that the proposed method is effective. I strongly suggest that the authors could incorporate more works that is state-of-the-art as baseline methods and consider strategies that are more reasonable to compute the relationship between each memory entry.
+
+Besides, there are some grammar mistakes and typos, especially about the usage of article and correctness on singular and plural. The paper needs more careful proofreading.
+",4,4.0,ICLR2019
+SJlBSAC3FS,1,rklnA34twH,rklnA34twH,Official Blind Review #1,"In this paper, the authors proposed the Adversarial predictive normalize maximum likelihood (pNML) scheme to achieve adversarial defense and detection. 
+The proposed method is an application of universal learning, which is compatible with existing adversarial training strategies like FGSM and PGD. 
+However, the experimental results indicate that the proposed method is more suitable for the models trained under PGD-based attacks. 
+According to the analysis shown in the paper, the proposed method works best when the adversary finds a local maximum of the error function, which makes it more robust to strong attacks. 
+It seems that the proposed work is a good attempt that applies universal learning to adversarial training, but more experiments are required to support its usefulness and effectiveness, especially for the weak attack like FGSM. Additionally, I would like to see more discussions about the limitations of the proposed method. 
+
+Minors:
+In Figure 2, I would like to see the results related to FGSM.",3,,ICLR2020
+yOfuXt4gI0a,4,mLtPtH2SIHX,mLtPtH2SIHX,Label smoothing in mixup,"This work proposes to change the labels for the mixed examples in mixup. My major concerns are as follows.
+1.	The motivation of adopting soft label for mixup is not clear. Label smoothing is helpful for generic training but why it can benefit mixup?
+2.	The proposed method is more like a combination of mixup and label smoothing. The improvement may come from label smoothing as a generic trick rather than mixup itself.
+3.	The performance of proposed method is very close to mixup, where the improvement is not significant. Additional experiments on ImageNet can make the results more convincing.",4,4.0,ICLR2021
+Bkx48v0Jqr,3,H1lXVJStwB,H1lXVJStwB,Official Blind Review #3,"*Revision after author response*
+
+I thank the authors for the comments on my questions. 
+
+Unfortunately, I do not feel that these comments addressed my main concerns. For all my experimental analysis questions, the authors promised some analyses for future versions, but I was hoping to see at least a minor preliminary analysis at this point, to see if indeed my concerns are valid or not. 
+
+Moreover, for my question number 1 about the optimization problem, the authors referred me to Corollary 1 from the paper, but that didn't really help me because, as the other reviewers also point out, the writing is quite hard to follow.
+
+Because of all these, I have decided to revise my score to a weak reject. While I believe the paper has merit, it requires revisions at many points in order for a reader to truly understand the method and trust the experimental results.
+
+--------------------------------------------------------------------------------------------------------------
+The paper proposes a curriculum learning approach that relies on a new metric, the dynamic instance hardness (DIH). DIH is used to measure the difficulty of each sample while training (in an online fashion), and to decide which samples to train on next. The authors provide extensive experiments on 11 datasets as well as some theoretical motivation for the use of this approach.
+
+---- Overall opinion ----
+Overall I believe this paper is an interesting take on curriculum learning that is able to achieve good results. I believe this approach is a combination of core ideas from multiple sources, such as boosting, self-paced learning, continual learning and other curriculum learning approaches, but overall it seems different enough from each one of them individually. Because of the resemblance with these many different methods, the method itself does not surprise through the novelty of a new idea, but the authors seemed to have found something that was missing from these methods and that leads to very good results. The experimental results look great, but I believe the paper is missing some ablation studies to assess the importance of certain components (see details below). I also had some trouble understanding certain arguments, which I hope the authors can clarify. 
+
+---- Major issues ----
+1. I find the arguments section 2.1 quite difficult to follow. In particular, under the assumption stated in the paper that r_t(i) = f(i|S_{1:t−1}) =  f(e_i + S_{1:t−1}) − f(S_{1:t−1}) , why does it follow that r_t(i) can be used instead of f in the minimization problem (2). 
+
+2. Based on the method itself, it seems to me that the parameter k_t could would have a lot of influence on how well the method doing.  The authors mention in the experimental section what values they use, but there is no indication on how one would choose this value. Moreover, it would be good to see an analysis of how sensitive the results are to this choice.
+
+3. In Figure 1, it is not clear whether the figure on the right shows the actual loss, or the smooth loss using Equation (1) with instantaneous instance (A). If it is the former, then if the loss is so smooth, why do we need DIH? If it is the latter, then what does the instantaneous loss look like? This actually raises the question of how important the smoothing component is -- could we achieve the same results with an instantaneous loss (i.e. set gamma to 1 in Eq. 1)?
+
+---- Minor issues ----
+1. How do you choose T0, gamma and gamma_k?
+
+2. In the conclusions, the authors state that “ The reason [why  MCL and SPL are less stable] is that, compared to the methods that use DIH, both MCL and SPL deploy instantaneous instance hardness (i.e., current loss) as the score to select sample”. Since there are so many other differences in the way training progresses, I think we don’t have enough evidence to attribute this to merely the “instantaneousness” of the loss. In fact, it would be interesting to see how SPL does if you use DIH as a metric (just smoothing the loss over time), but their approach of scheduling samples (easy to hard, and not the opposite and in DIHCL).
+
+3. Appendix C shows some interesting results regarding wall time comparison. I was surprised to see that, despite the extra computations, DHCL is comparable to random mini-batches. This makes me wonder what the stop criteria was, because when you stop matters a lot for run time comparisons. It would also be interesting to see a more ample discussion on this in the main text.
+
+4. In Figure 1, the axes are barely readable.
+
+5. The authors oftentimes reverse the use of \citet and \citep, for example “has been called the “instance hardness” Smith et al. (2014) corresponding to” should have a bracket, whereas “Our paper is also related to (Zhang et al., 2017)” should not have brackets.
+
+6. This is not an issue, but I just wanted to say I appreciated Appendix B.
+
+---- Suggestions ----
+1. It would be interesting to make a connection between the DIH and what other papers have discovered about example forgetting (e.g. Toneva et. al, that was mentioned in the paper).
+
+2. Major issues 3 -> a study on the effect of k and how to choose it.
+
+3. While I understand that the models chosen in the experiments are expensive to train, it would be good to report standard deviations in Table 1.
+
+4. Based on Table 1 and Figure 3, there is no concrete winner among the DIHCL methods. It would be good to include some recommendations in your conclusion on which one to choose and when.
+
+---- Questions ----
+1. “On average, the dynamics on the hard samples is more consistent with the learning rate schedule, which implies that doing well on these samples can only be achieved at a shared sharp local minima.” -> can you please explain why this is so?
+
+2. See Major issues 3.
+
+3. In Table 1, on some datasets, the authors apply lazier-than-lazy-greedy, and on some not.Why, and how does one decide this for a new dataset?
+
+4. How did you choose T0, gamma and gamma_k, as well as the schedules in Appendix C (page 17)? ",3,,ICLR2020
+zB5-eiUBPyH,1,3hGNqpI4WS,3hGNqpI4WS,simple yet effective method for achieving deployment-efficiency (but requiring some prior knowledge?),"The paper explores an under-researched problem, that of minimizing the number of policy updates in an RL setting (or being “deployment efficient”). This is an important aspect of using RL agents in real “production” environments where there may be many reasons why updates are costly and limiting them is an important consideration in the choice of RL method (or whether to even use RL).
+
+The paper shows that so-called ""off-policy” methods which, by their naming as such, it is implied that they should work in a sparse-deployments environment are, in fact, not suited (and often not evaluated) for this regime.
+
+By introducing a set of simple and strait-forward steps to the update and deployment process, the paper shows performance that approaches the continuous-deployment performance of comparable un-constrained methods.
+
+The main ingredients of the proposed method seem to be:
+Model based approach to support model-based offline exploration (an ensemble of models to prevent exploitation of model inaccuracies)
+Re-estimation of a model by Behaviour Cloning (BC) with data from last deployment (appears to have a large contribution though I don’t understand why)
+Conservative off-line policy updates (constrained by KL to BC policy) using offline rollouts (with forward model ensemble)
+
+The evaluation presented in the paper and extensive (13-page) appendix clearly show the advantage of the proposed method in the sparse-deployments regime and the overall competitive performance with regards to sample efficiency as well. Code was made available as supplementary material.
+
+I am unclear on why each training iteration should start with a policy learned from behavioural cloning of the last deployment’s data instead of the model that was deployed which would be available at that time. Figure 4 clearly shows BC to be the better approach In practice but I would appreciate some intuitive reasoning for this (unless this is standard practice).
+
+Perhaps my main concern with this paper, given the problem it addresses - that of a real deployment setting, is that it seems that some of the parameters that need to be tuned to achieve great performance require fitting to the specific task (i.e. deployments !). As far as I understand it, parameters such as the number of offline policy updates per deployment or the weight of the KL-divergence in the policy update step are crucial to good performance yet the paper does not explain how to choose them without engaging in the true environment. If that is correct than this seems to defeat the aims of the paper and question the overall methodology for deployment-efficiency.
+",8,4.0,ICLR2021
+khg73n5Sk0W,3,MbG7JBt0Yvo,MbG7JBt0Yvo,"The authors describe coupling view of the temporal LSTM/GRU style networks, suggesting modifications to the GRU called SGRU that offers improved results on UCI activity recognition against SGRU and RVSML style approaches.","I liked the formulation and motivation of the paper, explaining the sequence metric learning problem  and drawing parallel between synchronized trajectories produced by dynamical systems and the distance between similar sequences processed by a siamese style recurrent neural network. The authors propose modification the siamese recurrent network setting called classical Gated Recurrent Unit architecture (CGRU). The premise being two identical sub-networks, two identical dynamical systems which can theoretically achieve complete synchronization if a coupling is introduced between them. The authors describe how this model is able to simultaneously learn a similarity metric and the synchronization of unaligned multi-variate sequences in a weakly supervised way with the coupling demonstrating performance of the siamese Gated Recurrent Unit (SGRU) architecture on UCI activity recognition dataset (mobile data).
+
+The increase in norm with the epochs shows overall generalization is improving with the increase of coupling strength and an almost linear relationship between the increase of the norm and the decrease of the loss suggesting again that the coupling is helpful. Computing accuracy, F1 score Macro averaged and Mean Average Precision (MAP),similarly to Su & Wu (2019) also shows improvement against SGRU and RVSML.  The authors mention a drawback of the proposed architecture is that each pair has to be passed through the network instead of just computing once each representation and then the distance for each pair and that this could be balanced by the use of virtual metric learning during training. 
+
+The improvements points in the paper are
+- Comparing against more recent work on the activity recognition such as  Learning Discriminative Virtual Sequences for Time Series Classification
+- using more comprehensive evaluation (rather than a single data set)
+- will also be good to expand on the virtual  metric learning during training as passing each pair can increase the training complexity for large datasets
+",6,3.0,ICLR2021
+S1xfo2e9tH,2,SyxIterYwS,SyxIterYwS,Official Blind Review #1,"This paper proposes to take advantage of a known result on the channel capacity of a linear-Gaussian channel in order to estimate the empowerment and maximize mutual information between  policies (action sequences) and final states (given  initial states). The idea is to map the raw action sequences and states to a latent space where learning would force that linear property to be appropriate.
+
+I like the general idea of the paper (as stated above) along with its objectives but I have several concerns.
+
+First, I need to be reassured that we are computing the right quantity. Channel capacity is the maximum mutual information (between inputs and outputs) over the input distribution, whereas I had the impression that empowerment would be this mutual information, and that we want to increase it, but not necessarily reach its maximum over all possible policies: it would usually be one of the terms in an objective function (e.g. here we have reconstruction error, and in practice there would be some task to solve in addition to the exploration reward). One way to see this problem in the given formulation is that the C* objective only depends on the matrix A (which encapsulates the conditional density of z_{t+1} given the trajectory) and it does not depend at all on the distribution of the trajectory itself! This is weird since if we are going to use this as reward the objective is to improve the trajectory. What's the catch? So either I misunderstand something (which is quite possible) or there is something seriously wrong here.
+
+I am assuming that the training objective for the encoder is a sum of the reconstruction error and of C*. But note how this does not give a reward for policies, as such. This is a bit strange if the goal is to construct an exploratory reward!
+
+A less radical comment is: have you verified that the linear relationship between z_{t+k} and the b actually holds well? In other words, is the encoder able to map the raw state and actions to a space where the linearity assumption  is correct, and thus where equation (3) is satisfied.
+
+Figure 3 has something weird,  probably one of the two sequences of -1's should be a sequence of +1's.
+
+Figure 4 is difficult to interpret, the caption should do a better job.
+
+The experiment on the safety of the RL agent is weak. I don't see the longer path as safer, here. And the results are not very impressive, since the agent is only doing what it's told, i.e., go in areas with more options (more open areas) but there is no reason to believe that this is advantageous, here.
+
+Finally, what would make this paper much more appealing is if the whole setup led to learning better high-level representations (how to measure that is another question, but it is a standard kind of question in representation learning papers).
+
+Related work:
+
+I don't understand why in the abstract the authors refer to sampling-based methods as requiring exponentially many samples. This is not generally the case for sampling based methods (e.g. think of VAEs). I suppose the authors refer to something in particular but it was not clear to me what.
+
+References: in the intro, you might want to refer to the contrastive methods and variational methods to maximize mutual information between representations of the state and representations of the actions, e.g.,
+ Thomas et al, 2018, 1802.09484
+ Kim et al, 2018, arXiv:1810.01176
+ Warde-Farley et al, 2018, arXiv:1811.11359
+ 
+
+Evaluation: for now I suggest a weak reject but I am ready to modify my score if I am convinced that my main concerns were unfounded.
+",3,,ICLR2020
+SkVdaZRNl,1,Bk8BvDqex,Bk8BvDqex,,"Thank you for an interesting read on an approach to choose computational models based on kind of examples given.
+
+Pros
+- As an idea, using a meta controller to decide the computational model and the number of steps to reach the conclusion is keeping in line with solving an important practical issue of increased computational times of a simple example. 
+
+- The approach seems similar to an ensemble learning construct. But instead of random experts and a fixed computational complexity during testing time the architecture is designed to estimate hyper-parameters like number of ponder steps which gives this approach a distinct advantage.
+
+
+Cons
+- Even though the metacontroller is designed to choose the best amongst the given experts, its complete capability has not been explored yet. It would be interesting to see the architecture handle more than 2 experts.",8,3.0,ICLR2017
+HkexkySc27,2,SklVEnR5K7,SklVEnR5K7,"A paper with technical details and analysis, but the problem addressed does not seems to be interesting and significant","This paper analyzed on the core factor that make CNNs fail to hold shift-invariance, the naive downsampling in pooling. And based on that the paper proposed the modified pooling operation by introducing a low-pass filter which endows a shift-equivariance in the convolution features and consequently the shift-invariance of CNNs.
+
+Pros:
+1.	The paper proposed a simple but novel approach to make CNNs shift-invariant following the traditional signal processing principle.
+2.	This work gave convincing analysis (from both theoretical illustrations and experimental visualizations) on the problem of original pooling and the effectiveness of the proposed blur kernels.
+3.	The experiment gave some promising results. Without augmentation, the proposed method shows higher consistency to the random shifts.
+
+Cons:
+1.	When cooperating with augmentation, the test accuracy on random shifted images of proposed method did not exceed the baseline. Although the consistency is higher, it is secondary to the test accuracy of random shifted data. And it is confused to do average on consistency and test accuracy, which are in different scales, and then compare the overall performance on the averages. 
+2.	It seems to be more convincing if the ‘random’ test accuracy is acquired by averaging several random shifts on a single image and then do average among images, as well as to show how accuracy various on shifting distance.
+3.	Some other spatial transforming/shifting adaptive approaches should be taken into consideration to compare the performance.
+4.	There are some minor typos, such as line 3 in Section 3.1 and line 15 in Section 3.2
+",5,4.0,ICLR2019
+BJxoDJtnKS,1,B1lFa3EFwB,B1lFa3EFwB,Official Blind Review #2,"*Summary*
+
+The paper proposes a new method to learn data-driven representations, being invariant to some specific nuisance factors which are detrimental for the selected (supervised) classification task.
+Authors build upon the existing probabilistic framework termed Adversarial Invariant Induction (AII) from (Xie et al., 2017). 
+
+They claim to explore it under a both theoretical and practical point of view, demonstrating the limitations of maximizing a variational upper bound on conditional entropy as a proxy to achieve invariance. 
+
+Leveraging these observation, authors propose a novel method, called “invariance induction by discriminator matching” (IIDM) that is based on a regularized classification loss, penalized by a Kullbach-Leibler divergence between conditional distributions of the nuisance factor.
+
+Extremely convincing experiments are carried out on a synthetic and a real benchmark in multi-source domain generalization (PACS).
+
+
+
+*Pros*
+1. The genesis of the proposed IIDM is extremely paced since smoothly derived from the AII framework.
+2. Experimental results on a synthetic benchmark (a version of rotated MNIST) and on a popular benchmark for domain generalization (PACS) proved the effectiveness of IIDM
+
+
+
+ *Cons*
+1. The paper is hard to get, if the reader is not familiar with related literature
+2. It is not fully clear from the paper which parts are original and which are inherited from prior work.
+3. The structure of the paper needs to be improved (check my comments in the section beneath)
+Some of the proposed methodologies are not clear (IIDM+)
+
+
+
+
+*Detailed Comments*
+
+The problem considered by authors is surely interesting and addressing a popular topic in computer vision and deep learning. 
+
+1. Unfortunately, the paper, as it is is hard to get for scholars which are not expert of the AII formalism, which, in my opinion is not enough detailed. Therefore, in my opinion clarity is something that authors should try to work hard on: for instance, during the rebuttal time, authors can write from scratch an entire new Section in which they explain in plain terms the main outcomes of their paper, without entering too much into technical details.
+
+
+2. Additionally, the structure of the paper needs, in my opinion a major re-styling, still for the sake of better readability:
+2.a. A visualization of the proposed method (for instance, using flow-diagrams) in comparison with the existing AII should help in rapidly getting the factors of novelty of the proposed IIDM method. I would also encourage authors to add a pseudo-code
+2.b. Since authors claim two major contributions (understanding AII + IIDM), I would like to see those two contributions thoroughly presented and dissected in two separated sections of the paper. I am not fully convinced with the actual writing style in which the two contributions seem to be intertwined together.
+
+
+3. Although already convincing, the experimental part can be improved:
+3.a. Instead of a version of rotated-MNIST, authors can test on the “digits-five” setting (MNIST, MNIST-M, SVHN, UPS, SYN) as done in several works like http://openaccess.thecvf.com/content_cvpr_2018/papers/Xu_Deep_Cocktail_Network_CVPR_2018_paper.pdf.
+3.b. In addition to multi-domain generalization, authors could also have tried more classical unsupervised domain adaptation settings or, even, single-source domain generalization as in https://papers.nips.cc/paper/7779-generalizing-to-unseen-domains-via-adversarial-data-augmentation.
+
+
+
+*Final Evaluation*
+
+I think that the main aspect that authors should face during the rebuttal is to make the paper more easy to read and better separate the two contributions (understanding AII and IIDM). What I am not convinced at all about the writing style of the authors since when reading the paper I am not always capable of understanding what is novel (since proposed by authors) and what is inherited from prior work. But, maybe, the reason for this is that I am not an expert of the specific related field - but, even so, I think that the paper needs to be understood from the broadest audience possible.
+
+Instead, I am familiar with multi-source/single-source domain generalization (and adaptation) and, after my careful analysis of the experiments, I see a lot of potential in the approach. I would me more than interested in checking the performance of the proposed method over some of the novel benchmarks that I have recommended. It would be nice if authors add more experiments, but I know that this is always a complicated request during a rebuttal period.
+Globally, if I were asked to only rate the experimental part, I would have promoted for full acceptance. Unfortunately, the theoretical part of the paper is not fully clear to me and, therefore, I am not confident in calling for a full acceptance only based on the experiments. 
+
+In brief, I would go for a weak reject, looking forward to the authors’ rebuttal and the opinion of the other Fellow Reviewers.",3,,ICLR2020
+Hke2lCb92Q,2,rJgYxn09Fm,rJgYxn09Fm,a promising proposal that exploits the over-parameterization nature of neural nets to reduce the model size,"This work is motivated by the widely recognized issue of over-parameterization in modern neural nets, and proposes a clever template sharing design to reduce the model size. The design is sound, and the experiments are valid and thorough. The writing is clear and fluent. 
+
+The reviewer is not entirely sure of the originality of this work. According to the sparse 'related work' section, the contribution is novel, but I will leave it to the consensus of others who are more versed in this regard.
+
+The part that I find most interesting is the fact that template sharing helps with the optimization without even reducing the number of parameters, as illustrated in CIFAR from Table 1. The trade-off of accuracy and parameter-efficiency is overall well-studied in CIFAR and ImageNet, although results on ImageNet is not as impressive. 
+
+Regarding the coefficient alpha, I'm not sure how cosine similarity is computed. I have the impression that each layer has its own alpha, which is a scalar. How is cosine similarity computed on scalars?
+
+In the experiments, there's no mentioning of the regularization terms for alpha, which makes me think it is perhaps not important? What is the generic setup?
+
+In summary, I find this work interesting, and with sufficient experiments to backup its claim. On the other hand, I'm not entirely sure of its novelty/originality, leaving this part open to others.",7,3.0,ICLR2019
+ywrsWrcbTVb,3,Pzj6fzU6wkj,Pzj6fzU6wkj,Official Blind Review #3,"##########################################################################
+
+Summary:
+
+This paper proposes a benchmark for high-level mathematical reasoning and study the reasoning capabilities of neural sequence-to-sequence models. This is a non-synthetic dataset from the largest repository of proofs written by human experts in a theorem prover, which has a broad coverage of undergraduate and research-level mathematical and computer science theorems. Based on this dataset, the model need to fill in a missing intermediate proposition given surrounding proofs, named as IsarStep. It's a very interesting task. This task provides a starting point for the long-term goal of having machines generate human-readable proofs automatically. The experiments and analysis also reveal that neural models can capture non-trivial mathematical reasoning.
+
+
+##########################################################################
+
+Reasons for score: 
+
+ 
+Overall, I strongly vote for accepting. I think this is a very important work and this benchwork would benefit to other fresh ideas and new approaches for mathematical reasoning related research. My only concern is that as a benchmark, do authors need to conduct more experiments and baseline models on their data set to be more convincing?
+
+ 
+##########################################################################Pros: 
+
+Pros:
+
++ 1. The paper mined a large corpus of formal proofs and defined a proposition generation task as a benchmark for testing machine learning models’ mathematical reasoning capabilities. Such beckmark is important and beneficial to the development of the artificial intelligence community.
+
+ 
++ 2. The proposed HAT model is novel for better capturing reasoning between source and target propositions. The design of two types of layers is reasonable and interesting. The local layers model the correlation between tokens within a proposition, and the global layers model the correlation between propositions.
+
+ 
++ 3. This paper provides comprehensive experiments, including both qualitative analysis and quantitative results, to show the effectiveness of the proposed model. Experiments and analysis reveal that while the IsarStep task is challenging, neural models can capture non-trivial mathematical reasoning.
+
+
++ 4. The paper is well-written and the design decisions are clearly explained. The comparison of benchmark methods is also interesting to read. In general, I think this is a worthy publication. 
+
+ 
+##########################################################################
+
+Cons: 
+
+My only concern is that as a benchmark, do authors need to conduct more experiments and baseline models on their data set to be more convincing? The authors only use two baseline models: RNNSearch and transformer, it seems to be insufficient. At Bert era, do those improved models based on Generative Learning tasks could also be applied to IsarStep as baseline, like MASS(Masked Sequence to Sequence Pre-training for Language Generation)/UNILM(Uniﬁed Language Model Pre-training for Natural Language Understanding and Generation)?",9,4.0,ICLR2021
+SJl6_ov-G,2,HJ3d2Ax0-,HJ3d2Ax0-,"Interesting theory, could benefit from some experiments","After reading the authors's rebuttal I increased my score from a 7 to a 6.  I do think the paper would benefit from experimental results, but agree with the authors that the theoretical results are non-trivial and interesting on their own merit.
+
+------------------------
+The paper presents a theoretical analysis of depth in RNNs (technically a variant called RACs) i.e. stacking RNNs on top of one another, so that h_t^l (i.e. hidden state at time t and layer l is a function of h_t^{l-1} and h_{t-1}^{l})
+
+The work is inspired by previous results for feed forward nets and CNNs. However, what is unique to RNNs is their ability to model long term dependencies across time. 
+
+To analyze this specific property, the authors propose a concept called ""start-end rank"" that essentially models the richness of the dependency between two disjoint subsets of inputs. Specifically, let S = {1, . . . , T/2} and E === {T/2 + 1, . . . , T}. sep_{S,E}(y) models the dependence between these two sets of time points. Specifically sep_{S,E}(y) = K means there exists g_s^k and g_e^k for k=1...K such that y(x) = \sum_{k} g_s^k(x_S) g_e^k(x_E).
+
+Therefore sep_{S,E}(y) is the rank of a particular matricization of y (with respect to the partition S,E). If sep_{S,E}=1 then it is rank 1 (and would correspond to independence if y(x) was a probability distribution). A higher rank would correspond to more dependence across time. 
+
+(Comment: I believe if I understood the above correctly, it would be easier to explain tensors/matricization first and then introduce separation rank, since I think it much makes it clearer to explain. Right now the authors explain separation rank first and then discuss tensors / matricization).
+
+Using this concept, the authors prove that deep recurrent networks can express functions that have exponentially higher start/end ranks than shallow RNNs.
+
+I overall like the paper's theoretical results, but I have the following complaints:
+
+(1)  I have the same question as the other reviewer. Why is Theorem 1 not a function of L?  Do the papers that prove similar theorems about ConvNets able to handle general L? What makes this more challenging? I feel if comparing L=2 vs L=3 is hard, the authors should be more up front about that in the introduction/abstract.
+
+(2) I think it would have been stronger if the authors would have provided some empirical results validating their claims. 
+
+",7,3.0,ICLR2018
+w5coYaCSRAH,3,9t0CV2iD5gE,9t0CV2iD5gE,Official review for 1120,"The paper introduces SplitSGD method that detects the stationary phase in the stochastic optimization process and shrinks the learning rate. The SplitSGD is based on the observation that before reaching the stationary phase, two random batches of data will likely to have the gradient aligned as the noise between different batches is dominated by the shared gradient, whereas after reaching the stationary phase, two random batches should have misaligned gradient as the gradient has become mainly noise. This observation is intuitive, and some theoretical results show that (a) at the beginning, the algorithm determines non-stationary with high probability, and (b) more important, the SplitSGD algorithm is guaranteed to converge with probability tending to 1. The experiment reveals the advantage of the SplitSGD method over alternative SGD algorithms for CNN, Resnet, and LSTM.
+
+Personally, I feel this paper is good enough to be accepted. The intuition is neat, and the new approach, as supported by both the theoretical and empirical results, should merit significant values to be widely known to other scholars. The paper has made enough contributions and has high clarity in terms of writing.
+
+Here are some concerns that I would suggest the author to consider:
+1. The theoretical results are of course important, however, the proven results could appear expected and not have surprise, despite the many technical challenges. The proven results are either for the initial steps, or for the final state on whether the algorithm converges. To further enhance the significance of the theoretical results, it could seem better to establish results on the convergence process, such as the convergence rate analysis that the author in this paper already states would be “left for future work” and the analysis “appears to be a very challenging problem”.
+2. The datasets in the empirical evaluation seem not to have large sizes, especially considering the availability of large datasets nowadays. It would make the paper more convincing if the authors can add larger datasets for comparison of methods.
+3. The proposed method has gains over a number of alternatives in the simulation. The gap between the new method and the other methods appears not really large. More comparison could be helpful, although I do not think it is fully critical.
+4. For the simulation results, as far as I understand, the SplitSGD has better test metrics and results in less overfitting. It would be very helpful if the authors could provide an intuitive explanation of why this is the case. Also, could early stopping achieve a similar performance?
+5. Here is a typo: in Eq (5) and in line 21 of algorithm 1, $\theta^{(k)}_{i\cdot l}$ should have $i\cdot l +1$ rather than $i\cdot l$ in the subscript, to match the definition in Eq (4).",7,3.0,ICLR2021
+HJxUlvBPtr,2,BklXkCNYDB,BklXkCNYDB,Official Blind Review #1,"The authors propose a method to speed-up the time to validation accuracy for a particular class of graph neural networks:  Gated graph sequence neural networks (GGSNNs).
+
+The paper is interesting in that it describes several operations and engineering considerations to speed up the processing of a GGSNN on TPUs. It is essentially a collection of engineering steps that improve the time to validation accuracy.
+
+While I'm not an expert in (G)NN acceleration on TPUs, I have experience with GNNs and approaches to accelerate CNNs in GPUs. My assessment is that the scope of this work is far too narrow. It is specific to GGSNNs which is a small family of GNNs not widely used. It is also specific to TPUs and lacks evaluations of the proposed approach on other type of hardware.
+
+It is for these reasons that I think the paper is not appropriate for ICLR. The scope has to be broadened both in terms of the NN models and the hardware types. 
+",3,,ICLR2020
+HyeXzhdjYS,1,ryevtyHtPr,ryevtyHtPr,Official Blind Review #2,"This submission introduces a new concept, termed insideness, to study semantic segmentation in deep learning era. The authors raise many interesting questions, such as (1) Does deep neural networks (DNN) understand insideness? (2) What representations do DNNs use to address the long-range relationships of insideness? (3) How do architectural choices affect the learning of these representations? This work adopts two popular networks, dilated DNN and ConvLSTM, to implement solutions for insideness problem in isolation. The results can help future research in semantic segmentation for the models to generalize better. 
+
+I give an initial rating of weak accept because I think (1) This paper is well written and well motivated. (2) The idea is novel, and the proposed ""insideness"" seems like a valid metric. This work is not like other segmentation publications that just propose a network and start training, but perform some deep analysis about the generalization capability of the existing network architectures. (3) The experiments are solid and thorough. Datasets are built appropriately for demonstration purposes. All the implementation details and results can be found in appendix. (4) The results are interesting and useful. It help other researchers to rethink the boundary problem by using the insideness concept. I think this work will have an impact in semantic segmentation field. 
+
+I have one concern though. The authors mention that people will raise the question of whether these findings can be translated to improvements of segmentation methods for natural images. However, their experiments do not answer this question. Fine-tuning DEXTR and Deeplabv3+ on the synthetic datasets can only show the models' weakness, but can't show your findings will help generalize the model to natural images. Adding an experiment on widely adopted benchmark datasets, such as Cityscapes, VOC or ADE20K, will make the submission much stronger. 
+
+
+
+
+
+",6,,ICLR2020
+H1xkjO_ptS,1,H1gz_nNYDS,H1gz_nNYDS,Official Blind Review #1,"In this paper, the authors propose a method to perform architecture search on the number of channels in convolutional layers. The proposed method, called AutoSlim, is a one-shot approach based on previous work of Slimmable Networks [2,3]. The authors have tested the proposed methods on a variety of architectures on ImageNet dataset. 
+
+The paper is well-written and easy to follow. I really appreciate the authors for structuring this paper so well. I have the following questions:
+
+Q1: In figure 4, the authors find that “Compared with default MobileNet v2, our optimized configuration has fewer channels in shallow layers and more channels in deep ones.” This is interesting. Because in network pruning methods, it is found that usually later stages get pruned more [1] (e.g. VGG), indicating that there is more redundancy for deep layers. However, in this case, actually deep layers get more channels than standard models. Is there any justification for this? Is it that more channels in deep layers benefit the accuracy?
+
+Q2: In “Training Optimized Networks”, the authors mentioned that “By default we search for the network FLOPs at approximately 200M, 300M and 500M, and train a slimmable model.” Does this mean that the authors train the final optimized models from scratch as a slimmable network using “sandwich rule” and “in-place distillation” rule? Or are the authors just training the final model with standard training schedule? If it is the first case, can the authors justify why?
+
+Q3: In Table 1, “Heavy Models”, what is the difference between “ResNet-50” and “He-ResNet-50”? Also, why the params, memory and CPU Latency of some networks are omitted?
+
+Q4: In the last paragraph of section 4, the authors tried the transferability of networks learned from ImageNet to CIFAR-10 dataset. I am not sure how the authors transfer the networks from Imagenet to CIFAR-10? Is it the ratio of the number of channels? Can the authors provide the architecture details of MobileNet v2 on CIFAR-10 dataset?
+
+Q5: What is the estimated time for a typical run of AutoSlim? How does it compare to network pruning methods or neural architecture search methods?
+
+Q6: Can the methods be used to search for the number of neurons in fully connected layers? Are there any results?
+
+[1] Rethinking the Value of Network Pruning. Zhuang et al. ICLR 2019
+[2] Slimmable neural networks. Yu et al. ICLR 2019.
+[3] Universally Slimmable Networks and Improved Training Techniques. Yu et al. Arxiv.
+",6,,ICLR2020
+#NAME?,2,YmA86Zo-P_t,YmA86Zo-P_t,Insightful analysis paper of seq2seq architecture biases,"The paper studies inductive biases that are encoded in three main seq2seq architectures: LSTMs, CNNs, Transformers. Using one existing (fraction of perfect agreement) and one proposed metrics based on description length, and four dichotomy-like tasks, the authors show that, among other things, CNNs and Transformers are strongly biased to memorize training data, while LSTMs behave more nuancedly depending on input length and dropout. Unlike the first metric, the proposed metric takes values in a wider and more fine-grained range; the results are correlated for both of them. I also appreciate the attention to hyperparameter tuning and investigation of their effects in experiments. In general, the manuscript is well written and apart from a few minor questions can be accepted in its present form.
+
+Questions:
+- SGD was found to often generalize better than its adaptive variants, like Adam (e.g. Wilson et al. The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems. 2017), yet in your experiments there seems to be an opposite effect of changing the optimizer (Appendix C). Could you comment on why this is the case?
+- Regarding the tendency of large dropouts to hurt memorization: to what extend does this help the peer bias in a task? It seems that hindering memorization seem to cause a complementary increase in count or add-mul ability (Table 6). Is there a value for dropout (or a combination with other hyperparameters) when Transformers would start showing a counting bias?
+
+Minor:
+- please use alphabetic literature sorting
+
+UPDATE: I thank the authors for their detailed replies and running additional experiments. This resolves my questions and I'd keep my recommendation to accept the paper.",7,3.0,ICLR2021
+6xPML3UEa9W,4,wjJ3pR-ZQD,wjJ3pR-ZQD,Novel method for graph matching using deep networks,"Summary: The paper discusses the problem of graph matching (GM), which is the combinatorial (NP-hard) problem of finding a similarity between graphs, and has various applications in machine learning. More specifically, the paper proposes methods to leverage the power of deep networks to come up with an end-to-end framework that jointly learns a latent graph topology and perform GM, which they term as deep latent graph matching (DLGM).
+
+Strengths: The proposed method seems justified. The authors both explore a novel direction for GM by actively learning latent topology.  They further propose both a deterministic optimization-based approach a generative way to learn effective graph topology for matching. Regarding the empirical results of the paper, the authors report that their methods achieve state-of-the-art performance on public benchmarks, in comparison against a few peer methods (in measures of both accuracy and F1-score). 
+ Other than that, their claims appear to be correct, and so is the empirical methodology. Relation to prior work and differences are discussed.
+
+Weaknesses/Comments: The paper was very difficult to follow (maybe it is due to the fact that I am not highly familiar with part of the field.  Nevertheless, I think that the organization of the paper can be improved).  I think there is a missing word in the second last sentence on page 2: For notational brevity, we assume d1 and d2 keep the same across convolutional
+layers. Same dimensions maybe?",8,3.0,ICLR2021
+HyxpuAKVnQ,1,SyMhLo0qKQ,SyMhLo0qKQ,"An interesting analysis of linear interpolations with respect to the prior in the latent space, however the paper needs some improvements especially in the motivation and its implications.","The authors study the problem of when the linear interpolant between two random variables follows the same distribution. This is related to the prior distribution of an implicit generative model. In the paper, the authors show that the Cauchy distribution has such a property, however due to the heavy-tails is not particularly useful. In addition, they propose a non-linear interpolation that naturally has this property.
+
+Technically the paper in my opinion is solid. Also, the paper is ok-written, but I think it needs improvements (see comments).
+
+Comments:
+
+#1) In my opinion the motivation is not very clear and should be improved. In the paper is mentioned that the goal of shortest path interpolation is to get smooth transformations. So, in principle, I am really skeptical when the linear interpolant is utilized as the shortest path. Even then, what is the actual benefit of having the property that the linear interpolants follow the same distribution as the prior? How this is related to smoother transformations? What I understand is that, if we interpolate between several random samples, we will get less samples near the origin, and additionally, these samples will follow the prior? But how this induces smoothness in the overall transformation? I think this should be explained properly in the text i.e. why is it interesting to solve the proposed problem.
+
+#2) From Observation 2.2. we should realize that the distribution matching property holds if the distribution has infinite mean? I think that this is implicitly mentioned in Section 2.2. paragraph 1, but I believe that it should be explicitly stated.
+
+#3) Fig.1 does not show something interesting, and if it does it is not explained. In Fig. 2 I think that interpolations between the same images should be provided such that to have a direct comparison. Also, in Fig. 3 the norm of Z can be shown in order to be clear that the Cauchy distribution has the desired property. 
+
+#4) Section 2.2. paragraph 6, first sentence. Here it is stated that the distribution ""must be trivial or heavy-tailed"". This refers only to the Cauchy distribution? Since earlier the condition was the infinite mean. How these are related? Needs clarification in the text.
+
+#4) In Figure 4, I believe that the norms of the interpolants should be presented as well, such that to show if the desired property is true. Also in Figure 5, what we should see? What are the improvements when using the proposed non-linear interpolation?
+
+
+Minor comments:
+
+#1) Section 1.2. paragraph 2. For each trained model the latent space usually has different structure e.g. different untrained regions. So I believe that interpolations is not the proper way to compare different models.
+
+#2) Section 1.3 paragraph 1, in my opinion the term ""pathological"" should be explained precisely here. So it makes clear to the reader what he should expect.
+
+#3) Section 2.2. paragraph 2. The coordinate-wise implies that some Z_i are near zero and some others significantly larger? 
+
+In generally, I like the presented analysis. However, I do not fully understand the motivation. I think that choosing the shortest path guarantees smooth transformations. I do not see why the distribution matching property provides smoother transformations. To my understanding, this is simply a way to generate less samples near the origin, but this does not directly means smoother transformations of the generated images. I believe that the motivation and the actual implications of the discussed property have to be explained better.",5,4.0,ICLR2019
+t6EYOQ2Q-2Y,2,MLSvqIHRidA,MLSvqIHRidA,"This paper is acceptable, and the ICLR community may benefit from the contributions this paper brings to light. ","Summary:
+This paper presents a way to view contrastive divergence (CD) learning as an adversarial learning procedure where a discriminator is tasked with classifying whether or not a Markov chain, generated from the model, has been time-reversed. Beginning with the classic derivation of CD and its approximate gradient, noting relevant issues regarding this approximation, the authors present a way to view CD as an extension of the conditional noise contrastive estimation (CNCE) method where the contrastive distribution is continually updated to keep the discrimination task difficult. Specifically, when the contrastive distribution is chosen such that the detailed balance property is satisfied, then the CNCE loss becomes exactly proportional the CD-1 update with the derivation further extended to CD-k. Practical concerns regarding lack of detailed balance are mitigated through the use of Metropolis-Hastings rejection or an adaptive weighting that arises when deriving the gradient of their time-reversal classification loss.  A toy example providing empirical support for correcting the lack of detailed balance is included.
+
+Strengths:
+The paper is well written. From ""first principles"" through the CD-CNCE link, the paper was straightforward to follow without technical issues and with appropriate references. The results of this work appear novel, and proofs seem correct. The ability to use the weighting to address detailed balance in practice is neat, and the experiments, though limited, show promise.
+
+Concerns:
+I understand performance comparisons and experiments were not the focus of the paper. However, considering the newly presented link between CNCE and CD, it would have been exciting to see some evaluation metrics. Perhaps even just a simple experiment from the original NCE or CNCE papers. 
+For the MCMC process, appendix D mentions that 5 steps of Langevin dynamics were used. How was 5 selected? Was there any significant gain or degradation when it varied? More generally, what kind of impact does chain length have on the discriminator’s classification ability? Does chain length affect the behavior of CD with MH correction and adjusted CD similarly?
+
+Minor:
+In Figure (3b), it is said that CD is based on Langevin dynamics MCMC adjusted with the method of Sec. 3.4 yet both MH and the weight adjustment are included there. Which one is in Fig. (3.4) 
+",7,3.0,ICLR2021
+i3cS6nJPPZV,2,tnSo6VRLmT,tnSo6VRLmT,"Solid experimental results, moderate significance/novelty, some presentation improvements suggested","Summary:
+
+This paper presents two advances in conformal prediction, a field with information retrieval applications in which a set of candidate responses to a query is presented and the objective is to return a small set of responses with at least one of the responses being the correct response.  The first contribution is a method in which the possibility of several admissible responses is modeled (rather than there being just one response) with the system being calibrated against the odds that a particular response is the ""most admissible"", i.e. most conforming to the joint query/response distribution being learned from data. The second contribution is a cascaded prediction system in which simpler and less computationally expensive models are used for initial filtering and then more sophisticated and more expensive models are used downstream to further refine the response set. Rigorous statistical adjustments are used to account for multiple-hypothesis-test issues arising from using the cascading system.
+
+Pros:
+
+Reasonably thorough experimental results demonstrating performance gains in terms of sensitivity/specificity as well as in terms of computational cost are presented.
+
+The paper is mostly well written and the theory is presented with a good amount of rigor. 
+
+Cons:
+
+I feel that the cascading technique is only of moderate significance. It's a good idea and there's some value in promoting it, but it also feels like a solution that might get arrived at by a savvy, practical-minded machine learning engineer in industry rather than a particularly profound ML research concept. The rigorous multiple hypothesis corrections (Bonferroni , etc) are certainly appreciated and are something which a less sophisticated practitioner might not know to apply, so that aspect of it feels more like an academic-paper-level contribution, but the basic idea of cascading feels like an applied, industry systems solution that someone with common sense might arrive at on her/his own.
+
+Further suggestions:
+
+I strongly suggest changing the definition of ""predictive efficiency"" on page 6.  Defining it such that lower ""efficiency"" is better will confuse many readers. Efficiency has a well-established common-language meaning as something which is desirable. Why not define the efficiency as (for instance) the percentage of the entire candidate Y set which is eliminated, so that higher efficiency is better?
+
+I also think that it would broaden the appeal and accessibility of the paper to spend more time relating conformal prediction to more widely known information retrieval concepts like precision and recall and maybe also learning-to-rank. I suspect far more readers will be familiar with basic information retrieval, precision, recall, etc than with conformal prediction. I acknowledge that one of the appendices connects conformal prediction back to some of these concepts. I think maybe some of that appendix content belongs in the main paper. ",6,3.0,ICLR2021
+rJx-4Lt6nX,3,ByxmXnA9FQ,ByxmXnA9FQ,Bayesian reasoning about DNN outcome,"Summary
+=========
+The paper describes a probabilistic approach to quantifying uncertainty in DNN classification tasks.
+To this end, the author formulate a DNN with a probabilistic output layer that outputs a multinomial over the
+possible classes and is equipped with a Dirichlet prior distribution.
+They show that their approach outperforms other SOTA methods in the task of out-of-distribution detection.
+
+Review
+=========
+Overall, I find the idea compelling to treat the network outputs as samples from a probability distribution and
+consequently reason about network uncertainty by analyzing it.
+As the authors tackle a discrete classification problem, it is natural to view training outcomes as samples from
+a multinomial distribution that is then equipped with its conjugate prior, a Dirichlet.
+
+However, the model definition needs clarification. In the classical NN setting, I find it misleading
+to speak of output distributions (here called p(x)). As the authors point out, NNs are deterministic function approximators
+and thus produce deterministic output, i.e. rather a function f(x) that is not necessarily a distribution (although can be interpreted as a probability).
+One could then go on to define a latent multinomial distribution over classes p(z|phi) instead that is parameterized by a NN, i.e. phi = f_theta(x).
+The prior on p(phi) would then be a Dirichlet and consequently the posterior is Dirichlet as well.
+The prior distribution should not be dependent on data x (as is defined below Eq. 1).
+
+The whole model description does not always follow the usual nomenclature, which made it at times hard for me to grasp the idea.
+For instance, the space that is modeled by the Dirichlet is called a simplex. The generative procedure, i.e. how does data y constructed from data x and the probabilistic procedure, is missing.
+The inference procedure of minimizing the KL between approximation and posterior is just briefly described and could be a hurdle to understand, how the approach works when someone is unfamiliar with variational inference.
+This includes a proper definition of prior, likelihood and resulting posterior (e.g. with a full derivation in an appendix).
+
+Although the authors stress the importance of the approach to clip the Dirichlet parameters, I am still a bit confused on what the implications of this step are.
+As I understood it, they clip parameters to a value of one as soon as they are greater than one.
+This would always degrade an informative distribution to a uniform distribution on the simplex, regardless whether the parameters favor a dense or sparse multinomial.
+I find this an odd behavior and would suggest, the authors comment on what they mean with an ""appropriate prior"". Usually, the parameters of the prior are fixed (e.g. with values lower one if one expects a sparse multinomial).
+The prior then gets updated through the data/likelihood (here, a parameterized NN) into the posterior.
+
+Clipping would also lead to the KL term in Eq. 3 to be 0 often times, as the Dir(z|\alpha_c) often degrades to Dir(z|U).
+
+The experiments are useful to demonstrate the application and usefulness of the approach. 
+Outcome in table 3 could maybe be better depicted using bar charts, results from table 4 can be reported as text only, which would free up space for a more thorough model definition.",6,4.0,ICLR2019
+5wc1ZUqVmM,1,I6NRcao1w-X,I6NRcao1w-X,"Simple and nice idea, but experiments are not convincing enough.","Summary: This paper proposes to improve robustness in reinforcement learning via a population of diverse adversaries, where previous works mainly focus on the use a single adversary to mitigate the problem that the trained policy could be highly exploitable by the adversary. Specifically, at each iteration, it randomly selects an adversary from the population for rollouts, and it is trained by PPO. Experiments are conducted on 3 MuJoCo environments in comparison with vanilla PPO, domain randomization. 
+
+Strong points: Using a population of adversaries to improve robustness in RL is interesting. The idea is simple, and the writing is clear.
+
+Concerns:
+My major concern is in the experimental evaluation.
+a. Results are shown using final performance. I am curious about the learning curve – how does the method compare against other baselines in terms of sample efficiency? A side-effect using a population is that RAP needs to update n adversaries at each training iteration compared with using a single adversary, and will incur more computation overhead. Could authors fairly compare with other baselines in terms of this and show the learning curve? 
+
+b. How *MUCH LONGER* does it take to run RAP compared with other baselines? How much more memory does it take to use n adversaries compared with a single adversary?
+
+c. Could authors compare with a naive extension of the single adversary case in which the single adversary sample n actions? Is the baseline comparable with RAP using n adversaries? 
+
+d. I am confused why RAP is built upon an on-policy algorithm. A number of works using population-based methods are built upon off-policy algorithms as agents in the population can share the samples and could be beneficial. Could authors build the method upon off-policy algorithms to further improve the applicability of RAP?
+
+e. For Figure 3, the performance gain over using a single adversary is not significant on HalfCheetah and Ant, and the results is not convincing enough to support the claim. 
+
+As the paper uses the population-based methods, it is also worth discussing its relation with Khadka et al. 2018, etc.",5,4.0,ICLR2021
+SklXNK5Jqr,2,H1lKNp4Fvr,H1lKNp4Fvr,Official Blind Review #3,"This paper presents an algorithm for stereo image matching that attempts to capture improved representations of detailed spatial structure information, in particular, by increasing the size of the receptive field.  The paper shows that this leads to a major reduction in the number of model parameters (42% in one case) with comparable performance on the KITTI2015 data set.
+
+I like the driving principle of the authors' approach (that stereo image matching relies more heavily on low-level features, and that higher level ""semantic"" features are not as critical) compelling.  I would have really like to have seen the authors do some analysis of the features that they do extract, so that the reader can get a deeper insight into why their method works. The paper could be improved by providing more this kind of analysis and by adding more motivate for why low-level features are more important for stereo matching.  
+
+I'm concerned that the paper only present results on one, small (200+200 images) data set.  The paper would be much stronger if the authors tested on more, and varied data sets.
+
+Is it simply a network complexity issues or is there something else?
+
+The related work section appears to be just a laundry list of methods.  The paper would be stronger if the authors provided more interpretation of the strengths and weaknesses of these methods, some insight into why they work, and why the proposed method is better.
+
+The authors' method claims to use a 1x1 convolution layer.  Is that correct?  Sounds like simple multiplication.  Explain what it different.
+
+The authors' reporting of their results appears muddled.  They claim the error rate was ""reduced 3.4% and 1.9%"" in Table 5.  I could not figure out which numbers they were talking about.  In most cases, the authors' method was not the best.
+
+Minor point:  The authors say ""conclusion"" when I think they mean ""oclusion"".",3,,ICLR2020
+R5uNbQ3CRx,2,kvhzKz-_DMF,kvhzKz-_DMF,"Well written paper with good results, but limited novelty","This paper studies fine-tuning BERT-like pretrained language models (PLMs) on low resource target tasks. The authors hypothesize that the general-purpose knowledge obtained by the PLMs from pre-training might be irrelevant and redundant for a given target task. When fine-tuned onto a low resource target task, overfitting is likely to happen. To this end, a fine-tuning framework based on variational information bottleneck (VIB) is proposed to address these challenges. Specifically, the sentence representation will be mapped to a latent Gaussian variable  which compresses information in the sentence and also suppress irrelevant and redundant features, and a reconstructed version of the representation is used for task prediction. Empirical evaluations on sever datasets demonstrates the effectiveness of the method over previous research.
+
+The paper is presented well, and it's a good read. However, my major concern is on the novelty of the proposed method. As cited by the paper, VIB has been proposed and explored in various different settings, including supervised learning, semi-supervised learning, etc., and in a similar sense, variational encoder decoders have also been thoroughly explored. The proposed method is a direct application of VIB and/or variational encoder decoder. Apart from the competitive experimental results shown on the GLUE benchmark and a set of other tasks over standard baselines including Dropout, mixout and weight decay, I find it hard to justify the novelty of the proposed method. In other words, the VIB framework is general and if additional task/fine-tuning specific insights were identified and shown to necessary when applying to the low-resource fine-tuning, novelty is also justified. However, with the current set up of plainly applying VIB to fine-tune a PLM, I find novelty rather limited. 
+
+A minor question: as hypothesized if the pretrained LM contains many general purpose features, thus those irrelevant and redundant features needs to be suppressed, would one imagine the framework to work even better with large pretrained model pretrained on a much larger corpus (like BERT-large compared to BERT-base)? The main results in the paper seem to suggest otherwise, i.e., with a larger model, VIBERT actually has much less room to improve. How would the VIB framework work with a different PLM, e.g., XLM-Roberta, XLNet or T5?",4,4.0,ICLR2021
+SylpG0Y52Q,3,B1epooR5FX,B1epooR5FX,"Potentially interesting idea, not well explained and justified","This paper proposes using predicted variables(PVars) - variables that learn
+their values through reinforcement learning (using observed values and
+rewards provided explicitly by the programmer). PVars are meant to replace
+variables that are computed using heuristics.
+
+Pros:
+* Interesting/intriguing idea
+* Applicability discussed through 3 different examples
+
+Cons:
+* Gaps in explanation
+* Exaggerated claims
+* Problems inherent to the proposed technique are not properly addressed, brushed off as if unimportant
+
+The idea of PVars is potentially interesting and worth exploring; that
+being said, the paper in its current form is not ready for
+publication.
+
+Some criticism/suggestions for improvement:
+
+While the idea may be appealing and worth studying, the paper does not address several problems inherent to the technique, such as:
+
+- overheads (computational cost for inference, not only in
+  prediction/inference time but also all resources necessary to run
+  the RL algorithm; what is the memory footprint of running the RL?)
+
+- reproducibility
+
+- programming overhead: I personally do not buy that this technique -
+  at least as presented in this paper - is as easy as ""if statements""
+  (as stated in the paper) or will help ML become mainstream in
+  programming. I think the programmer needs to understand the
+  underpinnings of the PVars to be able to meaningfully provide
+  observations and rewards, in addition to the domain specific
+  knowledge. In fact, as the paper describes, there is a strong
+  interplay between the problem setting/domain and how the rewards should be
+  designed.
+
+- applicability: when and where such a technique makes sense
+
+The interface for PVars is not entirely clear, in particular the
+meaning of ""observations"" and ""rewards"" do not come natural to
+programmers unless they are exposed to a RL setting. Section 2 could
+provide more details such that it would read as a tutorial on
+PVars. If regular programmers read that section, not sure they
+understand right away how to use PVars. The intent behind PVars
+becomes clearer throughout the examples that follow.
+
+It was not always clear when PVars use the ""initialization function""
+as a backup solution. In fact, not sure ""initialization"" is the right
+term, it behaves almost like an ""alternative"" prediction/safety net.
+
+The examples would benefit from showing the initialization of the PVars.
+
+The paper would improve if the claims would be toned down, the
+limitations properly addressed and discussed and the implications of
+the technique honestly described. I also think discussing the
+applicability of the technique beyond the 3 examples presented needs
+to be conveyed, specially given the ""performance"" of the technique
+(several episodes are needed to achieve good performance).
+
+While not equivalent, I think papers from approximate computing (and
+perhaps even probabilistic programming) could be cited in the related
+work. In fact, for an example of how ""non-mainstream"" ideas can be
+proposed for programming languages (and explained in a scientific
+publication), see the work of Adrian Sampson on approximate computing
+https://www.cs.cornell.edu/~asampson/research.html
+In particular, the EnerJ paper (PLDI 2011) and Probabilistic Assertions (PLDI 2014).
+
+Update: I maintain my scores after the rebuttal discussion.",5,3.0,ICLR2019
+EkDilRQvn3r,4,Ip195saXqIX,Ip195saXqIX,"Review of ""Knowledge Distillation By Sparse Representation Matching""","**Paper summary**
+This paper proposes a knowledge distillation on the feature maps using sparse representation. The proposed method firstly constructs an over-complement dictionary to express the teacher's feature maps and learn sparse representation to express the teacher's feature map using the dictionary. Since directly utilizing the sparse representation is a too strong restriction for the student network, the loss function is designed to find the indices of sparse codes. The proposed distillation method is validated through several experiments. 
+
+**Pros**
+1. This paper proposes a way of utilizing sparse representation for knowledge distillation.
+2. The algorithm is written in clear formulations. 
+
+**Cons**
+1. The main idea of sparse representation matching (SRM) is the combination of sparse representation and knowledge distillation. However, the actual implementation of SRM does not transfer the sparse representation of the teacher to the student. Only the indices or the entire image's sparse code were transferred via knowledge distillation, so the feature map information of the teacher is not transferred. This looks counter-intuitive and weakens the arguments of the paper. Therefore, it is necessary to describe the information transferred by SRM in detail and the rationale for using this kind of information transferring.
+2. All experiments were conducted on All-CNN, but this network is not usually used by other knowledge distillation papers, so the experiments should be re-conducted on a more standard and efficient setting. For example, All-CNN got 74.7% accuracy using SRM with 2.2M parameters (in table 1), but a recent paper [1] (CRD) got 75.5% accuracy using WRN16-2 with 0.7M parameters, which uses only 1/3 number of parameters. In other words, All-CNN is not proper to compare with other distillation methods. Use popular network architectures (WRN, ResNet, VGG etc.) to get more reasonable performance. 
+[1] Contrastive Representation Distillation (ICLR 2020)
+3. In the case of transfer learning, it is common to tune the learning rates and weight decays for each dataset and each network. However, the experiments in the paper consistently use the same learning hyper-parameters. As a result, the performance gap between the baseline and the proposed method seems to have greatly inflated. A great gap against baseline (about 10%~20%) seems to be very different from the results of the other distillation papers.  In short, to be fair comparisons with other distillation methods, learning parameter tuning is necessary. 
+
+",4,5.0,ICLR2021
+BJg2jRSKhm,1,r1Vx_oA5YQ,r1Vx_oA5YQ,eye-catching image manipulation but lack justification in real use," This paper proposed a so-called ISS-GAN framework for data hiding in images, which  integrates steganography and steganalysis processes in GAN. The discriminative model simulate the steganalysis process, and the steganography generative model is to generate stego image, and confuse steganalysis discriminative model. 
+
+Overall the application seems interesting. My concern is its use in real secure information transmission systems: it can fool human eyes but what is its capacity against decoding algorithms; if the intent is to transmit some hidden information, how the receiver is supposed to decode it; is there something similar to the public key in encryption systems? These basic questions/concepts should be made clear to the reader to avoid confusion. 
+
+The evaluation protocol should be clarified and especially on how the PSNR is calculated (i.e., using the reconstructed secret image and real one?) ",5,2.0,ICLR2019
+BkTXGMKlf,1,ry831QWAb,ry831QWAb,"Clearly written paper, but experiments are not compelling and theoretical result is suboptimal","This paper proposes a family of first-order stochastic optimization schemes based on (1)  normalizing (batches of) stochastic gradient descents and (2) choosing from a step size updating scheme. The authors argue that iterative first-order optimization algorithms can be interpreted as a choice of an update direction and a step size, so they suggest that one should always normalize the gradient when computing the direction and then choose a step size using the normalized gradient. 
+
+The presentation in the paper is clear, and the exposition is easy to follow. The authors also do a good job of presenting related work and putting their ideas in the proper context. The authors also test their proposed method on many datasets, which is appreciated.
+
+However, I didn't find the main idea of the paper to be particularly compelling. The proposed technique is reasonable on its own, but the empirical results do not come with any measure of statistical significance. The authors also do not analyze the sensitivity of the different optimization algorithms to hyperparameter choice, opting to only use the default. Moreover, some algorithms were used as benchmarks on some datasets but not others. For a primarily empirical paper, every state-of-the-art algorithm should be used as a point of comparison on every dataset considered. These factors altogether render the experiments uninformative in comparing the proposed suite of algorithms to state-of-the-art methods. The theoretical result in the convex setting is also not data-dependent, despite the fact that it is the normalized gradient version of AdaGrad, which does come with a data-dependent convergence guarantee.
+
+Given the suite of optimization algorithms in the literature and in use today, any new optimization framework should either demonstrate improved (or at least matching) guarantees in some common (e.g. convex) settings or definitively outperform state-of-the-art methods on problems that are of widespread interest. Unfortunately, this paper does neither. 
+
+Because of these points, I do not feel the quality, originality, and significance of the work to be high enough to merit acceptance. 
+
+Some specific comments:
+p. 2: ""adaptive feature-dependent step size has attracted lots of attention"". When you apply feature-dependent step sizes, you are effectively changing the direction of the gradient, so your meta learning formulation, as posed, doesn't make as much sense.
+p. 2: ""we hope the resulting methods can benefit from both techniques"". What reason do you have to hope for this? Why should they be complimentary? Existing optimization techniques are based on careful design and coupling of gradients or surrogate gradients, with specific learning rate schedules. Arbitrarily mixing the two doesn't seem to be theoretically well-motivated.
+p. 2: ""numerical results shows that normalized gradient always helps to improve the performance of the original methods when the network structure is deep"". It would be great to provide some intuition for this.  
+p. 2: ""we also provide a convergence proof under this framework when the problem is convex and the stepsize is adaptive"". The result that you prove guarantees a \theta(\sqrt{T}) convergence rate. On the other hand, the AdaGrad algorithm guarantees a data-dependent bound that is O(\sqrt{T}) but can also be much smaller. This suggests that there is no theoretical motivation to use NGD with an adaptive step size over AdaGrad.
+p. 2-3: ""NGD can find a \eps-optimal solution....when the objective function is quasi-convex. ....extended NGD for upper semi-continuous quasi-convex objective functions..."". This seems like a typo. How are results that go from quasi-convex to upper semi-continuous quasi-convex an extension?
+p. 3: There should be a reference for RMSProp.
+p. 3: ""where each block of parameters x^i can be viewed as parameters associated to the ith layer in the network"". Why is layer parametrization (and later on normalization) a good way idea? There should be either a reference or an explanation.
+p. 4: ""x=(x_1, x_2, \ldots, x_B)"". Should these subscripts be superscripts?
+p. 4: ""For all the algorithms, we use their default settings."" This seems insufficient for an empirical paper, since most problems often involve some amount of hyperparameter tuning. How sensitive is each method to the choice of hyperparameters? What about the impact of initialization?
+p. 4-8: None of the experimental results have error bars or any measure of statistical significance.
+p. 5: ""NG... is a variant of the NG_{UNIT} method"". This method is never motivated.
+p. 5-6: Why are SGD and Adam used for MNIST but not on CIFAR? 
+p. 5: ""we chose the best heyper-paerameter from the 56 layer residual network."" Apart from the typos, are these parameters chosen from the training set or the test set? 
+p. 6: Why isn't Adam tested on ImageNet?
+
+  
+POST AUTHOR RESPONSE: After reading the author response and taking into account the fact that the authors have spent the time to add more experiments and clarify their theoretical result, I have decided to upgrade my score from a 3 to a 4. However, I still do not feel that the paper is up to the standards of the conference. 
+
+
+
+
+
+ ",4,5.0,ICLR2018
+H1l8W_ndtS,1,SkloDJSFPH,SkloDJSFPH,Official Blind Review #2,"The paper presents a technique for approximately sampling from autoregressive models using something like a a proposal distribution and a critic. The idea is to chunk the output into blocks and, for each block, predict each element in the block independently from a proposal network, ask a critic network whether the block looks sensible and, if not, resampling the block using the autoregressive model itself.
+
+In broad strokes the approach makes sense. It assumes, essentially, that parts of the sequence are hard to predict and parts are easy and, if there are enough easy parts, this procedure should lead to faster inference.
+
+The paper's writing is not ideal. There are some grammatical mistakes that harm reading (for example, the second paragraph of the introduction says ""However, these models must infer each element of the data x ∈ RN step by step in a serial manner, requiring O(N) times more than other non-sequential estimators"", where it is unclear what is O(N) more than what, how is this measured, etc). That said I was mostly able to follow all key points.
+
+The authors do not point out the obvious connection to GANs, which also rely on a critic network to decide whether a sample looks like it comes from the correct distribution, except in GANs the critic is jointly trained with the generator (as opposed to here where it's trained after) and in GANs the critic is only used at training time, while here the critic is used to accelerate sampling (the better the critic the faster this method can sample).
+
+I wish the experimental results were a little more explicit about the time vs quality tradeoff; I expected to see more plots with pareto curves, since as-is it's hard to judge the magnitude of the tradeoffs involved. I'd also like a more thorough analysis on why there is a non-monotonic tradeoff in some experiments (table 1, figure 2(b)) between the amount of approximation and the sample quality; this makes me think something else is going on here as this approximate inference method should just decrease quality, never increase it.
+
+Overall I lean towards accepting the paper, but I encourage the authors to revise the writing and to add a few plots explicitly showing the time vs quality tradeoff both in likelihood (wrt the full model) and in downstream metrics like FID.",6,,ICLR2020
+AHtUA8ck3oL,2,Y-Wl1l0Va-,Y-Wl1l0Va-,Interesting work,"This paper proposes the k-Shortest-Path (k-SP) constraint to restrict the agent’s trajectory to avoid redundant exploration and thus improves sample efficiency in sparse-reward MDPs. Specifically, k-SP constraint is applied to a trajectory rolled out by a policy where all of its sub-path of length k is required to be a shortest-path under the π-distance metric. Instead of a hard constraint, a cost function-based formulation is proposed to implement the constraint. The method can improve the sample efficiency in sparse reward tasks and also preserve the optimality of given MDP. Numerical results in the paper also demonstrate the effectiveness of k-SP compared with existing methods on two domains (1) Mini-Grid and (2) DeepMind Lab in sparse reward settings.
+
+Overall, the paper is well written and clearly conveys the main idea and the main results of the work. The idea and motivation of the paper are very intuitive and very reasonable. The theoretical results are immediately following the ideas. The algorithm proposed has a clear structure and is easy to understand and implement. For experiments, the new algorithm consistently outperforms existing studies on a set of MDPs where there exists a goal state (as both the unique reward state and the terminal state). Some important discussions are highlighted to introduce the algorithm,. Moreover, the proposed mechanism seems to be an inspiration for future work considering the state space exploration related to sample efficiency.
+
+However, I wasn't fully convinced by the paper about relevance. Reinforcement learning aims to learn in an environment by trial and error without prior knowledge about the environment. As the problems considered by the paper are with episodic rewards (in both theory and experiments), *the problem themselves are shortest path problems*. Using the shortest path constraint to solve shortest path problems seems not fair to be placed among a set of learning algorithms. Armed with this prior knowledge, the algorithm outperforms marginally (though consistently) compared with pure learning-based algorithms, only with its best choice of k. I believe it would fall short if placed among search algorithms. Some strong justifications are needed for the work to be relevant.
+
+Pros: 
+1.	The paper considers a practical problem in reinforcement learning: sample efficiency in sparse reward tasks. The RL algorithm tends to fail if the collected trajectory does not contain enough evaluative feedback. The idea of using a constrained-RL framework and cost function to tackle the problem is natural and has been well motivated given some drawbacks in existing work mentioned in section 5.
+2.	The relaxation from the shortest path to k-SP is well explained. The novel cost function introduced to penalizes the policy-violating SP constraint can tackle some limitations of existing methods. For example, it can preserve the convergence and optimality of the policy. 
+3.  This paper provides convincing numerical experiments to show the effectiveness of the proposed framework. The ablation studies are also helpful to show the effects of hyperparameters. 
+  
+Cons: 
+1. The choice of k is not clear.",6,5.0,ICLR2021
+HJlEGDCkqS,3,S1lEX04tPr,S1lEX04tPr,Official Blind Review #1,"This paper presents a method for adding credit assignment
+to multi-agent RL and proposes a way of adding a curriculum
+to the training process.
+The best heuristics and structures to incorporate in the
+modeling, learning, and exploration parts of multi-agent
+RL are still largely unknown and this paper explores some
+reasonable new ones.
+In most of the tasks this is evaluated on in Figure 5
+this approach adds a slight improvement to the SOTA
+and I think an exciting direction of future work is
+to continue pushing on multi-agent RL in even more
+complex envirnoments where the SOTA break down in
+even worse ways.
+",6,,ICLR2020
+H1lX6OD6Kr,1,BkxXe0Etwr,BkxXe0Etwr,Official Blind Review #2,"The paper proposes a novel value-based continuous control algorithm by formulating the problem as mixed-integer programming. With this formulation, the optimal action (corresponding to the maximum action value) can be found by solving the optimization problem at each time step. To reduce the time complexity of the optimization, the author proposes several variants to approximately solve the problem. Results on robotics control are presented. The proposed looks interesting and could be useful in practice. 
+
+1. Section 4 of the paper can be improved. Although the author proposes several methods for approximating the optimal solution, it is unclear what message the author wants to convey. How to decide which approximation to use? Is there any situation where one of the approximations should be preferred? 
+
+2. In the experiments, the standard deviation is very large, so it is hard to claim the proposed method is better.",6,,ICLR2020
+rJfWQ9OeG,2,S1GDXzb0b,S1GDXzb0b,Interesting argument for model-based imitation learning,"Model-Based Imitation Learning from State Trajectories
+
+SIGNIFICANCE AND ORIGINALITY:
+
+The authors propose a model-based method for accelerating the learning of a policy
+by observing only the state transitions of an expert trace.
+This is an important problem in many fields such as robotics where
+finding a feasible policy is hard using pure RL methods.
+
+The authors propose a unique two step method to find a high-quality model-based policy.
+
+First: To create the environment model for the model-based learner, 
+ they need a source of state transitions with actions ( St, At,xa St+1 ).
+To generate these samples, they first employ a model-free algorithm.
+The model-free algorithm is trained to try to duplicate the expert state at each trajectory.
+In continuous domains, the state is not unique … so they build a soft next state predictor
+that gives a probability over next states favoring those demonstrated by the expert.
+Since the transitions were generated by the agent acting in the environment,
+these transitions have both states and actions ( St, At, St+1 ).
+These are added to a pool.
+
+The authors argue that the policy found by this model-free learner is
+not highly accurate or guaranteed to converge, but presumably is good at
+generating transitions relevant to the expert’s policy.
+(Perhaps slowly reducing the \sigma in the reward would improve accuracy?)
+I guess if expert trace data is sparse, the model-free learner can generate a lot 
+of transitions which enable it to create accurate dynamics models which in turn
+allow it to extract more information out of sparse expert traces?
+
+Second: They then train a model based agent using the collected transitions ( St, At, St+1 ).
+They formulate the problem as a maximum likelihood problem with two terms: 
+an action dynamics model which is learned from local exploration using the learner’s own actions and outcomes
+and expert policy model in terms of the actions learned above 
+that maximizes the probability of the observed expert’s trajectory.
+This is a nice clean formulation that integrates the two processes.
+I thought the comparison to an encoder - decoder network was interesting.
+
+The authors do a good job of positioning the work in the context of recent work in IML.
+
+It looks like the authors extract position information from flappy bird frames, 
+so the algorithm is only using images for obstacle reasoning?
+
+
+QUALITY
+
+The propose model is described fairly completely and evaluated on 
+a “reaching"" problem and the ""flappy bird” game domain.
+The evaluation framework is described in enough detail to replicate the results.
+
+Interestingly, the assisted method starts off much higher in the “reacher” task.
+Presumably this task is easy to observe the correct actions.
+
+The flappy bird test shows off the difference between unassisted learning (DQN),
+model free learning with the heuristic reward (DQN+reward prediction) 
+and model based learning. 
+
+Interestingly, DQN + heuristic reward approaches expert performance
+while behavioral cloning never achieves expert performance level even though it has actions.
+
+Why does the model-based method only run to 600 steps and stopped before convergence??
+Does it not converge to expert level?? If so, this would be useful to know.
+
+There are minor grammatical mistakes that can be corrected.
+
+After equation 5, the authors suggest categorical loss for discrete problems, 
+but cross-entropy loss might work better. Maybe this is what they meant.
+
+
+CLARITY
+
+The overall approach and algorithms are described fairly clearly. Some minor typos here and there.
+
+Algorithm 1 does not make clear the relationship between the model learned in step 2 and the algorithms in steps 4 to 6.
+
+I would reverse the order of a few things to align with a right to left ordering principle. 
+In Figure 1, put the model free transition generator on the left and the model-based sample consumer on the right.
+In Figure 3, put the “reacher” test on the left and the “flappy bird” on the right.
+
+
+PROS AND CONS
+
+Interesting idea for learning quickly from small numbers of samples of expert state trajectories. 
+
+Not clear that method converges on all problems. 
+
+Not clear that the method is able to extract the state from video — authors had to extract position manually
+(this point is more about their deep architecture than the imitation framework they describe -
+though perhaps a key argument for the authors is the ability to work with small numbers of 
+expert samples and still be able to train deep methods ) ??
+
+
+POST REVIEW SUBMISSION:
+
+The authors make a number of clarifying comments to improve the text and add the reference suggested by another reviewer. ",7,3.0,ICLR2018
+S117txRef,3,B16yEqkCZ,B16yEqkCZ,DQN and catastrophic forgetting,"The paper studies catastrophic forgetting, which is an important aspect of deep reinforcement learning (RL). The problem formulation is connected to safe RL, but the emphasis is on tasks where a DQN is able to learn to avoid catastrophic events as long as it avoids forgetting. The proposed method is novel, but perhaps the most interesting aspect of this paper is that they demonstrate that “DQNs  are susceptible to periodically repeating mistakes”. I believe this observation, though not entirely novel, will inspire many researchers to study catastrophic forgetting and propose improved strategies for handling these issues.
+
+The paper is accurate, very well written (apart from a small number of grammatical mistakes) and contains appealing motivations to its key contributions. In particular, I find the basic of idea of introducing a component that represents fear natural, promising and novel. 
+
+Still, many of the design choices appear quite arbitrary and can most likely be improved upon. In fact, it is not difficult to design examples for which the proposed algorithm would be far from optimal. Instead I view the proposed techniques mostly as useful inspiration for future papers to build on. As a source of inspiration, I believe that this paper will be of considerable importance and I think many people in our community will read it with great interest. The theoretical results regarding the properties of the proposed algorithm are also relevant, and points out some of its benefits, though I do not view the results as particularly strong. 
+
+To conclude, the submitted manuscript contains novel observations and results and is likely to draw additional attention to an important aspect of deep reinforcement learning. A potential weakness with the paper is that the proposed strategies appear to be simple to improve upon and that they have not convinced me that they would yield good performance on a wider set of problems. 
+",7,3.0,ICLR2018
+ByeLxCR1qB,2,HkgB2TNYPS,HkgB2TNYPS,Official Blind Review #3,"Summary:
+The author performs theoretical analysis of the number-of-shot problem in the case study of prototypical network which is a typical method in few-shot learning. To facilitate analysis, the paper assumes 2-way classification (binary classification) and equal covariance for all classes in the embedding space, and finally derives the lower bound of the expected accuracy with respect to the shot number k. The final formula of the lower bounding indicates that increasing k will decrease the sensitivity of this lower bound to Σc (expected intra-class variance), and increase its sensitivity to Σ (variance of class means). To reduce the meta-overfitting (when training are test shot are the same, the performance becomes better), the author designed Embedding Space Transformation (EST) to minimize Σc and maximize Σ through a transformation that lies in the space of non-dominant eigenvectors of Σc while also being aligned to the dominant eigenvectors of Σ. The experimental results on 3 commonly used datasets for few-shot learning, i.e. Omniglot, Mini-ImageNet and Tiered-ImageNet demonstrate promising results and desired properties of the method.
+
++Strengths:
+1. The paper focuses on an important problem: number of shots in few-shot learning, and chooses prototypical network which is a very famous and widely used method for detailed analysis. 
+2. The paper provides relatively solid theoretical analysis with careful derivation. The final upperbound of expected accuracy matches our intuition to certain degree. Although some of the assumptions (such as 2-way classification and equal covariance for all classes) are not so realistic, the work is meaningful and very inspiring.
+3. The proposed modification over prototypical network inspired by the formula is reasonable and the experimental results demonstrate its effectiveness.
+
+-Weaknesses:
+1. The first observation says that ""diminishing returns in expected accuracy when more support data is added without altering \phi"". Does it mean that the accuracy of prototypical network deteriorates with more support data? Will the accuracy saturate and no longer diminish from certain k?
+2. Some of the accuracy improvements are not so significant from the results (even for Mini-ImageNet and Tiered-ImageNet). I was wondering if it is due to the prototypical network itself (the intrinsic property of the prototypical network limits its improvement upperbound) or something else? Please clarify.
+3. Some unclear descriptions.  How is the formulation derived between Eq. 3 and Eq. 4? More details should be given here. The descriptions about EST (Embedding Space Transformation) is insufficient, which makes it hard to understand why such operations are conducted. Moreover, it seems that the proposed approach need to compute the covariance mean and mean covariance of each class. Would it be computed in each iteration? If so, it seems to be very slow.",6,,ICLR2020
+K6oGTndTwG,3,le9LIliDOG,le9LIliDOG,This paper needs to make changes in some aspects.,"This paper concentrated on exploring how to efficiently extract the interaction information from the point clouds.  A key point lies in utilizing the non-uniform Fourier transform, rather than the regular Fourier transform. However, there exist some issues that need to be solved. 
+
++ves: 
++ The exploration of long-range interactions for point clouds is interesting.
+
++ The paper is well written. The related work makes a clear description of many fields about point cloud.
+
+Concerns: 
+1. In the introduction part, the authors describe many tasks that rely on point-cloud presentation. However, the authors ignore pointing out the existing issues. The authors should clearly present them.
+
+2. In the algorithmic section, the authors claim that ""NUFFT serves as the corner-stone of the LRC-layer"". So ""NUFFT"" is the unique solution? Whether or not some operations can replace FFT? 
+
+3. Point-cloud is indeed important for many tasks as described in the introduction part, but the authors just explore the effects of the proposed algorithm in a ""synthetic"" experiment. The experimental results are not convincing for readers, the authors should conduct more real-world tasks to verify the effectiveness of the proposed method.
+
+4. The presentation of this work needs to improve, if possible, the authors should provide an intuitive schematic diagram to present the procedure of proposing this idea.
+
+5. In the experimental section, the author should replot Figure2-3 to ensure clear enough for a better read.  
+
+#########################################################################
+
+Minor Comments:  
+
+(1)  “N_sample” and ""N"" maybe exist the inclusion relation，it will be better to replace one of them with the other form;
+
+(2)  The formula system is a little vague, if possible, the authors can simplify them to clearly describe.
+ ",5,4.0,ICLR2021
+Hye2e3RpFB,2,H1gz_nNYDS,H1gz_nNYDS,Official Blind Review #2,"This paper proposes a simple and one-shot approach on neural architecture search for the number of channels to achieve better accuracy. Rather than training a lot of network samples, the proposed method trains a single slimmable network to approximate the network accuracy of different channel configurations. The experimental results show that the proposed method achieves better performance than the existing baseline methods.
+
+- It would be better to provide the search cost of the proposed method and the other baseline methods because that is the important metric for neural architecture search methods. As this paper points out that NAS methods are computationally expensive, it would be better to make the efficiency of the proposed method clear.
+
+- According to this paper, the notable difference between the proposed method and the existing pruning methods is that the pruning methods are grounded on the importance of trained weights, but the proposed method focuses more on the importance of channel numbers. It is unclear to me why such a difference is caused by the proposed method, that is, which part of the proposed method causes the difference? And how does the difference affect the final performance?",3,,ICLR2020
+Thjf-hg76E,1,tc5qisoB-C,tc5qisoB-C,"Good idea, but confusing manuscripts.","The authors propose a new algorithm, called C-learning, which tackles goal-conditioned reinforcement learning problems. Specifically, the algorithm converts the future density estimation problem, which goal-conditioned Q learning is inherently performing, to a classification problem. The experiments showed that the modification allows a more precise density estimation than Q-learning, and in turn, a good final policy.
+
+Overall, I like the general idea to use classification as a tool for estimating the future density function. Especially, the idea is valuable in that it allows a better understanding of prior Q-learning based approaches in choosing a sensitive hyperparameter. However, the manuscript can be enhanced much by adapting more precise notations and adding more explanations on equations:
+* $p$ is highly overloaded; it is used to represent future conditional state density function and marginal state density function for both on-policy and off-policy, and transition dynamics.
+* Also, it would be important to notate $\pi$ in most of the parts, including $p$, $Q$, and $C$ (unless it is very obvious). Especially, the current notation is very confusing when the off-policy algorithm is introduced.
+* Related to this concern, I am not fully convinced of the off-line algorithm due to the marginal future state distribution $p(s_{t+})$. Doesn’t it supposed to be $p^{\pi_{\text{eval}}}(s_{t+})$, and therefore, does the marginal distribution also need to be adjusted as $p(s_{t+}|s_{t+1},a_{t+1})$?
+* It would be also helpful if the full derivation for Equation (6) is included in the main manuscript.
+
+-- After rebuttal
+
+I've read the authors' feedbacks and other reviewers' comments. My major concern was the clarity of the manuscript as other reviewers mentioned, and I believe the concern has been resolved during the rebuttal period. I adjusted my ratings representing that.",6,2.0,ICLR2021
+By5gfegEg,1,SJNDWNOlg,SJNDWNOlg,An outdated method with misleading claims.,"This paper explores different strategies for instance-level image retrieval with deep CNNs. The approach consists of extracting features from a network pre-trained for image classification (e.g. VGG), and post-process them for image retrieval. In other words, the network is off-the-shelf and solely acts as a feature extractor. The post-processing strategies are borrowed from traditional retrieval pipelines relying on hand-crafted features (e.g. SIFT + Fisher Vectors), denoted by the authors as ""traditional wisdom"".
+
+Specifically, the authors examine where to extract features in the network (i.e. features are neurons activations of a convolution layer), which type of feature aggregation and normalization performs best, whether resizing images helps, whether combining multiple scales helps, and so on. 
+
+While this type of experimental study is reasonable and well motivated, it suffers from a huge problem. Namely it ""ignores"" 2 major recent works that are in direct contradictions with many claims of the paper ([a] ""End-to-end Learning of Deep Visual Representations for Image Retrieval"" by  Gordo et al. and [b] ""CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples"" by Radenović et al., both ECCV'16 papers). These works have shown that training for retrieval can be achieved with a siamese architectures and have demonstrated outstanding performance. As a result, many claims and findings of the paper are either outdated, questionable or just wrong.
+
+Here are some of the misleading claims: 
+
+  - ""Features aggregated from these feature maps have been exploited for image retrieval tasks and achieved state-of-the-art performances in recent years.""
+  Until [a] (not cited), the state-of-the-art was still largely dominated by sparse invariant features based methods (see last Table in [a]).
+  
+  - ""the proposed method [...] outperforms the state-of-the-art methods on four typical datasets""
+  That is not true, for the same reasons than above, and also because the state-of-the-art is now dominated by [a] and [b].
+  
+  - ""Also in situations where a large numbers of training samples are not available, instance retrieval using unsupervised method is still preferable and may be the only option."".
+  This is a questionable opinion. The method exposed in ""End-to-end Learning of Deep Visual Representations for Image Retrieval"" by Gordo et al. outperforms the state-of-the-art on the UKB dataset (3.84 without QE or DBA) whereas it was trained for landmarks retrieval and not objects, i.e. in a different retrieval context. This demonstrates that in spite of insufficient training data, training is still possible and beneficial.
+
+  - Finally, most findings are not even new or surprising (e.g. aggregate several regions in a multi-scale manner was already achieved by Tolias at al, etc.). So the interest of the paper is limited overall.
+
+In addition, there are some problems in the experiments. For instance, the tuning experiments are only conducted on the Oxford dataset and using a single network (VGG-19), whereas it is not clear whether these conditions are well representative of all datasets and all networks (it is well known that the Oxford dataset behaves very differently than the Holidays dataset, for instance). In addition, tuning is performed very aggressively, making it look like the authors are tuning on the test set (e.g. see Table 3). 
+
+To conclude, the paper is one year too late with respect to recent developments in the state of the art.",3,5.0,ICLR2017
+vCAikM3KaKF,4,#NAME?,#NAME?,"The idea of learning to re-align input spaces in a common feature space has merit, but the experimental protocol is unusual and results not convincing","Summary
+--------------
+The paper proposes a trainable way to re-order or recover the ordering of features from sets of examples, and use it as a way to build a common feature space (or embedding) for a neural net, the (initial) parameters of which can be trained by Reptile.
+Experiments show that such initial parameters enable faster training (inside of an episode) than untrained weights.
+
+Pros
+------
+- The paper shows it is possible to recover information about the identity of coordinates in the input space, through a learned transformation, on several unstructured datasets. The similarity between such representations of individual coordinates can help identify similar features, either in a given dataset or across datasets.
+
+
+Cons
+--------
+The paper is overall really hard to follow, statements are often confusing or misleading. For instance:
+- The introduction suggests a multi-modal learning paradigm, where different tasks could have access to data in different input spaces, some of them common. However, the paper then seems to consider individual coordinates in the input space only, and focuses on mapping shuffled subsets of these coordinates back to their initial position.
+- There is confusion about the ""tasks"", which sometimes correspond to one of the OpenML datasets, and sometimes to individual few-shot episodes from one of these datasets.
+- Concepts like ""schema"" and ""predictors"" are never properly introduced or defined.
+- The description of the ""chameleon"" (alignment) component mentions ""order-invariant"" and ""permutation invariant"" several times, but it is quite unclear whether it refers to the the order of the examples within the data set (or episode) or the order in which the features are represented.
+
+The paper uses few-shot learning vocabulary and techniques, including Reptile, but the methodology seems completely different from the few-shot learning literature. In particular:
+- There does not appear to be a split between meta-training and meta-test classes within a dataset, or meta-training datasets and meta-testing ones, except for the EMNIST experiment. Even then, the pre-training of the ""chameleon"" alignment module seems to involve using examples of the meta-test classes.
+- The reported evaluation metric is really unusual: they report the improvement (and sometimes accuracy) after 3 steps of gradient descent from within an episode, which is somewhat related to the quality of the meta-learned weights, but no other metric that would be comparable to existing literature, which makes it especially hard to assess the results.
+
+The principle of the alignment module seems similar to (soft) attention mechanisms, in that there is a softmax trained to highlight which parts of an input vector should be emphasized (or selected) at a given point in the processing (here, in the aligned feature space). However, the literature on attention is not reviewed. 
+
+Many design choices are not addressed clearly, neither in how they were made, or the impact of these choices, especially regarding the architecture of the alignment module:
+- It is a linear transformation (before the softmax), though parameterized by 3 matrices. An alternative would have been a 3-layer neural net, similar to attention networks.
+- The parameterization of the first matrix makes the number of parameters depend on N, the number of examples in a given task. This could be quite limiting to be restrained to tasks of exactly N examples, especially if both the support (mini-train) and query (mini-test or valid) parts of an episode need to have exactly N examples.
+- There is also no discussion of the  value or impact of or K, the size of the chosen embedding space).
+
+Recommendation
+--------------------------
+I recommend to reject this submission.
+
+Arguments
+------------------
+The main idea in the paper, learning alignments of various input spaces into a common embedding space through an attention mechanism, has merit and may  work reasonably.
+However, both the algorithm and the experimental set up are described in a quite confused way, and not well justified or grounded. The reported results are not comparable with few-shot learning literature, nor multi-modal training or feature imputation, and do not make a convincing case. 
+
+Questions
+---------------
+As I understand it, the ""Chameleon"" architecture itself simply consists in 3 matrix multiplications (Nx8, 8x16, 16xK), which would be equivalent to the length-1 1D convolutions, is that correct? It may be more straightforward to explain that way, as $enc(X) = X M_1 M_2 M_3 X^T$.
+Also, should the 2nd and 3rd convolutions be labeled ""8x16x1"" and ""16xKx1"" respectively? As far as I can tell, only the first Conv1D should have a dependency on N.
+
+Additional feedback
+---------------------------
+In Figure 2, the ""reshape"" operation should be ""transpose"" instead.
+",3,5.0,ICLR2021
+HklZo4iptr,2,r1eowANFvr,r1eowANFvr,Official Blind Review #2,"In this paper the author propose a combination of the neural architecture search method DARTS and the meta-learning method MAML. DARTS relaxed the search space and jointly learns the structural parameters and the model parameters. T-NAS, the method proposed by the authors, applies MAML to the DARTS model and learns both the meta-parameters and meta-architecture. The method is evaluated for the few-shot learning and supervised classification.
+The idea is an incremental extension of MAML and DARTS. However, the idea is creative and potentially important. The paper and method are well described. The experimental results indicate a clear improvement over MAML and MAML++. A search time improvement at the cost of a small drop in accuracy is showns in Table 5. The experiments conducted follow closely the setups from the original MAML paper. Some more focus to the standard benchmarks for NAS would have been great. A comparison to the state-of-the-art in NAS on CIFAR-10 or ImageNet is missing.",6,,ICLR2020
+0lkQHux4YQ5,1,8_7yhptEWD,8_7yhptEWD,Review,"This paper studies the neural tangent kernel (NTK) of fully-connected neural networks with input injection (defined in the first set of display in Section 3.1), and the infinite depth limit of the NTK. The calculations are further carried out for the convolution neural networks with input injection (defined at the beginning of page 6). Those kernels are empirically evaluated on MNIST and CIFAR-10 datasets, and are compared with the usual NTKs without input injection.
+
+The theorems derived in this paper are incremental given existing studies on NTKs. The calculations are very similar to existing ones except the network structures considered in the current paper are slightly different. The infinite-depth limit of the kernels now indeed depends on the input, but the result is not surprising as the input is injected in each layer. 
+
+This paper also lacks a proper introduction to many concepts. For example, it is hard to understand what does DEQ-NTK really means in the introduction. Also the term NTK at first refers to the general concept in (1), but later seems to specifically refer to the NTK of fully-connected neural networks without input injection. Section 2 presents some background, but it gives many pieces of related works without a clear structure. For instance, I don't see how many concepts like ""weakly-trained"", ""fully-trained"", ""edge-of-chaos"" are relevant to the current paper. 
+
+In the experiments, the choices of parameters seem to be very arbitrary. The authors do not provide systemic guidance on how they tune those parameters, while the performance improvement is minor. Since one major motivation for studying NTKs is the relation to the actual neural networks, some proper comparisons or comments should be included. The experiment section in the current form is not very convincing. 
+
+Given the above concerns, I don't think this paper is suitable for publication in ICLR. ",4,3.0,ICLR2021
+r1xncHxi2X,3,BylBr3C9K7,BylBr3C9K7,"Interesting idea for energy-constrained compression, but some improvements still possible","This paper describes a procedure for training neural networks via an explicit constraint on the energy budget, as opposed to pruning the model size as commonly done with standard compression methods.  Comparative results are shown on a few data sets where the proposed method outperforms multiple different approaches.  Overall, the concept is interesting and certainly could prove valuable in resource-constrained environments.  Still I retain some reservations as detailed below.
+
+My first concern is that this paper exceeds the recommended 8 page limit for reasons that are seemingly quite unnecessary.  There are no large, essential figures/tables, and nearly the first 6 pages is just introduction and background material.  Likewise the paper consumes a considerable amount of space presenting technical results related to knapsack problems and various epsilon-accurate solutions, but this theoretical content seems somewhat irrelevant and distracting since it is not directly related to the greedy approximation strategy actually used for practical deployment.  Much of this material could have been moved to the supplementary so as to adhere to the 8 page soft limit.  Per the ICLR reviewer instructions, papers deemed unnecessarily long relative to this length should be judged more critically.
+
+Another issue relates to the use of a mask for controlling the sparsity of network inputs.  Although not acknowledged, similar techniques are already used to prune the activations of deep networks for compression.  In particular, various forms of variational dropout essentially use multiplicative weights to remove the influence of activations and/or other network components similar to the mask M used is this work.  Representative examples include Neklyudov et al., ""Structured Bayesian Pruning via Log-Normal Multiplicative Noise,"" NIPS 2017 and Louizos et al., ""Bayesian Compression for Deep Learning,"" NIPS 2017, but there are many other related alternatives using some form of trainable gate or mask, possibly stochastic, to affect pruning (the major ML and CV conferences over the past year have numerous related compression papers).  So I don't consider this aspect of the paper to be new in any significant way.
+
+Moreover, for the empirical comparisons it would be better to compare against state-of-the-art compression methods as opposed to just the stated MP and SSL methods from 2015 and 2016 respectively.  Despite claims to the contrary on page 9, I would not consider these to be state-of-the-art methods at this point.
+
+Another comment I have regarding the experiments is that hyperparameters and the use of knowledge distillation were potentially tuned for the proposed method and then simultaneously applied to the competing algorithms for the sake of head-to-head comparison.  But to me, if these enhancements are to be included at all, tuning must be done carefully and independently for each algorithm.  Was this actually done?  Moreover it would have been nice to see results without the confounding influence of distillation to isolate sources of improvement, but no ablation studies were presented.
+
+Finally, regarding the content in Section 5, the paper carefully presents an explicit bound on energy that ultimately leads to a constraint that is NP-hard just to project on to, although approximate solutions exist that depend on some error tolerance.  However, even this requires an algorithm that is dismissed as ""complicated.""  Instead a greedy alternative is derived in the Appendix which presumably serves as the final endorsed approach.  But at this point it is no longer clear to me exactly what performance guarantees remain with respect to the energy bound.  Theorem 3 presents a fairly inscrutable bound, and it is not at all transparent how to interpret this in any practical sense.  Note that after Theorem 3, conditions are described whereby an optimal projection can be obtained, but these seem highly nuanced, and unlikely to apply in most cases.
+
+Additionally, it would appear that crude bounds on the energy could also be introduced by simply penalizing/constraining the sparsity on each layer, which leads to a much simpler projection step.  For example, a simple affine function of the L0 norm would be much easier to optimize and could serve as a loose bound on the energy, given that the latter should be a non-decreasing function of the L0 norm.  Any idea how such a bound compares to those presented given all the approximations and greedy steps that must be included?
+
+
+Other comments:
+- As an implementation heuristic, the proposed Algorithm 1 gradually decays the parameter q, which controls the sparsity of the mask M.  But this will certainly alter the energy budget, and I wonder how important it is to employ a complex energy constraint if minimization requires this type of heuristic.
+
+- I did not see where the quantity L(M,W) embedded in eq. (17) was formally defined, although I can guess what it is.
+
+- In general it is somewhat troublesome that, on top of a complex, non-convex deep network energy function, just the small subproblem required for projecting onto the energy constraint is NP-hard.  Even if approximations are possible, I wonder if this extra complexity is always worth it relative so simple sparsity-based compression methods which can be efficiently implemented with exactly closed-form projections.
+
+- In Table 1, the proposed method is highlighted as having the smallest accuracy drop on SqueezeNet.  But this is not true, EAP is lower.  Likewise on AlexNet, NetAdapt has an equally optimal energy.",7,4.0,ICLR2019
+B1xV5ifgT7,2,HylDpoActX,HylDpoActX,in-depth analysis is needed for this paper,"This paper is about CNN model compression and inference acceleration using quantization. The main idea is to use 'nest' clustering for weight quantization, more specifically, it partitions the weight values by recurring partitioning the weights by arithmetic means and negative of that of that weight clustering.
+
+I have several questions for this paper:
+
+1) the main algorithm is mainly based on the hypothesis that the weights are with Gaussian distribution. What happens if the weights are not Gaussian, such as skewed distribution? Seems the outliners will bring lots of issues for this nest clustering  for partitioning the weight values.
+
+2) Since the paper is on inference acceleration, there is no real inference time result. I think having some real inference time on these quantized models and showing how their inference time speedup is will be interesting.
+
+3) Activation quantization in Section 4 is a standard way for quantization, but I am curious how to filter out the outliner, and how to set the clipping interval?
+
+4) I am not sure what does the 'sparsity' mean in Table 2? Does this quantization scheme introduce many zeros? Or sparsity is corresponding to the compression ratio? If that is the case, then many quantization algorithms can actually achieve better compression ratios with 2 bits quantization.",4,4.0,ICLR2019
+OcGDojF6myE,1,NqWY3s0SILo,NqWY3s0SILo,Novel but should compare with NAO and evaluate on ImageNet or CIFAR10.,"#### Summary:
+This work propose Graph Optimized Neural Architecture Learning, that uses a differentiable surrogate model to directly optimize the graph structures. More specifically, the surrogate model takes a graph structure as the neural architecture embedding and predicts a relative ranking, then applies gradient descent on the input graph structure to optimize the neural architecture. GOAL demonstrates superior performance compared to SoTAs.
+
+#### Weakness:
+
+-First, the method is quite similar to NAO, where an encoder and decoder approach the maps neural architectures into a continuous space and builds a predictor based on the latent representation. The only difference is that NAO is use a decoder to decode the optimized latent representation back to architecture representation while here GOAL applies gradient descent on a graph neural network. Also, how to decode the parameters back to the neural architecture discrete representation is not clearly explained in the paper.
+
+-Second,  this method can work well for small models and small search spaces, but can be hardly applied to larger models. Training the surrogate function for a larger search space or larger models can take more samples and more training time (e.g. large models takes much longer time to train, thus even a proxy accuracy should take more time to evaluate). A parameter sharing scheme can be very inaccurate in the beginning therefore results in suboptimal architecture selection. The inaccuracy is compounded when using the same model for both a surrogate function and neural architecture search. The evaluation on only NAS-bench partially verifies the reviewers concerns. 
+
+#### Detailed feedback:
+The reviewer would like to suggest several fixes to this paper:
+-First, try to compare to better baselines (e.g. NAO, more recent differentiable search work) on not only NASBench, but on real CIFAR10 or ImageNet workloads. 
+
+-Second, compare a non graph neural network based approach with GOAL and show the necessity of using a graph neural network.
+
+-Third, be more clear about decoding back the graph neural network parameterizations to the neural architecture representation. Include more details on the number of samples used to train the surrogate function and hyperparameteers used in algorithm 1. 
+
+
+[1] ""Neural Architecture Optimization"", https://arxiv.org/abs/1808.07233",5,4.0,ICLR2021
+HJlRfyx_aQ,3,HkNGYjR9FX,HkNGYjR9FX,Achieving binary/ternary LSTMs using batch normalization within recurrent layers,"The paper proposes a method to achieve binary and ternary quantization for recurrent networks. The key contribution is applying batch normalization to both input matrix vector and hidden matrix vector products within recurrent layers in order to preserve accuracy. The authors demonstrate accuracy benefits on a variety of datasets including language modeling (character and word level), MNIST sequence, and question answering. A hardware implementation based on DaDianNao is provided as well.
+
+Strengths
+- The authors propose a relatively simple and easy to understand methodology for achieving aggressive binary and ternary quantization.
+- The authors present compelling accuracy benefits on a range of datasets.
+
+Weaknesses / Questions
+- While the application of batch normalization demonstrates good results, having more compelling results on why covariate shift is such a problem in LSTMs would be helpful. Is this methodology applicable to other recurrent layers like RNNs and GRUs? 
+- Does applying batch normalization across layer boundaries or at the end of each time-step help? This may incur lower overhead during inference and training time compared to applying batch normalization to the output of each matrix vector product (inputs and hidden-states). 
+- Does training with batch-normalization add additional complexity to the training process? I imagine current DL framework do not efficiently parallelize applying batch normalization on both input and hidden matrix vector products.
+- It would be nice to have more intuition on what execution time overheads batch-normalization applies during inference on a CPU or GPU. That is, without a hardware accelerator what are the run-time costs, if any.
+- The hardware implementation could have much more detail. First, where are the area and power savings coming from. It would be nice to have a breakdown of on-chip SRAM for weights and activations vs. required DRAM memory. Similarly having a breakdown of power in terms of on-chip memory, off-chip memory, and compute would be helpful. 
+- The hardware accelerator baseline assumes a 12-bit weight and activation quantization. Is this the best that can be achieved without sacrificing accuracy compared to floating point representation? Does adding batch normalization to intermediate matrix-vector products increase the required bit width for activations to preserve accuracy?
+
+Other comments
+- Preceding section 3.2 there no real discussion on batch normalization and covariate shift which are central to the work’s contribution. It would be nice to include this in the introduction to guide the reader.
+- It is unclear why DaDianNao was chosen as the baseline hardware implementation as opposed to other hardware accelerator implementations such as TPU like dataflows or the open-source NVDLA. 
+",6,3.0,ICLR2019
+B1x9LHd6tr,2,SJlpy64tvB,SJlpy64tvB,Official Blind Review #2,"The paper proposed a novel approach for robust continual learning model from the adversarial attack. The authors start from one of the state-of-the-art episodic memory based continual learning method, A-GEM. The proposed method, Gradient Reversion (GREV), is specialized on A-GEM. The method perturbed the episodic memory examples on A-GEM, thus modifies the direction of reference gradient. While conventional attack like Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD) hardly show the their influence on A-GEM, GREV significantly attacks the performance.
+
+
+The paper is well written, and easy to follow. Also, attack technique on episodic memory based continual learning is interesting and would be valuable. 
+
+But, I feel that some of the analysis are obvious which are not much meaningful to analyze the model,  and the overall contributions are suggested under A-GEM model, while not to cover generic other episodic-based continual learning. 
+
+So, I hesitate to give the high score even the approach is interesting.
+
+Additional one question.
+I didn't get the concrete reasons that A-GEM is already robust for famous attack methods. What’s the reason that A-GEM is robust for them?
+",3,,ICLR2020
+BhmdttMeRJJ,1,lfJpQn3xPV-,lfJpQn3xPV-,Interesting problem; technical contribution is limited given prior work,"Temporal graphs can naturally model many real-world networks, and many graph neural network (GNN)-based methods have been proposed recently. Existing temporal GNNs can handle vertices and edges appearing / disappearing over time, but not vertex classes. This paper precisely considers this problem, and
+1) compiles three vertex classification datasets for future research,
+2) proposes an experimental procedure for evaluating performance under this setting,
+3) explores 5 existing GNNs, and concludes that incremental training for limited periods is as good as that over full timelines.
+
+
+
+## Pros
+1) (Motivation) It is reasonable to assume that new classes can appear over time in real-world networks. It is also worth investigating whether the full temporal graph (seen so far) is actually required for GNN neighbourhood aggregation in the current timestep.
+2) (Relevance) Learning representations on temporal graphs is a challenging, fast-growing topic, and relevant to the ICLR community.
+
+
+
+## Cons
+1) (Soundness) Tables 2, 3, and 4 compare accuracies of different static GNNs with varying window sizes (proposed idea) and with full graph (existing idea) which is informative. However, to increase the impact of the paper, the proposed idea (with static GNNs) should also be compared against state-of-the-art temporal GNNs on full graphs (in all these tables). As already cited by the authors, recent temporal GNNs include (but are not limited to)
+(a) EvolveGCN: Evolving Graph Convolutional Networks for Dynamic Graphs, In AAAI'20,
+(b) Inductive Representation Learning on Temporal Graphs, In ICLR'20.
+2) (Significance) The experiments in the paper are restricted to multi-class vertex classification with new classes appearing over time (in just one dataset domain based on scientific publications). The authors should clarify what challenges one would face for multi-label classification commonly seen with some datasets (e.g. social networks). It would be more convincing if experiments were also conducted on link prediction (e.g. social network link prediction with new classes i.e. communities appearing over time).  
+3) (Originality) Although the assumptions (classes appearing/disappearing over time), evaluation procedure, and datasets have not been considered / proposed before, the novelty of the paper is quite limited. As also acknowledged by the authors, the paper explores well-known existing static GNNs for temporal graphs. From this point of view, the paper is of limited originality since it explores well-known algorithms in an unexplored setting.
+
+
+
+To summarise, the paper has strong arguments along the axis of motivation but the major weaknesses outweigh the strengths.",5,4.0,ICLR2021
+mTIm92kcKn_,2,4AWko4A35ss,4AWko4A35ss,Performance is nice but lack insights,"The paper presents a novel pretext task for self-supervised video representation learning (SSVRL). The authors design several surrogate tasks for tackling intentionally constructed constrained spatiotemporal jigsaw puzzles.  The learned representations during training to solve the surrogate tasks can be transferred to other video tasks. The proposed method shows superior performances than state-of-the-art SSVRL approaches on action recognition and video retrieval benchmarks. 
+
+## Strengths: 
+
++. Good performances on two benchmarks. 
+
++. Carefully designed surrogate tasks. 
+
+## Weaknesses:
+
+-. Lack insightful analysis of how the idea is inspired, why it works. It seems the intuition of the paper is to make the 3D jigsaw problem easier to solve and it will just work. But why the easier problem could help learn better representations? Each of the two steps making the problem easier need to be analyzed more thoroughly: first, making the unconstrained jigsaw problem constrained; second, solving the surrogate tasks instead of solving the constrained jigsaw problem. Actually, the carefully designed surrogate tasks are quite different from the constrained jigsaw problem. They seems more ad-hoc but not a principled way to tackle the jigsaw problem. All these questions need more indepth clarification. 
+
+-. Experimental analysis is not thorough. In case the proposed method is not a principled method, but a carefully designed method. Extensive experiments of different variations of the proposed method could help better understand why the method works. A good performance on well-established benchmarks might be impressive, but analysis of why the performance can be achieved is more important. 
+
+-. Writing needs improvements. In the exposition of the proposed method section, some sentences are casual and misleading. For example, the third paragraph of sec 3.2. Besides, section 3.3 is a little bit difficult to follow. It could be possibly revised more concisely. 
+
+## Summary
+
+Overall, the paper presents yet another method to design the pretext task for SSVRL. But my major concern is it lacks enough insights for inspiring future research for this topic. It might not be good enough for ICLR. 
+",5,4.0,ICLR2021
+HklQyb9637,2,SyVU6s05K7,SyVU6s05K7,The proposed DFW lacks of sufficient novelty and the presented performance improvement needs more theoretical justification.,"This paper proposes a Frank-Wolfe based method, called DFW, for training Deep Network. The DFW method linearizes the loss function into a smooth one, and also adopts Nesterov Momentum to accelerate the training. Both techniques have been widely used in the literature for similar settings. This paper mainly focuses on the algorithm part, but only empirically demonstrate the convergence results. 
+
+After reading the authors’ feedback and the paper again, I think overall this is a good paper and should be of broader interest to the broader audience in machine learning community. 
+
+In Section 6.1, the authors mention the good generalization is due to large number of steps at a high learning rate. Can we possibly get any theoretical justification on this? 
+
+This paper uses multi class hinge loss as an example for illustration. Can this approach be applied for structure prediction, for example, various ranking loss? ",7,4.0,ICLR2019
+S1xrV0aqFB,2,rJx0Q6EFPB,rJx0Q6EFPB,Official Blind Review #1,"This paper proposes a new knowledge distillation method for BERT models. A number of modifications to the vanilla knowledge distillation method of Hinton et al (2015) are proposed. First, authors suggest adding L2 loss functions between alignment matrices, embedding layer values and prediction layer values. Second, authors propose run knowledge-distillation twice, once with the original pre-trained BERT model as teacher, and then again with task specific fine-tuned BERT as a new teacher. Third, authors emphasize the use of data augmentation for successful knowledge distillation. In Table 2, authors claim a significant lift across GLUE benchmarks with respect to other baseline methods with comparable model size.
+
+While the main contribution of this paper is the proposal of empirically useful techniques than theoretical development, the empirical results reported in this paper are somewhat puzzling. 
+
+First of all, GLUE benchmark scores reported in Table 2 don't seem to be consistent with Table 1 of Sun et al (2019) for BERT-PKD ( https://arxiv.org/pdf/1908.09355.pdf ) or DistilBERT ( https://medium.com/huggingface/distilbert-8cf3380435b5 ). Indeed, BERT-PKD in Sun et al seems to significantly outperform TinyBERT on QNLI (89.0 vs 87.7) and RTE (65.5 vs 62.9), and the gap between BERT-PKD and TinyBERT on other tasks are much smaller if we take numbers reported in the original paper.
+
+In Table 6, ablation studies with different distillation objectives are reported. Quite surprisingly, without Transformer-layer distillation (No Trm) the performance drops quite significantly. This is unexpected, because baselines such as Sun et al and DistilBERT do not use the Transformer-layer distillation but much more competitive to full TinyBERT than TinyBERT without Transformer-layer distillation. Would there be a reason why TinyBERT is so critically dependent on Transformer-layer distillation? Similarly, the removal of data augmentation (Table 5, No DA) is so detrimental to the performance of the model that it makes me to suspect whether the most of gain is from successful data augmentation. Indeed, 'No DA' row of Table 5 is very close to the performance BERT-PKD in Table 4, although the number of layers is different (4 vs 6). 
+
+In order for the research community to understand the contribution of proposed techniques more thoroughly, I suggest authors to conduct ablation studies with the simplest baseline. That is, rather than starting with the full TinyBERT model, start with a simple but competitive baseline like BERT-PKD, and only add one technique (DA, GD, Transformer-layer distillation) at a time so that readers shall understand what technique is the most important to be added to the baseline, and also whether some of the proposed techniques should always be used in combination.
+
+--- 
+After Author Rebuttal: authors have addressed all of my concerns quite clearly. Additional experiments which targeted a specific design choice at a time made me much more convinced that the techniques proposed in this paper are useful not only for this particular context but also more broadly applicable.",8,,ICLR2020
+SJgEnknh2m,2,BJgYl205tQ,BJgYl205tQ,"Authors coupled a local intrinsic dimensionality measure to assess GAN frameworks concerning their ability to generate realistic data. The proposal is straightforward and would be applied in different GAN-based approaches, mainly, being sensitive to mode collapse.","The paper is clear regarding motivation, related work, and mathematical foundations. The introduced cross-local intrinsic dimensionality- (CLID) seems to be naive but practical for GAN assessment. In general, the experimental results seem to be convincing and illustrative. 
+
+Pros: 
+- Clear mathematical foundations and fair experimental results.
+- CLID can be applied to favor GAN-based training, which is an up-to-date research topic.
+- Robustness against mode collapse (typical discrimination issue).
+
+Cons:
+-The CLID highly depends on the predefined neighborhood size, which is not studied properly during the paper. Authors suggest some experimentally fixed values, but a proper analysis (at least empirically), would be useful for the readers.
+- The robustness against input noise is studied only for small values, which is not completely realistic.",6,4.0,ICLR2019
+SyqiSbomf,4,S1vuO-bCW,S1vuO-bCW,Great idea but the write up needs to be made clearer,"This paper proposes the idea of having an agent learning a policy that resets the agent's state to one of the states drawn from the distribution of starting states. The agent learns such policy while also learning how to solve the actual task. This approach generates more autonomous agents that require fewer human interventions in the learning process. This is a very elegant and general idea, where the value function learned in the reset task also encodes some measure of safety in the environment.
+
+All that being said, I gave this paper a score of 6 because two aspects that seem fundamental to me are not clear in the paper. If clarified, I'd happily increase my score.
+
+1) *Defining state visitation/equality in the function approximation setting:* The main idea behind the proposed algorithm is to ensure that ""when the reset policy is executed from any state, the distribution over final states matches the initial state distribution p_0"". This is formally described, for example, in line 13 of Algorithm 1.
+The authors ""define a set of safe states S_{reset} \subseteq S, and say that we are in an irreversible state if the set of states visited by the reset policy over the past N episodes is disjoint from S_{reset}."" However, it is not clear to me how one can uniquely identify a state in the function approximation case. Obviously, it is straightforward to apply such definition in the tabular case, where counting state visitation is easy. However, how do we count state visitation in continuous domains? Did the authors manually define the range of each joint/torque/angle that characterizes the start state? In a control task from pixels, for example, would the exact configuration of pixels seen at the beginning be the start state? Defining state visitation in the function approximation setting is not trivial and it seems to me the authors just glossed over it, despite being essential to your work.
+
+2) *Experimental design for Figure 5*: This setup is not clear to me at all and in fact, my first reaction is to say it is wrong. An episodic task is generally defined as: the agent starts in a state drawn from the distribution of starting states and at the moment it reaches the goal state, the task is reset and the agent starts again. It doesn't seem to be what the authors did, is that right? The sentence: ""our method learns to solve this task by automatically resetting the environment after each episode, so the forward policy can practice catching the ball when initialized below the cup"" is confusion. When is the task reset to the ""status quo"" approach? Also, let's say an agent takes 50 time steps to reach the goal and then it decides to do a soft-reset. Are the time steps it is spending on its soft-reset being taken into account when generating the reported results?
+
+
+Some other minor points are:
+
+- The authors should standardize their use of citations in the paper. Sometimes there are way too many parentheses in a reference. For example: ""manual resets are necessary when the robot or environment breaks (e.g. Gandhi et al. (2017))"", or ""Our methods can also be used directly with any other Q-learning methods ((Watkins & Dayan, 1992; Mnih et al., 2013; Gu et al., 2017; Amos et al., 2016; Metz et al., 2017))""
+
+- There is a whole line of work in safe RL that is not acknowledged in the related work section. Representative papers are:
+    [1] Philip S. Thomas, Georgios Theocharous, Mohammad Ghavamzadeh: High-Confidence Off-Policy Evaluation. AAAI 2015: 3000-3006
+    [2] Philip S. Thomas, Georgios Theocharous, Mohammad Ghavamzadeh: High Confidence Policy Improvement. ICML 2015: 2380-2388
+
+- In the Preliminaries Section the next state is said to be drawn from s_{t+1} ~ P(s'| s, a). However, this hides the fact the next state is dependent on the environment dynamics and on the policy being followed. I think it would be clearer if written: s_{t+1} ~ P(s'| s, \pi(a|s)).
+
+- It seems to me that, in Algorithm 1, the name 'Act' is misleading. Shouldn't it be 'ChooseAction' or 'EpsilonGreedy'? If I understand correctly, the function 'Act' just returns the action to be executed, while the function 'Step' is the one that actually executes the action.
+
+- It is absolutely essential to depict the confidence intervals in the plots in Figure 3. Ideally we should have confidence intervals in all the plots in the paper.",7,4.0,ICLR2018
+qsa7qxh0vUa,1,nQxCYIFk7Rz,nQxCYIFk7Rz,My decision hinges on a very convincing explanation on the model settings and motivation. Currently I consider a reject.,"This paper studies the double/multiple descents of the prediction error curve for linear regression in both the under-parameterized and over-parameterized regime. 
+
+The strength: while many papers have studied the double descent for linear regression estimate or the minimum $\ell_2$ norm solution,  this paper shows multiple descents when d=O($\sqrt{n}$) a setting is barely studied by others. Further, while multiple descents have been numerically discovered by other concurrent works, they have theoretically proved that such multiple descents exist.  
+
+The weakness: The major weakness of the paper is the model settings. Specifically, 1) it is unclear why the prediction error is normalized by the number of features, and 2) the bias term is left out in the prediction error due to the true coefficients being zero and only the variance term is considered. First, for normalization, the authors claim that this normalization is necessary for comparison. Indeed, the entire results are hinged on this normalization, i.e., without the normalization, the proof can NOT show the existence of the multiple descents neither in under-parameterized regime nor overparameterized regime. The reasons I found this normalization is weird are the following:
+
+i) Normal linear regression problem does not have such normalization on the prediction error. It is unclear why we want to divide a one-dimensional error by the feature size.
+
+ii) Other double descent works mainly deliver two messages:
+
+a) Given a fixed sample size, what is the best model gives the best estimate of the response. The answer is a larger model, i,e, adding more features, may help. (e.g. Xu's PCR paper)
+
+b) Given a fixed feature size, what is the best sample size that gives the best estimate of the response. The answer is using a smaller sample size may help. (e.g. Hastie's double descent paper)
+
+For both cases, I do not see any reason to normalize the prediction error of response by feature size. If this normalization is for the purpose of the model selection penalty, it is unclear why we should encourage a larger model instead of penalizing it.   
+
+A reasonable quantity for such normalization is the MSE of the coefficient, i.e., $\|\hat{\beta}-\beta^*\|^2$. There are many applications where people are more interested in the coefficients rather than the response. Maybe the authors should consider this quantity instead of the prediction error. 
+
+For the second weakness of the model settings, the bias term has been left out of the prediction error when the true coefficients are assumed to be all zero. Because of this setting, all features are just pure noise, irrelevant to the response. Then, we can check that 0 is the best estimate when all features are just pure noise, and it seems that there is no motivation for us to learn anything from the random noise. If the main purpose of this paper is to deliver a message that using only irrelevant features and adding more of them can help to improve the prediction error, this effect is known already in those double descent paper in the overparameterized regime. Showing multiple descents does not add much value because it never beats the trivial estimate 0 in this setting.   
+
+Because of these major weakness, I recommend rejection for this paper. But I will possibly change my evaluation if the authors can provide a very convincing explanation of the model settings and motivation.
+
+Besides these, another suggestion for the paper is that the proof of the Theorems and the statement of Lemmas takes a lot of places. I think they can be replaced by more detailed discussions of the model settings and messages or conclusions from the main theorems. For example, is there any intuition about what kind of multiple descents curve is more favorable? Also, despite the attractive title, I think it is still hard to design the generalization curve without taking the bias term into consideration. The room can be left for the analysis of the bias term.   
+
+After response:
+Thanks for addressing the concern about normalization. It appears that other reviewers have a concern about such normalization as well. I suggest the authors remove the results with normalization entirely from the main paper and only have it in the appendix for anyone that is interested in such normalization. 
+
+On the other hand, without normalization, the results have changed for the under-parameterized regime (which makes more sense to me) and the proof looks quite different in the over-parameterized regime as well. I did not have time to check the proof and I believe it is better to resubmit the paper as new because of the major changes. 
+
+Finally, I still have concerns about the fact that only variance is discussed. I suggest the authors state their results in a setting where both bias and variance exists and the features added to the model are related to the response. Otherwise, it is a weird message that it is good to add pure noise as features. It feels like although we can design multiple descents in the overparameterized regime when noise is large, it is very likely that the 0 estimate achieves the best prediction risk. So there is no point to go into overparameterization and multiple descents at all.    
+
+In summary, I have raised the score to 5. I believe it can be 6 or 7 if all issues are addressed, but I am afraid that the paper looks basically new after these changes and thus I am not sure whether it should be still considered for this conference.  
+",5,4.0,ICLR2021
+r1xUKwB0FH,1,B1xu6yStPH,B1xu6yStPH,Official Blind Review #1,"This paper suggests a method for detecting adversarial attacks known as EXAID, which leverages deep learning explainability techniques to detect adversarial examples. The method works by looking at the prediction made by the classifier as well as the output of the explainability method, and labelling the input as an adversarial example if the predicted class is inconsistent with the model explanation. EXAID uses Shapley values as the explanation technique, and is shown to successfully detect many standard first-order attacks.
+
+Though method is well-presented and the evaluation is substantial, the threat model of the oblivious adversary is unconvincing. The paper makes the argument that oblivious adversaries are more prevalent in the real world, but several works [1,2,3,etc.] have shown that with only query access to input-label pairs from a deep learning-based system, it is possible to construct black-box adversarial attacks. Thus, it is unclear why an attacker cannot just treat the detection mechanism as part of this black box, and mount a successful query-based attack. 
+
+Though I recognize that the task of detection is separate from the task of robust classification, in both cases the defender should at least operate in the case where the attacker has input-output access to the end-to-end system (including whatever detection mechanisms are present). In particular, it seems impossible to ""hide"" a detector from an end user (when the method detects an adversarial example, it will alert the user somehow that the input was rejected), and so the user will be able to use this information to fool the system. The authors should investigate the white-box accuracy of their detection system, or at the very least try black-box attacks against the detector. For this reason I do not recommend acceptance for the paper at this time.
+
+[1] https://arxiv.org/abs/1804.08598
+[2] https://arxiv.org/abs/1807.04457
+[3] https://arxiv.org/abs/1712.04248",3,,ICLR2020
+ryT2f8KgM,2,SJx9GQb0-,SJx9GQb0-,"The paper continues a line of improvement to Wasserstein GANs, and suggests an approach based a double perturbation of each data point, penalizing deviations from Lipshitz-ness. Empirical results demonstrate the effectiveness of the proposal. ","This paper continues a trend of incremental improvements to Wasserstein GANs (WGAN), where the latter were proposed in order to alleviate the difficulties encountered in training GANs. Originally, Arjovsky et al.  [1] argued that the Wasserstein distance was superior to many others typically used for GANs. An important feature of WGANs is the requirement for the discriminator to be 1-Lipschitz, which [1] achieved simply by clipping the network weights. Recently, Gulrajani et al. [2] proposed a gradient penalty ""encouraging"" the discriminator to be 1-Lipschitz. However, their approach estimated continuity on points between the generated and the real samples, and thus could fail to guarantee Lipschitz-ness at the early training stages. The paper under review overcomes this drawback by estimating the continuity on perturbations of the real samples. Together with various technical improvements, this leads to state-of-the-art practical performance both in terms of generated images and in semi-supervised learning.  
+
+In terms of novelty, the paper provides one core conceptual idea followed by several tweaks aimed at improving the practical performance of GANs. The key conceptual idea is to perturb each data point twice and use a Lipschitz constant to bound the difference in the discriminator’s response on the perturbed points.  The proposed method is used in eq. (6) together with the gradient penalty from [2]. The authors found that directly perturbing the data with Gaussian noise led to inferior results and therefore propose to perturb the hidden layers using dropout. For supervised learning they demonstrate less overfitting for both MNIST and CIFAR 10.  They also extend their framework to the semi-supervised setting of Salismans et al 2016 and report improved image generation. 
+
+The authors do an excellent comparative job in presenting their experiments. They compare numerous techniques (e.g., Gaussian noise, dropout) and demonstrates the applicability of the approach for a wide range of tasks. They use several criteria to evaluate their performance (images, inception score, semi-supervised learning, overfitting, weight histogram) and compare against a wide range of competing papers. 
+
+Where the paper could perhaps be slightly improved is writing clarity. In particular, the discussion of M and M' is vital to the point of the paper, but could be written in a more transparent manner. The same goes for the semi-supervised experiment details and the CIFAR-10 augmentation process. Finally, the title seems uninformative. Almost all progress is incremental, and the authors modestly give credit to both [1] and [2], but the title is neither memorable nor useful in expressing the novel idea. 
+[1] Martin Arjovsky, Soumith Chintala, and Leon Bottou. Wasserstein gan.
+
+[2] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved training of wasserstein gans. 
+
+",7,4.0,ICLR2018
+HMLeBTQRP5e,3,qpsl2dR9twy,qpsl2dR9twy, well-written and technically sounds,"Summary: This work proposes to use 'intention' of each agent to enhance the message sharing scheme of MARL. For training, MADDPG is used as a backbone MARL and implement ITGM and AM on top of it. The proposed method is compared with several baseline approaches from three different environments.
+
+Strengths:
++ The paper is well-written and technically sounds. 
++ The motivation is clear.
+
+Weaknesses:
+- This work wants to see the effectiveness of the use of intention as a communication scheme under MARL. However, in general, this is not the first to use intention. I suggest to review some relevant works and show how 'intention' has been explored in a similar/different way in the literature. 
+- Although visualization of imagined trajectory is given in Figure 4, it does not fully demonstrate the idea of this approach. More qualitative evaluation seems required to validate from various aspects. 
+- It seems that attention is used to capture which waypoint is more important in an imagined trajectory rather than whose imagined trajectory is more important. If so, is it really important? Where is the ablative study of using attention?
+- Intuitively, when two agents set the same goal (catching same prey), what is the rationale of determining who maintain the goal or who change the plan? ",6,3.0,ICLR2021
+xysqyfLEoyN,3,PUkhWz65dy5,PUkhWz65dy5,"Strong and non-trivial theoretical contributions, interesting empirical insight that connects directly to the theory","Summary: the authors propose to solve a family of related tasks with shared features and rewards that are linear in the features and equivalent up to scaling factor. The main contributions are as follows:
+- a novel framework for analyzing a broad family of generalized policies (policies that are generalized to arbitrary rewards in the task space), including the concept of a set improving policy (SIP), and providing two practical examples that fit this definition, namely the worst case set max policy (SMP) and the well known and studied generalized policy iteration (GPI). It is shown that it is always better to use GPI over SMP, making it an instance of SIP. 
+- a novel iterative method for building a policy library for solving the worst-case reward, formulated as a convex optimization problem, along with policy improvement guarantees, an informed method for stopping the algorithm, and the ability to remove redundant policies (termed inactive policies)
+- an empirical evaluation that connects the proposed method to learning a policy library with a diverse set of skills. The theoretical results are also validated experimentally, on a grid world example and control problems from Deepmind.
+
+Pros:
+- the work is of very high quality, all motivations seem sound and the theoretical results seem correct
+- the idea of active task selection for building the policy library is very interesting, and it is surprising that this has not been considered within the framework of Barreto et al., 2017 so far
+- the work could be of significance in the apprenticeship/curriculum/meta-RL community, and it is nice to see a more theoretical treatment of this topic
+
+Questions:
+- If my understanding is correct, the authors use the orthogonal and random basis to propose w at each iteration, but evaluate the resulting SMP policies with respect to the optimized rewards from (8). I am wondering if this is a fair evaluation for the baselines, given that the policies are always evaluated on $w_t^{SMP}$, or whether a new set of tasks (a proper ""test"" set) sampled from B (the standard ball) should be used to fairly compare (8) with the baselines? This would really test the generalization of the method on new instances as well, and is also often standard in the literature for evaluating the performance of a learning policy set. In other words, how robust is the resulting policy library to solving new task instances not previously seen before?
+- Also, one thing that could explain the poor performance of the orthogonal baseline is that the reward seems to be quite sparse when most of the basis elements are set to zero (in the one-hot phi case, wouldn't they be almost always uninformative?) In this case, a more suitable baseline that directly targets diversity could be defined as finding the $w_1, w_2 \dots w_T$ such that their coverage of the task space is maximized under some prior belief over w (e.g. the standard ball). If I am not mistaken, this problem is similar to the maximum coverage or voronoi tessellation problem, which could be solved in advance and then deployed. (e.g. Arslan, 2016)
+- Performing well relative to the worst-case performance seems reasonable so that the agent does not do poorly on any one task, but it could also be overly conservative. That is, could there be situations where optimizing the worst case leads to the agent not successfully completing the desired objective (e.g getting stuck on locally optimal solution)? 
+- at each iteration when the new optimal policy is learned with respect to $w_\Pi^{SMP}$, is the idea of SMP or GPI and previously learned policies used to help learn this new policy, or is it learned entirely from scratch (e.g. by simple epsilon-greedy)?
+
+Minor comments:
+- the legends in Figure 1a/b and the axis font in Figure 1c could be increased, same with Figure 2
+- is the $\max_i$ necessary in equation (8)?
+
+Overall, this works proposes a coherent theory for policy improvement, that also leads to useful implementation and interesting empirical insight (and cool visualizations). It can often be hard to obtain all of these at once.
+
+Arslan, Omur, and Daniel E. Koditschek. ""Voronoi-based coverage control of heterogeneous disk-shaped robots."" 2016 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2016.",7,4.0,ICLR2021
+SJeO5obc3X,2,HJMCcjAcYX,HJMCcjAcYX,Reinventing the (methodological) wheel?,"Update: From the perspective of a ""broader ML"" audience, I cannot recommend acceptance of this paper. The paper does not provide even a clear and concrete problem statement due to which it is difficult for me to appreciate the results. This is the only paper out of all ICLR2019 papers that I have reviewed / read which has such an issue. Of course for the conference, the area chair / program chairs can choose how to weigh the acceptance decisions between interest to the broader ML audience and the audience in the area of the paper. 
+
+----------------------------------------------------------------------------------------------------------------------------------
+
+ This paper addresses the problem that often features are obtained as a set, whereas certain orders of these features are known to allow for easier learning. With this motivation the goal of this paper is to learn a permutation of the features. This paper makes the following three main contributions:
+1. The idea of using pairwise comparison costs instead of position-based costs
+2. The methodological crux of how to go from the pairwise comparison costs to the permutation (that is, solving Eqn. (2) using  Eqn. (1) )
+3. An empirical evaluation
+
+I like the idea and the empirical evaluations are promising. However, I have a major concern about the second contribution on the method. There is a massive amount of literature on this very problem and a number of algorithms are proposed in the literature. This literature takes various forms including rank aggregation and most popularly the (weighted) minimum feedback arc set problem.  The submitted paper is oblivious to this enormous literature both in the related work section as well as the empirical evaluations. I have listed below a few papers pertaining to various versions of the problem (this list is by no means exhaustive). With this issue, I cannot give a positive evaluation of this submitted paper since it is not clear whether the paper is just re-solving a solved problem. That said, I am happy to reconsider if the related work and the empirical evaluations are augmented with comparisons to the past literature on the methodological crux of the submitted paper (e.g., why off-the-shelf use of previously proposed algorithms may or may not suffice here.)
+
+
+Unweighted feedback arc set:
+
+A fast and effective heuristic for the feedback arc set problem, Eades et al.
+
+Efficient Computation of Feedback Arc Set at Web-Scale, Simpson et al.
+
+How to rank with few errors, Kenyon-Mathieu et al.
+
+Aggregating Inconsistent Information: Ranking and Clustering, Ailon et al.
+
+
+Hardness results:
+
+The Minimum Feedback Arc Set Problem is NP-hard for Tournaments, Charbit et al.
+
+
+Weighted feedback arc set:
+
+A branch-and-bound algorithm to solve the linear ordering problem for weighted tournaments, Charon et al.
+
+Exact and heuristic algorithms for the weighted feedback arc set problem: A special case of the skew‐symmetric quadratic assignment problem, Flood
+
+Approximating Minimum Feedback Sets and Multicuts in Directed Graphs, Even et al.
+
+
+Random inputs:
+
+Noisy sorting without resampling, Braverman et al.
+
+Stochastically transitive models for pairwise comparisons: Statistical and computational issues, Shah et al.
+
+On estimation in tournaments and graphs under monotonicity constraints, Chatterjee et al. 
+
+
+Survey (slightly dated):
+
+An updated survey on the linear ordering problem for weighted or unweighted tournaments, Charon et al.
+
+
+Convex relaxation of permutation matrices:
+
+On convex relaxation of graph isomorphism, Afalo et al.
+
+Facets of the linear ordering polytope, Grotschel
+
+",3,2.0,ICLR2019
+HkebXr-r9r,3,SylVJTNKDr,SylVJTNKDr,Official Blind Review #4,"This paper sets up a couple discrete communication games in which agents must communicate a discrete message in order to solve a classification task.  The paper claims that networks trained to perform this case show a tendency to use as little information as necessary to solve the task.
+
+I vote to reject this paper.
+
+The experiments are not very interesting and I don't at all agree with the assertion of the paper.  The paper claims the networks use only the entropy necessary to solve the task, but there are two main problems with this assertion.  (1) their own experiments don't support this all that strongly, as in the limit of few hidden bits (left half of the x axis in Figure 1), the networks all had noticeable excess information, and (2) and perhaps most damning the paper applies entropy regularization on the sender during training?  Could it perhaps instead be the fact that the entropy of the sender was penalized as an explicit regularization term that the entropy of the senders messages tended to be small?
+
+I also find the experimental design puzzling.  Why both reinforce and the 'stochastic computation graph' approach?  Treating the receiver's output as binary and stochastic without using the log loss of the bernoulli observation model is just giving up on a good gradient as far as the receiver is concerned.
+
+The experiments done are much to simple and the protocol flawed.
+
+The second set of experiments in Figure 3 were not left to converge, so I'm not sure how we can derive a great deal of insight.  Additionally, relaxing the gumbel softmax channel to being continuous rather than discrete technically ruins any argument that there is an entropy bottleneck present anymore, as theoretically even a single dimensional continuous signal could store an arbitrary amount of information.  If the paper wanted to, it could have upper bounded the mutual information between the input and the message using a variational upper bound.
+
+--------------- Response to Response ---------------------------------
+
+I'm editing here in light of continuing to look at the paper and the responses from the author below.
+
+I have to still argue for a rejection of this paper.  
+
+I thank the authors for addressing my comments and I admit that at first I thought the paper was minimizing the entropy during training which would have been particularly bad.  While I was mistaken on that point, I still believe the paper is deeply flawed.
+
+In particular, the paper makes a very bold claim, namely that ""We find that, under common
+training procedures, the emergent languages are subject to an entropy minimization pressure that has also been detected in human language, whereby the mutual
+information between the communicating agent’s inputs and the messages is minimized, within the range afforded by the need for successful communication.""  But if we are being honest here, the experiments are very lacking to support such a bold claim.
+
+In particular there was one thing I was worried about upon reading the paper again, and is similar to the point raised by the other reviewers.  In Figure 1, we are shown the entropy only of those networks that have succeeded.  Naturally to succeed, the entropy i the message must be large enough to accomodate the size of the remaining bits we are trying to reconstruct.  That is why Figure 1 includes the dotted line, since the networks must be above that line to have good performance.  And the main evidence for the main claim of the paper is that the trained networks are above that line and arguably close to it.
+
+Now, we know that there are clearly solutions to these tasks (in particular the Guess Number task) which could achieve good performance at noticeably higher entropy.  For instance we could take any minimal solution and simply split up each message into 8x different buckets, each of which had exactly the same behavior from the decoder.  This would give us a +3 in the entropy of our message space while having no effect whatsoever on the loss.  The claim of the paper is that under normal training procedures it seems like we don't find those solutions and instead seem to find minimal ones.
+
+But after implementing a simplified version of the experiment in the paper (Notebook available here: https://nbviewer.jupyter.org/urls/pastebin.com/raw/ZF7g34GN ) I suspect something much simpler is going on.  The reason the solutions look minimal in Figure 1 is probably because the initialization chosen for the encoder they used in the paper tended to start at low entropies.  Imagine if all of the networks started out with an initial message entropy of around 3 bits.  Then Figure 1 could be explained by the problems with hidden bits ~< 3 bits simply preserved their entropy, which in order to solve the task with higher numbers of digits hidden we know requires some minimal budget, so they get sort of pushed up.  This could explain the figure, but we wouldn't claim this explains why we observe small entropy for the high number of hidden digits case.
+
+In particular, if we initialized the encoders with higher entropy, we might expect that we fail to see this phenomenon.  That is exactly what I was able to show for myself in that notebook.  If you simply initialize the encoder to have high entropy, all of the solutions have high entropy and the observed effect goes away.
+
+Overall, the paper as I said is low quality.  Several choices were made that don't make a lot of sense.  With the experiments being as small scale as they were, why not explicitly marginalize out the message (as I did in the notebook)?  Why use single layer neural networks to predict 256 x 1024 parameters? Why not just learn them directly?   If the paper aimed to mimic more standard setups and show that under those setups we observe this kind of minimal message entropy, then it would have to much better tease out the effects of all of these choices. 
+
+Why does the decoder use a mean field sigmoid bernoulli observation model to try to predict something is in one of ~32 states?  The missing digits are not independent given the message, why model them as so?  Is that part of the purported reason why these models show minimal entropy (cause it isn't discussed).  
+
+For such a simple problem, you could presumably analytically compute the gradient with respect to the loss and study whether that correlates with the gradient of the entropy.  There are several things I could imagine checking, none of which are checked in the paper.
+
+The primary question the paper addresses is an interesting one.  But this paper does very little to carefully investigate that question.  I maintain my vote to reject.",1,,ICLR2020
+AiQuud80Zd,1,KubHAaKdSr7,KubHAaKdSr7,Review,"The paper proposes a very interesting yet practical problem of modifying the memory encoded in transformer weight to unlearn some facts and update with new facts. This paper seems to be a follow-up work from FAE, which encodes symbolic facts in memory for retrieval. Generally speaking, I like the basic idea of this paper and it might have a broad impact on the whole community. However, there are still a lot of questions about the paper.
+1) the paper seems to be written in a rush without refining, there are numerous serious typos and spellings errors, which affect my understanding a lot. For example, in 4.5.1, what is ""RT"", is it supposed to be ""RI""?  In Figure 3, why is the 32, the figures are a mess. Why is the left showing ""32->512"" while the right showing ""32->128""? Why do you say it's sharp degrading, it's not that sharp reflected from Figure 3. I'm not sure if I misunderstand something. 
+2) The results are also quite messy. The algorithm without constrained optimization has its results reported in the table, while the algorithm with constrained optimization is reported in figures. The results with FAE is yet in another table far away. It's hard for me to compare them and draw a consistent conclusion. Is it possible to aggregate all the main results in one table and demonstrate all the ablation studies using Figures? Currently, the figures involved in 4.5.2 are distributed from page 6 - page 8, is it possible to aggregate them in a concentrated place? 
+3) Besides these details, I think the proposed method is somewhat ""not novel"". In lifelong learning or meta-learning community, such constrained optimization algorithms have been explored for a few years to prevent the mode from catastrophic forgetting. I don't think the paper makes any significant contribution to this aspect. 
+4) Overall, I still quite like the scope of this paper. I would like to see a more structured and clear version of the paper with more fundamental algorithm innovation. ",5,4.0,ICLR2021
+B1TCDcOxG,2,rkGZuJb0b,rkGZuJb0b,Interesting idea but more experimental validation required,"The paper presents a new parameterization of linear maps for use in neural networks, based on the Multiscale Entanglement Renormalization Ansatz (MERA). The basic idea is to use a hierarchical factorization of the linear map, that greatly reduces the number of parameters while still allowing for relatively complex interactions between variables to be modelled. A limited number of experiments on CIFAR10 suggests that the method may work a bit better than related factorizations.
+
+The paper contains interesting new ideas and is generally well written. However, a few things are not fully explained, and the experiments are too limited to be convincing.
+
+
+Exposition
+On a first reading, it is initially unclear why we are talking about higher order tensors at all. Usually, fully connected layers are written as matrix-vector multiplications. It is only on the bottom of page 3 that it is explained that we will reshape the input to a rank-k (k=12) tensor before applying the MERA factored map. It would be helpful to state this sooner. It would also be nice to state that (in the absense of any factorization of the weight tensor) a linear contraction of such a high-rank tensor is no less general than a matrix-vector multiplication.
+
+Most ML researchers will not know Haar measure. It would be more reader friendly to say something like ""uniform distribution over orthogonal matrices (i.e. Haar measure)"" or something like that. Explaining how to sample orthogonal matrices / tensors (e.g. by SVD) would be helpful as well.
+
+The article does not explain what ""disentanglers"" are. It is very important to explain this, because it will not be generally known by the machine learning audience, and is the main thing that distinguishes this work form earlier tree-based factorizations.
+
+On page 5 it is explained that the computational complexity of the proposed method is N^{log_2 D}. For D=2, this is better than a fully connected layer. Although this theoretical speedup may not currently have been realized, it perhaps could be achieved by a custom GPU kernel. It would be nice to highlight this potential benefit in the introduction.
+
+
+Theoretical motivation
+Although I find the theoretical motivation for the method somewhat compelling, some questions remain that the authors may want to address. For one thing, the paper talks about exploiting ""hierarchical / multiscale structure"", but this does not refer to the spatial multi-scale structure that is naturally present in images. Instead, the dimensions of a hidden activation vector are arbitrarily ordered, partitioned into pairs, and reshaped into a (2, 2, ..., 2) shape tensor. The pairing of dimensions determines the kinds of interactions the MERA layer can express. Although the earlier layers could learn to produce a representation that can be effectively analyzed by the MERA layer, one is left to wonder if the method could be made to exploit the spatial multi-scale structure that we know is actually present in image data.
+
+Another point is that although from a classical statistics perspective it would seem that reducing the number of parameters should be generally beneficial, it has been observed many times that in deep learning, highly overparameterized models are easier to optimize and do not necessarily overfit. Thus at this point it is not clear whether starting with a highly constrained parameterization would allow us to obtain state of the art accuracy levels, or whether it is better to start with an overparameterized model and gradually constrain it or perform a post-training compression step.
+
+
+Experiments
+In the introduction it is claimed that the method of Liu et al. cannot capture correlations on different length scales because it lacks disentanglers. Although this may be theoretically correct, the paper does not experimentally verify that the proposed factorization with disentanglers outperforms a similar approach without disentanglers. In my opinion this is a critical omission, because the addition of disentanglers seems to be the main or perhaps only difference to previous work.
+
+The experiments show that MERA can drastically reduce the number of parameters of fully connected layers with only a modest drop in accuracy, for a particular ConvNet trained on CIFAR10. Unfortunately this ConvNet is far from state of the art, so it is not clear if the method would also work for better architectures. Furthermore, training deep nets can be tricky, and so the poor performance makes it impossible to tell if the baseline is (unintentionally) crippled.
+
+Comparing MERA-2 to TT-3 or MERA-3 to TT-5 (which have an approximately equal number of parameters), the difference in accuracy appears to be less than 1 percentage point. Since only a handful of specific MERA / TT architectures were compared on a single dataset, it is not at all clear that we can expect MERA to outperform TT in many situations. In fact, it is not even clear that the small difference observed is stable under random retraining.
+
+
+Summary
+An interesting paper with novel theoretical ideas, but insufficient experimental validation. Some expository issues need to be fixed.",5,4.0,ICLR2018
+Syee2fmjYH,1,HJgfDREKDB,HJgfDREKDB,Official Blind Review #3,"This paper presents a method for single image 3D reconstruction. It is inspired by implicit shape models, like presented in Park et al. and Mescheder et al., that given a latent code project 3D positions to signed distance, or occupancy values, respectively. However, instead of a latent vector, the proposed method directly outputs the network parameters of a second (mapping) network that displaces 3D points from a given canonical object, i.e., a unit sphere. As the second network maps 3D points to 3D points it is composable, which can be used to interpolate between different shapes. Evaluations are conducted on the standard ShapeNet dataset and the yields results close to the state-of-the-art, but using significantly less parameters.
+
+Overall, I am in favour of accepting this paper given some clarifications and improving the evaluations.
+
+The core contribution of the paper is to estimate the network parameters conditioned on the input (i.e., the RGB image). As noted in the related work section this is not a completely new idea (cf. Schmidhuber, Ha et al.). There are a few more references that had similar ideas and might be worth adding: Brabandere et al. ""Dynamic Filter Networks"", Klein et al. ""A dynamic convolutional layer for short range weather prediction"", Riegler et al. ""Conditioned regression models for non-blind single image super-resolution"", and maybe newer works along the line of Su et al. ""Pixel-Adaptive Convolutional Neural Networks"".
+
+The input 3D points are sampled from a unit sphere. Does this imply any topological constraints? Is this the most suitable shape to sample from? How do you draw samples from the sphere (Similarly, how are the points sampled for the training objects)? What happens if you instead densely sample from a 3D box (similar to the implicit shape models)?
+
+On page 4 the mapping network is described as a function that maps c-dimensional points to 3D points. What is c? Isn't it always 3, or how else is it possible to composite the mapping network?
+
+Regarding the main evaluation: The paper follows the ""standard"" protocol on ShapeNet. Recently, Tatarchenko et al. showed in ""What Do Single-view 3D Reconstruction Networks Learn?"" shortcomings of this evaluation scheme and proposed alternatives. It would be great if this paper could follow those recommendations to get better insights in the results.
+Further, I could not find what k was set to in the evaluation of Tab. 1. It did also not match any numbers in Tab. 4 of the appendix. Tab. 4 shows to some extend the influence of k, but I would like to see a more extensive evaluation. How does performance change for larger k, and what happens if k is larger at testing then on at training, etc.?
+
+Things to improve the paper that did not impact the score:
+- The tables will look a lot nicer if booktab is used in LaTeX
+",6,,ICLR2020
+H1_DwlWEl,1,B1ewdt9xe,B1ewdt9xe,"Good paper, nice example of using the idea of feeding forward error signals.","Paper Summary
+This paper proposes an unsupervised learning model in which the network
+predicts what its state would look like at the next time step (at input layer
+and potentially other layers).  When these states are observed, an error signal
+is computed by comparing the predictions and the observations. This error
+signal is fed back into the model. The authors show that this model is able to
+make good predictions on a toy dataset of rotating 3D faces as well as on
+natural videos. They also show that these features help perform supervised
+tasks.
+
+Strengths
+- The model is an interesting embodiment of the idea of predictive coding
+  implemented using a end-to-end backpropable recurrent neural network architecture.
+- The idea of feeding forward an error signal is perhaps not used as widely as it could
+  be, and this work shows a compelling example of using it. 
+- Strong empirical results and relevant comparisons show that the model works well.
+- The authors present a detailed ablative analysis of the proposed model.
+
+Weaknesses
+- The model (esp. in Fig 1) is presented as a generalized predictive model
+  where next step predictions are made at each layer. However, as discovered by
+running the experiments, only the predictions at the input layer are the ones
+that actually matter and the optimal choice seems to be to turn off the error
+signal from the higher layers. While the authors intend to address this in future
+work, I think this point merits some more discussion in the current work, given
+the way this model is presented.
+- The network currently lacks stochasticity and does not model the future as a
+  multimodal distribution (However, this is mentioned as potential future work).
+
+Quality
+The experiments are well-designed and a detailed analysis is provided
+in the appendix.
+
+Clarity
+The paper is well-written and easy to follow.
+
+Originality
+Some deep models have previously been proposed that use predictive coding.
+However, the proposed model is most probably novel in the way it feds back the
+error signal and implements the entire model as a single differentiable
+network.
+
+Significance
+This paper will be of wide interest to the growing set of researchers working
+in unsupervised learning of time series. This helps draw attention to
+predictive coding as an important learning paradigm.
+
+Overall
+Good paper with detailed and well-designed experiments. The idea of feeding
+forward the error signal is not being used as much as it could be in our
+community. This work helps to draw the community's attention to this idea.",8,5.0,ICLR2017
+yZJpW-ET_fV,3,IpsTSvfIB6,IpsTSvfIB6,"In this paper the authors propose an optimization formulation for the decomposition of doubly stochastic matrices in terms of permutation matrices, where the cost function is differentiable. The optimization is carried out in the optimization over manifolds setting, using a recent result that gives a Riemannian structure to the set of doubly stochastic matrices.","In this paper the authors propose an optimization formulation for the decomposition of doubly stochastic matrices in terms of permutation matrices, where the cost function is differentiable. The optimization is carried out in the optimization over manifolds setting, using a recent result that gives a Riemannian structure to the set of doubly stochastic matrices.
+
+The writing of the paper is correct in general, although there are some points commented below.
+
+The term ""differentiable"" in this setting is somehow confusing. I would add ""differentiable cost function"" or something that leaves no room for misunderstandings. I was reading the paper thinking that I would find something different regarding the differentiable part.
+
+The proposed formulation and optimization method seem correct, and to me it is a promising starting point. From here, I would expect to see some theoretical results regarding convergence (at least partially, in the spirit of [Z]), or applications where its full potential can be seen.
+
+Section 4, which is an application of the proposed method, constitutes a great part of the paper (maybe more than half of it). Of course, the application of a given method is extremely important, but in this case it shifts the focus of the paper. Besides that, the writing of this large section seems to be quite different from the previous parts. While the first part was relatively well written and easy to follow, this half is rough. 
+
+To me, the interesting part of the paper is the method, and it falls short, given the comments above.
+
+I would usually write this under ""minor comments"", but in this case it shows a lack of care and attention.
+I appreciate the notation paragraph, but it seems to correspond to another paper. The notation for cardinality is never used as such, but the same notation is used for the absolute value several times. The upper index in $X^i$ is never defined (it could index a set of matrices, or indicate rows/columns). The Hadamard operations and pseudo-inverse are never used. On the other hand, in Algorithm 1 there are undefined operators, such as $\Pi$ and $R_X$ (which I assume are projection and retraction).
+
+Minor comments:
+ - Please add punctuation to the equations, since they are part of the text (in Definition 1 for instance)
+ - The last sentence in page 2 is weird. 
+ - In the last sentence of the first paragraph, Section 3.1, there is an extra ""f"" (of f O(n^4) )
+ - typo: geootp -> geoopt
+ - The bibliography is not consistent. Some authors with full name, some with initials, some in all caps
+
+
+[Z] Zavlanos, M. M., & Pappas, G. J. (2008). A dynamical systems approach to weighted graph matching. Automatica, 44(11), 2817-2824.",5,3.0,ICLR2021
+J0-RYnWo04H,1,UwOMufsTqCy,UwOMufsTqCy,"Multi-layer rule learner with missing discussions on scalability, interpretability and implementation details","The authors propose a classifier consisting of multiple layers. The inner layers construct rules in conjunctive normal form. The last layer is used to assign weights to the constructed rules. To train the overall model and obtain discrete solutions, the authors use a simple rounding mechanism leading to a method that they call gradient grafting. The paper concludes with a computational study, where the authors report the classification performance as well as the model complexity on a set of problems. 
+
+This is an interesting paper proposing a new method to tackle the trade-off between interpretability and accuracy. Indeed, rule-based learners are considered to be more interpretable. The authors also claim that the proposed rule-learner is also scalable. Overall, introduction of a scalable, interpretable and accurate method could have been considered as a big achievement. However, I have several questions and comments about this achievement as I list below:
+
+- Are the results given for test set? If so, what is the train-test percentage?
+
+- What are the computation times? Without these figures, it is hard to assess the scalability of the proposed method.
+
+- The numbers of edges that you report range from 50 to 1000. These values seem quite large. How does this affect the interpretability?
+
+- In Figure 4, only five clear rules are reported. What is the distribution of the weights of the remaining rules?
+
+-  How do you decide various design parameters; such as, number of layers (n_l), number of bins for feature discretization (k), number of layers (L)?
+
+- As you need a binarization layer to divide the continuous features into bins, can't we just say that the method works only with discrete features? This is how it would be presented by other rule-learning methods.
+
+- I guess Section 3.2 is superficial as it is straightforward to split a continuous feature into bins. Am I missing something here?
+
+- How do your results compare against the results obtained with other rule/tree learning methods based on (integer) linear optimization? Some names from that field are Bertsimas,  Rudin, Gunluk.
+
+- Can you guarantee that the resulting set of rules covers the entire feature space? In other words, is it possible that a test sample is not classified with the output set of rules?",6,5.0,ICLR2021
+h4g2A3zgEa2,1,KCzRX9N8BIH,KCzRX9N8BIH,An interesting idea but writing and presentation should be improved.,"# Summary:
+The paper proposes the use of complete parametrized likelihoods for providing supervision in place of the commonly used loss functions. The normal distribution, the categorical distribution defined by softmax and the likelihood of the robust rho-estimator are considered. The main idea is that by including the parameters of these likelihoods and optimizing over them, one can increase robustness, detect outliers and achieve re-calibration. In addition, by considering parametric priors and tuning their parameters one can obtain more flexible regularizers over the trainable parameters of a model.
+
+# Strengths:
+The idea of the paper is quite interesting as it lifts some commonly made and often overlooked assumptions regarding the data distribution. By lifting these assumptions one can improve the performance of the trained model by considering likelihoods that better capture the data distribution. For example, data is usually affected by heteroskedasticity as well as outliers, and if the likelihood considered in its full form covers these aspects the resulting models will be better calibrated.
+The proposed methods consider different aspects of conditioning and dimensionality of the likelihoods employed, varying from global to data-specific modeling.
+
+The use of likelihoods instead of common loss functions leads to competitive new methods and variants for robust modeling, outlier detection, adaptive regularization and model re-calibration.
+
+
+# Weaknesses:
+Although the use of likelihoods instead of loss functions is not a common practice in deep learning, its advantages have been thoroughly studied in statistics, econometrics and other disciplines, as also discussed in the related work of the paper. Hence, the novelty mainly lies in the application of these ideas in deep learning and the employment of some likelihoods better suited for the respective problems (i.e. softmax and rho-estimators).
+
+The paper is interesting however I found it somewhat difficult to read. In my view it tries to pack many different aspects and applications of the main idea (use of likelihood) into a very limited space. In fact, there are too many cross-references to the supplemental material, to the point that it seems that most of the paper is described in the supplemental material. 
+On a similar note, due to the fact that four different application domains are considered, there numerous methods, metrics and datasets involved in each one of them which are not sufficiently covered in the text. Additionally, many of the proposed methods/improvements/variants on each domain are not explained in sufficient detail (e.g. AE+S and PCA+S in Sec. 5.2). I would expect some more principled and thorough guidance on how to use the likelihood functions and, regarding the conditioning and dimensionality, strategies on how to choose among the various options.
+
+Also some editing is required, for example the likelihood of the softmax is not provided as the respective sentence after eq. 4 is suddenly interrupted (see also the comments below).
+
+## Minor comments
+* the text in the figure is very small, making it very difficult to read in typical zoom levels (~100%)
+* Figure 3: the text does not correspond to the figure for the intermediate case
+* Figure 4, caption: include reference to left, middle and right panel
+* Table 1: there is no reference of this table in the text. Also, the three dots should be replaced with the actual setting.
+
+
+# Rating Justification:
+I think that the overall idea of the paper is interesting and provides improved data modeling which leads to important advantages of the estimated models. However, possibly due to space limitations, the paper does not explain in sufficient detail important aspect of applying the proposed idea in the domains considered.
+
+# Rating and comments after the rebuttal
+I think that in the revised version the paper has addressed many of the weaknesses pointed out in our reviews, hence I increase my rating to 6. Nevertheless, the paper still packs too much information which makes it difficult to read and appreciate.
+Regarding novelty, although I agree with other reviews that the core idea is not novel I think that it is important that the paper stresses the applicability and usefulness of considering likelihoods in deep learning models, as it appears to be not fully appreciated currently.
+Overall, I think that the paper would shine as a journal paper while it is only a borderline submission in its current form.",6,4.0,ICLR2021
+uNiWR-V7oXo,2,tv8n52XbO4p,tv8n52XbO4p,Thorough Multi-attack Robustness Evaluation and Clever Adversarial Training ,"This paper addresses a timely issue in adversarial robustness - efficient training of robust models against multiple adversarial perturbations. The authors propose a combination of three techniques: stochastic adversarial training (SAT), meta noise generator (MNG), and adversarial consistency (AC) loss for efficient training, and evaluate the robustness using multiple L1, L2, and Linf norm-bounded attacks and three datasets (CIFAR-10, SVHN, and Tiny Imagenet). The results show improved multi-attack robustness over several baselines (including single-attack and multiple-attack models) and reduced training time. Ablation studies are also performed to illustrate the utility of each component of the proposed model. Overall, this paper provides very detailed evaluations involving multiple datasets, attacks, baselines, and robustness metrics. I find the results convincing and important, and also find sufficient novelty in the proposed training method.
+
+The strengths (S) and weaknesses (W) of this submission are summarized below.
+
+S1. The proposal of MNG and AC is effective and novel.
+S2. The evaluation is thorough and convincing.
+S3. The proposal improves both robustness and training efficiency in most cases.
+
+W1. The adversarial consistency (AC) loss is never defined explicitly. Based on equation (5), it is hard to understand how AC ""represents the Jensen-Shannon Divergence (JSD) among the posterior distributions"" when considering three distributions, P_clean, P_adv, and P_aug. More clarification is needed.
+
+W2. Although the results show improved multi-attack robustness, it will be great if the authors can add more intuition on why the proposed training method leads to performance improvement. Based on the ablation study,  it seems that the role of SAT and MNG is to reduce overfitting in robustness to encourage generalization, rather than optimization over the worst-case scenarios.
+
+W3. The considered multi-attack setting is still limited to different Lp norm perturbation constraints. Although the authors showed improved robustness over unforeseen attacks, the authors should also discuss how the proposed method can generalize to different attacks beyond Lp norms.
+
+
+",6,5.0,ICLR2021
+rygDVQg2FS,1,B1gXR3NtwS,B1gXR3NtwS,Official Blind Review #2,"This paper proposed deep Bayesian structure networks (DBSN) to model weights, \alpha, of the redundant operations in cell-based differentiable NAS. The authors claim that DBSN can achieve better performance (accuracy) than the state of the art. 
+
+One of my concerns is the Bayesian formulation introduced in Eq. (4) seems problematic. It is not clear what priors are placed on alpha. In the case of Bayes by BP (BBB), which is cited as Blundell et al. 2015 in the paper, a Gaussian prior (with zero mean) is used. Therefore there is a KL term between the variational distribution q(w) and the prior distribution p(w) to regularize q(w). In DBSN, q(\alpha) is parameterized by \theta and \epsilon, and so is p(\alpha), meaning that the KL term is effectively zero. This is very different from what is done in BBB.
+
+The second major concern is on the experiments. (1) The authors use DARTS as a main baseline and show that DBSN significantly outperforms DARTS. However, looking at the DARTS paper, the test error on CIFAR-10 is around 3% for both the first-order and second-order versions. The test error in Table 1 is around 9%, which is a lot lower. I notice that the DARTS paper has a parameter number of 3.3M, while in the current paper it set to 1M. Given that DARTS is the main baseline method and the same dataset (CIFAR-10) is used, it would make much more sense to use exactly the same architecture for comparison. The current results is hardly convincing. (2) Besides, note that in the DARTS paper, DenseNet-BC has test error of 3.46%, much higher than DARTS (~3%). In Table 2 of this paper however, DARTS is significantly worse than DenseNet-BC (8.91% versus 4.51%). These results are highly inconsistent with previous work.
+
+As mentioned in the paper, Dikov & Bayer 2019 has a very similar idea to perform NAS from a Bayesian perspective. It would be best (and would definitely make the paper stronger) to include some comparison. Even if Dikov & Bayer 2019 is not very scalable, it is at least possible to compare them in smaller network size. Otherwise it is hard to evaluate the contribution of DBSN given this highly similar work.
+
+The authors mentioned in the introduction that DBSN ‘yields more diverse prediction’ and therefore brings more calibrated uncertainty comparing to ensembling different architectures. This is not verified in the experiment section. Table 3 only reports the ECE for one instance of trained networks. For example, it would be interesting to sample different architecture from the alpha learned in DARTS and DBSN, train several networks, ensemble them, and use the variance of the ensemble to compute ECE. This would verify the claim mentioned above.
+
+Do you retrain the network from scratch after the architecture search (which is done in DARTS) for DARTS and DBSN?
+
+I am not convinced by the claim that BNN usually achieve compromising performance. Essentially, BNN, if trained well, is a generalization of deterministic NN. If very flat priors and highly confident variational distributions are used, BNN essentially reduces to deterministic NN.
+
+Missing references on Bayesian deep learning and BNN:
+
+Bayesian Dark Knowledge
+Towards Bayesian Deep Learning: A Survey
+Natural-Parameter Networks: A Class of Probabilistic Neural Networks",3,,ICLR2020
+SJe4yM7lcr,3,rkgyS0VFvr,rkgyS0VFvr,Official Blind Review #3,"This paper proposes a distributed backdoor attack strategy, framed differently from the previous two main approches (1) the centralised backdoor approach and (2) the (less discussed in the paper) distributed fault tolerance approach (often named ""Byzantine"").
+
+The authors show through experiments how their attack is more persistent than centralised backdoor attack.
+The authors also compare two aggregation rules for federated learning schemes, (Fung et al 2018 & Pillutla et al 2019), suggesting that both rules are bypassed by the proposed distributed backdoor attack.
+
+Strength:
+
+what I found most interesting in the paper is Section 3.4, presenting an appreciable attempt to ""interpret"" poisoning. Together with Section 4. 
+This kind of fine-grained analysis of poisoning is highly needed.
+
+Weakness: 
+
+in section 3.3, the authors compare against RFA and take what is claimed in Pillulata et al as granted (that RFA detects more nuanced outliers than the wort-case of the Byzantine setting (Blanchard et al 2017) ). In fact, there is more to the Byzantine setting than that, see e.g. Draco (Chen et al 2018 SysML), Bulyan (El Mhamdi et al 2018 ICML) and SignSGD (Bernstein et al 2019 ICLR) which have proposed more sophisticated approches to distributed robustness.
+Since this paper is about distributed robustness and distributed attacks, it would be very informative to the community to illustrate DBA attack on these methods to have a more compelling message.
+
+post rebuttal: thank your for your detailed reply, I acknowledge your new comparisons with the distributed robustness mechanisms of Krum and Bulyan, too bad time was short to compare with the other measures such as Draco and SignSGD.",6,,ICLR2020
+rJeplPkCtB,1,S1gyl6Vtvr,S1gyl6Vtvr,Official Blind Review #3,"This paper proposes a framework for training time filter pruning for convolutional neural networks. The main idea is to use a trainable soft binary mask to zero out convolutional filters and corresponding batch norm parameters.
+
+Pros:
++ The proposed method seems relatively easy to implement.
++ The quantitative results in the paper indicate that MaskConvNet achieves performance competitive with previously proposed pruning methods.
+
+Cons:
+- Writing of the paper could be significantly improved. See some examples below.
+- The main thing that bothered me about the method was the usage of hard sigmoid. If a mask component ever gets into one of the flat regions it won’t be able to escape. The authors propose a workaround which they call “mask decay update”. This approach looks quite hacky and I’m not sure how easy it is to make it work in practice.
+
+Notes/questions:
+* Abstract: “elegant support” -> “support”
+* Everywhere in the text: Back-to-back citations should have the form (citation1; citation2; …)
+* Section 1, paragraph 3: “suffer one” -> “suffer from one”
+* Section 1, paragraph 4: “above mentioned” -> “above-mentioned”
+* Figure 1: The figure would greatly benefit from a detailed description. What’s IF, OF and OF? The reader shouldn’t be guessing. 
+* Section 2, paragraph 3: “corresponded” -> “corresponding”
+* Section 3.1, paragraph 2: “W \in R” – W is probably not a scalar value therefore it’s in R^n. The same goes for the mask.
+* Section 3.1, paragraph 2: “It’s easy to know ...” – this sentence needs to be rewritten, e.g., “One can see that …”
+* Section 3.1, paragraph 2: “sparser” -> “more sparse”
+* Section 3.2, “Extension to Multi-metrics”: “FLOPs” are never defined in the paper. How is this quantity computed exactly? I’m also not entirely sure how useful it is to introduce multiple lambdas – it seems that this case corresponds to a single lambda which is a sum \lambda_i.
+* Section 3.3, paragraph 1: “undersparsed”, “oversparsed” – not sure if these words exist. Maybe rephrase instead of introducing new terms?
+* Section 3.3, paragraph 1: “very laborious” -> “laborious”
+* Figure 3: Why not show points all the way to 0 sparsity?
+* Section 4.2, CIFAR-10: The authors mention that (Lemaire et al., 19) achieve better FLOP sparsity due to usage of Knowledge Distillation. From this sentence alone it’s not clear how exactly KD helps. Why can’t KD be applied in the proposed framework? I’d appreciate if the authors could elaborate on this.
+
+I must admit that I’m not an expert in the field of NN pruning but I’m surprised that training-time masking of filters has not been tried before. Even if it’s really the case I’m not entirely confident that the paper should be accepted: the “unsparsification” looks more like a hack than a principled approach and the overall quality of writing needs to be improved. I’m giving a borderline score for now but I’m willing to increase it provided that the rebuttal addresses my concerns.",3,,ICLR2020
+S1eoNpUQaQ,4,BylBfnRqFm,BylBfnRqFm,"incremental idea, weak experimental evidence","Summary
+CAML is a gradient-based meta-learning method closely related to MAML. It divides model parameters into disjoint sets of task-specific parameters $\phi$ which are adapted to each task and task-independent parameters $\theta$ with are meta-learned across tasks. $\phi$ are then interpreted as an embedding and fed as input to the model (parameterized by $\theta$). Experiments demonstrate that this approach performs on par with MAML while adapting far fewer parameters. An additional benefit is that this approach is less sensitive to the adaptation learning rate and is easier to implement and faster to compute.
+
+Strengths
+While not really explained in the paper, this work connects gradient-based to embedding-based meta-learning approaches. Adaptation is via gradient descent, but the adapted parameters are then re-interpreted as an embedding.
+The method has the potential to perform on par with MAML while being simpler and faster.
+The paper is well-written.
+
+Weaknesses
+The field of meta-learning variants is crowded, and this paper struggles to carve out its novelty. 
+Rusu et al (LEO) optimize a context vector, which is used to generate model parameters. Reducing the generative model to a point estimate, how is this different from generating the FiLM parameters as a function of context as done in CAML? 
+Lee and Choi (MT-nets) propose a general formulation for learning which model parameters to adapt. CAML is simpler in that the model parameters to adapt are chosen beforehand to be inputs. 
+Snell et al. / Oreshkin et al. are prototype-based methods infer context via a neural network rather than optimizing for it.
+
+In this context, CAML appears to be yet another point drawn from the convex hull of choices already explored in episodic meta-learning (these choices can be broadly grouped into task encoding and conditional inference). The paper must then rest on its experimental results, which are at present unconvincing.
+
+On the whole, the experimental results seem weak and analysis results largely uninformative. The method is benchmarked on the toy tasks of sinusoid regression and a 2-D point mass, as well as mini-ImageNet few-shot classification. The sinusoid and point mass navigation are toy and compared only to MAML, so it is hard to draw conclusions from those experiments. For mini-ImageNet, while CAML outperforms MAML, it seems that the pertinent comparison is with MT-NET (which CAML does not outperform) and LEO (missing fair comparison?).
+
+Questions regarding experiments
+ - CAML is robust to the adaptation learning rate, but isn’t this true of any scheme that separates meta-learned and adapted parameters into disjoint sets? (e.g. also true of Lee and Choi?) 
+ - The visualizations of the context parameters are nice, but interpreting much higher dimensional context vectors (which would be necessary for harder tasks) is more difficult, so I’m not sure what to take away from this? It’s very unsurprising that the 2-D context vector encodes x and y position in the point mass experiment, for example. 
+ - I am confused by the comparison between adapting input parameters versus subsets of nodes at each layer or entire layers for the sinusoid regression task. Adapting subsets of nodes at each layer roughly corresponds to Lee and Choi, yet the reported numbers are quite different? 
+ - In Table 3, which CAML is a fair comparison (in terms of network size and architecture) to MT-NET? 
+
+Editorial Notes
+Intro paragraph 3: fine-tuning image classification features for a semantic segmentation task is not a good example of task independent parameters, since fine-tuning end-to-end gives significant improvements.
+Related work paragraph 2: Initializing context parameters to zero is not the only difference with Rei et al (2015), and seems a strange thing to highlight?
+Tables 1 and 2: state what the task is in the caption
+",4,5.0,ICLR2019
+ryx9MuhU6Q,3,r1g4E3C9t7,r1g4E3C9t7,Interesting findings but hard to fully understand the experiments.,"This paper proposed a study on audio adversarial examples and conclude the input transformation-based defenses do not work very well on the audio domain, especially for adaptive attacks. They also point out the importance of temporal dependency in designing defenses which is specific for the audio domain. This observation is very interesting and inspiring as temporal dependency is an important character that should be paid attention to in the field of audio adversarial examples. They also design some adaptive attacks to the defense based on temporal dependency but either fail to attack the system or can be detected by the defense.
+
+Based on the results in Table S7, it seems like being aware of the parameter k when designing attacks are very helpful for reducing the AUC score. My question is if the attacker uses the random sample K_A to generate the adversarial examples, then how the performance would be. Another limitation of this work is that the proposed defense can differentiate the adversarial examples to some extent, but the ASR is not able to make a right prediction for adversarial examples. In addition, the writing of Section 4 is not very clear and easy to follow.
+
+In all, this paper proposed some interesting findings and point out a very important direction for audio adversarial examples. If the author can improve the writing in experiments and answer the above questions, I would support for the acceptance.
+
+
+",6,3.0,ICLR2019
+H1gxFLNtYS,1,HkenPn4KPH,HkenPn4KPH,Official Blind Review #2,"Summary:
+There are three main contributions of the paper:
+i) The authors present an empirical study of different self-supervised learning (SSL) methods in the context of self-supervised learning.
+ii) They point out how SSL helps more when the dataset is harder.
+iii) They point out how domain matters while using SSL for training and present a method to choose samples from an unlabeled dataset.
+
+Strengths:
+1) They confirm the results of [7] and provide additional evidence of the benefit of the self-supervised learning in the few-shot setting.  They also showcase an interesting new result that self-supervised learning helps more in case of harder problems. 
+2) The authors have a done a commendable job of coming up with a meaningful set of experiments by varying base-models, self-supervised methods, datasets, and few-shot learning methods in Section 4.1. This is quite a comprehensive study. 
+3)  The paper is well-written and well-motivated.
+
+Weaknesses:
+1) The main weakness of the paper is Section 4.2's experimental setup. 
+
+i) The definition of domain distance in not quite meaningful. Since the chosen datasets have very different classes (airplanes vs dogs etc), the average embeddings for different datasets/classes will be far from each other.  In Figure 4d, it is misleading to show a trendline that includes the same domain as that will always be 0. If that datapoint is removed the trend line is mostly flat. The authors want to present a quantifiable way to show how domain distribution affect performance on self-supervised learning methods. But this definition of domain distance is more meaningful in the domain adaptation setting (like Amazon-Office dataset used for domain adaptation (https://people.eecs.berkeley.edu/~jhoffman/domainadapt/)) as in that case the requirement is to get the embeddings close to each other for the same class but from different domains.
+
+ii) The authors go on to create an ""unlabeled pool"" by combining images from many domains. Then they train a domain classifier by labeling in-domain images as positive and  labeling the images in pool as negative. Considering all images in the pool as negative is not correct as there can be images of same class in unlabeled pool. 
+
+iii) Then they choose the samples which the classifier predicts comes from the same domain. This probably succeeds as it is done on top of ResNet-101 features (that has seen all of ImageNet). This technique probably works possibly due to the ResNet being pre-trained with so many labeled classes. One baseline might have just been to choose the k-nearest neighbors (from unlabled pool)  to the average embedding of all the images of the chosen dataset. Since, it is quite easy for classifiers to detect from which dataset an image came from [6] it might be easy to choose an image from similar domain by using nearest neighbor in the embedding space. 
+
+I am not sure how well their heuristic will work for an unlabeled pool that the classifier has never seen. Additionally, a portion of the dataset has been created by combining existing datasets. Since statistics of different datasets vary a lot artificially, the creation of an unlabeled pool by combining different datasets might work in the favor of the proposed heuristic of looking at ""domain distance"" to choose the samples.
+
+2) Effect of domain shift in SSL (Figure 4b) is studying an extreme case where the domain shift results in almost no common classes in the SSL training phase. It would make more sense to include datasets where at least some of the classes are shared so that self-supervised learning methods get to see some relevant classes. In practice, a self-supervised learning method would be applied on a large unlabeled pool of images. Hopefully with increasing diversity and number of images, there might be some images on which ding SSL helps the downstream few-shot task. Hence comparison with such a dataset like ImageNet/iNatrualist is important here.
+
+Decision:
+The paper presents an well thought-out empirical study on self-supervised learning for few-shot learning. But there are  major concerns with the empirical setup and methods presented in the section where they propose a method to choose samples for SSL from an unlabeled pool of images. 
+
+
+Minor Comments:
+1)  “In contrast, we humans can quickly learn new concepts from limited training data” - Remove we. 
+2) “Despite recent advances, these techniques have only been applied to a few domains (e.g., entry-level classes on internet imagery), and under the assumption that large amounts of unlabeled images are available.” This is not true. Self-supervised learning methods have been used for continuous control in reinforcement learning[1, 2], cross-modal learning[3], navigation [5], action recognition[4] etc. Later on in related work the authors list many papers that use self-supervised learning in different contexts. This line should be modified to reflect how common self-supervised learning methods are in other fields as well and with less data.
+3) “making the rotation task too hard or too trivial to benefit main task” - add respectively.
+4) ""With random sampling, the extra unlabeled data often hurts the performance, while those sampled using the “domain weighs” improves performance on most datasets."" replace weighs with weights
+5) [8] points out how self-supervised learning on a large dataset but with a domain shift (YFCC100M) is not as effective for pre-training as it is doing self-supervised learning on the downstream task's dataset (ImageNet). While it is a different setting it is inline with one of the main conclusions of the paper.
+
+References:
+[1] “PVEs: Position-Velocity Encoders for Unsupervised Learning of Structured State Representations” Rico Jonschkowski, Roland Hafner, Jonathan Scholz, and Martin Riedmiller.
+[2] “Learning Actionable Representations from Visual Observations” Debidatta Dwibedi, Jonathan Tompson, Corey Lynch, Pierre Sermanet
+[3] “Look, Listen and Learn” Relja Arandjelović, Andrew Zisserman
+[4] “Self-supervised Spatiotemporal Learning via Video Clip Order Prediction” Dejing Xu, Jun Xiao, Zhou Zhao, Jian Shao, Di Xie, Yueting Zhuang.
+[5] “Scaling and Benchmarking Self-Supervised Visual Representation Learning” Priya Goyal, Dhruv Mahajan, Abhinav Gupta, Ishan Misra
+[6] ""Unbiased Look at Dataset Bias"" Antonio Torralba and Alexei A. Efros.
+[7] ""Boosting ´few-shot visual learning with self-supervision."" Spyros Gidaris, Andrei Bursuc, Nikos Komodakis, Patrick Perez, and Matthieu Cord. 
+[8] ""Deep clustering for unsupervised learning of visual features"" Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze
+",3,,ICLR2020
+f-ovP7fU1O7,2,hPWj1qduVw8,hPWj1qduVw8,Moderate contribution but lacking some details concerning the implementation,"The paper studies the problem of video-grounded multi-turn QA and adopts reasoning paths to exploit dialogue information.
+
+Sequential: fail to exploit long turn dependencies
+Graphical: fixed structure, fail to factor temporal dependencies
+The proposed reasoning path method: balanced between sequential and graphical
+
+It first constructs a turn-level semantic graph based on overlapping lexical span:
+- Extract lexical spans from each turn (<Q, A> pair) using a (Stanford) parser
+- Two turns are connected if one of their corresponding lexical spans are similar (in terms of word2vec embedding).
+
+Then it trains a path generator to predict paths from each turn to its preceding turns:
+- It starts from the current turn and auto-regressively finds the most dependent preceding turn with Transformers
+- The turn-level semantic graph is used to mask the dependencies.
+- It is trained with supervised loss where the target paths are constructed by running BFS on the semantic graph
+
+Finally, the proposed paths are used to employ multimodal reasoning:
+- Visual features are combined with turn level attention
+- Multi-model turn-level embeddings are propagated using GCN
+- Then use SOTA decoder to generate language response
+
+The author conducts experiments on a benchmark, and the proposed method achieves better QA performance than SOTA without a pre-trained language model and achieves comparable performance when the pre-trained language model is involved.
+
+The author further studies different variations of graph structures and show that using graphs constructed based on lexical spans is better than fully connected graphs or graphs based on whole sentence embedding. And it also shows that including bidirectional edges does not necessarily improve the performance.
+
+A nice feature of the method is that the generated reasoning path can serve as extra explanations for the answer.
+
+Some concerns:
+
+- The model is graph-based thus is restricted to scenarios with a small number of turns, and becomes computationally expensive for long-conversation scenarios.
+- Need a more detailed explanation of how the message passing part (section 3.4) is trained.
+- Each pair of turns may share multiple pairs of lexical spans that are identical, e.g. Figure 3-A, “she” in turn 10, but there are 2 “she”s in turn 9. Does the frequency influence the similarity?
+- It would be more convincing if it gives an analysis of failure cases.
+- Section 3.3: Eq.(4), what is the initial $D_0$ correspond to $Z_0$?
+- The reasoning path generator uses $C_t$ as input, does it include $A_t$ during inference?
+
+Minor concerns:
+
+- Many symbols are used before their definition:
+    > Explanation of $\mathcal{V}$ is first given in Algorithm 1 (section 3.2) but first used in Eq.1 (section 3.1).
+    > Section 3.3, 2nd paragraph, 4th line: undefined symbols $\hat{r}_1,\dots,\hat{r}_{m-1}$. They are later mentioned as turn indices in section 3.4, last line of page 5.
+- Page 5, line 2: “incorporate” —> “incorporates”.
+- Index m is used as a word position in Eq.(1) but becomes a decoding step from section 3.3.",6,5.0,ICLR2021
+SJelH-XZcH,2,SkgODpVFDr,SkgODpVFDr,Official Blind Review #1,"[Overview]
+
+In this paper, the authors proposed a shuffle strategy for convolution layers in convolutional neural networks (CNNs). Specifically, the authors argued that the receptive field (RF) of each convolutional filter should be not constrained in the small patch. Instead, it should also cover other locations beyond the local patch and also the single channel. Based on this motivation, the authors proposed a spatial shuffling layer which is aimed at shuffling the original feature responses. In the experimental results, the authors evaluated the proposed ss convolutional layer on CIFAR-10 and ImageNet-1k and compared with various baseline architectures. Besides, the authors further did some ablated analysis and visualizations for the proposed ss convolutional layer.
+
+[Pros]
+
+1. The authors proposed a new strategy for convolutional layers. The idea is borrowed from the biological domain, and then transformed to a spatial shuffling layer which can shuffle the feature response at each convolutional layer.
+
+2. The authors performed experiments on both small-scale dataset (CIFAR-10) and large-scale data (CIFAR-100) for evaluations.
+
+3. The authors further added some ablated analysis on the proposed model. Specifically, the authors visualize the receptive field which can be used in ss layer compared with the original convolutional layer, which indicates that ss layer can incorporate the global context at the very beginning.
+
+[Cons]
+
+1. The motivation behind the proposed ss layer is not explained very well. Though the authors mentioned that it is biologically inspired, I would not buy that since it is still a unclear phenomenon, and even it is true, using a randomized shuffling seems not align with the observations to some extent.
+
+2. The paper is poorly written in general. The motivation behind the proposed method, and the presentation of method section part are cluttered much. In the model analysis in experiment section, the presentation and explanations are also vague and not clear to me.
+
+3. The proposed model seems increase the baseline models' performance very marginally on all architectures. It is hard to say that it is because the shuffling layer enable the neurons to incorporate the global context information. Instead, it might just because the randomization which would increase the generalization ability of the trained model.
+
+4. Finally, the comparison with previous models, such as SENet, ShuffleNet, etc are not systematically. I would like to see a more comprehensive summarization of the differences between the proposed ss layer and other architectures, because all of them are trying to incorporate more contextual information from other channels or locations.
+
+[Summary]
+
+Overall, I think the proposed ss layer is still a reasonable way to incorporate the contextual information in CNNs. However, the poor presentation and the weak experimental results and analysis make the paper overall a one under the bar of the venue. I would suggest the authors revise the paper with more well-motivated formula and more solid experiments and analysis in the next submission.",3,,ICLR2020
+Bdc-VJZv6--,1,4_57x7xhymn,4_57x7xhymn,Review,"Summary
+
+
+Pros
++ New interesting task for video synthesis
++ New capsule based neural network to learn relationships
++ Outperforms baselines in pixel based metrics and user study
+
+Comments / Questions:
+- Stopping criteria:
+I may have missed it, but I cannot find what is the stopping criteria of the video generation. Is it when the generated frame stops changing? If not, what happens if you let the network generate frames infinitely, does the frame become static once the task is accomplished?
+
+- Number of parameters:
+The proposed method needs additional parameters for each of the capsule blocks feature computation. Does the concatenation match the number of parameters of the proposed method? Or is everything kept the same except for the concatenation replacing the action capsule?
+
+- Difference between the testing and training data:
+It’s obvious that for the input instructions, the network must be able to see the before to generalize, however, the object arrangement is a different story. How much does the training and testing data differ in object spatial arrangement? Is it safe to assume that the same exact test image won’t be in the training set?
+
+- Capsule module embeds sparse graph structure:
+The authors mention that they argue about sparse graph structure happening in their network. Is there a way to visualize the internal state of the network to verify this?
+
+
+Conclusion:
+This paper presents an interesting task that has the potential of sparking research in this direction. In addition, the authors propose an interesting architecture that generates extremely good video given some simple instructions. There are a few questions I have listed above, however, the paper to me seems interesting enough consider for acceptance.
+
+###########################
+ POST REBUTTAL DECISION
+###########################
+
+After reading other reviews and the rebuttal, I have concerns related to the stochastic trajectories mentioned by R2. The fact that the authors confirmed that all semantic actions have the same number of steps makes me question potential overfitting. I would imagine that some actions should take less steps than others based on the objects that are being interacted with, and potential multiple trajectories when performing the same semantic action. I still think the task is interesting, but the setup seems not appropriate to claim a general concept from the current method. Therefore, I have decided to lower my score.
+",5,5.0,ICLR2021
+HJfCu9Amg,1,BJjn-Yixl,BJjn-Yixl,The paper need more improvements to be accepted,"This paper describes a method that estimates the similarity between a set of images by alternatively attend each image with a recurrent manner. The idea of the paper is interesting, which mimic the human's behavior. However, there are several cons of the paper:
+
+1. The paper is now well written. There are too many 'TODO', 'CITE' in the final version of the paper, which indicates that the paper is submitted in a rush or the authors did not take much care about the paper. I think the paper is not suitable to be published with the current version.
+
+2. The missing of the experimental results. The paper mentioned the LFW dataset. However, the paper did not provide the results on LFW dataset. (At least I did not find it in the version of Dec. 13th)
+
+3. The experiments of Omniglot dataset are not sufficient. I suggest that the paper provides some illustrations about how the model the attend two images (e.g. the trajectory of attend).",3,5.0,ICLR2017
+XaiXgknkrs5,4,Bpw_O132lWT,Bpw_O132lWT,Review of Dynamic of Stochastic Gradient Descent with State-dependent Noise ,"This paper proposes power-law dynamic of SGD which considers state-dependent noise. The power-law distributed derived from this  dynamic explains the heavy-tailed distribution of parameters trained by SGD. Besides, this dynamic also shows efficiency of escaping local minima.
+
+Concerns:
+1. The proof of theorem 2 is not provided in the appendix. But I doubt if C(w) is well-defined. It is not clear how w* is selected considering there are multiple local minima. It does not make sense to me if w* is fixed when taking x-->\infty, as the quadratic approximation should be used in the neighborhood of w*. 
+
+2. The escaping efficiency of the power-law dynamic is only analyzed in low-dimension case. I wonder how if performs in high-dimensional space. Does it provide more benefits than Langevin/alpha-stable dynamic in the expense of calculating sigma_g and sigma_H.
+
+Minor comments:
+1. I think  [Li et al., 2017] also proposed state-dependent noise in Theorem 1.",5,3.0,ICLR2021
+9Z5s9QlxIBq,4,nXSDybDWV3,nXSDybDWV3,Review for Einstein VI,"In this paper, the authors developed a probabilistic programming framework for stein variational gradient descent and its variants using difference kinds of kernels, i.e. nonlinear kernels or matrix kernels. Simple experiments are included that the repository is effective and scalable for various problems. 
+
+Followings are a few of my questions and comments:
+
+1. How is the new implementation compared with other frameworks using black box variational inference? For example, What is the speed of the training comparing with previous frameworks such as edward in large scale dataset tasks?  And the report does not give us a more thorough guide of the performance of each kernels for difference tasks. 
+
+2. The authors mentioned that the framework can be extended to use other objective function such as  Rényi ELB, , Tail-adaptive f-divergence, or Wasserstein pseudo-divergence. I am extremely confused about this part, since actually there is no objective function for svgd based methods (unless you design a new loss based on KSD or related things), how is this possible to combine other objective function using svgd? It would be great if the authors write down the derivations and have a detailed discussion.
+
+3.  Does the current framework implement amortized svgd and other related stein's paper that can be utilized to train neural networks based applications such as stein-vae, stein-gan or kernel stein generative modeling [1, 2, 3]? This implementation can be important since it can be quite helpful for many other applications such as meta learning. 
+
+Also, the authors give the public code link of their implementation in the paper, which may expose their identity, but I am not sure if this violates anonymous requirement of ICLR submissions. 
+
+[1] Feng, Yihao, Dilin Wang, and Qiang Liu. ""Learning to draw samples with amortized stein variational gradient descent."" arXiv preprint arXiv:1707.06626 (2017).
+
+[2] Wang, Dilin, and Qiang Liu. ""Learning to draw samples: With application to amortized mle for generative adversarial learning."" arXiv preprint arXiv:1611.01722 (2016).
+
+[3] Chang, Wei-Cheng, et al. ""Kernel Stein Generative Modeling."" arXiv preprint arXiv:2007.03074 (2020).",5,3.0,ICLR2021
+tzQPcaK895p,1,6BWY3yDdDi,6BWY3yDdDi,Simple yet effective approach for efficient negative sampling,"When the number of classes is very large, calculating softmax for classification (e.g., in backpropagation) is computationally costly. Approaches based on negative sampling have been used in literature to alleviate this problem. However, most of existing approaches are (argued to be) either inaccurate or computationally costly. This paper proposes to use the well-known LSH (locality sensitive hashing) method to address this problem. In particular, two variants, LSH label and LSH Embedding are showed to speed up the training in terms of time needed to converge compared with a number of baseline methods over three large scale datasets. 
+
++ The suggested approach is simple, making it potentially useful in practice. 
++ The methodological contribution of the paper is more or less an off-the-shelf use of LSH for negative sampling. This being said, the application of LSH in this context is (seemingly) new, and the two basic similarity measures are interesting.
++ I would be cautious about calling the method ""A truly constant-time"" method. If we assume K and L are constant then yes; but theoretically in order to get good result we need to have larger values especially for L. Please elaborate on this.
++ The two theorems offered in the paper are the theorems from the LSH literature. Since a proxy is being used for similarity (e.g., label or embedding) these do not translate into the final classification result. The authors can perhaps elaborate on this and also give clues on how large K and L should be set depending on the parameters of the dataset, etc.
++ In intro we have ""We show that our technique is not only provable but ..""; it is not clear what is exactly provable, and how it relates to the classification result. ",5,4.0,ICLR2021
+N3Nq1xKQy3,1,VD_ozqvBy4W,VD_ozqvBy4W,"Useful method with promising results, but evaluation could be better","**Thanks to the authors for the response. The addition of a human study and CoCon+ has made the paper substantially stronger, as it resolves most of my concerns. The authors provided plausible explanations for the remaining questions. The paper should now be considered a clear accept.**
+
+The paper proposes a method for controlled text generation with pretrained (unconditional) language models. The method trains a relatively small network (CoCon) that is injected between any two successive layers of a pretrained Transformer language model. Given a prefix, CoCon is trained to output an input to the next layer such that the remainder of the text is generated at the output layer. CoCon is a function not only of the prefix but also of some desired 'content' sequence, which allows to control the content of the output text at inference time. Several auxiliary loss terms are employed to improve generalization. The model is evaluated on its ability to generate output with desired content, to generate text of a desired topic, and to generate text of specific sentiment.
+
+Compared to previously proposed Plug and Play Language Models, the novelty of the CoCon method lies in its ability to condition on entire input sequences (instead of only bag-of-words), and the fact that it does not require style labels, which are both important properties. The paper is well written, the method is intuitive and all components are well motivated. The experimental section could be more thorough. For example, several aspects of the model are only evaluated qualitatively, and I don't find the examples very convincing. Moreover, some of the results are difficult to interprete or non-conclusive. The paper could benefit from a human evaluation.
+
+In the papers current state I would already slightly lean towards acceptance because the method itself will be useful to the community, and some of the results are promising. I am willing to strengthen my recommendation if my questions below are answered positively.
+
+* In the _CoCon Setup_ you report to split $x$ into $x^a$ and $x^b$ somewhere between the 8th and 12th BPE position. Why is this sufficient? Wouldn't we expect the model to perform poorly on prefixes that are not between 8 and 12 BPE tokens long?
+
+* Table 2 suggests that CoCon without the adversarial loss achieves the best performance, drastically improving on content similarity while retaining comparable text quality and diversity. This makes me wonder why the adversarial term was introduced in the first place, and why it is apparently used in the other two experiments.
+
+* Why is the perplexity of CoCon (and PPLM) consistently lower than the perplexity of the baseline LM GPT-2? Shouldn't we expect a trade-off between controllability and text quality? In the PPLM paper, the perplexity of the baseline is consistently (slightly) lower than that of PPLM.
+
+* Why does training on (human-written) Webtext instead of (machine-written) outputs of GPT-2 _decrease_ the text quality? Wouldn't we expect the opposite?
+
+* The above three questions lead me to believe that using perplexity on GPT-1 might not be a suitable metric to judge text quality in these scenarios. Could you please provide more arguments why you believe a human study is not needed here?
+
+Suggestions:
+
+The model extensions listed under 4.4 are very interesting. The paper would be even stronger if you had quantitative experiments for these, as the examples that are given are not very convincing. For example, couldn't you apply your model to (gradual) sentiment transfer by conditioning CoCon on the input text as well as the target sentiment (""is perfect""), weighted by $\tau_{content}$? Even if the results were not very good compared to the state-of-the-art in sentiment transfer, such an experiment could show off the versatility of CoCon compared to PPLM. Moreover, if PPLM and CoCon complement each other as you claim, why not add another row ""CoCon + PPLM"" to Table 3, 4, and 5?
+
+Minor suggestions:
+* In the results section of 4.1, you say that L_adv marginally reduces CoCon's perplexity, but the table shows that _removing_ L_adv reduces it.",8,4.0,ICLR2021
+AIaCOKT9z13,4,lbc44k2jgnX,lbc44k2jgnX,"Review of ""Random Coordinate Langevin Monte Carlo""","This paper generalizes the Langevin within Gibbs sampler to be able to put different frequencies over different coordinates. The idea is cute. The result, however, is not convincingly better than the vanilla Langevin algorithm.
+
+The convergence rate for the current method for strongly convex and Lipschitz smooth case scales as O(d^2/\epsilon^2) in Wasserstein 2 distance, where every step requires partial derivative in one coordinate. This is to be compared to O(d/\epsilon^2) gradient computations required for the vanilla Langevin algorithm to converge. Although in terms of number of partial derivatives, they are comparable to each other, current computation infrastructure has made gradient computation a lot cheaper than d number of sequential  partial derivative computations.
+
+One possible use case, as the authors mentioned, might be that certain dimensions are more stiff than the others, calling for more careful exploration. In practice, however, the stiff dimensions change with the state and it is challenging to detect these stiff directions on the fly.
+
+#################################################################################
+TLDR
+
+Pros:
+Generalizes Langevin within Gibbs and achieves convergence guarantees.
+
+Cons:
+Not outperforming (sometimes even underperforming) current methods.
+
+Related work:
+Please also check the related work: MALA-within-Gibbs samplers for high-dimensional distributions with sparse conditional structure, X. T. Tong, M. Morzfeld, Y. M. Marzouk, 2019.",4,4.0,ICLR2021
+zMx4bJLK-EW,5,a2rFihIU7i,a2rFihIU7i,"Thorough experiments suggests an effective approach, but overall limited insight and novelty with respect to previous work","Summary
+-------
+
+Algorithms for hyperparameter optimization either do not maintain a
+model over configurations or are synchronous. This work proposes an
+extension of the asynchronous successive halving algorithm by
+introducing a Gaussian Process model which maintains a distribution over
+configurations and resource levels (referred to as rungs). Their
+algorithm, MOBSTER, is evaluated thoroughly against many baselines and
+datasets.
+
+Decision
+--------
+
+Overall, I am leaning towards marginally below acceptance threshold for
+this paper. I do not have much confidence, due to my unfamiliarity with
+this area. Your experiments section seems comprehensive and MOBSTER
+generally outperforms commonly used algorithms for hyperparameter
+optimization. Despite connections to ASHA and the median rule, there is
+limited novelty and extension of prior work.
+
+Originality
+-----------
+
+The way the work is presented suggests that the proposed approach is not
+very novel. Your paper does not note novelty of the proposed approach as
+a contribution (i.e. in Section 1.1). For someone unfamiliar with this
+area, it would be good to discuss why this approach has not been
+explored before despite its lack of novelty.
+
+Quality and Clarity
+-------------------
+
+The paper is generally well-written. Some statements are vague or have
+typos, and these can be found in the detailed/minor comments.
+
+Strengths
+---------
+
+-   As someone not familiar with this literature, I think you do a good
+    job at explaining previous work and your proposed approach. Section
+    2 in particular, helps situate your work in the literature discussed
+    in Section 1.2.
+-   The experimental results are thorough and convincing. Although some
+    choices remain unexplained, such as reporting regret and truncating
+    performance, you include many baselines and evaluate performance on
+    a wide variety of tasks. Some of the experiments on the larger
+    models (LSTM, ResNet) seem inconclusive, but it's good to include
+    them.
+
+Weaknesses
+----------
+
+-   There is limited novelty in the proposed approach. This is not an
+    issue on its own, but I think that the immediate challenges or
+    drawbacks of the proposed approach are not adequately addressed.
+    Perhaps it is not immediately straightforward to use a GP in
+    combination with successive halving, but the challenges are not
+    clear in Section 3. If it is straightforward, then it would be good
+    to discuss the reasons why it wasn't done before.
+-   Although it is noted that a main contribution is clarifying
+    differences between ASHA and median rule, there is overall limited
+    new insight. For example, much of Section 3 (on page 5) include
+    statements on design choices for the experiments. These do not seem
+    to be motivated from the proposed method, and generate little new
+    insight and seem to be better suited to the experiment section.
+
+Detailed Comments
+-----------------
+
+-   Abstract: ""simultaneously reason across hyperparameters and resource
+    levels, and supports decision-making in the presence of pending
+    evaluations.""
+
+    This statement is confusing and doesn't convey what your approach
+    is, or how it differs from existing methods.
+
+-   Section 1: "" \$**x**<sub>**\***</sub> ∈ arg
+    min<sub>x∈\ mathcal{X}</sub> f(x)""
+
+    You're discussing architecture and hyperparameters here, but this
+    mathematical expression seems out of place. It is unclear if
+    $\mathbf{x}$ is the hyperparameter or the input, because $f$ is
+    defined later as a mapping for inputs $x$
+    ($y_i = f(x_i) + \epsilon_i$) not hyperparameters.
+
+-   Figure 2: I find the dashed lines fairly difficult to see. I
+    appreciate the effort to label the difference between stopping and
+    promotion, but I think the figure is actually not informative at all
+    of what is going on. All of my understanding from this figure is
+    from the caption.
+
+-   Section 4: ""We report the immediate regret…""
+
+    The reasoning for this makes sense intuitively. However, can the
+    results in the paper be directly compared to results of other
+    papers? For example, under what conditions will the best performing
+    algorithm with respect to immediate regret also correspond to the
+    best performing algorithm with respect to validation error?
+
+-   ""For all classification datasets we cap the regret at a lower bound
+    of 10−3, which corresponds to a 0.1% difference in classification
+    accuracy to the global optimum.""
+
+    This seems like a bold choice. Are there similar decisions in other
+    literature that can further justify this?
+
+-   Figure 6: ""After around 7000 seconds, MOBSTER’s model \[approaches
+    the optimum\] much faster than ASHA""
+
+    Is this accurate? Both methods seem to reach the optimum around the
+    same time.
+
+Minor Comments
+--------------
+
+-   Section 2, footnote: "" Supporting pause-and-resume by checkpointing
+    is difficult in the multi-node distributed context. It is simpler to
+    start training from scratch whenever a configuration is promoted.
+    This is done in our implementation""
+
+    You should be clear here that pause-and-resume is done in your
+    implementation. ""This is done in our implementation"" suggests that
+    you do in fact start training from scratch whenever a configuration
+    is promoted.
+
+-   Section 4: ""(for which the complexity considerable), which is why
+    omitted them here.""
+
+    There seems to be some missing words here.
+
+-   Scrolling and zooming in on the figures on pages 6/7/8 is very
+    taxing on my relatively powerful desktop. I am not sure how the
+    figures are embedded, but maybe these can be rasterized.
+
+-   Figure 3: I assume RS is random search, but this is not said
+    anywhere in the paper.
+
+-   Figure 6: Why are the symbols (circle, diamond, squared) on the line
+    plot missing in the figures here?
+
+Post Discussion
+--------------
+After discussion, i have raised my score from 5 to 6.",6,2.0,ICLR2021
+eN-Ak1uomP,3,TQt98Ya7UMP,TQt98Ya7UMP,well written paper - could have a large impact,"The paper presents two soft-constrained rl approaches built on top of D4PG. Specifically, they use meta gradients for the lagrange multiplier learning rate (MetaL), and use meta gradients for reward shaping (MeSH).
+
+I found the paper to very clearly written, the main algorithm MetaL is clearly presented, and the results are fairly conclusive: the meta gradient approach proposed in the paper works better than the tested baselines. The introduction motivates the different real-world problems very nicely, e.g., hard vs. soft constraints. As someone with a lot of deep RL experience, but not a lot of constrained RL experience, I found the authors did a very good job at explaining all the relevant background. I also really appreciated the detailed experimental analysis of the approach at the end of 6.1 – it highlights exactly why the method works well. 
+
+One critique I do have, is that it would be great to have the intuition for the MeSH update in the main paper. Otherwise, since the results indicate that it performs worse than MetaL across the board – perhaps it would be best to relegate the method to the appendix and change the presentation of the paper, e.g., change “we propose two meta gradient methods”  etc… I feel like the paper would be stronger without MeSH as it takes away from the overall message. 
+
+Another issue I have is that 3 seeds might not be enough to make any meaningful conclusions. More seeds would be needed to make any statistical comparison between metal and Rs-0.1 in table 1. Nevertheless, as highlighted RS-0.1 fails at humanoid – which makes sense – you would need to tune the penalty parameter for each domain (although 0.1 actually works quite well on ¾ domains). Which is why meta gradient approaches make sense. 
+
+Overall, assuming the authors add extra seeds and perform the statistical significance testing – I think this paper would have a large impact at ICLR.
+
+Small notes:
+
+On the presentation side of the results, I find figures 2 and 3 to be hard to interpret at a glance. It would be better to compare against the best baseline – instead of ~6 of them. 
+
+One final note, \bar \lambda = 1000 corresponds to the upper bound of the reward. Unless the algos are consistently achieving this upper bound, \bar \lambda seems very high. Ideally, a bunch of lambdas should be tested. This feels arbitrary.
+",7,4.0,ICLR2021
+BJenAkApFH,3,SJem8lSFwB,SJem8lSFwB,Official Blind Review #2317,"
+Main contribution of the paper
+- The paper proposes a new pruning method that dynamically updates the sparse mask and the network weight.
+- Different from the other works, the proposed method does not require post-tuning.
+- A theoretical explanation of the method is provided.
+
+Methods
+- In this method, the weight of the baseline network is updated not by the gradient from the original weight but pruned weight.
+- Here, pruning can be conducted by (arbitrary) a pruning technique given the network weight (Here, the author uses the magnitude-of-the-weight given method from Han.et.al).
+
+
+Questions
+- See the Concerns
+
+Strongpoints
+- The author provides the simple and effective pruning method and verifies the performance with a sufficient amount of experiments.
+- The author argues that the method is applicable to various pruning techniques.
+
+Concerns
+- It seems that the paper omits the existing work (You.et.al - https://arxiv.org/pdf/1909.08174.pdf), which seems to share some contribution. The reviewer wants the author to clarify the differences and the strongpoints compared to the work.
+- The main pruning&update equation (DPF) does not seem to force the original network w to become sparse, such as by l1-regularization. So, the reviewer worried that the method might not produce sparsity if the initial weights are not that sparse.
+If the reviewer missed the explanation about this, clarify this.
+- Regarding the above concern, what if we add regularization term in training the original network w? 
+- As far as the reviewer knows, the proposed method improves the sparsity of the network, but most works choosing the strategy actually cannot meaningfuly enhance the operation time and just enhances the sparsity. Does the author think that the proposed method can enhance the latency? If so, a detailed explanation or experiment will be required.
+
+Conclusion
+- The author proposes a simple but effective dynamic pruning method.
+- The reviewer has some concerns regarding the novelty, real speed up, and guarantee of the sparsity. 
+However, the reviewer thinks that this work has meaningful observations for this field with a sufficient amount of verification, assuming that the author's answers for the concerns do not have much problem.
+
+Inquiries
+- See the Concerns parts.",6,,ICLR2020
+BygGVkvh3B,4,HJghoa4YDB,HJghoa4YDB,Official Blind Review #4,"The paper discusses the policy evaluation problem using temporal-difference (TD) learning with nonlinear function approximation. The authors show that in the “lazy training” regime both over- and under-parametrized approximators converge exponentially fast, the former to the global minimum of the projected TD error and the latter to a local minimizer of the same error surface. Simply put, the lazy regime refers to the approximator behaving as if it had been linearized around its initialization point. This can happen if the approximation is rescaled, but can also occur as a side-effect of its initialization. The authors present simple numerical examples illustrating their claims.
+
+Although I did not carefully check the math, this seems like a solid contribution on the technical side. My main concern about the paper is that it falls short in providing intuition and contextualizing its technical content. Regarding the presentation, I believe it is possible to have a less dry prose without sacrificing mathematical rigor. If some of the technical material is moved to the appendix --like auxiliary results, discussion on proof techniques, etc--, the additional space could be used to discuss the implications of the theoretical results in more accessible terms.
+
+For example, a subject that ought to be discussed more clearly is the nature of the approximation induced by the lazy training regime. As far as I understand, this regime can be thought of as a sort of regularization that severely limits the capacity of the approximator.  Although the authors mention in the conclusion that “...convergence of lazy models may come at the expense of their expressivity”, after reading the paper I do not have a clear sense of how expressive such models actually are. In their experiments, Chizat et al. (2018) observed that the performance of commonly used neural networks degrades when trained in the lazy regime --to a point that they consider it unlikely that the successes of deep learning can be credited to this regime of training. It seems to me that this subject should be more explicitly discussed in a paper that sets out to provide theoretical support for deep reinforcement learning.
+
+Still regarding the behavior of lazy approximators, my intuitive understanding is that they work as a linear model using random features. If this interpretation is correct, this makes the theoretical results a bit less surprising. They are still interesting, though, for they can be seen as relying on a “smoother” version of the linearity assumption often made in the related literature. Maybe this is also something worth discussing? Still on this subject, it seems to me that one potential disadvantage of lazy models with respect to their linear counterparts is that it is less clear how to enforce the lazy regime in practice. In Section 4.2 the authors discuss how this can naturally happen as a side-effect of the initialization, but it is unclear how applicable the particular strategy used to illustrate this phenomenon, with the “doubling trick”, is in practice. This is another example of a less technical discussion that would make the paper a stronger contribution.",6,,ICLR2020
+w1vrzpTRsSn,3,OBI5QuStBz3,OBI5QuStBz3,Degree of advance or surprise over prior work is not clear.,"This paper studies the minimum number of bits that need to be communicated between N machines that jointly seek to minimize \sum_{i=1}^N f_i(x) where each function f_i is held by one of the machines, and the domain D of each f_i is a subset of R^d. The motivation for this is large optimization tasks that have to be solved often in machine-learning problems: the hope is to learn the limits to which the number of bits communicated in popular machine learning tasks can be optimized. 
+
+While the problem studied in the rest of the paper is cast purely in theoretical terms in the classical Message-Passing setting in distributed optimization, and therefore a conference with more theoretical bent seems more appropriate, given the many prior works that have appeared in machine learning conferences I am not too worried on this count. 
+
+As the main result, the paper shows that for quadratic optimization (i.e., even if all the f_i's are guaranteed to be quadratic functions),  to obtain an additive epsilon approximation (to the minimum of \sum_i f_i(x)) deterministically requires \Omega(Nd*log(beta*d/epsilon)) bits and the same for randomized approximation is \Omega(Nd*log(beta*d (N*epsilon))) bits. Here beta is the smoothness parameter of \sum_i f_i. The closest related earlier work showed that in the 2-node setting one requires \Omega(d*log(beta*d/epsilon)) bits. The generalization to N machines of this earlier result is a natural extension to study. 
+
+A couple of concerns:
+
+1) Given the result for the two machine setting, what would one expect to be the lower bound in the N machine setting? Perhaps the proof maybe involved, but are there reasons to expect the lower bound to take any other form? If there are, they don't seem to be present in the paper. 
+
+2) Even if the result is not unexpected, proving it could well be complicated. So if there are clear technical innovations compared to prior work that would be a good plus. However the degree of technical advance is not quite clear to me --- I have not worked in this area to say for sure. The authors point out that the innovation is here is the connection built to communication complexity in the context of real-valued optimization-tasks, but I am not so sure if there is enough to clear the ICLR bar.
+
+Overall this is a problem whose answer is good to know. The paper is written quite well, and is situated well in the landscape of related work. I am not sure if it clears the ICLR bar, perhaps accept if room.",5,2.0,ICLR2021
+S1gObe4q2X,2,rJlDnoA5Y7,rJlDnoA5Y7,Neat idea backed by a solid technical contribution,"This paper describes a technique for replacing the softmax layer in sequence-to-sequence models with one that attempts to predict a continuous word embedding, which will then be mapped into a (potentially huge) pre-trained embedding vector via nearest neighbor search. The obvious choice for building a loss around such a prediction (squared error) is shown to be inappropriate empirically, and instead a von Mises-Fisher loss is proposed. Experiments conducted on small-data, small-model, greedy-search German->English, French->English and English->French scenarios demonstrate translation quality on par with BPE, and superior performance to a number of other continuous vector losses. They also provide convincing arguments that this new objective is more efficient in terms of both time and number of learned parameters.
+
+This is a nice innovation for sequence-to-sequence modeling. The technical contribution required to make it work is non-trivial, and the authors have demonstrated promising results on a small system. I’m not sure whether this has any chance of supplanting BPE as the go-to solution for large vocabulary models, but I think it’s very healthy to add this method to the discussion.
+
+Other than the aforementioned small baseline systems, this paper has few issues, so I’ll take some of my usual ‘problems with the paper’ space to discuss some downsides with this method. First: the need to use pre-trained word embeddings may be a step backward. It’s always a little scary to introduce more steps into the pipeline, and it’s uncomfortable to hear the authors state that they may be able to improve performance by changing the word embedding objective. As we move to large training sets, having pre-trained embeddings is likely to stop being an advantage and start being a hindrance. Second: though this can drastically increase vocabulary sizes, it is still a closed vocabulary model, which is a weakness when compared to BPE (though I suppose you could do both).
+
+Smaller issues:
+
+First paragraph after equation (1): “the hidden state … t, h.” -> “the hidden state h … t.”
+
+Equation (2): it might help your readers to spell out how setting \kappa to ||\hat{e}|| allows you to ignore the unit-norm assumption of \mu.
+
+“the negative log-likelihood of the vMF…” - missing capital
+
+Unnumbered equation immediately before “Regularization of NLLvMF”: C_m||\hat{e}|| is missing round brackets around ||\hat{e}|| to make it an argument of the C_m function.
+
+Is predicting the word vector whose target embedding has the highest value of vMF probability any more expensive than nearest neighbor search? Does it preclude the use of very fast nearest neighbor searches?
+
+It might be a good idea to make it clear in 4.3 that you see an extension to beam search for your method to be non-trivial (and that you aren’t simply leaving out beam search for comparability to the various empirical loss functions). This didn’t become clear to me until the Future Work section.
+
+In Table 5, I don’t fully understand F1 in terms of word-level translation accuracy. Recall is easy to understand (does the reference word appear in the system output?) but precision is harder to conceptualize. It might help to define the metric more carefully.",7,4.0,ICLR2019
+B1lgRLesoX,1,Byf5-30qFX,Byf5-30qFX,"Simple and nice idea, but very unclear description and some serious flaws","In this paper, the authors extend the HER framework to deal with dynamical goals, i.e. goals that change over time.
+In order to do so, they first need to learn a model of the dynamics of the goal, and then to select in the replay buffer experience reaching the expected value of the goal at the expected time. Empirical results are based on three (or four, see the appendix) experiments with a Mujoco UR10 simulated environment, and one experiment is successfully transfered to a real robot.
+
+Overall, the addressed problem is relevant (the question being how can you efficiently replay experience when the goal is dynamical?), the idea is original and the approach looks sound, but seems to suffer from a fundamental flaw (see below).
+
+Despite some merits, the paper mainly suffers from the fact that the implementation of the approach described above is not explained clearly at all.
+Among other things, after reading the paper twice, it is still unclear to me:
+- how the agent learns of the goal motion (what substrate for such learning, what architecture, how many repetitions of the goal trajectory, how accurate is the learned model...)
+- how the output of this model is taken as input to infer the desired values of the goal in the future: shall the agent address the goal at the next time step or later in time, how does it search in practice in its replay buffer, etc.
+
+These unclarities are partly due to unsufficient structuring of the ""methodology"" section of the paper, but also to unsufficient mastery of scientific english. At many points it is not easy to get what the authors mean, and the paper would definitely benefit from the help of an experienced scientific writer.
+
+Note that Figure 1 helps getting the overall idea, but another Figure showing an architecture diagram with the main model variables would help further.
+
+In Figures 3a and 5, we can see that performance decreases. The explanation of the authors just before 4.3.1 seem to imply that there is a fundamental flaw in the algorithm, as this may happen with any other experiment. This is an important weakness of the approach.
+
+To me, Section 4.5 about transfer to a real robot does not bring much, as the authors did nothing specific to favor this transfer. They just tried and it happens that it works, but I would like to see a discussion why it works, or that the authors show me with an ablation study that if they change something in their approach, it does not work any more.
+
+In Section 4.6, the fact that DHER can outperform HER+ is weird: how can a learn model do better that a model given by hand, unless that given model is wrong? This needs further investigation and discussion.
+
+In more details, a few further remarks:
+
+In related work, twice: you should not replace an accurate enumeration of papers with ""and so on"".
+
+p3: In contrary, => By contrast, 
+
+which is the same to => same as
+
+compare the above with the static goals => please rephrase
+
+In Algorithm 1, line 26: this is not the algorithm A that you optimize, this is its critic network.
+
+line 15: you search for a trajectory that matches the desired goal. Do you take the first that matches? Do you take all that match, and select the ""best"" one? If yes, what is the criterion for being the best?
+
+p5: we find such two failed => two such failed
+
+that borrows from the Ej => please rephrase
+
+we assign certain rules to the goals so that they accordingly move => very unclear. What rules? Specified how? Please give a formal description.
+
+For defining the reward, you use s_{t+1} and g_{t+1}, why not s_t and g_t?
+
+p6: the same cell as the food at a certain time step. Which time step? How do you choose?
+
+The caption of Fig. 6 needs to be improved to be contratsed with Fig. 7.
+
+p8: the performance of DQN and DHER is closed => close?
+
+DHER quickly acheive(s)
+
+Because the law...environment. => This is not a sentence.
+
+Mentioning in the appendix a further experiment (dy-sliding) which is not described in the paper is of little use.
+",6,3.0,ICLR2019
+vvUx6UGCABv,2,UOOmHiXetC,UOOmHiXetC,A modification of Monte Carlo tree search that produces marginal improvements that may not be present with tuning of the Monte Carlo tree search exploration parameter,"The authors present a method that combines Monte Carlo tree search (MCTS) and random rollouts. The authors their relate this to the bias-variance tradeoff observed in n-step temporal difference methods. The authors evaluate their method on Sokoban and the Google Football League environment. The results show that the authors' method leads to marginal improvements on these domains.
+
+I do not think what the authors are doing is very novel as MCTS combined with rollouts was already used in AlphaGo. Furthermore, I believe the small difference in results can be made up by using only MCTS with a different exploration parameter (i.e. like the one that was used in the AlphaGo paper).
+
+I would like to know what benefits this method brings that cannot be obtained from combining MCTS with rollouts as in AlphaGo or from a hyperaparameter search with MCTS. Is there an anaylsis of the bias variance tradeoff of this method?",3,4.0,ICLR2021
+B1RZJ1cxG,2,SyrGJYlRZ,SyrGJYlRZ,Misleading and shaky theoretical motivation/approach,"The paper explores momentum SGD and an adaptive version of momentum SGD which the authors name YF (Yellow Fin). They compare YF to hand tuned momentumSGD and to Adam in several deep learning applications.
+
+
+I found the first part which discusses the theoretical motivation behind YF to be very confusing and misleading:
+Based on the analysis of 1-dimensional problems, the authors design a framework and an algorithm that  supposedly ensures accelerated convergence. There are two major problems with this approach:
+
+-First: Exploring 1-dim functions is indeed a nice way to get some intuition. Yet,  algorithms that work in the 1-dim case do not trivially generalize to high dimensions, and such reasoning might lead to very bad solutions.
+
+-Second: Accelerated GD does not benefit over GD in the 1-dim case. And therefore, this is not an appropriate setting to explore acceleration.
+Concretely, the definition of the generalized condition number $\nu$, and relating it to the standard definition of the condition number $\kappa$, is very misleading. This is since $\kappa =1$ for 1-dim problems, and therefore accelerated GD does not have any benefits over non accelerated GD in this case.
+However, $\nu$ might be much larger than 1 even in the 1-dim case.
+
+
+Regarding the algorithm itself: there are too many hyper-parameters (which depend on each other) that are tuned (per-dimension).
+And as I have mentioned, the design of the algorithm is inspired by the analysis of 1-dim quadratic functions.
+Thus, it is very hard for me to believe that this algorithm works in practice unless very careful fine tuning is employed.
+The authors mention that their experiments were done without tuning or with very little tuning, which is very mysterious for me.
+
+In contrast to the theoretical part, the experiments seems very encouraging. Showing YF to perform very well on several deep learning tasks without (or with very little) tuning. Again, this seems a bit magical or even too good to be truth. I suggest the authors to perform a experiment with say a qaudratic high dimensional function, which is not aligned with the axes in order to illustrate how their method behaves and try to give intuition.
+",4,5.0,ICLR2018
+ByxGiJd5sX,1,B1e4wo09K7,B1e4wo09K7,CoVAE,"The paper presents a VAE that uses labels to separate the learned representation into an invariant and a covariant part. The method is validated using experiments on the MNIST dataset.
+
+The writing in this paper is somewhat problematic. Although it is hard to put the finger on a particularly severe instance, the paper is filled with vague and hyperbolic statements. Words like ""efficiently"", ""meaningful"", ""natural"", etc. are sprinkled throughout to confer a positive connotation, often without having a specific meaning in their context or adding any information. Where the meaning is somewhat clear, the claims are often not supported by evidence. Sometimes the claims are so broad that it is not clear what kind of evidence could support such a claim.
+
+A relatively large amount of space is used to explain the general concept of invariant/covariant learning, which, as a general concept, is widely understood and not novel. There are other instances of overclaiming, such as ""The goal of CoVAE is to provide an approach to probabilistic modelling that enables meaningful representations [...]"". In fact, CoVAE is a rather specific model(class), rather than an approach to probabilistic modelling.
+
+The paper is at times meandering. For instance, the benefits of and motivation for the proposed approach are not simply stated in the introduction and then demonstrated in the rest of the paper, but instead the paper states some benefits and motivations, explains some technical content, mentions some more benefits, repeats some motivations stated before, etc.
+
+Many researchers working on representation learning hope to discover the underlying learning principles that lead to representations that seem natural to a human being. In this paper, labels are used to guide the representation into the ""right"" representation. It is in my opinion not very surprising that one can use labels to induce certain qualities deemed desirable in the representation.
+
+To conclude, because of the writing, limited novelty, and limited experiments, I think this paper currently does not pass the bar for ICLR.",4,3.0,ICLR2019
+SJCbAbugf,1,HyydRMZC-,HyydRMZC-,An interesting way of creating better adversarial examples.,"This paper explores a new way of generating adversarial examples by slightly morphing the image to get misclassified by the model. Most other adversarial example generation methods tend to rely on generating high frequency noise patterns based by optimizing the perturbation on an individual pixel level. The new approach relies on gently changing the overall image by computing a flow an spatially transforming the image according to that flow. An important advantage of that approach is that the new attack is harder to protect against than to previous attacks according to the pixel based optimization methods.
+
+The paper describes a novel model method that might become a new important line of attack. And the paper clearly demonstrates the advantages of this attack on three different data sets.
+
+A minor nitpick: the ""optimization based attack (Opt)"" was first employed in the original ""Intriguing Properties..."" 2013 paper using box-LBFGS as the method of choice predating FGSM.",7,4.0,ICLR2018
+H1xPl92pKB,2,HkxWXkStDB,HkxWXkStDB,Official Blind Review #2,"This paper proposes a data augmentation method that interpolates between two existing methods (Cutout and Gaussian), for training robust models towards Gaussian and naturally occurring corruptions. The method is shown to improve robustness without sacrificing accuracy on clean data.
+Pros:
+The proposed method, despite being simple, seems to empirically work well in terms of the mCE criterion evaluated in the experiments. This does support the authors’ claim that current methods haven’t reached the robustness/accuracy tradeoff boundary yet.
+Cons:
+I’m a bit concerned about the significance of the work though. The method is a straight-forward combination of existing methods, so methodologically the novelty is kind of limited. Hence, I’m expecting more insights from the analysis of the results, to gain more understanding of why it works so well. However, the presentation of the experiments just seems to aim for the best numbers one can get (I’m not certain how significant the numbers are to this field though). A few examples/pictures of success cases (when the method works) and failure cases (when the method doesn’t work), may help readers (I’m not an expert) to better understand the approach and get more intuitions? The frequency analysis seems quite intuitive. It’s obvious that Gaussian filter blocks high-frequency components, and Cutout keeps some original parts of the image which allow high-freq details to be captured. But, considering CIFAR image size is only 32x32, a patch of size 25 is quite large, how much is the method different from plain whole image Gaussian then?
+",3,,ICLR2020
+my3h4BMpsb,1,yOkSW62hqq2,yOkSW62hqq2,Official Blind Review #3 ,"Overall, I vote for marginally below the acceptance threshold. I think the idea is somewhat novel for the Explicit Connection Distillation, especially for cross-network layer-to-layer gradients propagation. This paper proposes a new strategy of Knowledge distillation called  Explicit Connection Distillation (ECD), and the ECD achieving knowledge transfer via cross-network layer-to-layer gradients propagation without considering conventional knowledge distillation losses. Experiments on popular image classification tasks show the proposed method is effective.
+However, several concerns including the clarity of the paper and some additional ablation studies ( see cons below) result in the decision.
+
+##########################################################################
+Pros: 
+1) The knowledge distillation by cross-network layer-to-layer gradients propagation is somewhat novel to me.
+2) This paper is easy to follow, the methodology part is clear to me.
+3) The experiments part show detail ablation study of each component and the supplementary lists almost detail of experiments which help the community to reproduce the proposed methods.
+
+##########################################################################
+Cons: 
+1) The first concern is about motivation. (1): The author claims conventional KD methods leads to complex optimization objectives since they introduce two extra hyper-parameters. To my best of the knowledge, these two parameters have not too much search space. e.p Temperature is from 3-5 and the weight is T^2 from Hilton's paper and following paper.  (2): The drawback of one-stage KD methods is a little bit overclaim,  Both ONE and DML can be applied to a variety of architecture. In my opinion, the teacher design of  ECD  follows a similar strategy with ONE and its variants, which is the teacher is wider than the student.  Overall, I think the motivation of this paper needs to be very careful to clear.
+2) The fairness of comparison. Is the Dynamic additive convolution component used to in student network, does this will influence the comparison of  Table2, Does ONE and DML also use that?
+3) Why the automatically generated teachers of ECD is much lower than other methods in Table2 in term of performance and results in higher student performance. Is there any explanation here, like such as [1][2]?
+4) Could you provide the computation cost comparison of the proposed method and other one-stage methods in Table2.
+5) Some recently SOTA work is missed [3][4], although I know the performance of this paper is outperformed. I think they need to be discussed.
+
+
+Reference:
+[1]: Improved Knowledge Distillation via Teacher Assistant
+
+[2]: Search to distill: pearls are everywhere but not the eyes
+
+[3]: Online Knowledge Distillation via Collaborative Learning
+
+[4]: Peer Collaborative Learning for Online Knowledge Distillation
+",5,5.0,ICLR2021
+CWomj3cJi_2,2,Q1aiM7sCi1,Q1aiM7sCi1,A good contribution that finally enables clustering of topological descriptors,"# Synopsis of the paper
+
+This paper develops a novel algorithm for performing fuzzy clustering
+(i.e. non-hard assignment of points to cluster centres) on persistence
+diagrams, i.e. topological data descriptors. This is a highly relevant
+contribution because persistence diagrams 'live' in a space that makes
+metric analyses somewhat cumbersome. By contrast, the proposed method,
+even though this is not strictly highlighted in the paper, can be used
+as a principled way to obtain 'representatives' of a data set.
+
+The critical algorithmic insight of the paper lies in developing a new
+way to calculate Fréchet means; this makes it possible to adapt FCM to
+this domain.
+
+A set of experiments demonstrates the utility of the proposed approach.
+
+# Summary of the review
+
+This is a well-written paper with a highly relevant contribution for
+the TDA community. I am excited to see such a clustering algorithm
+finally emerge for persistence diagrams, and I envision that this paper
+will be a very useful contribution to the field.
+
+That being said, there are some issues in the current write-up that
+prevent me from fully endorsing this work for now, namely:
+
+1. Presentation: the paper will be somewhat confusing for non-experts in
+   TDA. While this is to be expected to some extent, there are several
+   places in which the paper could be improved to provide some more
+   intuition, making it possible that even non-experts can appreciate
+   the contribution.
+
+2. The experiments appear to be somewhat preliminary. The experiment
+   with synthetic data, for example, only comprises few samples that
+   are clustered; a large portion of this section is spent on discussing
+   an application in materials science instead (which is of course
+   important, but I feel that it is hard to both appreciate the
+   application domain and the algorithm at the same time). In addition,
+   some details about the empirical behaviour of the method are not
+   discussed.
+
+If these two points were to be rectified in a revision of the paper,
+it would help the contribution to shine more. Please see below for more
+detailed comments.
+
+# Detailed comments
+
+- In terms of exposition, I would suggest to cite the Vietoris--Rips
+  complex construction (and also refer to the complex by this
+  alternative name; I think VR is much more common that just 'Rips')
+
+- Would it not be easier to show a Vietoris--Rips filtration instead of
+  the Čech one? I would suggest updating Figure 1 to account for this.
+  To provide a more intuitive view to TDA, the caption of the figure
+  could also be extended to describe the creation of cycles in the
+  point cloud, for example. This is the first figure that readers will
+  see, so any updates are well worth the effort, in my opinion. 
+
+- Footnote 2 needs some clarification: is the paper suggesting to
+  compute the full VR complex (until the full simplex of $n$ vertices
+  has been reached)? If not, there *will* be multiple points at
+  infinity.
+
+- Would it not be possible to sidestep the issue of infinite points
+  entirely by using, say, extended persistence? It is my understanding
+  that the method does not 'care' about the way the persistence diagrams
+  are calculated, right? Hence, there are no structural assumptions
+  being made here.
+
+- The Fréchet mean is not necessary unique. Does this pose any problems
+  for the algorithm? I would assume that the calculated clustering might
+  also not be unique (or one of multiple equivalent solutions), but
+  I lack the intuition here. This should be briefly discussed.
+
+- Before equation 2, it should be 'Fréchet mean $\widehat{\mathbb{D}}$',
+  i.e. the variable indicating the mean should be used here. 
+
+- How is the convergence behaviour of the algorithm? How many
+  iterations does it usually take until clusters start to stabilise? 
+
+- The section on 'Computing the Fréchet mean' could be improved in terms
+  of the flow. Would it be possible to provide an overview algorithm as
+  well?
+
+- In terms of the limitations of the method, it is my understanding that
+  it heavily depends on the Wasserstein distance. Is this correct? I am
+  asking because this distance is known to be computationally
+  challenging to compute, so I was wondering while reading the paper
+  whether a similar algorithm could be derived for *kernels* between
+  persistence diagrams. While the experiments discuss runtime already in
+  the supplements, I am not convinced about the overall scalability of
+  the algorithm. (this is not a fault of the paper, I merely want the
+  limitations to be discussed in more details)
+
+- As already mentioned above, I would suggest updating the discussion of
+  the lattice structure data if possible. I feel that this is an
+  interesting topic, but I would rather see more experiments in the
+  paper and a shortened description of the background (it could be put
+  into the supplemental section).
+
+- As an additional suggestion for improving the experiments, I would
+  suggest running the synthetic test data experiments with more
+  diagrams. This *can* be an excellent introductory experiment to
+  showcase the capabilities of the algorithm, but at present, it falls
+  slightly short of that.
+
+- To me, the 'decision boundaries' section could be extended. This is
+  a really exciting application; the ability to find representatives 
+  of diagrams opens up all kinds of new avenues! Is it possible to link
+  this more to previous results, i.e. Ramamurthy et al.?
+
+- As for the discussion on generalisation capabilities, I would suggest
+  citing prior work (Rieck et al., 'Neural Persistence: A Complexity
+  Measure for Deep Neural Networks Using Algebraic Topology'), which
+  mentions relationships between topology-based measures and
+  generalisation capabilities.
+
+All in all, I am convinced that this has the potential to be a strong
+addition to the TDA community!
+
+# Style & clarity
+
+The paper is well-written; there are a few sentences that I failed to
+parse correctly, though:
+
+- 'invariance to the basis symmetries': should this be 'basic
+  symmetries' instead? Moreover, why is there an ellipsis (...) after
+  'physics'? Should this be '[...]' to indicate that parts of the
+  quotation were left out?
+
+- As a matter of personal style preference, I would prefer to say 'The
+  algorithm by Turner et al.' rather than ""Turner et al.'s algorithm"".
+  The latter strikes me as somewhat confusing.
+
+- I would suggest to use no contractions in formal writing, hence
+  ""cannot"" instead of ""can't"" etc.; this is a minor point, but it since
+  the remainder of the paper is written so neatly, I cannot help but
+  point out ways to improve it even more.
+
+- 'so does not' --> 'so it does not'
+
+# Update after rebuttal & discussions
+
+I thank the authors for their thorough rebuttal. While the technical details are acknowledged and addressed for the most part, the experimental setup could still be improved. R1 mentioned that the work by Lacombe et al. might also be applicable as a comparison partner. Investing in a more thorough scenario would strengthen the paper by a lot.
+
+# Further update after discussions
+
+The primary subject of our discussions concerned the experimental setup. While I still see this paper favourably, it would be strengthened by a more in-depth comparison with the work by Lacombe et al. The core of the paper would be more convincing if the utility of the fuzzy clustering could be highlighted better in a set of scenarios that are more comparable with existing TDA literature.",6,5.0,ICLR2021
+Skgk1ABTKH,2,HJl8AaVFwB,HJl8AaVFwB,Official Blind Review #3,"~The authors propose the addition of multiple instance learning mechanism to existing deep learning models to predict taxonomic labels from metagenomic sequences.~
+
+I appreciate the focus area and importance of the problem the authors have outlined. However, I do not think the authors have achieved the conclusions they mention on page 2, as well as other issues throughout the work. I also think the inclusion of the multiple instance learning framework is incremental and does not provide sufficient benefit.
+
+“A new method to generate synthetic read sets with realistic co-occurence patterns from collections of reference genomes”. I do not think there was any systematic analysis of the parameterization of their generative framework. I would appreciate empirical comparison of previous synthetic read generation techniques to the current proposed framework. Also, there is no comparison of the generative framework to real data. How are the parameters chosen in section 3.1.1? Finally, how the authors propose to alleviate bias of composition of databases? Rare species that may be present in abundance in metagenomic data may be swamped out by more common species sequenced again and again in databases.
+
+“A thorough empirical assessment of our proposed model, showing superior performance in prediction the distributions of higher level taxa from read sets.” A comparison to existing alignment-based methods is absolutely required for this work. The authors of GeNet compare to state-of-art Kraken and Centrifuge, and when reading Rojas-Carulla et al., these models have still performed worse than Kraken and Centrifuge, and that should be reported in your assessment.
+
+A few minor points:
+
+-More description of your neural network architecture is needed. I’m not sure what a “ResNet-like neural network” actually means, and how something built for images deals with sequences. There are also different ResNets with different numbers of parameters.
+
+-Why isn’t a 1D convolutional neural network used to process the input sequences? This would make the sequences translation invariant, and would have a similar effect to working with kmers, where k = convolution width.
+
+-It may be useful to interpret the attention mechanism to understand which reads are likely influencing the decision of taxonomic assignment.
+",1,,ICLR2020
+bqExpiExlE,2,Z_3x5eFk1l-,Z_3x5eFk1l-,A simple yet effective approach to extend MAML against adversarial attacks,"This paper presents ADML (ADversarial Meta-Learner), a method for general meta-learning when adversarial samples are present. In a sense, ADML extends MAML to deal with adversarial/contaminated samples in training. Building on top of MAML, the key insight is to validate or ground the model's updates from both angles: 1) to ground the parameter update from clean examples onto an adversarial set; 2) to ground the parameter update from the adversarial examples onto a clean set. Experimental evaluations on MiniImageNet and CIFAR 100 demonstrates the good performances of ADML over a series of meta-learning baselines. In addition, it's also shown that simply creating a mixed dataset with both clean and adversarial examples (i.e., MAML-AD) does not fully address the adversarial signals, thus validating the key insights of the two-way ""cross-grounding"" technique.
+
+The paper is well written and the extension of MAML onto the adversarial setting is simple yet effective. One minor comment regarding the evaluation though is that there could a couple more ablation studies for ADML. For example, how sensitive is the ADML to the learning rates for the outer objective? Also for the clean-clean case, since MAML still performs slightly better than ADML, is it possible to modify/extend ADML such that it will still match MAML's performance when there's no adversarial examples in the training set?",6,4.0,ICLR2021
+B1l5wezE9H,3,Bklrea4KwS,Bklrea4KwS,Official Blind Review #2,"This paper described an approach of performing multiple instance learning (MIL) by using a network branch to weight instances and then using a Gaussian normalization layer on top of it, where the weightings are predicted based on in-bag variances. The instance weighting scheme, a classic in MIL, has been proposed in deep networks by [3]. Hence, I don't see much novelty here except that there is a Gaussian normalization layer after the instance weighting in the MIL framework. 
+
+I'm a bit worried that sigma seems to be unnormalized in the first case -- what would happen if the bag score distribution is non-Gaussian? GP1T1 seems more sensible by learning all these weights.
+
+However, I see significant issues in terms of evaluation which makes hard to accept this paper.
+
+I firmly believe that 22 years after the (Dietterich 1997) paper, it's no longer enough to only use the original MIL datasets and classification datasets such as the CIFAR-10 bags to evaluate MIL. I may be alone here which is open to discussions, but the original motivation for MIL is for the problem to be weakly-supervised where we only know a high-level label but no low-level labels. There exist many realistic image problems that are similar to this (e.g. weakly-supervised detection and semantic segmentation) and have received a plethora of work, hence it's unclear to me whether still using these arbitrarily generated CIFAR-10 bags and the 5 old datasets would still apply to MIL as an approach. After all, MIL is almost dying and ""weakly-supervised learning"" has risen in popularity with almost being the same problem as MIL. I think for MIL to work its way back, it should first start by using the right datasets to test (e.g. the datasets in [3] would be a great starting point) and comparing with other weakly-supervised detection approaches that do not use additional information.
+
+Besides the philosophical point, a practical issue is that the numbers seem to be bold arbitrarily according to the authors' whim. In table 3, MI-Net DS is better than every other method in the last 2 columns, but not bold, and also it seems that no t-test was performed to determine the significance of differences. This also happens in Table 1, where I think WSDN should be equivalent with the proposed method in the last 2 rows.
+
+Finally, the choice of excluding MI-Net in the CIFAR experiments is dubious as well, given its performance in the simple datasets. Why is MI-Net not tested on the CIFAR experiments?",3,,ICLR2020
+S1l7y8esh7,2,BJx0sjC5FX,BJx0sjC5FX,"An interesting work offers first step in inspecting RNN representations, the experimental results does not fully support the claim","The work proposes Tensor Product Decomposition Networks (TRDN) as a way to uncover the representation learned in recurrent neural networks (RNNs). TRDN trains a Tensor Product Representation, which additively combine tensor products of role (e.g., sequence position) embeddings and filler (e.g., word) embeddings to approximate the encoding produced by RNNs. TRDN as a result shed light into inspecting and interpreting representation learned through RNNs. The authors suggest that the structures captured in RNNs are largely compositional and can be well captured by TPRs without recurrence and nonlinearity.
+
+pros:
+1. The paper is mostly clearly written and easy to follow. The diagrams shown in Figure 2 are illustrative;
+2. TRDN offers a headway to look into and interpret the representations learned in RNNs, which remained largely incomprehensible;
+3. The analysis and insight provided in section 4 is interesting and insightful. In particular, how does the training task influence the kinds of structural representation learned. 
+
+
+cons:
+1. The method relies heavily on these manually crafted role schemes as shown in section 2.1; It is unclear the gap in the approximation of TPRs to the encodings learned in RNNs are due to inaccurate role definition or in fact RNNs learn more complex structural dependencies which TPRs cannot capture;
+2. The MSE of approximation error shown in Table 1 are not informative. How should these numbers be interpreted? Why normalizing by dividing by the MSE from training TPDN on random vectors?
+3. The alignment between prediction using RNN representations and TPDN approximations shown in Table 2 are far from perfect, which would contradict with the claim that RNNs only learn tensor-product representation. ",6,4.0,ICLR2019
+rkgVy1xs2m,2,BJeOioA9Y7,BJeOioA9Y7,"Intriguing idea, strong performance, but missing empirical results to validate intuition","This paper proposes to feed the representations of various external ""teacher"" neural networks of a particular example as inputs to various layers of a student network. 
+The idea is quite intriguing and performs very well empirically, and the paper is also well written.  While I view the performance experiments as extremely thorough, I believe the paper could possibly use some additional ablation-style experiments just to verify the method actually operates as one intuitively thinks it should.   
+
+Other Comments:
+
+- Did you verify that in Table 3, the p_w values for the teachers trained on the more-relevant C10/C100 dataset are higher than the p_w value for the teacher trained on the SVHN data?  It would be interesting to see the plots of these p_w over the course of training (similar to Fig 1c) to verify this method actually operates as one intuitively believes it should.
+
+- Integrating the teacher-network representations into various hidden layers of the student network might also be considered some form of neural architecture search (NAS)  (by including parts of the teacher network into the student architecture). 
+See for example the DARTS paper: https://arxiv.org/abs/1806.09055
+which similarly employs mixtures of potential connections.  
+Under this NAS perspective, the dependence loss subsequently distills the optimal architecture network back into the student network architecture.
+
+Have you verified that this method is not just doing NAS, by for example, providing a small student network with a few teacher networks that haven't been trained at all? (i.e. should not permit any knowledge flow)
+
+- Have the authors considered training the teacher networks jointly with the student? This could be viewed as teachers learning how to improve their knowledge flow (although might require large amounts of memory depending on the size of the teacher networks).
+
+- Suppose we have an L-layer student network and T M-layer teacher networks.
+Does this imply we have to consider O(L*M*T) additional weight matrices Q?
+Can you comment on the memory requirements?
+
+- The teacher-student setup should be made more clear in Tables 1 and 2 captions (took me some time to comprehend).
+
+- The second and third paragraphs are redundant given the Related Work section that appears later on. I would like to see these redundancies minimized and the freed up  space used to include more results from the Appendix in the main text. 
+",8,5.0,ICLR2019
+6gteQiaJymJ,3,Oq79NOiZB1H,Oq79NOiZB1H,Experiments on discussing the version without full-batch computation are needed.,"Node sampling is a crucial point in making GCNs efficient. While several sampling methods have been proposed previously, the theorectical convergence analysis is still lacking. This paper finds that the convergence speed is related to not only the function approximation error but also the layer-gradient error. Based on this finding, the authors suggest to take historical hidden features and historical gradients to do doubly variance reduction. Experiments are done on 5 datasets for 7 baseline sampling-based GCNs.
+
+Pros:
+
+1. The core contribution of this paper lies in Theorem 1, which reveals the relationship between convergence speed and the function approximation error and the layer-gradient error. 
+
+2. The idea of doubly variance reduction is reasonable.
+
+Cons:
+
+The biggest weakness of the proposed method is that it requires to compute snapshot features and gradients over all nodes (Line 5, Alg. 1 & Line 5 Alg. 2) before the sampling process. As this paper aims at enhancing sampling based GCNs, we should assumes no computation access/memory of performing full GCN. The authors have provided the related analyses in the appendix. It will be better if the experiments without full-batch shapshot are added in Table 1, as such we can check how it influences the final performance and if certain approximation will work well. 
+
+",7,4.0,ICLR2021
+rkeZ_Y8F3m,1,B1fbosCcYm,B1fbosCcYm,"Review for ""A Biologically Inspired Visual Working Memory for Deep Networks"" (Strong Accept after Revision)","Revision:
+
+The authors have took in the feedback from myself and the other reviewers wholeheartedly, and have clearly worked hard to improve the results, and the paper during the revision process. In addition, their code release encourages easy reproducibility of their model, which imo is needed for this work given the non-conventional nature of the model (that being said, the paper itself is well written and the authors have done sufficiently well in explaining their approach, and also the motivation behind it, as per my original review). The code is relatively clear and self-contained demonstrating their experiments on MNIST, CelebA demonstrating the use of the visual sketch model.
+
+I believe the improvements, especially given the compute resources available to the authors, warrant a strong accept of this work, so I revised my score to 9. I also believe this work will be of value to the ICLR community as it offers alternate, less explored approaches compared to methods that are typically used in this domain. I'm excited to see more in the community explore biologically inspired approaches to generative models, and I think this work along with the code base will be an important base for other researchers to use as a reference point for future work.
+
+Original Review below:
+
+Summary: They propose a biologically motivated short term attentive working memory (STAWM) generative model for images. The architecture is based on Hebbian Learning (i.e. associative memories are represented in the weight matrices that are dynamically updated during inference by a modified version of Hebbian learning rule). These memories are sampled from glimpses on an input image (using attention on contextual states, similar to [1]), in addition to a latent, query state. This model learns a representation of images that can be used for sequential reconstruction (via a sequence of updates, like a sketchpad, like DRAW [1], trained in an unsupervised manner). These memories produced by drawing can also be used for semi-supervised classification (achieves very respectable and competitive results for MNIST and CIFAR-10).
+
+This paper is beautifully written, and the biological inspiration, motivation behind this work, and links to neuroscience literature as well as relation to existing ML work (even recent papers) is well stated. The main strength of this paper is that the author went from a biologically inspired idea to a complete realization of the idea in algorithmic form. The semi-supervised classification results are competitive to SOTA, and although the CIFAR-10 reconstruction results are not great (especially compared to generative adversarial models or recent variation models [2]), I think the approach is coming from a very different angle that is different enough compared to the literature to warrant some attention, or at least a glimpse, so to speak, from the broader community. The method may offer new ways to interpret ML models that is current models lack, which in itself is an important contribution. That being said, the fact that most adversarial generative models achieved a far better performance raises concern on the generalization ability of these memory-inspired learned representations, and I look forward to seeing future work investigate this area in more detail.
+
+The authors also took great care in writing details for important parts of the experiments in the Appendix section, and open sourced the implementation to reproduce all their experiments. Given the complex nature of this model, they did a great job in writing a clear explanation, and provided enough details for the community to build biologically inspired models for deep networks. Even without the code, I felt I might have been able to implement most of the model given the detail and clarity of the writing, so having both available is a great contribution.
+
+I highly recommend this paper for acceptance, with a score of 8 (edit: revised to 9 after rebuttal period). The paper might warrant a score of 9 if they had also achieved higher quality results for image generation, on Celeb-A or demonstrated results on ImageNet, and provided more detailed analysis about drawbacks of their approach vs conventional generative models.
+
+[1] https://arxiv.org/abs/1502.04623
+[2] https://arxiv.org/abs/1807.03039
+
+",9,5.0,ICLR2019
+Hk2HjIfxG,1,SyhcXjy0Z,SyhcXjy0Z,clear application paper / class project,"The paper is relatively clear to follow, and implement. 
+
+The main concern is that this looks like a class project rather than a scientific paper. For a class project this could get an A in a ML class!
+
+In particular, the authors take an already existing dataset, design a trivial convolutional neural network, and report results on it. There is absolutely nothing of interest to ICLR except for the fact that now we know that a trivial network is capable of obtaining 90% accuracy on this dataset.",1,5.0,ICLR2018
+SJgnuHj6KB,1,HJezF3VYPB,HJezF3VYPB,Official Blind Review #2,"The paper proposed and studied the unsupervised federated domain adaption problem, which aims to transfer knowledge from source nodes to a new node with different data distribution. To address the problem, a federated adversarial domain adaption (FADA) algorithm is introduced in the paper. The key idea of the algorithm is to update the target model by aggregating the gradients from source nodes, and also leverage adversarial adaption techniques to reduce the discrepancy between source features and target features. Overall, the problem studied in the paper is interesting, theoretical analysis on the error bound is provided in the paper, and the effectiveness of the proposed method has been validated in various datasets. Although the technical contributions of the paper are solid, I still have several concerns about it.
+1. The proposed algorithm is not described very clearly in section 4. According to the paper, DI is used to identify the domain from the output of Gi and Gt and align the features from those domains, then how is it related to the disentanglement in Eq 6. Also in Eq 6, symbol C_s was not introduced in the previous context, which makes it confusing to understand this objective.
+
+2. It would be better if the author(s) can provide some complexity analysis of the proposed algorithm.
+
+3. The paper still contains some typos and unresolved reference issues. ",6,,ICLR2020
+BJ9p-XmEx,2,HksioDcxl,HksioDcxl,,"This paper proposed a joint model for rate prediction and text generation.  The author compared the methods on a more realistic time based split setting, which requires “predict into the future.”
+One major flaw of the paper is that it does not address the impact of BOW vs the RNN based text model, specifically RRN(rating+text) already uses RNN for text modeling, so it is unclear whether the improvement comes from RNN(as opposed to BOW) or application of text information. A more clear study on impact of each component could make it more clear and benefit the readers.
+Another potential improvement direction of the paper is to support ranking objectives, as opposed to rate prediction, which is more realistic for recommendation settings.
+The overall technique is intuitive and novel, but can be improved to give more insights to the reader,.
+
+
+
+",6,3.0,ICLR2017
+4JZ9-Doot9A,1,t4hNn7IvNZX,t4hNn7IvNZX,A randomized smoothing distributional robust certificate,"This paper studies the problem of certified robustness in adversarial learning. In a nutshell, they apply the randomized smoothing technique to the distributional robustness certificate proposed by Sinha et al. (2018), thereby relaxing the smoothness assumption required therein so that the ReLU network can be applied. Based on this new formulation, they derive the upper bound on the worst-case population loss and develop an algorithm with convergence guarantees. The results on tested on MNIST, CIFAR-10 and Tiny ImageNet.
+
+The topic is definitely important and the authors did a good job of explaining their framework. Nevertheless, unfortunately, I think the proposed methodology seems rather straightforward and does not provide many new insights into this area. Besides, there are numerous careless flaws in the paper.
+
+1. Theorem 1 appears to be a standard result in distributional robust optimization and it is unfortunate that the authors did not recognize it. See, for example, Kuhn et al. (2019) (https://arxiv.org/abs/1908.08729).
+
+2. Theorem 2 and Corollary 1 appear to be a standard result in randomized smoothing. Besides, the proof is not rigorous -- it takes for granted that the $\hat\ell$ is twice differentiable.
+
+3. The statement of Theorem 3 contains an obvious typo. The left side and the right side are no different from each other -- the only difference is the integration variable: one is $\hat{x}$ and the other is $x'$.
+
+That is to say, all three major theoretical results are either straightforward corollary from existing results or contain flaws. 
+
+4. The algorithm proposed in Section 3.2 is problematic. In particular, in equation (6) and Alg.1 line 3-6, when $\cal X$ is a normed vector space, the value of the inner supremum does not depend on $z_{ij}$ simply via a change of variable from $x+z_{ij}$ to $x$. I think the correct version should be 
+$$
+  \frac{1}{n} \sum_{i=1}^n \sup_{x\in\cal X} [ \frac{1}{s}\sum_{j=1}^s \ell(\theta;x+z_{ij}) - \gamma c(x,x_0^i) ].
+$$
+
+5. As for the numerical experiment, it is unclear to me how $\gamma$ is chosen, and in particular, does such choice make the inner supremum of (6) a strongly concave problem?
+
+Given these many issues that are not easy to fix, I would encourage the authors to carefully revise their manuscript and resubmit to another conference.
+",2,4.0,ICLR2021
+H1xhh7JecS,2,B1ecVlrtDr,B1ecVlrtDr,Official Blind Review #1,"In this paper, a new activation function, i.e. S-APL is proposed for deep neural networks. It is an extension of the APL activation, but is symmetric w.r.t. x-axis. It also has more linear pieces (actually S pieces, where S can be arbitrarily large) than the existing activation functions like ReLU. Experimental results show that S-APL can be on par with or slightly better than the existing activation functions on MNIST/CIFAR-10/CIFAR-100 datasets with various networks. The authors also show that neural networks with the proposed activation can be more robust to adversarial attacks.
+
+First of all, the activation function is much more complicated than the existing ones, as it has to determine the parameter S and the hinge positions. However, the gain is marginal as shown in Table 1. Besides, the authors never tell how to choose S and the hinge positions.
+
+Secondly, the neural networks used in the experiments are quite outdated. And the error rates shown in Table 1 are far away from state-of-the-art. Why don't you choose a latest network such as ResNet/DenseNet/EfficientNet and replace the activation with S-APL? The results could be more convincing.
+
+I am not an expert in adversarial attack. But is there any intuition why a complicated activation function is more robust to adversarial attack? Again, most of the models used in Table 2 are quite old (Lenet5, Net in Net, CNN).
+
+In a word, the proposed activation function is unnecessarily complicated and the gain is not justified with the latest models and not significant enough to convince people to adopt it.
+",3,,ICLR2020
+Hye1Aka8qB,3,B1ecVlrtDr,B1ecVlrtDr,Official Blind Review #2,"This paper proposes a learnable piece-wise linear activation unit whose hinges are placed symmetrically. It gives a proof on the universality of the proposed unit on a certain condition. The superiority of the method is empirically shown. The change of the activation during training is analyzed and insight on the behavior is provided. The robustness to adversarial attacks is also empirically examined.
+
+This paper discusses a very basic component of neural network models: activation function. Thus, it should be of interest to many researchers. The proposed method is simple and seems easy to use in real settings. A number of experiments are conducted to validate the method and the results look promising. The experiments in Section 5 is particularly interesting. It might give some hints for the following studies.
+
+However, there are several things to be addressed for acceptance.
+
+1) What is actually proposed is not very clear. 
+
+S-APL is formulated in Equation 2. However, there are some discussion after that which changes or restricts the equation. For example, it seems that b_i^s^+ = b_i^s^- is assumed throughout the paper. In that case, it should be just reflected in Equation 2. In the third paragraph of Section 3.2, it is mentioned that h_i^s(x) = h_i^s(-x) with b_i^s^+ = b_i^s^-. However, it should also assume that a^s^+ = a^s^-. From the experiments. apparently, a^s^+ = a^s^- is not assumed. It seems that the method has symmetry only for the hinge locations. 
+
+In the first paragraph of Section 3.2, it is implied that parameters are shared across layers. It is not very clear what is shared and what is not. Please make that part clear. It will make it easier to understand the experimental settings, too.
+
+2) Theorem 3.1 does not seem to prove the approximation ability of S-APL.
+
+It is clear that g(x, S) can represent arbitrary h(x, S), but I am not sure if it is clear that h(x, S) can represent arbitrary g(x, S). It should also depend on the conditions on a^s^+, a^s^-, b_i^s^+, b_i^s^-. I think it needs to prove that h(x, S) can approximate arbitrary piecewise linear function (i.e., g(x, S)) if you want to prove the approximation ability of h(x, S).
+
+Equation 4 seems to assume that all intervals are the same (i.e., ∀i, B_i - A_i = (B-A) / S). It should be stated explicitly. This relates to the problem 1).
+
+I may not understand some important aspect. I am happy to be corrected.
+
+3) Experimental conditions are not clear.
+
+Please cite the papers which describe the architecture of the models used in the experiments. The effectiveness of the proposed method should depend on the network architecture and it is importable to be able to see the details of the models.
+
+4) On the sensitivity of optimization on the initial value.
+
+It is interesting to see that ""fixed trained S-APL"" is not comparable with ""S-APL positive"". If the hypothesis in the paper is correct, it is natural to assume that ""fixed trained S-APL"" also has some issue on training. It would be interesting to see experimental results with ""initialized with trained-S-APL"" and ""S-APL positive with non-zero initial value"".  It is a bit weird to observe that ""S-APL positive"" never becomes non-zero for x < 0.
+
+5) Comparison results with other activation units in Section 6.
+
+The proposed method is compared only with ReLU. It is important to see comparisons with other activations such as the plain APL.
+
+
+Some other minor comments:
+
+It is quite interesting that objects are actually modified for adversarial attack for the proposed method in Figure 5. It would be interesting to have some consideration on it.",6,,ICLR2020
+r1lkFUpCKr,2,SJeQi1HKDH,SJeQi1HKDH,Official Blind Review #2,"This paper proposes a new way to incentivize diverse policy learning in RL agents: the key idea is that each agent receives an implicit negative reward (in the form of an early episode termination signal) from previous agents when an episode begins to resemble prior agents too much (as measured by the total variational distance measure between the two policy outputs). 
+
+Results on three Mujoco tasks are mixed: when PPO is combined with the proposed objective for training diverse policies, it results in very strong performance boosts on Hopper and HalfCheetah, but falls significantly short of standard PPO on Walker 2D. I would have liked to see a deeper analysis of what makes the approach work in some environments and not in others. 
+
+Experimental comparisons in the paper are only against alternative approaches to optimize the same diversity objective as the proposed approach (with weighted sum of rewards (WSR) or task novel bisection(TNB)). Given that this notion of diversity is itself being claimed as a contribution, I would expect to see comparisons against prior methods, such as in DIAYN. There are other methods that have been proposed before in similar spirit to induce diversity in the policies learned. Aside from the evolutionary approaches covered in related work, within RL too, there have been methods such as the max-entropy method proposed in Eysenbach et al, ""Diversity is All You Need..."". These methods, evolutionary and RL, could be compared against to make a more convincing experimental case for the proposed approach.
+
+The experimental setting is also not fully clear to me: throughout experiments, are the diversity methods being evaluated for the average performance over all the policies learned in sequence to be different from prior policies? Or only the performance of the last policy? Related, I would be curious to know, if K policies are trained, the reward vs the training order k of the K policies. This is close to, but not identical to the study in Fig 4, to my understanding.
+
+Aside from the above points being unclear, the paper in general could overall be better presented. While I am not an expert in this area, I would still expect to be able to understand and evaluate the paper better than I did. 
+- Sec 3.1 makes a big deal of metric distance, but never quite explains how this is key to the method.
+- The exact baselines used in experiments are unhelpfully labeled ""TNB"" (with no nearby expansion) and ""weighted sum of rewards (WSR)"", with further description moved to appendix. In general, there are a few too many references to appendices.
+- The results in Fig 2 are difficult to assess for diversity, and this is also true for the video in the authors' comment.
+- There is an odd leap in the paper above Eq 7, where it claims that ""social uniqueness motivates people in passive ways"", which therefore suggests that ""it plays more like a constraint than an additional target"". 
+- Sec 5.1 at one point points to Table 1 for ""detailed comparison on task related rewards"" but says nothing about any important conclusions from the table.
+- There are grammar errors throughout. ",3,,ICLR2020
+BJlB2ry9cB,2,rkl_Ch4YwS,rkl_Ch4YwS,Official Blind Review #3,"In this paper the authors propose a two stage pipeline that aims to solve for mathematical expression recognition. The main approach uses the following stages, a detection stage that is based on YoloV3 and a sequence to sequence approach. The authors compare their method against Image2Latex approach (2016) that is an end to end pipeline and show that there is significant improvement compared to this approach. 
+
+However, this problem has been a standard task and solved both in the handwritten math expression problem (CROHME challenge of ICDAR and typeset formula detection and recognition. There have been much progress through these challenges with various teams competing. A variety of approaches have been tried for this task and unfortunately the present work has not compared nor evaluated against these approaches. 
+
+Im2Latex work is quite old benchmark and there have been numerous works as have been cited by the authors and more as can be available from the challenge. The methods presented are also not novel. Using Yolov3 for detection and sequence2sequence for parsing expressions are more or less standard approaches. Hence, the proposed work does not add a significant insight in solving the problem. ",3,,ICLR2020
+jHGVFM8A1KZ,3,Ef1nNHQHZ20,Ef1nNHQHZ20,Interesting ODE interpretations of layer-wise adversarial training ,"The authors proposed a new adversarial training scheme based on ODE techniques. The standard adversarial training seeks for small perturbations in the input space. But the perturbations can be found in the feature space. From the observation and the similarity between the layer-wise adversarial attacks and ODE formulation, the optimization scheme was developed.
+ 
+Clarity:
+The paper is written well. Overall, it reads well. If the pseudocode of the two proposed methods is provided, it would be easier to understand the proposed schemes. The authors may save some space leaving out the proof of Theorem 3.3.
+ 
+Strengths/Quality/Significance (pros):
+The authors studied an interesting construction. The proposed methods show that the adversarial training can be generalized to hidden representations. The generalization can be viewed as ODE formulations.
+ 
+To optimize the model parameters efficiently, using the operator-splitting theory from numerical analysis,  
+efficient numerical schemes can be developed using ODE theories. Discussion about Lie-Trotter Splitting Scheme and Strang-Marchuck splitting scheme was interesting.
+ 
+In addition, the resulting algorithms achieved competitive performance compared to state-of-the-art methods.
+ 
+The effectiveness of the proposed method with natural training methods is significant compared to vanilla natural training.
+ 
+Weaknesses (cons) & Questions:
+Authors claimed that  Eq 1. is inefficient to optimize and derive their numerical schemes based on techniques in the literature of ODE/Numerical analysis. The authors did not show the efficiency of the proposed methods. Even with a small dataset and small neural networks, if authors provide the comparison with Naïve approaches in terms of elapsed time/memory consumption and time/space complexity, it will be easier to evaluate the value of the proposed methods.
+ 
+The effect of approximated schemes in terms of numerical stability and adversarial robustness needs to be discussed.
+ 
+Since the bound for perturbation is omitted, the starting point (the optimization formulation) is unbounded. If authors tighten the loose end and bridge between the optimization problem and ODE-inspired numerical schemes, it will be more useful. Unlike the input space (image space), it is much trickier to define small perturbations to preserve the labels of samples. If the authors can fill this gap, it might be possible to provide a new definition of adversarial attacks.
+ 
+It is confusing. In Figure 2, the smallest number (ratio) looks smaller than 10^5, unlike the discussion. Clarify this.
+ 
+The performance gain against TRADES is table 3 is marginal.
+ 
+No experimental results are available to verify the theoretical analysis of the second-order dynamics. Is it possible to design experiments to check whether the models/numerical schemes show the behavior?",5,3.0,ICLR2021
+SJeRQb-oFH,2,S1lOTC4tDS,S1lOTC4tDS,Official Blind Review #4,"Paper summary.
+The paper proposes Dreamer, a model-based RL method for high-dimensional inputs such as images. The main novelty in Dreamer is to learn a policy function from latent representation-and-transition models in an end-to-end manner. Specifically, Dreamer is an actor-critic method that learns an optimal policy by backpropagating re-parameterized gradients through a value function, a latent transition model, and a latent representation model. This is unlike existing methods which use model-free or planning methods on simulated trajectories to learn the optimal policy. Meanwhile, Dreamer learns the remaining components, namely a value function, a latent transition model, and a latent representation model, based on existing methods (the world models and PlaNet). Experiments on a large set of continuous control tasks show that Dreamer outperforms existing model-based and model-free methods. 
+
+Comments. 
+Efficiently learning a policy from visual inputs is an important research direction in RL. This paper takes a step in this direction by improving existing model-based methods (the world models and PlaNet) using the actor-critic approach. I am leaning towards weak accepting the paper. 
+
+I am reluctant to give a higher score due to its incremental contribution. Specifically, the policy update in Dreamer resembles that of SVG (Heess et al., 2015), which also backpropagates re-parameterized gradients through a value function and a transition model. The main difference between Dreamer and SVG is that Dreamer incorporates a latent representation model. From this viewpoint, the actor-critic component in Dreamer is an incremental contribution. Since the latent models are learned based on existing techniques, the paper presents an incremental contribution. 
+
+Besides the above comments, I have these additional comments. 
+- Effectiveness on very long horizon trajectories: 
+Simulating long-horizon trajectories with a probabilistic model is known to be unsuitable for model-based RL due to accumulated errors. This is an open issue in model-based RL. The paper attempts to solve this issue by backpropagating policy gradients through the transition model, which is known to be more robust against model errors (see e.g., PILCO (Deisenroth et al., 2011)). However, the issue still exists in Dreamer, since there seems to be an upper limit of effective horizon length (perhaps around 40, according to Figure 4). This horizon length is still short compared to the entire horizon length of many MDPs (e.g., 1000). I think this point should be discussed in the paper. That is, the issue still exists, and Dreamer is less effective with very long horizon.
+
+- Inapplicability to discrete controls: 
+One restriction of re-parameterized gradients is that the technique is not applicable to discrete random variables. This restriction exists in Dreamer, and the method cannot be applied to discrete control tasks unless approximation techniques such as Gumbel-softmax are used. Still, such approximations would make learning more challenging, especially with long-horizon backpropagation. This restriction should be noted in the paper. 
+
+- There is no mention about variance of policy gradient estimates. Dreamer does not use any variance reduction technique, so the gradient estimates could have very large variance. 
+
+- q_theta was introduced in Eq. (8) before it is defined in Eq. (11). Also, I suggest moving Section 4 to be right after Section 2, since Section 4 presents existing techniques similarly to Section 2, while Section 3 presents the main contribution. 
+
+
+Update after authors' response.
+I read the response. The paper is more clear after authors' clarification. Though, I still think the contribution is incremental, since back-propagating gradients through values and dynamics has been studied in prior works (albeit with less empirical successes compared to Dreamer). Nonetheless, I am keen to acceptance. I would increase the rating from 6 to 7, but I will keep the rating of 6 since the rating of 7 is not possible.",6,,ICLR2020
+H1gBBwcGKr,1,HJghoa4YDB,HJghoa4YDB,Official Blind Review #1,"This paper discussed the nonlinear value function approximation for temporal different learning on on-policy policy evaluation tasks in the lazy training regime. The authors proved that for the both over-parameterized case (state number is fixed finite number) and the under-parameterized case (state number is sufficiently large or infinite), the value function can converge to some stationary point with exponential convergence rate. Moreover, the authors also characterized the error when the value function is under-parameterized.
+
+Overall, this is a good paper. But the paper organization is awful. There are many places that are ambiguous or with notations that not formally defined. It may not due to the page limit as the authors currently use only 8 pages. I think the authors should polish the whole paper and make it more readable. Below are some main clarity issues I find, but the authors should not only solve the issues I mentioned.
+
+1. For better presentation, I suggest the authors include a notation paragraph in the main text, which will be very helpful for the readers.
+2. I think it would be better to mention (7) returns w that V_w = V^* / \alpha in Sec. 2 for better reading.
+3. In Theorem 3.5, it is unclear that the estimation \tilde{V}^* is from \alpha V_{w}.
+4. The WP1 in the paragraph after Theorem 3.5 means with probability 1?
+5. In Equation (11), what is the definition of the measure \pi? If I understand correctly it is still \mu as the invariant measure should be fixed for a given policy?
+6. The last paragraph in the proof of Theorem 3.5 is hard to follow. It can be better to introduce the result from the textbook and list the condition that need to verify, then give the lemmas show that the conditions can be verified.
+What is the functional X in (12)? Should mention it in the main text, not in the appendix.
+7. Figure 2 is somehow hard to understand. Maybe better show how the projection of TD error vector field outwards along the spiral in (a) and inwards in (b) in the figure.
+
+From my point of view, the proof is almost correct and most of the lemmas are standard. This result is meaningful as it shows how and when the nonlinear function approximation will converge in temporal-difference learning (under the context of lazy training, and I think it is also correct under the context of NTK). The perspective on viewing the TD learning as linear dynamic system on functional space can inspire several new research in this field. My main concern is about the paper organization. I feel it can be hard for the potential readers to go through the whole paper. If the authors improve the quality of writing during the rebuttal period, I will consider improve my score.",6,,ICLR2020
+B1ZFEvR4x,3,B1ckMDqlg,B1ckMDqlg,,"This paper proposes a method for significantly increasing the number of parameters in a single layer while keeping computation in par with (or even less than) current SOTA models. The idea is based on using a large mixture of experts (MoE) (i.e. small networks), where only a few of them are adaptively activated via a gating network. While the idea seems intuitive, the main novelty in the paper is in designing the gating network which is encouraged to achieve two objectives: utilizing all available experts (aka importance), and distributing computation fairly across them (aka load). 
+Additionally, the paper introduces two techniques for increasing the batch-size passed to each expert, and hence maximizing parallelization in GPUs.
+Experiments applying the proposed approach on RNNs in language modelling task show that it can beat SOTA results with significantly less computation, which is a result of selectively using much more parameters. Results on machine translation show that a model with more than 30x number of parameters can beat SOTA while incurring half of the effective computation.
+
+I have the several comments on the paper:
+- I believe that the authors can do a better job in their presentation. The paper currently is at 11 pages (which is too long in my opinion), but I find that Section 3.2 (the crux of the paper) needs better motivation and intuitive explanation. For example, equation 8 deserves more description than currently devoted to it. Additional space can be easily regained by moving details in the experiments section (e.g. architecture and training details) to the appendix for the curious readers. Experiment section can be better organized by finishing on experiment completely before moving to the other one. There are also some glitches in the writing, e.g. the end of Section 3.1. 
+- The paper is missing some important references in conditional computation (e.g. https://arxiv.org/pdf/1308.3432.pdf) which deal with very similar issues in deep learning.
+- One very important lesson from the conditional computation literature is that while we can in theory incur much less computation, in practice (especially with the current GPU architectures) the actual time does not match the theory. This can be due to inefficient branching in GPUs. It would be nice if the paper includes a discussion of how their model (and perhaps implementation) deal with this problem, and why it scales well in practice.
+- Table 1 and Table 3 contain repetitive information, and I think they should be combined in one (maybe moving Table 3 to appendix). One thing I do not understand is how does the number of ops/timestep relate to the training time. This also related to the pervious comment.
+",7,4.0,ICLR2017
+Syl5iFwiKB,2,rkeIq2VYPr,rkeIq2VYPr,Official Blind Review #3,"Summary: the authors introduce a method to learn a deep-learning model whose loss function is augmented with a DPP-like regularization term to enforce diversity within the feature embeddings. 
+
+Decision: I recommend that this paper be rejected. At a high level, this paper is experimentally focused, but I am not convinced that the experiments are sufficient for acceptance.
+
+****************************
+My main concerns are as follows:
+
+- Many mathematical claims should be more carefully stated. For example, the authors extend the discrete DPP formulation to continuous space. It is not clear to me, based on the choice of the kernel function embedding, that the resulting P_k(X) is a probability (Eq. 1). If it is not (using a DPP-based formulation as a regularizer does not require a distribution), the authors should clarify that fact; more generally, the authors should be more careful throughout the paper (for example, det=0 if features are proportional, not necessarily equal; the authors inconsistently switch between DPP kernel L and marginal kernel K throughout computations.)
+
+- The authors do not describe their baselines for several experiments. In tables 1, 2, 3, the baseline is never described (I assume it's the same setup without regularization); I did not find a description of DCH (Tab 4) in the paper (Deep Cauchy Hashing?). The mAP-k metric should also be defined. Furthermore, the authors do not report standard deviations for their experiments.
+
+- A key consideration when using DPPs is their compulational cost: most operations involving them require SVD (which seems to be used in this work), matrix inversion, and often both. This, unsurprisingly, limits the applications of DPPs, and has driven a lot of research focused on improving DPP overhead. I would like to see more discussion in this paper focused on to which extent the DPP's computational overhead remains tractable, and which methods were used (if any) to alleviate the computational burden.
+
+- Finally, the paper itself appears to be somewhat incomplete: sentences are missing or incomplete (Section 4), and numbers are missing in some tables (Table 5).
+
+
+***********************
+Questions and comments for the authors:
+
+- When computing the proper sub-gradient, are you computing the subgradient as inv(L + \hat L)?
+
+- You state that by avoiding matrix inversion, your method is more feasible. However, it seems like your method requires SVD, which is also O(N^3); could you please provide more detail for this?
+
+- Could you report number of trials and standard deviations for your experiments?
+
+- Do you have any insight into why DPPs do more poorly than the DCH baseline in Table 4 for mAP-k metrics?
+
+- You might be able to save space by decreasing the space allocated to the exponentiated quadratic kernel.
+",3,,ICLR2020
+HJx7Jukc9H,4,HklFUlBKPB,HklFUlBKPB,Official Blind Review #4,"In this paper, the authors showed that in many cases it is possible to reconstruct the architecture, weights, and biases of a deep ReLU network given the ability to query the network. The studied problem is very interesting. I have the following questions about this paper:
+
+1. Can the authors provide detailed explanation of Figure 1? For instance start from input (x_1, x_2), and the weight in layer 1 and layer 2, what is the exact form of the function plotted in the middle panel? Also, how the input space is partitioned? I appreciate the authors provide this simple example, but detailed math will help readers to understand this easily.
+
+2. How about the efficiency of the proposed method? Is it NP-hard? I would like to see some analysis of the computational complexity and also some related experimental results.
+
+3. If the ReLU network can be reconstructed, can the input also be reconstructed based on the output? It would be very interesting to show a few example on reconstructing the input. Also, is that possible to even reconstruct the training data based on the released model?",6,,ICLR2020
+134RZfNhEQQ,2,jpDaS6jQvcr,jpDaS6jQvcr,Paper has poor presentation of results and is a clear reject.,"The proposed approach differs from autoencoder based anomaly detection approach in the following ways
+(a) Autoencoders are trained using only selected data points with small reconstruction errors. These are selected using a sampling scheme with theoretical guarantees on convergence.  The selected points are then shuffled between two autoencoders. 
+
+(b) During the testing phase, each autoencoder applies dropout to generate multiple predictions. The averaged ensemble output is used as the final anomaly score.
+
+Some of the issues with this paper
+
+(a) One key issue is why just two autoencoders (which the authors delegate for future work). However, it is key to understanding utility of such an ensemble based shuffling framework.
+
+(b) Poor presentation of results
+    1) Figure 2 legend issue
+    2) Figure 3 (a) is better presented as a table (Table 3 in Appendix should be here instead). Very hard to interpret it in the current form. Similar comments for Figure 3(b) and Table 1. Also for anomaly detection benchmarking AUC is not sufficient and the authors have to present AUPR or F-1 scores also. I suggest looking at these recent papers for presentation of experimental results 
+
+http://proceedings.mlr.press/v108/kim20c/kim20c.pdf
+
+
+https://proceedings.icml.cc/paper/2020/file/0d59701b3474225fca5563e015965886-Paper.pdf (Goyal et al. ICML 2020)
+
+(c) Theorem 3 might have a connection with the notion of r-robustness presented in https://arxiv.org/abs/2007.07365
+      so authors would want to make it clear how they differ.",3,4.0,ICLR2021
+GTehZmLCIc1,2,iQQK02mxVIT,iQQK02mxVIT,Why resampling outperforms reweighting for correcting sampling bias,"
+Summary:
+When training data comes from a sampling distribution that is different from the target test distribution, there are two commonly-used machine learning techniques to correct the distribution difference --- re-sampling and re-weighting. This paper investigates why re-sampling outperforms re-weighting when using stochastic gradient descent for optimization. The paper provides two explanations. 
+(1) Resampling with SGD is more stable. By stability, the authors mean that the expected L2 distance between the parameter during optimization and the true parameter is small. 
+(2) Reweighting with SGD can converge to a worse local minimum with larger probability. 
+
+
+They also conduct experiments to show that re-sampling outperforms re-weighting. 
+
+Strengths:
+
+1. The problem to compare re-sampling and re-weighting during optimization with stochastic gradient descent is very interesting and important. Machine learning from biased data (e.g. selection bias) is very prevalent and stochastic gradient descent is a popular optimizer. 
+
+2. The paper is clearly written and easy to follow. 
+
+3. The stability analysis and SDE analysis provide two alternatives to explain the difference between re-sampling and re-weighting. 
+
+Weakness
+
+1 The theoretical analysis does not provide strong evidence that re-sampling is better than re-weighting. 
+
+1.1 It would be great if the authors could explain more why Lemma 1 and Lemma 2 show that resampling is better. To me, it just shows that, if we want to achieve stability, we have to use smaller learning rate for reweighting. It makes no sense to me to compare resampling and reweighting with a fixed learning rate. 
+
+1.2 SDE analysis shows that re-weighting sometimes tends to converge to a worse local minimum. But it did not sate re-sampling always tends to converge to a good local minimum. It would be great if authors explain when re-weighting and re-sampling tend to converge to good and bad local minimum. 
+
+
+
+
+2. The experiments details are not clear and I did not see how these experiments reflect the theoretical analysis. 
+
+
+2.1 It would be great to mention whether the results are performed on the training set. I assume the results presented are for the training set since this paper investigates how the optimization with SGD works instead of focusing on generalization. 
+
+2.2 It would be great to report the final training objective. Since the main theoretical result of the paper is that optimization with SGD using re-sampling tends to be more stable around a better local minimum. 
+
+2.3 In the classification experiments, the ROC-AUC of reweighting method is only around 0.53, that is just 0.03 better than random guess. A large neural network typically can overfit a classification dataset with just two classes. I am wondering why the ROC-AUC is so low. It would be great if the authors could provide the experiment details, e.g. the learning rate. 
+
+2.4 In the nonlinear regression experiment, logistic regression is used. Logistic regression objective is a convex objective. There is no local minimum. So the experiments do not validate the SDE analysis. The authors claimed that they do this experiment to show that ``performance of SGD deteriorates when reweighting is used''. I did not see any previous analysis indicating this. It would be great if the authors could explain more about why they conduct this experiment and how this relates to their theoretical analysis. 
+
+
+
+
+
+
+Additional feedback
+
+In the abstract ''reweighting outperforms resampling'' -> ''resampling outperforms reweighting''
+
+Why in equation (3), there is an approximate equal? It seems to me that it is just equal. Same for equation (4). 
+
+learning rate \eta first appears in Lemma 1 without explanation. Also, Lemma 1 uses C= 1 which is not explained in the main paper. 
+
+
+In SDE analysis: The statement ""all iterates will stay in this region'' confuses me. If the learning rate is large enough, I think it can get out of the the region. Say when \hat{\theta}_t is at -2 and we encounter an example from V_1, then the gradient is positive. If the learning rate is large enough, it can get to (0,\infty). 
+
+
+
+
+",5,4.0,ICLR2021
+km7j-UZkjHj,2,85d8bg9RvDT,85d8bg9RvDT,"Interesting algorithm, but sub-par experimentation protocol","##########################################################################
+Summary: 
+
+This paper presents an end-to-end deep retrieval method for recommendation. The model encodes all candidates into a discrete latent space, and learns the latent space parameters alongside the other neural network parameters. Recommendation is performed through beam search. The paper compares the method on two public dataset against several methods (DNN, CF, TDM) and concludes that it can achieve the same result as a brute-force solution in sub-linear time.
+
+##########################################################################
+Reasons for score: 
+
+The paper presents an interesting end-to-end deep retrieval approach.  However, the paper suffers from several key limitations:
+
+First, it makes a very strong assumption (that vector-based approaches have fundamental limitations).  Because this assertion is very strong, it should be backed by a more thorough analysis than what is done on the paper (more details below).
+
+Second, it fails to take into account several key state-of-the-art methods (such as VAE).  The method proposed in the paper might perform significantly worse than this SOTA based on their reported results.
+
+Finally, it brings confusion between two problems: the one of choosing an algorithm (vector-based versus deep end-to-end) and the one of choosing a brute-force vs approximate nearest-neighbor.  Yet it is well-known that approximate nearest neighbor search is often almost as good as brute-force nearest neighbor, as can be seen here: http://ann-benchmarks.com/
+
+The method presented in the paper is interesting, though.  I believe this work could be published, but with significantly more research work.
+
+##########################################################################
+Pros:
+
+- Novelty: this paper present a method that, as far as I know, is novel.
+
+- Comparison to tree-based methods: the paper presents an interesting comparison with tree-based approaches 
+
+##########################################################################
+Cons: 
+
+- Lack of validation: first and foremost, the work presented in this paper lacks validation experiments.  For a paper presenting a new algorithm and making a strong claim regarding vector-based methods, we would expect at least 4-5 datasets as is commonly done in the literature, e.g. with MSD, Netflix, Medium, Amazon, Yahoo datasets.  We would also expect more metrics, and in particular, the right recall values as is commonly done in the field.  In particular, other papers use recall@20 and recall@50 (as can be seen in paperswithcode.com) instead of recall@10.
+
+- Missing state-of-the-art: the paper misses on significant portions of state of the art regarding the evaluation.  Several key methods should be included in the evaluation, such as VAEs [1], EASE [2], RACT [3], SLIM [4] and CML [5].  While these methods are not end-to-end, the paper should compare its performance against these methods to conclude whether end-to-end deep retrieval yields better (or even similar) performance compared to them.  It turns out that some of these methods perform well on recall@20 and maybe better than the method presented in this paper.  Note that some of these methods are also sub-linear in the number of items, such as CML.
+
+- The paper is also missing a reference on solving the vector-based limitations, with Off-Policy Learning in Two-Stage Recommender Systems [6].
+
+- The paper claims to address ""large-scale"" recommender systems (at several places in the paper) but does not address this aspect.  There exist a significant body of literature on the topic of recommender systems operating at the scale of billions of users and items now, e.g. [7], [8].  Working at the scale of MovieLens and AmazonBooks is not large-scale.  In addition, a complexity analysis of the method would be very welcome.
+
+- Lack of clarity: The clarity of the paper could be greatly improved by putting the description of the algorithm in a single place.  At the moment, it is spread between Section 1 and Section 2.1.  In particular, Section 1 introduces D and K but does not explain what they are. Some aspects of the algorithms are described in a single line (a GRU is used to project the behavior sequence, but nothing is explained about it).  
+
+- Lack of code: the paper does not provide the code, which does not help for reproducibility and sharing with the community.  Providing code is paramount when proposing a new algorithm.
+
+[1] Daeryong Kim and Bongwon Suh. 2019. Enhancing VAEs for Collaborative Filtering: Flexible Priors Gating Mechanisms. In Proceedings ofthe 13th ACM Conference on Recommender Systems (RecSys ’19). Association for Computing Machinery, New York, NY, USA, 403–407. https://doi.org/10.1145/3298689.3347015
+
+[2] Harald Steck. 2019. Embarrassingly Shallow Autoencoders for Sparse Data. In The World Wide Web Conference (WWW ’19). Association forComputing Machinery, New York, NY, USA, 3251–3257. https://doi.org/10.1145/3308558.3313710
+
+[3] Sam Lobel, Chunyuan Li, Jianfeng Gao, and Lawrence Carin. 2020. RaCT: Toward Amortized Ranking-Critical Training For Collaborative Filtering. InEighth International Conference on Learning Representations (ICLR). https://www.microsoft.com/en-us/research/publication/ract-toward-amortizedranking-critical-training-for-collaborative-filtering/
+
+[4] Xia Ning and George Karypis. 2011. SLIM: Sparse Linear Methods for Top-N Recommender Systems. In Proceedings of the 2011 IEEE 11th InternationalConference on Data Mining (ICDM ’11). IEEE Computer Society, USA, 497–506. https://doi.org/10.1109/ICDM.2011.134
+
+[5] Cheng-Kang Hsieh, Longqi Yang, Yin Cui, Tsung-Yi Lin, Serge Belongie, and Deborah Estrin. 2017. Collaborative Metric Learning. In Proceedings ofthe 26th International Conference on World Wide Web (WWW ’17). International World Wide Web Conferences Steering Committee, Republic andCanton of Geneva, CHE, 193–201. https://doi.org/10.1145/3038912.3052639
+
+[6] Jiaqi Ma, Zhe Zhao, Xinyang Yi, Ji Yang, Minmin Chen, Jiaxi Tang, Lichan Hong, and Ed H. Chi. 2020. Off-Policy Learning in Two-Stage Recommender Systems. In Proceedings of The Web Conference 2020 (WWW ’20). Association for Computing Machinery, New York, NY, USA, 463–473. https://doi.org/10.1145/3366423. 3380130
+
+[7] Chantat Eksombatchai, Pranav Jindal, Jerry Zitao Liu, Yuchen Liu, Rahul Sharma, Charles Sugnet, Mark Ulrich, and Jure Leskovec. 2018. Pixie: A System for Recommending 3+ Billion Items to 200+Million Users in Real-Time. In Proceedings of the 2018 World Wide Web Conference (WWW ’18). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 1775–1784. https://doi.org/10.1145/3178876.3186183
+
+[8] JizheWang, Pipei Huang, Huan Zhao, Zhibo Zhang, Binqiang Zhao, and Dik Lun Lee. 2018. Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’18). ACM, New York, NY, USA, 839–848. https://doi.org/10.1145/3219819.3219869 
+
+#########################################################################
+Some typos: 
+
+""grow"" on page 1
+
+""from the all successors"" on page 4
+
+upper-case are missing in the references (e.g. ALSH)
+",3,5.0,ICLR2021
+caC1m8LsR8P,1,zFM0Uo_GnYE,zFM0Uo_GnYE,The paper studies the importance of utilising manifold/topology information in prediction tasks. The method is not novel enough and the comparison seems problematic.,"The paper focuses on studying the importance of utilising manifold/topology information for machine learning tasks. To this end, the authors benchmark four different approaches, including VAE, GR-VAE (using graph distances to regularise  embedding distances (as shown in Eq. 1)).  The paper performs experiments on four tasks, including synthetic data, MNIST, text representation, and chemical reactions. As conclusion, the paper demonstrates that in some cases, adding relational information is beneficial, while in other cases, the effect is subtle. Thus, the paper aims to provide a metric for understanding when and how manifold/topology information is needed. 
+
+Pros:
+
+1. Instead explicitly learning a graph, this paper proposes an implicit method with graph regularisation.
+
+2. The related work is well-explained. The paper did a well summarization of previous methods.
+
+3. The paper performs extensive experiments to study the importance of manifold for prediction tasks.
+
+Cons:
+
+1. The latent graph method is not novel enough, since the method can be categorised as graph regularisation, which is a widely used method in recommendation and information retrieval. Could the author explain why this regularisation is picked from a spectrum of graph regularisation algorithms?
+
+2. The comparison of the paper is problematic. First, the methods (DGI, node2vec, GR-VAE, VAE) compared are quite different methods. Can the authors confirm that the comparison is fair and meaningful (e.g. eliminating other confounding factors like controlling the number of parameters)? Second, I am not sure whether this comparison is optimal. In particular, to study the importance of relational information, other method can be used. For example, we can control adjacency matrix received by graph neural networks. We can totally ignore the edge information (like VAE in the paper) or use a predefined graph (e.g. a fully connected graph like in Transformer). In between, we can corrupt input graphs (e.g. randomly adding or deleting some edges) before feeding it to graph neural networks. This approach seems more reasonable to me for studying the importance of manifold. It is difficult to control these in this paper because the methods used in this paper are totally different  (e.g. DGI and GR-VAE differs in both loss function and input format). So, the conclusion of the paper is skeptical. It mainly justifies which method can perform better in downstream tasks instead of justifying the importance of manifold. 
+
+3. The introduction is lengthy and should be more focused on the contribution of this paper. Similarly, the other sections need a major revision to highlight the contribution, as the main contribution of the paper lies in the implicit graph regularisation and a comparison of a series of methods with/without relational information.
+
+4. Some baseline methods are not considered, for example the methods learning latent graphs: Semi-supervised classification with graph convolutional networks and Glomo: Unsupervised learning of transferable relational graphs.
+
+5. The acknowledgement of the paper reveals location information, which may be a violation of anonymity. 
+
+Based on these cons, I think a more rigorous comparison is needed. ",4,4.0,ICLR2021
+r1x6sUQ6nm,2,ByxkijC5FQ,ByxkijC5FQ,"An interesting idea, but insufficient.","This paper proposes to analyze the complexity of a neural network using its zero-th persistent homology. Each layer is considered a bipartite graph with edge weights. As edges are being added in a monotonically decreasing order, each time a connected component is merged with others will be recorded as a new topological feature. The persistence of each topological feature is measured as the weight difference between the new edge and the maximal weight (properly normalized). Experiments show that by monitoring the p-norm of these persistence values one can stop the training a few epochs earlier than the validation-error-based early stopping strategy, with only slightly worse test accuracy.
+
+The proposed idea is interesting and novel. However, it is needs a lot of improvement for the following reasons.
+
+1) The proposed idea can be explored much deeper. Taking a closer look, these zero-th persistence are really the weights of the maximum spanning tree (with some linear transformation). So the proposed complexity measure is really the p-norm of the MST. This raises other related questions: what if you just take all the weights of all edges? What if you take the optimal matching of the bipartite graph? How about the top K edges? I am worried that the p-norms of these edge sets might have the same effect; they converge as the training converges. These different measurements should be at least experimentally compared in order to show that the proposed idea is crucial. 
+
+Note also that most theoretical proofs are straightforward based on the MST observation.
+
+2) The experiment is not quite convincing. For example, what if we stop the training as soon as the improvement of validation accuracy slows down (converges with a much looser threshold)? Wouldn’t this have the same effect (stop slightly earlier with only slightly worse testing accuracy)? Also shouldn’t the aforementioned various alternative norms be compared with?
+
+3) Some other ideas/experiments might worth exploring: taking the persistence over the whole network rather than layer-by-layer, what happens with networks with batch-normalization or dropout?
+
+
+
+",4,4.0,ICLR2019
+84bc8kSq6D,4,c5klJN-Bpq1,c5klJN-Bpq1,Review,"Update: I have read the author's response and decided to keep my review, confidence, and score. 
+
+----
+
+Summary: this paper generalizes existing decision trees to some neural-style model. The most critical argument is that the model generalizes decision tree while maintaining interpretability. Since this is a new model, interpretability should at least be justified with the visualizations that can be judged as interpretable (let alone an objective measurement or human experiments). However, the demonstration of interpretability is far from satisfaction, so I recommend a clear rejection. See below for further details. 
+
+1. One of the major motivation is using a linear number of parameters w.r.t. tree depth to construct a decision tree-style model. However, this problem has been investigated in literature but not compared theoretically or empirically at all in this paper. See the classic paper [1] and the recent paper [2]. 
+
+2. The current proposition 1 does not seem useful. Any classifier $p(y | x)$ can be written as a decision transformer $F \pi_0$ by letting $\pi_0 = [1]$ and $F = p(y | x)$. The factorization only makes intuitive sense when each $T_i$ is restricted / interpretable. 
+
+3. Interpreting a stochastically routing decision tree would inevitably involve inspecting the whole tree, since every prediction involves the whole tree. Hence, visualizing the whole tree is necessary to claim interpretability, especially for a new architecture. The only visualization in Fig. 3 is far from visualizing an interpretable model. 
+
+4. The theoretical statement is not clear and rigorous. E.g., what do the authors mean by ""explainable""?
+
+5. the proposed architectures seems to highly relevant to hierarchical mixture of experts [3], which can be trained via EM algorithms efficiently. Can the authors show similar things here?
+
+[1] Langley, Pat, and Stephanie Sage. ""Oblivious decision trees and abstract cases."" Working notes of the AAAI-94 workshop on case-based reasoning. 1994.
+
+[2] Lee, Guang-He, and Tommi S. Jaakkola. ""Oblique Decision Trees from Derivatives of ReLU Networks."" International Conference on Learning Representations. 2019.
+
+[3] Jordan, Michael I., and Robert A. Jacobs. ""Hierarchical mixtures of experts and the EM algorithm."" Neural computation 6.2 (1994): 181-214.",3,4.0,ICLR2021
+BylpBu5BcS,3,S1gR2ANFvB,S1gR2ANFvB,Official Blind Review #1,"This paper violates the conference's double-blind reviewing policy by explicitly identifying their research institute and team name. The violations occur in the 4th line of the abstract, the last paragraph of Section 1 (Introduction), and the first paragraph of Section 2 (System Description). For this reason, I am not providing a review of this paper.
+
+.........................................................................................................................................................................................................",1,,ICLR2020
+S1DUyihgz,3,HkuGJ3kCb,HkuGJ3kCb,Simple post-processing technique with theoretical motivations,"This paper proposes a simple post-processing technique for word representations designed to improve representational quality and performance on downstream tasks. The procedure involves mean subtraction followed by projecting out the first D principle directions and is motivated by improving isotropy of the partition function. Extensive empirical analysis supports the efficacy of the approach.
+
+The idea of post-processing word embeddings to improve their performance is not new, but I believe the specific procedure and its connection to the concept of isotropy has not been investigated previously. Relative to other post-processing techniques, this method has a fair amount of theoretical justification, particularly as described in Appendix A. I think the experiments are reasonably comprehensive. All told, I think this is a good paper, but I do have some comments and questions that I think should be addressed before publication.
+
+1) I think it is useful to analyze the distribution of singular values of the matrix of word vectors. However, I did not find the heuristic analysis based on the visual appearance of these distributions to be convincing. For example, in Fig. 1, it is not clear to me that there exists a separation between regimes of exponential decay and rough constancy. It would be ideal if a more quantitative metric is established that captures the main qualitative behavior alluded to here.
+
+Furthermore, the vocabulary size is likely to have a strong effect on the shape of the distributions. Are the plots in Fig. 4 for the same vocabulary size? Related to this, the dimensionality of the representation will have a strong effect on the shape, and this should be controlled for in Fig. 8. One way to do this would be to instead plot the density of singular values. Finally, for the Gaussian matrix simulations, in the asymptotic limit, the density of singular values depends only on the ratio of dimensions, i.e. the vector dimension to the vocabulary size. Fig. 4/8 might be more revealing if this ratio were controlled for.
+
+2) It would be useful to describe why isotropy of the partition function is the goal, as opposed to isotropy of the vectors themselves. This may be argued in Arora et al. (2016), but summarizing that argument in this paper would be helpful. In fact, an additional experiment that would be very valuable would be to investigate empirically which form of isotropy is more effective in governing performance. One way to do this would be to enforce approximate isotropy of the partition function without also enforcing isotropy of the vectors themselves. Practically speaking, one might imagine doing this by requiring I = 1 to second order without also requiring that the mean vanish. I think this would allow for \sigma_max > \sigma_min while still satisfying I = 1 to second order. (But this is just off the top of my head -- there may be better ways to conduct this experiment).
+
+It is not clear to me why the experiment leading to Table 2 is a good proxy for the exact computation of I. It would be great if there were some mathematical justification for this approximation.
+
+Why does Fig. 3 use D=10, 20 when much smaller D are considered elsewhere? Also I think a log scale on the x-axis might be more informative.
+
+3) It would be good to mention other forms of post-processing, especially in the context of word similarity. For example, in the original paper, GloVe advocates averaging the target and context vector representations, and normalizing across the feature dimension before computing cosine similarity.
+
+4) I think it's likely that there is a strong connection between the optimal value of D and the frequency distribution of words in the evaluation dataset. While the paper does mention that D may depend on specifics of the dataset, etc., I would expect frequency-dependence to be the main factor, and it might be worth exploring this effect explicitly.
+",7,4.0,ICLR2018
+BJiNow9gG,2,Sk2u1g-0-,Sk2u1g-0-,Review for Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments,"This paper proposed a gradient-based meta-learning approach for continuous adaptation in nonstationary and adversarial environment. The idea is to treat a nonstationary task as a sequence of stationary tasks and train agents to exploit the dependencies between consecutive tasks such that they can deal with nonstationarities at test time. The proposed method was evaluated based on a nonstationary locomotion and within a competitive multi agent setting. For the later, this paper specifically designed the RomoSumo environment and defined iterated adaptation games to test various aspect of adaptation strategies. The empirical results in both cases demonstrate the efficacy of the proposed meta-learned adaptation rules over the baselines in the few-short regime. The superiority of meta-learners is further justified on a population level.
+
+The paper addressed a very important problem for general AI and it is well-written. Careful experiment designs, and thorough comparisons make the results conniving. I
+
+Further comments:
+
+1. In the experiment the trajectory number seems very small, I wonder if directly using importance weight as shown in (9) will cause high variance in the performance?
+
+2. One of the assumption in this work is that trajectories from T_i contain some information about T_{i+1}, I wonder what will happen if the mutually information is very small between them (The extreme case is that two tasks are independent), will current method still perform well?
+
+P7, For the RL^2 policy, the authors mentioned that “…with a given environment (or an opponent), reset the state once the latter changes” How does the agent know when an environment (or opponent) changes? 
+
+P10, “This suggests that it meta-learned a particular…” This sentence need to be rewritten.
+
+P10, ELO is undefined
+",7,4.0,ICLR2018
+aBNDGVAVq0o,2,Iz3zU3M316D,Iz3zU3M316D,Review for AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights,"###################################################################
+
+Summary:
+
+This paper shows that momentum-based gradient descent optimizers reduce the effective step size in training scale-invariant models including deep neural networks normalized by batch normalization, layer normaliztion, instance normalization and group normalization. The authors then propose a solution that projects the update at each step in gradient descent onto the tangent space of the model parameters. Theoretical results are provided to show that this projection operator only adjusts the effective learning rate but does not change the effective update directions. Empirical results on various tasks are provided to justify the advantage of the proposed method over the baseline momentum-based (stochastic) gradient descent and Adam.
+
+###################################################################
+
+Reason for the Score:
+
+Overall, this paper could be an interesting algorithmic contribution. However, there are relevant points needed to be clarified on the theory and experiments. My first main concern is that theoretically it is hard to justify that the proposed projection-based update yields smaller parameter norms than the baseline momentum-based update. My second main concern is some baseline results in the experiments do not match those in existing literature, and no error bars are provided in the empirical results even though the improvements of the proposed methods over the baseline methods are small. 
+
+Currently, I am leaning toward rejecting the paper. However, given additional clarifications on the two main concerns above in an author response, I would be willing to increase the score.
+
+###################################################################
+
+Strong points:
+
+1. The paper points out a relevant issue in using normalization techniques such as batch normalization together with momentum-based optimization algorithms in training deep neural networks. 
+
+2. The paper provides experimental results on various tasks and datasets to demonstrate the advantage of the proposed method.
+
+3. The paper is well-written with illustrative figures.
+
+###################################################################
+
+Weak points:
+
+1. It is not clear to me that the proposed update in equation (12) yields smaller norms ||w_{t+1}|| than the momentum-based update in equation (8).  The parameters of the model evolve differently under these two update rules. Throughout the training, the update p_t in equation (11) is different from the update p_t in equation (8). As a result, it is hard to compare ||q_t|| in equation (12) and ||p_t|| plus all the terms ||p_k|| in equation (8). 
+
+2. The improvements of the proposed SGDP and AdamP over the baseline SGD and Adam are small across experiments, and thus error bars are needed to validate that these improvements are not due to randomness. However, no error bars are provided for the empirical results in the paper.
+
+3. The reported baseline results for audio classification are worse than those reported in (Won et al., 2019).
+
+4. The baseline results for adversarial robustness seems to be much higher than reported results in (Madry et al., 2018). Also why are the values of epsilon used in the paper quite small (80/255 and 4/255 vs. 8 in (Madry et al., 2018)).
+
+###################################################################
+
+Additional Concerns and Questions for the Authors:
+
+1. Adam normalizes the gradient by its cumulative norm. This can help eliminate the small step size issue since the norms of the gradients become smaller during training. Can you provide a similar simulation as in Figure 3 but using Adam, AdamW, and AdamP?
+
+2. What are the baseline results, reported in existing literature, on ImageNet for ResNet18 and ResNet50 trained with the cosine learning rate schedule in 100 epochs? Can you please link me to the previous papers that report those results? The paper you cite, (Loshchilov & Hutter, 2016), does not report those results.
+
+3. In section 4.1, the authors say “For ResNet, we employ the training hyperparameters in (He, 2016)”. However, the training hyperparameters for ResNet used in the paper are not from (He, 2016). In (He, 2016), the models are trained for only 90 epochs without using cosine learning rate.
+
+4. The proposed update is more expensive than the baseline momentum-update. The paper also reports that the proposed update incurs 8% extra training time on top of the baselines for ResNet18 on ImageNet classification while resulting in only small improvements over the baselines. It is needed to compare the proposed optimizer with the baseline momentum-based optimizer trained with more epochs and potentially with an additional learning rate decay.
+
+###################################################################
+
+Minor Comments that did not Impact the Score:
+
+1. The paper proposes not only AdamP, but also SGDP. It is better if the authors remove AdamP in the title.
+
+2. In figure 4, is the y-axis the test or train accuracy?
+
+###################################################################
+
+References:
+
+Minz Won, Sanghyuk Chun, and Xavier Serra. Toward interpretable music tagging with self- attention. arXiv preprint arXiv:1906.04972, 2019. 
+
+Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations (ICLR), 2018. URL https://openreview.net/forum?id= rJzIBfZAb. 
+
+Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016. 
+
+Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
+
+
+###################################################################
+
+Post Discussion Score:
+
+After reading the rebuttal from the author and the comments from other reviewers, I am still not clear if the proposed update in equation (12) yields smaller norms ||w_{t+1}|| than the momentum-based update in equation (8). However, the authors have addressed all of my other concerns. I decided to increase my score for this paper from 4 to 5. 
+
+",5,4.0,ICLR2021
+SyxDYjcq2m,3,S1EERs09YQ,S1EERs09YQ,Solid paper with interesting insights - left with some questions,"This paper describes a method for identifying linguistic components (""concepts"") to which individual units of convolutional networks are sensitive, by selecting the sentences that most activate the given unit and then quantifying the activation of those units in response to subparts of those sentences that have been isolated and repeated. The paper reports analyses of the sensitivities of different units as well as the evolution of sensitivity across network layers, finding interesting patterns of sensitivity to specific words as well as higher-level categories.
+
+I think this paper provides some useful insights into the specialization of hidden layer units in these networks.  There are some places where I think the analysis could go deeper / some questions that I'm left with (see comments below), but on the whole I think that the paper sheds useful light on the finer-grained picture of what these models learn internally. I like the fact that the analysis is able to identify a lack of substantial change between middle and deeper layers of the translation model, which inspires a prediction - subsequently borne out - that decreasing the number of layers will not substantially reduce task performance.
+
+The paper is overall written pretty clearly (though some of the questions below could likely be attributed to sub-optimal clarity), and to my knowledge the analyses and insights that it contributes are original. Overall, I think this is a solid paper with some interesting contributions to neural network interpretability.
+
+Comments/questions:
+
+-I'm wondering about the importance of repeating the “concepts” to reach the average sentence length. Do the units not respond adequately with just one instance of the concept (eg ""the ball"" rather than ""the ball the ball the ball"")? What is the contribution of repetition alone?
+
+-Did you experiment with any other values for M (number of aligned candidate concepts per unit)? It seems that this is a non-trivial modeling decision, as it has bearing on the interesting question of how broadly selective a unit is.
+
+-You give examples of units that have interpretable sensitivity patterns - can you give a sense of what proportion of units do *not* respond in an interpretable way, based on your analysis?
+
+-What exactly is plotted on the y-axis of Figure 5? Is it number of units, or number of concepts? How does it pool over different instances of a category (different morphemes, different words, etc)? What is the relationship between that measure and the number of distinct words/morphemes etc that produce sensitivity?
+
+-I'm interested in the units that cluster members of certain syntactic and semantic categories, and it would be nice to be able to get a broader sense of the scope of these sensitivities. What examples of these categories are captured? Is it clear why certain categories are selected over others? Are they obviously the most optimal categories for task performance?
+
+-p7 typo: ""morhpeme""",6,4.0,ICLR2019
+SyrOMN9eM,3,HJGXzmspb,HJGXzmspb,a thorough and flexible approach towards discretizing neural networks,"The authors propose WAGE, which discretized weights, activations, gradients, and errors at both training and testing time. By quantization and shifting, SGD training without momentum, and removing the softmax at output layer as well, the model managed to remove all cumbersome computations from every aspect of the model, thus eliminating the need for a floating point unit completely. Moreover, by keeping up to 8-bit accuracy, the model performs even better than previously proposed models. I am eager to see a hardware realization for this method because of its promising results. 
+
+The model makes a unified discretization scheme for 4 different kinds of components, and the accuracy for each of the kind becomes independently adjustable. This makes the method quite flexible and has the potential to extend to more complicated networks, such as attention or memory. 
+
+One caveat is that there seem to be some conflictions in the results shown in Table 1, especially ImageNet. Given the number of bits each of the WAGE components asked for, a 28.5% top 5 error rate seems even lower than XNOR. I suspect it is due to the fact that gradients and errors need higher accuracy for real-valued input, but if that is the case, accuracies on SVHN and CIFAR-10 should also reflect that. Or, maybe it is due to hyperparameter setting or insufficient training time?
+
+Also, dropout seems not conflicting with the discretization. If there are no other reasons, it would make sense to preserve the dropout in the network as well.
+
+In general, the paper was written in good quality and in detail, I would recommend a clear accept.
+",8,4.0,ICLR2018
+rygAU3qTtB,2,SJgwNerKvB,SJgwNerKvB,Official Blind Review #3,"Review of “Continual learning with hypernetworks”
+
+This paper investigates the use of conditioning “Hypernetworks” (networks where weights are compressed via an embedding) for continual learning. They use “chunked” version of the hypernetwork (used in Ha2017, Pawlowski2017) to learn task-specific embeddings to generate (or map) tasks to weights-space of the target network.
+
+There is a list of noteworthy contributions of this work:
+
+1) They demonstrate that their approach achieves SOTA on various (well-chosen) standard CL benchmarks (notably P-MNIST for CL, Split MNIST) and also does reasonably well on Split CIFAR-10/100 benchmark. The authors have also spent some effort to replicate previous work so that their results can be compared (and more importantly analyzed) fairly to the literature, and I want to see more of this in current ML papers. (one note is that the results for CIFAR-10/100 is in the Appendix, but I think if the paper gets accepted, let's bring it back to the main text and use 9 pages, since the results for CIFAR10/100 are substantial).
+
+2) In addition to demonstrating good results on standard CL benchmarks, they also conduct analysis of task-conditioned hypernetworks with experiments involving long task sequences to show that they have very large capacities to retain large memories. They provide a treatment (also with visualization) into the structure of low-dim task embeddings to show potential for transfer learning.
+
+3) The authors will release code to reproduce all experiments, which I think is important to push the field forward. Future work can not only reproduce this work, but also the cited works.
+
+The work seems to be well written, and the motivation of using hypernetworks as a natural solution to avoid catastrophic loss is clearly described. Overall, I think this work is worthy of acceptance and should encourage more investigation into hypernetworks for CL and transfer learning going forward in the community.
+",8,,ICLR2020
+v6P3aCQe0Iz,3,Tq_H_EDK-wa,Tq_H_EDK-wa,Predicting infection on structured data with missing testing ,"This paper formulates the contagious disease into a missing label problem with dependence between each data point. The paper targets an important problem, especially in this pandemic, and the effort is greatly appreciated. However, the writing of this paper is confusing and it makes it hard to catch the main contribution of this paper. There are some concerns: 
+
+1. The formulated problem sounds like a node label missing problem in a graph, where the node is patient (x, y) and the edge is whether they are contacted (e). In so, the paper is actually predicting the label of each node. If my understanding is correct, I am not sure why the authors choose the current formulation rather than graph one. 
+
+2. The notation and wording are sometimes confusing, especially when I only have limited knowledge in healthcare. For example, even after reading section 4.1, it is still confusing what kind of special data structure the author is indicating. I recommend the author give a more intuitive or even graphic explanation in the paper. 
+
+3. All the experiments are self-compared and make the result less convincing. For example, it is not clear whether different NN network would result in different performance because only NN is fixed here. 
+
+
+",5,2.0,ICLR2021
+Hygnc4E0dH,1,HJgKYlSKvr,HJgKYlSKvr,Official Blind Review #1,"The paper tries to solve the problem of recovering the 3D structure from 2D images. To this end, it describes a GAN-type model for generating realistic images, where the generator disentangles shape, texture, and background. Most notably, the shape is represented in three dimensions as a mesh made of triangles. The final image is rendered by a fixed differentiable renderer from a randomly-sampled viewpoint. This allows the model to learn to generate realistic 3D shapes, even though it is trained only using 2D images.
+
+The authors introduce a novel renderer based on the Lambertian image model, which is differentiable not only with respect to the texture but also with respect to position on mesh vertices, which allows better shape learning compared to prior art. Authors also identify some learning ambiguities: objects can be represented by the background layers, and sometimes object surface contains high-frequency errors. These are addressed by generating a bigger background and randomly selecting its part as the actual background, and by averaging shapes generated at different scales to smoothen the surface of generated objects, respectively. Authors do mention pitfalls of the model in the conclusion: fixed-topology mesh, the background is not modelled as a mesh, the model works only with images containing a single centred object, the image model is Lambertian.
+
+I think that the approach is extremely interesting, addresses an important problem, and shows promising results. However, I vote to REJECT this paper, because the evaluation is insufficient, and the paper lacks clarity.
+
+The approach is evaluated only on a single dataset and is not compared to any baselines. While results from ablations of the model are provided, they are only qualitative, consisting of a single example per ablation, and are hard to read and interpret---in particular, the provided description of the ablations and corresponding results is unclear. There are no quantitative results in the paper, and it is difficult for me to judge how good the method is given only qualitative examples from a single dataset.
+
+As for clarity, I think that the distinction between supervised, unsupervised and weakly supervised learning in section 2 in unnecessary, does not add value to the paper, and can confuse the reader. Section 4 contains some unnecessary assumptions and incorrect claims. For example, the renderer R doesn't need to be able to generate perfect images for the approach to work; I think Theorem 1 is also incorrect since it does not take e.g. mode collapse into account, which prevents the learned distribution from being the same as the data distribution. Section 5 is very unclear, with practically no explanation for equations (6-11), which makes them very difficult to decipher.
+
+The related works section is quite thorough, but the authors missed two extremely relevant papers: [1] and [2], which do a very similar thing and contain some of the ideas used in this paper.
+
+I think the paper would be very valuable if the differentiable renderer was clearly explained and more evaluation and comparisons with baselines were provided.
+
+[1] Rezende et. al., ""Unsupervised Learning of 3D Structure from Images"", NIPS 2016.
+[2] Nugyen-Phuoc et. al., ""HoloGAN: Unsupervised learning of 3D representations from natural images"", ICCV 2019.
+
+
+=== UPDATE ===
+I appreciate adding more examples for ablations, FID scores and the LSUN dataset experiments. However, I still think that the exposition could be significantly improved, as eg. the description of the differentiable renderer is difficult to follow -- the equations should be better explained. Also, Figure 3. is difficult to understand; and the caption saying that ""one rectangle overlaps another"" is not helpful. 
+
+I think this is really cool work, but due to the lack of clarity, I think it shouldn't be accepted at this conference. Having said that, I am increasing my score to ""weak reject"" because of the improvements.
+",3,,ICLR2020
+SylozPPPnQ,2,BJx9f305t7,BJx9f305t7,Original contribution unclear,"The paper W2GAN describes a method for training GAN and computing an optimal transport (OT) map
+between distributions. As far as I can tell, it is difficult to identify the original contributions
+of the paper. Most results are known from the OT community. The differences with the work of Seguy, 2018
+is also not obvious. I encourage the authors to establish more clearly the differences of their work
+with this last reference. Most of the theoretical considerations of Section 3 is either based on 
+unrealistic assumptions (case 1) or make vague assumptions 'if we ignore the possibly significant effect ...'
+that seem unjustified so far. Experimental results do not show evidences of superiority wrt. existing works.  
+All in all I would recommend the authors to better focus on the original contribution of their works wrt.  
+state-of-the-art and explain why the theoretical analysis on convergence following a geodesic path in a 
+Wasserstein space is valuable from a practical view. Finally, I did not understand the final claim of the 
+Abstract : 'Perhaps surprisingly, we also provide empirical evidence that other GANs also approximately following
+the Optimal Transport.'. What are those empirical evidences ? It seems that this claim is not supported somewhere 
+else in the paper.
+
+Minor remarks:
+ - regarding the penalization in eq. (5), the expectation is not for all x and y \in R^2, but for x drawn from \mu and y from \nu.
+   Same for L_2 regularization
+ - Proposition 1 is mainly due to Brenier
+Brenier, Y. (1991). Polar factorization and monotone rearrangement of vector‐valued functions. Communications on pure and applied mathematics, 44(4), 375-417.
+ - from Eq (7), you should give precisely over what the expectations are taken.
+ - Eq (10) : how do you inverse sup and inf ? 
+ - when comparing to Seguy 2018, are you using an entropic or a L_2 regularization ? How do you set the regularization strength ?
+ - where is Figure 2.a described in section 4.2 ? 
+
+Related works :
+ - what is reference (Alexandre, 2018) ?  
+ - regarding applications of OT to domain adaptation, there are several references on the subject. 
+   See for instance 
+Courty, N., Flamary, R., Tuia, D., & Rakotomamonjy, A. (2017). Optimal transport for domain adaptation. IEEE transactions on pattern analysis and machine intelligence, 39(9), 1853-1865.
+or 
+Damodaran, B. B., Kellenberger, B., Flamary, R., Tuia, D., & Courty, N. (2018). DeepJDOT: Deep Joint distribution optimal transport for unsupervised domain adaptation. ECCV 
+for a deep variant.
+ - Reference Seguy 2017 and 2018 are the same and should be fused. The corresponding paper
+   was published at ICLR 2018
+   Regarding this last reference, the claim 'As far as we know, it is the first demonstration of a GAN achieving reasonable generative modeling results and an approximation of the optimal transport map between two continuous distributions.' should maybe be lowered ? ",3,4.0,ICLR2019
+NUphSR-bqni,1,hPWj1qduVw8,hPWj1qduVw8,"Interesting engineering contribution, but the underlying principle seems not really new and lack of discussion with relevant related works","Summary:
+
+This paper addresses the visual question answering in a multi-turn or conversational setting. Given a video (series of frames or images), a model has to reason across space and time to arrive at a correct answer for a given question. This task involves understanding the content and context of dialogue turns, i.e., given a question and N dialogue turns, only M<<N of the dialogue turns are strongly related to the question posed. This paper proposes to simulate the dependencies between dialogue turns, forming a reasoning path, to answer a given question. In a way, the proposed approach selects relevant dialogue turns that are useful to answer the question.  
+
+There are two steps to make the reasoning path:
+(1) At each dialogue turn, a graph network is constructed at the turn level. Any two turns are connected if they have an overlapping lexical span or if their lexical spans are semantically similar.
+(2) Secondly, a path generator is trained to predict a path from the current dialogue turn to past dialogue turns that provide additional and relevant cues to answer the current question. 
+
+Ultimately, the main idea to create a reasoning path is based on compositional semantic similarities.
+
+
+
+Comments (Technical, Major Flaws of this paper): 
+
+(A) I am not sure whether the author(s) is aware, but from the NLP perspective, the current method (step 1) is trying to simulate the discourse structure of dialogues. I believe that this is an important direction, and the uniqueness of this works lies in the multi-modality of the input, i.e., possibility of the interplay between texts and images (using the information in both modalities). 
+- The claimed novelty in this paper is in the construction or usage of reasoning graph, i.e., to construct a graph structure to connect turn-level representations in dialogue. However, in Step 1, the use of entity and/or compositional similarity to create a graph structure out of a text is not new at all, and the paper fails to cite related works, as if it is the first one to propose this. In fact, the idea has been used in NLP for a long time (albeit mostly in the monologue). 
+- I am not sure whether combining entity with action phrases (called ""lexical spans"" in the paper) is new. Can you confirm whether the proposed ""lexical spans"" is indeed new to construct/simulate the discourse structure?
+- Regarding step 1, perhaps the main contribution of this paper is applying the idea to dialogues, instead of monologues? Another possible contribution is ""filtering out"" unimportant semantic relations. In normal discourse structure, all parts of texts are connected in a single structure. However, in the context of this paper, only edges that are relevant to the posed question are used. 
+- Unless the paper can discuss the related NLP works for step 1, I can only treat this paper as the extension of the corresponding NLP method in a multi-modal setting. There is an engineering contribution, but not from the methodological (theoretical) perspective.
+- I think the author(s) will benefit much by surveying papers on discourse structures (or the ``shallow"" construction of them), instead of machine reading comprehension. Many studies tried to establish discourse structure (albeit in a monologue) using entities that are mentioned and their semantic representations. A few of such works are:
+  R. Barzilay and M. Lapatta. 2008. Modelling Local Coherence: An Entity-based Approach. https://www.aclweb.org/anthology/J08-1001.pdf
+  C. Guinaudeau and M. Strube. 2013. Graph-based Local Coherence Modeling. https://www.aclweb.org/anthology/P13-1010/
+  J.W.G. Putra and T. Tokunaga. 2017. http://www.aclweb.org/anthology/W/W17/W17-2410.pdf 
+- The currently proposed method step 1 seems to be the combination of entity graph + semantic similarity graph in these related works, but the current paper ""filters"" only edges relevant to the posed question. 
+- A related work to construct the discourse structure in dialogues:
++ G. Morio and K. Fujita. 2018. End-to-end Argument Mining for Discussion Threads Based on Parallel Constrained Pointer Architecture. https://www.aclweb.org/anthology/W18-5202.pdf
+
+(B) The reasoning model, which is a combination of GCN + transformer can be interesting. However, the idea of cross-modality representation refinement is somewhat similar to what has been studied in VQA.
+  Le, T. M., Le, V., Venkatesh, S., & Tran, T. (2020). Dynamic Language Binding in Relational Visual Reasoning. In IJCAI 2020.
+  Gao, P., Jiang, Z., You, H., Lu, P., Hoi, S. C., Wang, X., & Li, H. (2019). Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In CVPR 2019.
+
+(C) After constructing the reasoning path (in response to the given question), the next step is to decode such representation to generate the answer. This paper proposes to use the transformer model to do that. I believe the use of the transformer model to generate text is not new. In fact, the author(s) mentions this in the paper (the last paragraph of Section 3.4). 
+
+(D) In overall, if we look at the pipeline (system) level, the proposed pipeline is new (the whole process). However, I seriously concern about the step (1) of the proposed method (page 1). My main concern about this paper is its lack of awareness of related works in text processing (step 1 of their method). In fact, it fails to cite relevant works (that are very similar to this work). I might appreciate this paper in terms of engineering contribution (in a multi-modal setting), but I cannot acknowledge that step 1 is novel. Having that said, I think the authors need to provide a comparison to related works, proving the novelty of the current method. I am willing to increase the rating if the authors can properly address my concerns during the rebuttal phase.
+
+(E) The content from 3.3 to 3.4 is very hard to follow. 
+Correction of terms:
+	- linguistic dependency parsers --> ""syntactical"" dependency parsers (this is the correct term)
+	- linguistically, the term ""lexical span"" is weird. A span is a series of continuous lexicons (in the text surface). I suggest using a better term, as the ""lexical span"" in this paper might be discontinuous (do I misunderstand?).",7,4.0,ICLR2021
+Ij5nD_wwWPy,2,vK9WrZ0QYQ,vK9WrZ0QYQ,Review on the theoretical analysis of the relationship between Laplace kernel and NTK,"This paper proves that the reproducing kernel Hilbert spaces of a deep neural tangent kernel and the Laplace kernel have the same set of functions when they restricted to the sphere $S^{d-1}$, which improves the results established in Geifman et al., 2020. Moreover, the paper proves that more non-smooth of the exponential power kernel leads to a larger RKHS with restriction on the sphere $S^{d-1}$ and the entire $R^d$. Furthermore, the authors conduct numerical experiments to verify the asymptotics of the Maclaurin coefficients of the Laplace kernel and NTKs kernel. In summary, the paper is well-written and organized logically. The proof of theoretical results of this paper seems to be correct and reasonable, resulting from the full details of the proof provided in the appendices. 
+The contribution of this paper includes two parts. Firstly, it aims to explain why the Laplace kernel and NTK have similar performance in experiments from theoretical point of view by showing that the space of the Laplace kernel and NTK are the same when limited to a sphere. On the other hand, the author reveals the relationship between the smoothness of the exponential power kernel and the corresponding RKHS to explain the better performance of the exponential kernel with a smaller power in the experiments. 
+Now I would like to give some comments, 1. Do Laplace kernel and NTK have a similar learning dynamic when we perform kernelized gradient decent in real-world dataset? 2. It is necessary to study the behavior of the NTK and the Laplace kernel outside of $S^{d-1}$. I wonder whether the theoretical results proposed are helpful to improve the performance of Laplace kernel or NTK. 3. The author demonstrates that a non-smooth exponential power kernel leads to a larger RKHS, but whether this indicates that the model obtained by adopting a non-smooth kernel has greater generalization capability. I think the author needs to further theoretically illustrate the relationship between them.
+",7,4.0,ICLR2021
+HygOpABTFS,3,BJx040EFvH,BJx040EFvH,Official Blind Review #3,"The authors claimed a classic adversarial training method, FGSM with random start, can indeed train a model that is robust to strong PGD attacks. Moreover, when it is combined with some fast  training methods, such as cyclic learning rate scheduling and mixed precision, the adversarial training time can be significantly decreased. The experiment verifies the authors' claim convincingly.
+Overall, the paper provides a novel finding that could significantly change the adversarial training strategy. The paper is clearly written and easy to follow. I recommend the acceptance.
+",8,,ICLR2020
+H1ake7INx,2,S1oWlN9ll,S1oWlN9ll,Novel 2nd order loss-aware binarization method for neural networks. Optimization performance is evaluated through the test error proxy.,"The paper presents a second-order method for training a neural networks while ensuring at the same time that weights (and activations) are binary. Through binarization, the method aims to achieve model compression for subsequent deployment on low-memory systems. The method is abbreviated BPN for ""binarization using proximal Newton algorithm"".
+
+The method incorporates the supervised loss function directly in the binarization procedure, which is an important and desirable property. (Authors mention that existing weight binarization methods ignore the effect of binarization to the loss.) The method is clearly described and related analytically to the previously proposed weight binarization methods.
+
+The experiments are extensive with multiple datasets and architectures, and demonstrate the generally higher performance of the proposed approach.
+
+A minor issue with the feed-forward network experiments is that only test errors are reported. Such information does not really give evidence for the higher optimization performance. (see also comment ""RE: AnonReviewer3's questions"" stating that all baselines achieve near perfect training accuracy.) Making the optimization problem harder (e.g. by including an explicit regularizer into the training objective, or by using a data extension scheme), and monitoring the training objective instead of the test error could be a more direct way of demonstrating superior optimization performance.
+
+The superiority of BPN is however becoming more clearly apparent in the subsequent LSTM experiments.",7,4.0,ICLR2017
+H1XSQ8BVe,2,SJkXfE5xx,SJkXfE5xx,A generally useful and interesting approach to 2-sample tests,"The submission considers the setting of 2-sample testing from the perspective of evaluating a classifier.  For a classifier between two samples from the same distribution, the distribution of the classification accuracy follows a simple form under the null hypothesis.  As such, a straightforward threshold can be derived for any classifier.  Finding a more powerful test then amounts to training a better classifier.  One may then focus efforts, e.g. on deep neural networks, for which statistics such as the MMD may be very difficult to characterize.
+
++ The approach is sound and very general
++ The paper is timely in that deep learning has had huge impacts in classification and other prediction settings, but has not had as big an impact on statistical hypothesis testing as kernel methods have
+
+- The discussion of the relationship to kernel-MMD has not always been as realistic as it could have been.  For example, the kernel-MMD can also be seen as a classifier based approach, so a more fair discussion could be provided.  Also, the form of kernel-MMD used in the comparisons is a bit contradictory to the discussion as well
+ * The linear kernel-MMD is used which is less powerful than the quadradic kernel-MMD (the authors have justified this from the perspective of computation time)
+ * The kernel-MMD is argued against due to its unwieldy distribution under the null, but the linear time kernel-MMD (see also Zaremba et al., NIPS 2013) has a Gaussian distribution under the null.
+
+Arthur Gretton's comment from Dec 14 during the discussion period was very insightful and helpful.  If these insights and additional experiments comparing the kernel-MMD to the classifier threshold on the blobs dataset could be included, that would be very helpful for understanding the paper.  The open review format gives an excellent opportunity to assign proper credit for these experiments and insights by citing the comment.",8,5.0,ICLR2017
+hm5yZhRcIRL,4,Oos98K9Lv-k,Oos98K9Lv-k,"Generally interesting, but comparison is not persuasive enough","This paper proposes a new variant of neural topic model leveraging optimal
+transport in order to incorporate information from pre-trained word vectors.
+Specifically, the authors replaced the KL-divergence reguralization term with
+an optimal transport between topic distribution and empirical word distribution
+in each text.
+Experimental evaluation yields generally better topic coherence and high
+precision of K-means clustering on the induced topic distributions.
+
+Basically this paper is interesting, but still leaves some questions.
+
+- First of all, evaluations are based only on high NPMI topics (section 5.2).
+Therefore, it is trivial that the induced topic-word distribution of these
+topics are good. In my experience, original vanilla LDA yields mostly
+interpretable topic-word distributions; however, even for the high-NPMI topics,
+each topic in Figure 4 is somewhat noisy, and unshown topics might be worse.
+Therefore, I would like to know the comparison between the proposed method and
+the original LDA too.
+
+- The paper first says that ""good document representation and coherent/diverse
+topics"" is difficult. Then why not including perplexity evaluation in the 
+experiments? K-means results are only auxiliary evaluation of the former, thus
+the reader would like to know whether the proposed model could yield better 
+perplexity on documents or not.
+
+- It seems that the choice of dimensionality and the number of topics seems
+too low and arbitrary. Why only the 50-dimensional word vectors are used?
+Is there any difference over the competitors when that dimensionality is
+changed?
+Also, I could not know why only the results with K=100 is shown in main text.
+Appendix E shows that K=500 consistently yields better results; why are they
+not included?
+
+- Finally, I cannot understand what kind of optimization w.r.t M is conducted
+in this paper. To make the paper as self-contained as possible, I strongly
+recommend to show what kind of optimization is actually done.
+
+The proposed OT regularization seem to work better, but I cannot see why the
+baseline of word-vector based topic models like (Dieng+ 2020) is inferior.
+Is there any intuition or explanation over these trivial competitors?
+",6,4.0,ICLR2021
+rJxy1Mc0YH,2,S1e-0kBYPB,S1e-0kBYPB,Official Blind Review #1,"-------------------- AFTER
+The original rating of ""Weak Reject"" still holds as the authors failed to provide proper justification for the raise concerns and support their claims through additional experiments.
+
+""We do not introduce an explanation generation framework, as explainers do. "" - The proposed evaluation requires the explainer of the NLP model to agree with the RCNN in-terms of relevant or irrelevant words, to be considered a good explainer. The RCNN model which is defining the relevant and irrelevant tokens for a prediction task is in fact stating that we can explain the decision of an NLP model in terms of relevant and irrelevant tokens. Hence, the proposed RCNN can also be considered as an explainer. The evaluation task is demonstrating if the other explainers are providing explanations consistent with this new explainer based on RCNN.
+
+
+""The RCNN is not meant to explain any other models except itself."" - Unclear
+
+""Regarding the request for more experiments:"" - The authors don't provide enough justification to ""why they didn't perform more experiments?""
+
+"" Hence, with our current instantiations, any domain-agnostic explainer can be evaluated"" - The experiment to validate this claim are missing.
+
+""The novelty of our paper consists in the fact that, to our knowledge, it is the first to (1) shed light over a fundamental difference"" - This is not a technical novelty. This is an exploratory analysis based observation
+
+""and (2) propose a methodology for evaluating explainers that ...and without human intervention (unlike evaluation type 4)."" -  In Section 5 Qualitative Analysis, the authors are also doing human evaluation like other methods in evaluation type-4 of their related works. Also, doing human evaluation is a strong way to justify an explainer. Though expensive, whenever possible it should be done and is in no way a limitation of current evaluation metrics.
+
+Your model needs labelled data for training RCNN. This adds a constraint on the usability and scalability of your proposed evaluation method. Since RCNN is also black-box, one will required another explainer to explain the RCNN. 
+
+In the worst-case scenario, if RCNN is trained with data such that it considers all relevant words as irrelevant, the evaluation made by RCNN will be incorrect. Hence, "" Success depends on the ability of the RCNN to extract correct subsets of tokens.""
+
+
+
+-------------------  BEFORE
+The paper proposed a verification framework to evaluate the performance of different explanatory methods in interpreting a given target model. Specifically, the authors evaluated three explanatory methods namely, LIME, SHAP and L2X for a target model trained to perform sentiment analysis on text data. Authors assume for each input text, there is a subset of tokens that are most relevant and that are completely irrelevant to the final prediction task.  The proposed framework uses a recurrent convolutional neural network (RCNN) to find these subsets. The performance of an explainer is evaluated in terms of overlap between the RCNN most relevant tokens and the most relevant tokens provided by the explainer as an explanation. 
+
+Major
+•	The paper lack technical novelty.
+•	The proposed architecture uses a RCNN to find the most relevant subset of tokens. Firstly, RCNN is also a black box that provides no intuition behind its selection decision. Secondly, in the absence of the ground truth labels for true relevance and irrelevance of a token in input sentence, this explainer method can also suffer from “assuming a reasonable behavior” assumption. The method assumes that the RCNN is performing reasonably in identifying relevant subsets.
+•	The success of the method depends on the ability of the RCNN to extract correct subsets of tokens. The data used for training the RCNN, might have some underlying bias. In that case, the evaluation is not accurate.
+•	In related work, for “Interpretable target models” the authors mentioned LIME as an example of explainer functions that explains target models that are “very simple models may not be representative for the large and intricate neural networks used in practice”. LIME locally explains the decision of a complex function for a given data point using simpler models like linear regression. But LIME itself can be used for generating explanation for prediction of complex neural network like Inception Net. 
+•	The example used to explain the difference between feature additive and feature selection-based explainer methods, is confusing. Its not clear how in health diagnostics, one will prefer feature-selection perspective. Although the most relevant features used for the instance are important to understand the decision, but in clinical settings sometimes low rank features can also be useful to understand the target model.
+•	For text, the relevant features are the individual tokens of the input sentence. Similarly, for images relevance can be important regions of the image. The authors did not have any experiments on images or tabular data.
+•	In the experiment section, the comparison is made with only 3 explainer models and for just one task. The experiments are inadequate.
+•	In Figure 4, the colormap is not readable.
+",3,,ICLR2020
+rklvQRB1TX,3,B1l9qsA5KQ,B1l9qsA5KQ,Review of mental fatigue monitoring using brain dynamics preferences,"The mental fatigue is an important factor in road accidents. Finding a direct mapping between EEG features and reaction time is difficult and error-prone, combining the noise measurement of EEG and individual variation of RT.  The authors introduce a measure called BDrank based on partial ordering instead of regression. Formulating the measure as a MAP problem, the authors propose a generalized EM algorithm for prediction. An online extension, relying on iterative L-BFGS optimization over mini-batches. 
+
+Figure 3 shows the indegree sequence for 4 selected subjects. What is the criterion to select these subjects? These cases seem interesting, but is it representative for the best/worst case? It could provide some information to show some of the few cases where SVR is more accurate than BDrank.
+Regarding the identification of noisy channels, the 33rd channel is indicated as a non-EEG one. What is it?
+
+Some minor questions and suggestions:
+- It could be interesting to mention the performance of this measure using only a limited set of EEG channels to evaluate its robustness.
+- The introduction indicates that de Naurois et al ., 2017 rely on EEG to estimate the RT, but it is not the case.
+- The formulation of the assumption (2) on page 3 is unclear, as sensors are not supposed to make any emission and there is a high correlation between channels.
+- The model do not consider transition between type-1 and -2 preference, could it be a problem with confidence interval",7,3.0,ICLR2019
+H6ZRjZPWYdP,3,_TM6rT7tXke,_TM6rT7tXke,"Review of ""Return-Based Contrastive Representation Learning for Reinforcement Learning""","
+Summary:
+
+The authors present a contrastive auxiliary loss based upon state-action returns.
+
+They introduce an abstraction over state-action pairs and divide the space of state-action returns into K bins over which the Z function is defined where Z(s,a) is distributed over K-dimensional vectors.  Given an encoding function, phi, and an input x, Z-irrelevance is defined as phi(x_1) = phi(x_2) when Z(x_1) = Z(x_2) which motivates the objective for Z-Learning: to classify state-action pairs with similar returns (within bounds) to be similar.  From this a contrastive loss can be defined (Return-based Contrastive RL, RCRL) where class labels are determined by Z-irrelevance sets encouraging state-action encodings to be similar when the returns are.  In the limit Z becomes the RL state-action value function Q.
+
+The authors evaluate their approach on Atari (discrete actions) and the DeepMind Control Suite (continuous actions) across both model-free and model based RL algorithms against and in combination with other auxilliary losses including CURL (Srinivas et al. 2020). 
+
+Strengths & Weaknesses:
+
+Auxiliary losses have become an important component in RL for developing stable agents that can generalize well and form good representations.  In particular, contrastive losses have come into increasing use with growing literature around these methods and so I believe the domain area of this paper is relevant and of interest.  The authors do a good job of covering the recent developments of background literature in their related work section and grounding their approach with recent efforts undertaken in RL auxiliary losses, contrastive learning approaches and state abstraction/representation learning literature.
+
+The approach is overall novel as many contrastive learning methods are defined against input data or downstream representations, whereas this work derives it's data from RL returns and creates a link between the representational landscape of the observations and actions and broad outcomes as they are valuable to an agent.  As the author's have framed the problem, I believe this approach is more powerful and also more tractable than something like reward prediction.  Intuitively the formulation seems solid to me since we often would like to understand not only when we're in a good state and taking a useful action but also, in general, what kind of properties state-action pairs with similar returns should have.  The authors do note that this may be learnable by temporal difference updates alone however, this approach aims to directly encourage the learning of this relationship and decouple it from the RL algorithm (where perhaps other things may be focused on such as planning etc.). 
+
+One shortfall of this approach could be the available data itself as you'd rely on the policy to provide you with good samples for RCRL.  The authors indicate that they segment trajectories to ensure better quality positive and negative samples for learning however, it could be made clearer how much of a problem this can be.    This approach could possibly be combined with a self-supervised approach to alleviate these types of concerns.   It would also be nice to know the additional computational burden of RCRL and how this compares to other auxiliary losses.
+
+The experiments on Atari & Control look solid and demonstrate that this method attained a stronger score both alone and when combined with CURL and good top performance on DeepMind control suite tasks when compared again to CURL and pixelSAC.  It might have been nice to see more comparisons or combinations with other contrastive methods that have had some success in  learning visual representations (SimCLR: Chen et al. 2020, BYOL: Grill et al. 2020).  The similarity analysis also provided some nice insight into the inductive bias induced by RCRL.
+
+Overall, the paper is well written and has a clear layout.  The authors provide clear algorithms and figures and the content flows well from section to section.
+
+
+Recommendation:
+
+I believe that this is a promising and very active area of research and that this work makes the case for a solid new approach and a set of encouraging results to back it up. ",7,4.0,ICLR2021
+DkON7Q4s4_o,3,#NAME?,#NAME?,"This paper discusses a method to make spiking networks more relevant to latency-sensitive applications on the edge. I believe the authors' method is relevant to this problem, but doesn't feel like it uncovers anything fundamentally new.  It is an engineering solution to a specific problem; and spiking networks are not a widely used method at the edge currently.  ","The scheme proposed breaks down the information in a block of an image into orthogonal basis functions (DCT is used) to make a progressively better reconstruction of the original image block with the addition of more basis functions used (like an nth order Taylor expansion).  The increasing spatial frequency components are known to be perceptually less sensitive (they need to include this) in images, so the low freq components can be presented first.  Each freq component is encoded into spikes sequentially, thereby staging the more perceptually important information first, with less important info coming later.  This reorders the presentation of information to allow a tradeoff of image quality with time/latency.
+
+I think the solution proposed is well-founded and will indeed mitigate the latency problem for spiking neural networks.  However, this feels a bit more like an engineering solution to a specific problem rather than a new concept.  I do like the injection of methods from other fields like image/video compression; it often feels that the deep learning field rediscovers things that have been uncovered years ago in other fields.  I see that as the main value of the paper in addition to helping to make spiking neural networks a POSSIBLE viable solution to edge deployment.  
+
+Section 1: I don’t think it’s a strongly supported claim that deep learning architectures are unsuitable for edge deployment.  There are plenty in deployment and there are new processors (Movidius, Mythic, etc) that can handle these computations for real-time applications.  I’d suggest a softer language there.  This does weaken the motivation for the paper though.
+
+Section 1, second paragraph:  typo: Thy -> The
+
+Section 3.2: On constraints for the transforms.  Did the authors consider Integer Transform (IT)?  This is used in MPEG/AVC.  It is a reversible transform that is an integer simplification of the DCT.  Given that the point of the paper is to decrease latency and computing requirements for edge deployments, this could help.
+
+Section 3.2: The authors do a good job of sweeping performance for different block sizes.
+
+Figure 5:  Isn’t it an obvious result that more time steps are required for Poisson vs DCT?  There simply aren’t enough bins to sum over to have a result until a certain point.  
+",6,5.0,ICLR2021
+BJlacioyqr,2,HJg_ECEKDr,HJg_ECEKDr,Official Blind Review #1,"Summary:
+The paper proposes Generative Teaching Networks, which aims to generate synthetic training data
+for a given prediction problem. The authors demonstrate its use in an MNIST prediction task
+and a neural architecture search task on Cifar10.
+I do not find the idea compelling nor the empirical idea convincing enough to warrant acceptance at
+ICLR.
+
+
+Detailed Comments:
+ 
+At a high level, the motivation for data generation in order to improve a given prediction problem 
+is not clear. From a statistical perspective, one can only do so well given a certain amount of
+training data, and being able to generate new data would suggest that one can do arbitrarily better
+by simply creating more data -- this is not true. 
+
+While data augmentation techniques have improved accuracy in many cases, they have also relied
+heavily on domain knowledge about the problem, such as mirroring, cropping for images. The proposed
+GTN model does not seem to incorporate such priors and I would be surprised that one can do better
+with such synthetically generated data. 
+Indeed, the proposed approach does not do better than the best performing models on MNIST.
+ 
+The authors use GTNs in a NAS problem where they use the accuracy on the generated images as a proxy
+for the validation accuracy. As figure 4c illustrates there actually does not seem to be much
+correlation between the accuracies on the synthetic and real datasets. 
+While Table 1 indicates that they outperform some baselines, I do not find them compelling. This
+could simply be because random search is a coarse optimization method (and hence the proposed metric
+may not do well on more sophisticated search techniques). 
+    - On a side note, why is evaluating on the synthetic images cheaper than evaluating on the
+      original images? 
+    - What is the rank-correlation metric used? Did you try more standard correlation metrics such
+      as Pearson's coefficient? 
+
+
+=================
+Post rebuttal
+
+Having read the rebuttal, the comments from other reviewers, and the updated manuscript, I am more positive about the paper now. I agree that with reviewer 2 that the proposed approach is interesting and could be a method to speed up NAS in new domains. I have upgraded my score to reflect this.
+
+My only remaining issue is that the authors should have demonstrated this on new datasets (by running other methods on these datasets) instead of sticking to the same old datasets. However, this is the standard practice in the NAS literature today.",6,,ICLR2020
+fivm5N4X6p7,2,x9C7Nlwgydy,x9C7Nlwgydy,A good attempt at combining ensemble learning approaches with deep clustering,"This paper studies the effect of combining ensemble learning approaches with deep clustering. The paper wants to show that ensemble learning methods, in particular consensus clustering, can improve the clustering accuracy when combined with general representation learning/clustering blocks. However, I am not sure that the results presented in the paper are enough to support the claims.
+
+The paper's  pros are:
+(+) It Is the first to combine ensemble methods with deep clustering models. Although ensemble methods have been widely applied, studying consensus clustering in the current problem setting is novel.
+(+) It Is the first to be able to have an ensemble deep clustering algorithm that gains empirically over other state-of-the-art models, showing the ideas to be potentially effective.
+(+) The writing is in general clear and undestandable.
+
+The paper's cons are:
+(-) The wording in the abstract is a bit confusing in the sense that after reading it one might think the algorithm does consensus clustering first and uses the clustering to learn better representations of the input data. Although this is clarified later in the main body.
+(-) The description of the main algorithm seems to be more intuitive than innovative. Some algorimic design choices are not very convincing. For example, the choice of using random projections on embedding to produce different clusterings, although an interesting idea, makes me wonder why it is necessarily a good way to introduce randomness into the whole framework. The authors can expand their discussion on this. I also remain dubious about why different representation instead of different clustering methods 
+Also, since the authors meant the idea to be applicable to general representation learning/deep clustering blocks, I'm not sure the current experimental data in the paper can lead to that conclusion.
+(-) Using the performance metrics provided by the authors, I find it a bit hard to conclude that the proposed algorithm has a significant advantage over state of the art methods, especially PICA. Also, the fluctuation in performance metrics caused by different parameter settings seems to be, in magnitude, at least comparable to the margin of ConCURL over the baselines.
+
+Overall, I think this paper contains  interesting seed ideas such as combining consensus clustering with representation learning and making use of the learned representation to generate multiple clusterings. These seed ideas could be good for this venue. However, the work is still premature and flawed by crude algorithm/experiment design.  The quality can be significantly improved if the authors can give a more general algorithmic framework (since the authors meant the ideas to be applicable to general representation learning/clustering algorithms), equipped with more thorough experimental investigation to support the applicability and superiority of the current approach.",5,3.0,ICLR2021
+WRpG-HhiXXX,3,T3kmOP_cMFB,T3kmOP_cMFB,Replacing one of the two function samples in zeroth-order online learning algorithms with an old sample collected at a previous iteration: potential and limitations of the technique.,"This paper proposes a zeroth-order (derivative-free) algorithm for online stochastic optimization problems. The objective is to find a sequence of actions $x_0,\dots,x_{T-1}$ minimizing the expected regret
+$$\mathbb{E} [ \sum_{t=0}^{T-1} f_t(x_t) - \min_{x\in\mathcal{X}} \sum_{t=0}^{T-1} f_t(x)],$$ where the (sub-)gradients of unknown cost functions $f_t$ are not available, and only measurements $f_t(x_t)$ of the values of the functions at tests points $x_t$ can be obtained.
+
+The submission builds on the zeroth-order techniques developed by Nesterov & Spokoiny in [1] for derivative-free, non-smooth, convex and non-convex optimization, where similar gradient estimation techniques based on sampling and Gaussian smoothing are used, with the difference that two values of an identical noisy instance of the cost function are needed in [1] at each iteration. By requiring only one noisy function value per iteration and recalling the function value collected during the previous iteration (in the submission this technique is called ""residual feedback""), the proposed algorithm extends convergence results of the two-point approach [1] to regret bounds in stochastic/bandit settings where the function is changing after every new value observed, on condition that the differences between two consecutive instances of the cost function are bounded in variance.  
+
+The regret bounds derived in the paper match those obtained for recent 'one-point' zeroth-order methods for online optimization (e.g. [2]). A specificity of the algorithm proposed in the paper, compared to other 'one-point' methods, is that the algorithm does not depend on the absolute function levels, only on differences between two function instances, which may improve the performance in practice, as shown in the numerical experiments. This property was also shared by the approach of Bach and Perchet [4], which serves as a benchmark algorithm in the numerical experiments of the submission. These experiments are carried out on a nonstationary LQR control algorithm, and on a nonstationary resource allocation problem.
+
+The paper is technically sound and the developments are clear. The regret bounds derived for online non-convex optimization are interesting. The contributions to the online convex optimization framework are less obvious, due to the abundant literature on the topic. See my concerns below and my questions to the authors.
+
+I look forward to the authors' answers. My recommendation will be amended after their rebuttal.
+
+
+
+Pros:
+
+The paper is technically sound and well written.
+
+The regret bounds derived in the non-convex online optimization framework are of particular interest.
+
+Since the proposed algorithm does not depend on the function levels, it may perform better than the basic 'one-point' methods in practice.
+
+
+
+Concerns and questions:
+
+The presentation of the results leaves a mixed impression. I agree that regret bounds in non-convex online learning are a contribution to the field. The claims of novelty made by the authors for the convex case, on the other hand, look somewhat overstated. They write, for instance: ""it is also the first time that a one-point gradient estimator demonstrates comparable performance to that of the two-point method"". This sounds optimistic to me, in the sense that the authors' argument is mostly empirical (numerical experiments for a particular problem), whereas the regret bounds derived in the paper do not compare with the regret bounds that two-point methods would achieve. Moreover, there exist more recent approaches to convex zeroth-order online learning which claim the conjectured $\Omega(\sqrt{T})$ regret bound [3,5]. These new trends in zeroth-order online learning are not discussed in the submission.
+
+I don't see a clear distinction between the settings 'online bandit optimization' (Sections 3 and 4) and 'online stochastic optimization' (Section 5), because the regret criterion (6), the assumptions of the cost function sequences (3.1, 4.1 / 5.1, 5.2), the algorithms, and the regret bounds are apparently the same for the two settings. The only specificity of the Section 5 model seems to be the existence of a mean cost function $\mathbb{E}[f_t]$, if we set $f_t(\cdot)\equiv F(\cdot;\xi_t)$ — assumption which is not exploited. Also, I found Section 5 slightly redundant. I thought it could easily be replaced by a discussion on all the frameworks covered by Assumptions 3.1 and 4.1 and on the possible interpretations given to the model.
+
+In the submission, an algorithm developed by Bach and Perchet in [4], was classified by the authors as a two-point zeroth-order optimization algorithm and used in the numerical experiments as a benchmark for comparison. In my recollection of [4], the algorithm relies on a gradient estimator which considers the difference between two noisy functions values affected by two independent noises, with the assumption that the noises are uniformly bounded in variance or satisfy a martingale property. To me, these assumptions are similar (if not identical) to those made in Section 5 and in Sections 3,4,6, respectively. Could the authors clarify the differences between the noise model of [4] and the one they use, and why the algorithm [4] is impractical and cannot be used in online settings? Why was it treated differently in the numerical experiments?
+
+Another feature that was not discussed in the paper is the feasibility of the algorithm in terms of the availability of the function queries. The problem stated in Equation (P) is a constrained online optimization problem over a convex set. However, since the test points are sampled over the entire state space from Gaussian distributions, the proposed algorithm will query function values outside the feasible set, and these function values are not available in many learning applications. Note that it is possible to combine Gaussian sampling with constrained online optimization [5], and that feasible zeroth-order optimization algorithms based on residual feedback have been developed [6].
+
+
+
+Typos :
+p.2 such an one-point derivative-free setting => such a
+p.7 nonstatinoary => nonstationary
+
+[1] Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527–566, 2017.
+
+[2] Alexander V Gasnikov, Ekaterina A Krymova, Anastasia A Lagunovskaya, Ilnura N Usmanova, and Fedor A Fedorenko. Stochastic online optimization. single-point and multipoint non-linear multi-armed bandits. convex and strongly-convex case. Automation and remote control, 78(2):224–234, 2017.
+
+[3] https://arxiv.org/abs/1603.04350
+
+[4] Francis Bach and Vianney Perchet. Highly-smooth zero-th order online optimization. In Conference on Learning Theory, pp. 257–283, 2016.
+
+[5] https://arxiv.org/abs/1607.03084
+
+[6] https://arxiv.org/abs/2006.05445
+
+
+__________
+
+
+
+Update after the discussions:
+
+I would like to thank the author(s) for all their comments. Although most of my concerns have been addressed, some questions remain topics of contention. Before discussing these topics, I will first append to this review my answer to the author(s)' last comments, as it was their wish to keep hearing from me after closing of the discussions:  
+
+$\ \ \ $ The assumptions (3.1, 4.1, 5.1, 5.3) made on the function sequence $f_t$ for convergence of the proposed algorithm are unconventional as they require that the expected absolute variations of two function values at the points visited by the algorithm be bounded, or that the squared variations of two function values obtained by Gaussian sampling from points visited by the algorithm be bounded. So formulated, the conditions for convergence involve the algorithm's trajectory $x_t$ as much as the function sequence $f_t$, and they are difficult to verify. In an attempt to identify sufficient conditions for these assumptions to hold true, I made three suggestions: (i) and (ii) were concerned with the boundedness of the sequence of points generated by the algorithm, and (iii) was the case of bounded incremental variations of the sequence $f_t$, e.g. martingales. In their reply, the author(s) were right to rule out (i) and (ii), which indeed were unrelated. This leaves us with (iii) as a possible setting for the proposed algorithm.  
+
+$\ \ \ $ In my last comment I argued that the case (iii), where the sequence $f_t$ undergoes incremental variations uniformly bounded in expectation, was covered by the approach taken in Bach & Perchet (2016), where two function queries obtained from perturbations around the same iterate are processed at each step. The Bach/Perchet approach is cited in the paper for comparison, but it is called impractical as it would not apply when $f_t$ varies over time $-$ argument I disagree with and that I attempted to refute in a brief discussion involving martingale-like variations for $f_t$. When the author(s) of the submission object to my regret analysis in the case of martingale-like noise on the basis that the assumptions they make also cover non-zero-mean variations with similar uniform upper bounds on the moments, they do not address the main point of my comment. My intention was to show that it does not take much effort to consider the approach used in Bach & Perchet (2016) in settings where the cost function is changing over time, for as long as the cost variations are incremental with bounded moments. This can be seen by noting that the convergence result derived in the revised version of the supplementary material for the residual-feedback algorithm with unit-sphere sampling can be reproduced for the Bach/Perchet approach under the considered assumptions. I take it that the author(s), who excel at deriving the convergence rates for such algorithms, will not disagree. Although the assumptions used in Bach & Perchet (2016) (uniform zero-mean increments) may look somewhat stricter, they have the merit of being clear and simple, as opposed to Assumptions 3.1, 4.1, 5.1, 5.3, which involve the trajectory of the algorithm and can't be verified easily. They are also sufficient to improve the convergence rates for higher degrees of smoothness compared to the early algorithm by Flaxman et al. (2005), which was the objective of that paper. Higher degrees of smoothness failing which it is difficult to improve the convergence rates, as confirmed by the convergence rates given in the submission. In my sense, one important message conveyed by the submission is that the approaches proposed in the submission and in Bach & Perchet (2016) can both handle bounded additive noise, and both fail in the more general framework of adversarial learning. By calling the Bach/Perchet algorithm impractical for their setting, I believe the author(s) of the submission missed to chance to compare the two approaches from a fair perspective and to answer the simple question that comes to mind when reading their paper: is the residual feedback technique really useful in the stochastic learning framework, or isn't convergence just as fast when the function queries are processed by pairs as in Bach & Perchet (2016), or in the reference paper by Nesterov & Spokoiny (2017) ?
+
+-------
+
+That being said, the following issues remain in this submission:
+
+$\bullet$ The assumptions (3.1, 4.1, 5.1, 5.3) made on the function sequence $f_t$ for convergence of the proposed algorithm are unconventional and difficult to verify, because they consist in properties of the iterates of the algorithm.
+
+$\bullet$ In our discussions, only incremental sequences $f_t$ with variations uniformly bounded in expectation have been identified to meet those assumptions. In my sense, this particular setting is also covered by the approach taken in Bach & Perchet (2016), where the function queries are handled by pairs obtained from perturbations around the same iterate. Also, I still find it unfair to call the latter approach impractical for the considered setting.
+
+$\bullet$ Since the convergence rates derived in the paper show no clear improvement, compared to the early approach of Flaxman et al. (2005), the arguments of the submission lie in the experimental results, where I don't think the algorithm by Bach & Perchet (2016) is given a fair treatment (for the reason explained in the previous paragraph). Besides, the application considered in Section 6.1 reduces to the unconstrained minimization of a polynomial of high degree that is neither Lipschitz nor smooth, which is a basic requirement for the convergence algorithms. This makes the convergence of the algorithms highly dependent on the initial point, unless optimization is done over a compact set, but I don't think the projection step was implemented for the algorithms.
+
+$\bullet$ In constrained optimization, the problem that the proposed algorithm samples function values outside the feasible set has been partly addressed by the author(s), who provided a variant of the algorithm based no longer on Gaussian sampling, but on sampling over a sphere. Partly because only one convergence result for a particular setting was derived, and it remains unclear (as pointed out by Reviewer 4) if all the benefits of Gaussian smoothing and all convergence results would also extend to spheric smoothing. This discussion is missing. In my opinion, the extension to settings where the functions can't be sampled outside the feasible set is not absolutely imperative in all frameworks (the author(s) have provided counter-examples), but it would be useful to know the limits of the proposed technique. 
+
+All things considered, I would not recommend the submission for presentation at the conference. Independently of the final decision, I hope the author(s) will make the most from the discussions with all the reviewers.
+
+I would like to make a last comment about the submission and the discussions that followed. It is natural that the author(s) give the best picture of the algorithm they propose. Yet in the paper the contrast is particularly strong between, on the one hand, the haziness surrounding the assumptions made on the function sequence $f_t$, or the negligence with which the algorithms were applied in Section 6.1 to a problem not actually meeting the conditions for convergence, and on the other hand the severity with which the Bach/Perchet approach was disqualified as a possible method of solution. This contrast gives the reader an overall feeling of partiality, which makes the reviewing task an intricate, contradictory, and unappreciative one.",4,4.0,ICLR2021
+jiKkjHqyeaf,4,JkfYjnOEo6M,JkfYjnOEo6M,This paper studies group equivariance properties of self-attention networks. Permutation equivariance follows from the self-attention definition while group equivariance depends on the definition of the positional encoding.,"The authors describe their contributions in the introduction: the analysis of equivariance of self-attention, and how group invariance in the relative positional encoding enables group equivariance of the self-attention.
+
+
+The authors have a very condensed related work section without going into any detail but with a lot of citations on (non attention) papers about equivariance. Works missed include Equivariant transformer networks by Tai et al., Equivariant multi-view networks (Esteves et al.), SO(3)-equivariant representations (Esteves et al. ), and early work on equivariant function spaces by Hel-Or and Teo (Canonical Decomposition of Steerable Functions).  
+
+But probably most relevant is the missing discussion on equivariance on set networks (incl point cloud networks) and graph networks (Maron and Lipman). There is no positional encoding in point clouds but the value at each element are the coordinates and some parallels can be drawn to positional encodings (for example first few layers of pointnet). 
+
+The authors clearly define self-attention, first with matrices, and then with a functional formulation. The functional formulation is elegant but difficult to follow. The concatenation in the multi-head case is an example of where the vector space formulation allows replacing concatenation with a union. The authors might want to explain the benefits compared to tensor formulation.
+
+Authors first prove permutation equivariance of global self-attention without positional encoding.
+The proof in G.4.1 is kind of convoluted and might be clearer using only matrices (if $\Pi$ is permutation than $\sigma(\Pi Q K^T \Pi^T)\Pi V = \Pi \sigma(Q K^T) V$.
+
+Then they prove translational equivariance for relative positional encoding. It is easy to see that relative positional encoding is translation invariant. The step to equivariance has to be followed in the appendix and it would be easier for the reader to provide at least a sketch in the main text. 
+
+The observation about translation paves the ground for generalizing to any group if the positional encoding is invariant to this group. Unfortunately, at this point the discussion becomes very confusing compared to the original ""lifting"" by Cohen and Welling (2016). While this paper makes the lifting appear as a trick, the main idea of Cohen and Welling is that when one applies group convolution on a quotient space, for example SE(2)-convolution on R^2, the result is automatically invariant in SO(2) since R^2=SE(2)/SO(2). Instead, one performs a group correlation where the output is a function of the group action (not of the quotient space) and after this step one applies a group convolution (also appearing in the spherical CNNs (Cohen)as well as in the icosahedral multi-view networks (Esteves)).
+
+The paper concludes with the claim that linear mappings whose positional encoding is G-invariant are G-equivariant. This is easy to see in the definition of convolution and one can imagine this for linear mappings but it is difficult to see that self-attention is a linear mapping if one looks at eq. 5 and possible definitions of the encoding function (12). I see it through the equivalence proof in Cordonnier but not through the self-attention definition.
+
+Experiments do not show any advantage of equivariant self-attention in z2 or r4 CNNS.
+
+To summarize the paper is interesting but quite difficult to follow. I wish the paper would follow the tensor notation like in the SE(3)-transformers. Authors should justify the superiority of their formalism vs tensors.
+
+Related work should not be condensed with mere listing of citations like \cite{*).
+
+Last: Steerability is claimed in the abstract and the introduction but never mentioned again in the paper. One can somehow see how the positional encoding implies it but a section would be worth to be dedicated to it. 
+
+",7,4.0,ICLR2021
+Skx5Nrw5hm,2,BJgEjiRqYX,BJgEjiRqYX,Interesting idea but not novel and ultimately unconvincing,"This paper explores compositional image generation. Specifically, from a set of latent noises, the relationship between the objects is modelled using an attention mechanism to generate a new set of latent representations encoding the relationship. A generator then creates objects separately from each of these (including alpha channels). A separate generator creates the background. The objects and background are finally combined in a final image using alpha composition. An independent setting is also explored, where the objects are directly sampled from a set of random latent noises.
+
+My main concern is that the ideas, while interesting, are not novel, the method not clearly motivated, and the paper fails to convince. 
+
+It is interesting to see that the model was able to somewhat disentangle the objects from the background. However, overall, the experimental setting is not fully convincing. The generators seem to generate more than one object, or backgrounds that do contain objects. The datasets, in particular, seem overly simplistic, with background easily distinguishable from the objects. A positive point is that all experimented are ran with 5 different seeds. The expensive human evaluation used does not provide full understanding and do not seem to establish the superiority of the proposed method.
+
+The very related work by Azadi et al on compositional GAN, while mentioned, is not sufficiently critiqued or adequately compared to within the context of this work.
+
+The choice of an attention mechanism to model relationship seems arbitrary and perhaps overly complicated for simply creating a set of latent noises. What happens if a simple MLP is used? Is there any prior imposed on the scene created? Or on the way the objects should interact?
+On the implementation side, what MLP is used, how are its parameters validated?
+
+What is the observed distribution of the final latent vectors? How does this affect the generation process? Does the generator use all the latent variables or only those with highest magnitude? 
+The attention mechanism has a gate, effectively adding in the original noise to the output — is this a weighted sum? If so, how are the coefficient determined, if not, have the authors tried?
+
+The paper goes over the recommended length (still within the limit) but still fails to include some important details —mainly about the implementation— while some of the content could be shortened or moved to the appendix. Vague, unsubstantiated claims, such as that structure of deep generative models of images is determined by the inductive bias of the neural network are not really explained and do not bring much to the paper.",4,5.0,ICLR2019
+sYMmZtYtOh,1,CaCHjsqCBJV,CaCHjsqCBJV,Interesting ideas but not polished enough,"Summary
+-------
+
+The paper makes the observation that various non-decomposable losses in machine learning can be rewritten as linear programs, whose constraints depends on the model output. This is the case for AUC, multi-class AUC, F-score, and to some extend NMF.
+
+The authors review these losses, and recall how they may be rewritten as LPs. The LP formulation as known for AUC and NMF, but as far as the reviewer understand, they are new for multi-class AUC and F-score.
+
+Then, the authors propose to directly backpropagate through the LP resolution to minimize non-decomposable losses, applied on top of deep architectures. For this, they propose to solve an approximate solution to the LP problem (a quadratic penalization of the constraint violations) using a modified Newton method. They propose either to backpropagate by unrolling the Newton steps, or by using the computed minimizer directly.
+
+Review
+------
+
+The endeavor of writing non-decomposable losses as LPs, to see these losses as pluggable LP-layer in deep architecture is interesting, albeit not original.
+
+Using a penalized approximation of the LP to be able to solve them efficiently using a Newton method is also interesting.
+
+The experiment section shows that it is indeed beneficial to directly optimize over a certain decomposable loss when we measure performance in term of this loss: in particular, it outperform using a simple logistic loss. This was completely expected, but it is good to verify it experimentally.
+
+On the other hand, the manuscript suffer from many unclear parts, and from a theoretical analysis that is not polished enough. In particular:
+
+    - Phi is not introduced beforehand p. 4, and the F-score part is very hard to understand.
+
+    - the NMF section is very unclear, in particular as the authors use vague terms in their construction, such as ""zero padding ensures a sxs matrix"". I do not understand the role of tilde p in (6).
+
+    - Lemma 1 is not stated properly, as there is no f in equation (7). The authors state that ""each y has a neighborhood in which the Hessian is quadratic"", which does not mean anything. The proof sketch of Theorem 2 is very vague, in particular when the authors state that ""the possible choice of Hessian is finite"".
+
+    - I do not understand whether rho is chosen at every iteration, and what is its importance.
+
+I have trouble understanding why the authors went to such lengths in their
+theoretical analysis. They modify a LP by making it a ""smooth almost everywhere""
+problem, which can then be solved using any methods, and backpropagated through
+using either unrolling, or the computed minimizer (by virtue of Danskin
+theorem), or the implicit function theorem. There is therefore not need to backpropagate throught tilde A^{-1} b.
+
+The fact the the problem is only smooth almost everywhere may be a problem,
+which is not addressed by using a Newton method. It implies that the gradient
+becomes a subgradient, and may hinder optimization performance. Remark 4
+dismisses this problem as unimportant, yet it is, as local convergence rates for
+non-convex gradient descent requires smoothness.
+
+Relating to experiments:
+
+    - The reported performance does not show std errors across splits, which makes it impossible to compare in between similar methods (PPD-SG, PPD-AdaGrad and Ours). It appears that all three methods are within statistical variations.
+
+    - NMF is a long studied problem, with many powerful methods to handle large inputs. I do not understand the choice of using the input of a deep learning network for the experiment. As it it, the experiment proposed in this manuscript is not polished enough to be valuable.",3,5.0,ICLR2021
+S138CtnWf,3,SJJySbbAZ,SJJySbbAZ,-,"This paper proposes a simple modification of standard gradient descent -- called “Optimistic Mirror Descent” -- which is claimed to improve the convergence of GANs and other minimax optimization problems.  It includes experiments in toy settings which build intuition for the proposed algorithm, as well as in a practical GAN setting demonstrating the potential real-world benefits of the method.
+
+
+Pros
+
+Section 3 directly compares the learning dynamics of GD vs. OMD for a WGAN in a simple toy setting, showing that the default GD algorithm oscillates around the optimum in the limit while OMD’s converges to the optimum.
+
+Section 4 demonstrates the convergence of OMD for a linear minimax optimization problem. (I did not thoroughly verify the proof’s correctness.)
+
+Section 6 proposes an OMD-like modification of Adam which achieves better results than standard Adam in a practical GAN setting (WGANs trained on CIFAR10) .
+
+
+Cons/Suggestions
+
+The paper could use a good deal of proofreading/revision for clarity and correctness. A couple examples from section 2:
+- “If the discriminator is very powerful and learns to accurately classify all samples, then the problem of the generator amounts to solving the Jensen-Shannon divergence between the true distribution and the generators distribution.” -> It would be clearer to say “minimizing” (rather than “solving”) the JS divergence. (“Solving” sounds more like what the discriminator does.)
+- “Wasserstein GANs (WGANs) Arjovsky et al. (2017), where the discriminator rather than being treated as a classifier is instead trying to simulate the Wasserstein−1 or earth-mover metric” -> Instead of “simulate”, “estimate” or “approximate” would be better word choices.  And although the standard GAN discriminator is a binary classifier, when optimized to convergence, it’s also estimating a divergence -- the JS divergence (or a shifted and scaled version of it).  Even though the previous paragraph mentions this, it feels a bit misleading to characterize WGANs as doing something fundamentally different.
+
+Sec 2.1: There are several non-trivial but uncited mathematical claims hidden behind “well-known” or similar descriptors. These results could indeed be well-known in certain circles, but I’m not familiar with them, and I suspect most readers won’t be either. Please add citations. A few examples:
+- “If the loss function L(θ, w) ..., then standard results in game theory and no-regret learning imply that…”
+- “In particular, it is well known that GD is equivalent to the Follow-the-Regularized-Leader algorithm with an L2 regularizer...”
+- “It is known that if the learner knew in advance the gradient at the next iteration...” 
+
+Section 4: vectors “b” and “c” are included in the objective written in (14), but are later dropped without explanation.  (The constant “d” is also dropped but clearly has no effect on the optimization.)
+
+
+Overall, the paper could use revision but the proposed approach is simple and seems to be theoretically well-motivated with solid analysis and benefits demonstrated in real-world settings.",6,4.0,ICLR2018
+r1xv8YTatr,3,HklliySFDS,HklliySFDS,Official Blind Review #1,"The paper proposed an interesting continual learning approach for sequential data processing with recurrent neural network architecture. 
+The authors provide a general application on sequential data for continual learning, and show their proposed model outperforms baseline.
+
+It is natural that their naive baseline shows poor performance since they do not consider any continual learning issues like the catastrophic forgetting problem. Then, I hesitate to evaluate the model in terms of performance. In that sense, it would be much crucial to show more meaningful ablation studies and analysis for proposed model. However, there is a few of thing about them. 
+
+Then, I decide to give a lower score that even the authors suggest that the main contribution is a definition of problem setting. It requires more detailed and sophisticated analysis.
+          
+",3,,ICLR2020
+H1eJfdke5r,3,r1g6ogrtDr,r1g6ogrtDr,Official Blind Review #1,"This paper describes an approach to applying attention in equivariant image classification CNNs so that the same transformation (rotation+mirroring) is selected for each kernel. For example, if the image is of an upright face, the upright eyes will be selected along with the upright nose, as opposed to allowing the rotation of each to be independent. Applying this approach to several different models on rotated MNIST and CIFAR-10 lead to smaller test errors in all cases.
+
+Overall, this is a good idea that appears to be well implemented and well evaluated.  It includes an extensive and detailed bibliography of relevant work. The approach seems to be widely applicable. It could be applied to any deep learning-based image classification system. It can be applied to additional transformations beyond rotation and mirroring.
+
+The one shortcoming of the paper is that it takes a simple idea and makes it somewhat difficult to follow through cumbersome notation and over-mathmaticization. The ideas presented would be much clearer as an algorithm or more code-like representation as opposed to as equations. Even verbal descriptions could suffice. The paper is also relatively long, going onto the 10th page. In order to save space, some of the mathematical exposition can be condensed.
+
+In addition, as another issue with clarity, the algorithm has one main additional hyperparameter, r_max, but the description of the experiments does not appear to mention the value of this hyperparameter. It also states that the rotated MNIST dataset is rotated on the entire circle, but not how many fractions of the circle are allowed, which is equivalent to r_max.",6,,ICLR2020
+SkeChVXt27,2,S1E3Ko09F7,S1E3Ko09F7,"Novel methods for Shapley value estimation seem theoretically sound, could benefit from slightly more extensive evaluation","This paper provides new methods for estimating Shapley values for feature importance that include notions of locality and connectedness. The methods proposed here could be very useful for model explainability purposes, specifically in the model-agnostic case.  The results seem promising, and it seems like a reasonable and theoretically sound methodology.  In addition to the theoretical properties of the proposed algorithms, they do show a few quantitative and qualitative improvements over other black-box methods.  They might strengthen their paper with a more thorough quantitative evaluation.
+
+I think the KernelSHAP paper you compare against (Lundberg & Lee 2017) does more quantitative evaluation than what’s presented here, including human judgement comparisons.  Is there a way to compare against KernelSHAP using the same evaluation methods from the original paper?
+
+Also, you mention throughout the paper that the L-shapley and C-shapley methods can easily complement other sampling/regression-based methods.  It's a little ambiguous to me whether this was actually something you tried in your experiments or not.  Can you please clarify?",7,2.0,ICLR2019
+7vCSfEs3_Hj,4,HxzSxSxLOJZ,HxzSxSxLOJZ,"Review of ""RESNET AFTER ALL: NEURAL ODES AND THEIR NUMERICAL SOLUTION""","Paper summary:
+
+The paper demonstrates how neural ODE models generating features for downstream tasks (or simply modelling trajectories) may rely on the discreteness of integration methods to generate features and thus fail in the exact ODE limit of integration step-size going to zero. The paper highlights particular failure modes, such as the discreteness of integration methods allowing for qualitative differences like overlapping trajectories (impossible for the exact solution of an autonomous ODE) compared to exact solutions, or quantitative differences like the accumulated error of a numerically integrated ODE resulting in useful features for downstream tasks. The paper empirically demonstrates the phenomenon that low training losses can be achieved for a range of integration methods and integration step-sizes, but that, of these models, the ones robust to changes in integration method and decreases in integration step-sizes at test time are those trained below a certain (empirically determined) integration step-size threshold. This is attributable to models trained with lower integration step-sizes maintaining features that are qualitatively the same as or quantitatively close to those features produced by the same model with smaller integration step-sizes. The paper proposes an algorithm for adapting integration step-size during training so that the resulting neural ODE model is robust to changes in integration method and integration step-size at test time. The algorithm is empirically demonstrated to achieve the same performance as grid search (for similar numbers of function evaluations).
+
+------------------------------------------
+Strengths and weaknesses:
+
+I liked the paper as it raised an important question of whether and when we should interpret neural ODEs as having continuous semantics and gave a few examples of failure cases. The results of the step-size adaptive algorithm were also promising (it matched grid search but with less work). Further, the paper was clearly written and easy to understand.
+
+However, as it stands, I’m assigning a score of 5. I like the paper and think that it would be a good workshop paper but is not ready for the main conference. The reason for this is that the theoretical part of the paper is mostly qualitative, whilst the experiments are not extensive enough to make up for the qualitative theoretical justification. If one of these two areas were to be improved, I would be happy to increase my score. To be concrete, here are examples of theoretical and empirical questions whose answers (just one would do) would increase the paper’s score for me:
+
+1)	How can we mathematically describe when numerically integrated trajectories cross over in terms of the time over which the ODE is integrated and on the initial separation of the trajectories?
+
+2)	Suppose we are integrating an ODE for which we have the analytic form. Are there additional behaviours we need to watch out for? For example, after passing below a step-size where we transition from crossing trajectories to non-crossing trajectories, is it possible to transition back to crossing trajectories as we continue to decrease the step-size? Or can we rule out this case, for example, in the case of a f being continuous in the equation z’(t) = f(z)?
+
+3)	For Lady Windermere’s Fan with the true dynamics, at what step-size does trajectory overlap cease to occur (assuming a minimum initial separation of trajectories and fixed time period)? And if we instead attempt to learn Lady Windermere’s Fan with a neural ODE, at what step-size does the neural ODE start to be robust against test-time decreases in step-size? How does this latter step-size compare to the former step-size? 
+------------------------------------------
+Questions and clarification requests:
+
+1)	What was the true underlying model for figures 1 and 2?
+
+2)	Why are the classifier decision boundaries different in figures 2a and 2b? I thought that you trained a neural ODE with h_train = 1/2 and then tested this model for both h_test = 1/2 and h_test = 1/4. 
+
+3)	I didn’t understand the connection between Lady Windermere’s Fan and the XOR problem. Does running Lady Windermere’s Fan on R^2 with an XOR labelling lead to trajectory end points that are linearly separable? If so, how did you discover this?
+
+4)	You mention at the end of section 2.2 that “The current implementation of Neural ODEs does not ensure that the model is driven towards continuous semantics as there are no checks in the gradient update ensuring that the model remains a valid ODE nor are there penalties in the loss function if the Neural ODE model becomes tied to a specific numerical configuration.” Do you have any ideas of what directions you might head in in terms of regularising neural ODEs so that they manage to learn continuous semantics, even when trained at larger step-sizes? In particular, why did you go in the direction of an adaptive optimization algorithm, instead of, say, training the neural ODE with a randomly chosen step-size each iteration or even step?
+
+5)	Why was the CIFAR-10 classification accuracy (~55%)? Previous work on neural ODEs has obtained accuracy in the 80 -95% range. Is this just due to the limited expressiveness of the upstream classifier, cf. “For all our experiments, we do not use an upstream block f_u similar to the architectures proposed in Dupont et al. (2019). We chose such an architectural scheme to maximize the modeling contributions of the ODE block.” 
+
+------------------------------------------
+Typos and minor edits:
+
+- Write Initial Value Problem (IVP) on first usage of IVP.
+- Fig.2 caption – “The model was trained …, we used …” -> “The model was trained …, and we used …”
+- Page 8, Conclusion, line 3 – “… an continuous…” -> “… a continuous …”",5,4.0,ICLR2021
+rkeQQzTK3m,1,BygfghAcYX,BygfghAcYX,The authors present a novel bound for the generalization error of 1-layer neural networks with multiple outputs and ReLU activations. ,"It is shown empirically that common algorithms used in supervised learning (SGD) yield networks for which such upper bound decreases as the number of hidden units increases. This might explain why in some cases overparametrized models have better generalization properties.
+
+This paper tackles the important question of why in the context of supervised learning, overparametrized neural networks in practice generalize better. First, the concepts of \textit{capacity} and \textit{impact} of a hidden unit are introduced. Then, {\bf Theorem 1} provides an upper bound for the empirical Rademacher complexity of the class of 1-layer networks with hidden units of bounded \textit{capacity} and \textit{impact}. Next, {\bf Theorem 2} which is the main result, presents a new upper bound for the generalization error of 1-layer networks. An empirical comparison with existing generalization bounds is made and the presented bound is the only one that in practice decreases when the number of hidden units grows. Finally {\bf Theorem 3} is presented, which provides a lower bound for the Rademacher complexity of a class of neural networks, and such bound is compared with existing lower bounds.
+
+## Strengths
+- The paper is theoretically sound, the statement of the theorems
+    are clear and the authors seem knowledgeable when bounding the
+    generalization error via Rademacher complexity estimation.
+
+- The paper is readable and the notation is consistent throughout.
+
+- The experimental section is well described, provides enough empirical
+    evidence for the claims made, and the plots are readable and well
+    presented, although they are best viewed on a screen.
+
+- The appendix provides proofs for the theoretical claims in the
+    paper. However, I cannot certify that they are correct.
+
+- The problem studied is not new, but to my knowledge the
+    presented bounds are novel and the concepts of capacity and
+    impact are new. Theorem 3 improves substantially over
+    previous results.
+
+- The ideas presented in the paper might be useful for other researchers
+    that could build upon them, and attempt to extend and generalize
+    the results to different network architectures.
+
+- The authors acknowledge that there might be other reasons
+    that could also explain the better generalization properties in the
+    over-parameterized regime, and tone down their claims accordingly.
+
+## Weaknesses
+\begin{itemize}
+- The abstract reads ""Our capacity bound correlates with the behavior
+    of test error with increasing network sizes ..."", it should
+    be pointed out that the actual bound increases with increasing
+    network size (because of a sqrt(h/m) term), and that such claim
+    holds only in practice.
+
+- In page 8 (discussion following Theorem 3) the claim
+    ""... all the previous capacity lower bounds for spectral
+        norm bounded classes of neural networks (...) correspond to
+        the Lipschitz constant of the network. Our lower bound strictly
+    improves over this ..."", is not clear. Perhaps a more concise
+    presentation of the argument is needed. In particular it is not clear
+    how a lower bound for the Rademacher complexity of F_W translates into a
+    lower bound for the rademacher complexity of l_\gamma F_W. This makes the claim of tightness of Theorem 1 not clear. Also this makes
+    the initial claim about the tightness of Theorem 2 not clear.
+",7,3.0,ICLR2019
+BJgRgt7j5r,3,HkewNJStDr,HkewNJStDr,Official Blind Review #3,"===== Update after author response
+
+Thanks for the clarifications and edits in the paper.
+
+I recommend acceptance of the paper.
+
+Other comments:
+Definition 1 in the updated version is still too vague (""difference of what?"" -- function values? distance in norm between iterates?) -- this should be clarified.
+
+========
+
+This paper considers the problem of sparsity-constrained ERM and asks whether one can design a variant of the stochastic hard thresholding approaches where the hard-thresholding complexity does not depend on a (sparsity dependent) condition number, unlike all previous approaches (Table 1). It proposes a method which combines SVRG-type variance reduction, with block-coordinate updates, leaving the hard thresholding operation outside the inner loop, to accomplish this goal. It provides a convergence analysis which significantly improves the previous best rates (by having both the sparsity level shat which is significantly lower (kappa_shat vs. kappa_stilde^2) as well as a condition number independent hard thresholding complexity (Table 1). An asynchronous and sparse (in the features) variant is also proposed, with even better complexity. Some standard experiments on sparse linear regression and sparse logistic regression is presented showing an improvement in both number of iterations as well as CPU time.
+
+I think the clarity of the paper should be quite improved (see detailed comment), hence why I think the paper is borderline, but I am leaning towards an accept given the significant theoretical improvements over the past literature (and positive empirical results), even though the algorithmic suggestion is somewhat incremental.
+
+The proposed Algorithm 1 seems very close to the one of Chen & Gu (2016), the paper should be more clear about this. There seems to be mainly two changes: a) extending the support projection of the gradient to the union of the sampled block with the one of the support of the reference parameter wtilde (vs. just the sampled block in Chen & Gu (2016) and b) moving the hard-thresholding iteration outside of the SVRG-inner loop. These small tweaks to the algorithm yield a significant theoretical improvement, though.
+
+== Detailed comments ==
+
+Clarity: the number of long abbreviations with only one letter change make it hard to follow the different algorithms; perhaps a better more differentiating naming scheme could be used. Moreover, I think more background on the sparse optimization setup should be provided in the introduction or at least in the preliminaries, as I do not think the wider ICLR community is very familiar with it (in particular, no cited paper was at ICLR). For example, define early the separation in optimization error and statistical error; and point out that F(w_t) might even be lower than F(w*) as the sparsity threshold s might be much higher than s*. This will make Table 1 more concrete and less abstract for people who not are not yet experts on this particular analysis framework.
+
+- Table 1: I would suggest to put the rate for S2BCD-HTP instead on the last row and mention instead that the rate for ASBCD is similar under conditions on the delay; as it is interesting to already have a better gradient complexity for S2BCD vs. SBCD.
+
+** Questions:  **
+1) In Corollary 1, how is the gradient oracle complexity defined or computed? And more specifically, how does one compare fairly the cost of doing a gradient update in Algorithm 1 on the *bigger set* S = Gtilde U G_jt vs. just G_jt for the Chen & Gu ASBCD algorithm? Is this accounted in the computation?
+
+2) In Figure 1, which ""minimum"" is referred to and how is it found? I suspect it is not F(w*) (as it could be higher than F(w_t)), i.e. it is *not* the minimum of (1) with s*. One natural guess is that it might be min_w F(w) s.t. ||w||_0 <= s, though I do not see any guarantee in the main paper that running the algorithm would make F(w_t) converge to such a value (i.e. all we know from Thm 1 is that F(w_t) might be within O(||nabla_Itilde F(w*)||^2) of F(w*) ultimately. Please explain and clarify!
+
+== Potential improvement ==
+
+The current result in Theorem 1, which is building on a similar proof technique as the original SVRG paper, has the annoying property of requiring the knowledge of the condition number in setting the size of the inner loop iteration. I suspect that this is an artifact of using an outdated version of the SVRG algorithm. This has been solved since then by considering a ""loopless"" version of SVRG which implicitly defines the size of the inner loop in a random manner using a quantity *which does not depend on the condition number*. This was proposed first by Hofmann et al. [2015], and then re-used by Lei & Jordan [2016] and more recently by Kovalev et al. [2019] e.g. Note that Leblond et al. (2017) that you cited profusely also used this variant of SVRG. I suspect that this technique could be re-used in your case to obtain a similar result with a loopless variant (which also gives cleaner complexity results). (Though I only skimmed through your proof.)
+
+Caveat: the sensibility of the theory in the main paper seems reasonable, but I did not check the proofs in the appendix.
+
+= References:
+- Hofmann et al. [2015]: Variance Reduced Stochastic Gradient Descent with Neighbors, Thomas Hofmann, Aurelien Lucchi, Simon Lacoste-Julien and Brian McWilliams, NeurIPS 2015
+- Lei & Jordan [2016]: Less than a Single Pass: Stochastically Controlled Stochastic Gradient Method, Lihua Lei and Michael I. Jordan, AISTATS 2016
+- Kovalev et al. [2019]: Don't Jump Through Hoops and Remove Those Loops: SVRG and Katyusha are Better Without the Outer Loop, Dmitry Kovalev, Samuel Horvath and Peter Richtarik, arXiv 2019
+",6,,ICLR2020
+Hy0bdarZG,3,ry9tUX_6-,ry9tUX_6-,Reasonably good idea (but with lots of strong assumptions) connecting generalization of entropy SGD and PAC-Bayes risk bound. ,"Brief summary:
+    Assume any neural net model with weights w. Assume a prior P on the weights. PAC-Bayes risk bound show that for ALL other distributions Q on the weights, the the sample risk (w.r.t to the samples in the data set) and expected risk (w.r.t distribution generating samples) of the random classifier chosen according to Q, averaged over Q, are close by a fudge factor that is KL divergence of P and Q scaled by m^{-1} + some constant.
+
+Now, the authors first show that optimizing the objective of the Entropy SGD algorithm is equivalent to optimizing the empiricial risk term + fudge term over all data dependent priors P and the best Q for that prior. However, PAC-Bayes bound  holds only when P is NOT dependent on the data. So the authors invoke results from differential privacy to show that as long as the prior choosing mechanism in the optimization algorithm is differentially private with respect to data, differentially private priors can be substituted for valid PAC-Bayes bounds rectifying the issue. They show that when entrop SGD is implemented with pure gibbs sampling steps (as in Algorithm 3), the bounds hold.
+
+Weakness that remains is that the gibbs sampling step in Entropy SGD (as in algo 3 in the appendix) is actually approximated by samples from SGLD that converges to this gibbs distribution when run for infinite hops. The authors leave this hole unsolved. But under the very strong sampling assumption, the bound holds. The authors do some experiments with MNIST to demonstrate that their bounds are not trivial. 
+
+Strengths:
+  Simple connections between PAC-Bayes bound and entropy SGD objective is the first novelty. Invoking results from differential privacy for fixing the issue of validity of PAC-Bayes bound is the second novelty. Although technically the paper is not very deep, leveraging existing results (with strong assumptions) to show generalization properties of entropy-SGD is good.
+
+Weakness:
+  a) Obvious issue : that analysis assumes the strong gibbs sampling step.
+  b) Experimental results are ok. I see that the bounds computed are non-vacuous. - but can the authors clarify what exactly they seek to justify ? 
+ c) Typos: 
+   Page 4 footnote ""the local entropy should not be <with>.."" - with is missing.
+   Eq 14 typo - r(h) instead of e(h) 
+   Definition A.2 in appendix - must have S and S' in the inequality -both seem S.
+
+d) Most important clarification: The way Thm 5.1, 5.2 and the exact gibbs sampling step connect with each other to produce Thm 6.1 is in Thm B.1. How do multiple calls on the same data sample do not degrade the loss ? Explanation is needed. Because the whole process of optimization in TRAIN with may steps is the final 'data dependent prior choosing mechanism' that has to be shown to be differentially private. Can the authors argue why the number of iterations of this does not matter at all ?? If I get run this long enough, and if I get several w's in the process (like step 8 repeated many times in algorithm 3) I should have more leakage about the data sample S intuitively right ?
+
+e) The paper is unclear in many places. Intro could be better written to highlight the connection at the expression level of PAC-Bayes bound and entropy SGD objective and the subsequent fix using differentially private prior choosing mechanism to make the connection provably correct. Why are all the algorithms in the appendix on which the theorems are claimed in the paper ??
+
+Final decision: I waver between 6 and 7 actually. However I am willing to upgrade to 7 if the authors can provide sound arguments to my above concerns.",6,3.0,ICLR2018
+6gxj0JxV-v1,4,uUX49ez8P06,uUX49ez8P06,Solving continual learning problems with minimal expansion of network parameters.,"Efficient Architecture Search for Continual Learning 
+- Summary
+This paper aims to solve continual learning problems with minimal expansion of network parameters. The authors propose Continual Learning with Efficient Architecture Search (CLEAS), which is equipped with a neuron-level NAS controller. The controller selects 1) the most useful previous neurons to model the new task (knowledge transfer) and 2) a minimum number of additional neurons. The experimental results show that the proposed method outperforms state-of-the-art methods on several continual learning benchmark tasks such as MNIST Permutation, Rotation MNIST, and Incremental CIFAR-100.
+
+- Strong points
+	1. The proposed framework selectively comprises neurons for new tasks, and training only newly added weights enables zero-forgetting of previously learned tasks.
+	2. The experimental results showed performance improvements compared with the previous algorithms (PGN, DEN, RCL) while preserving or reducing the number of parameters, especially in the case of CIFAR-100.
+
+- Weakness
+	1. From the perspective of real-world problems, the neuron-level decision through the RNN controller costs a long training time. The authors in this paper demonstrated small-scale neural networks such as 3-layers. Although the authors mentioned like “On the positive note, the increase in the running time is not substantial.”, I’m wondering it is plausible, and I couldn’t find what the running time in the figure 7 means.
+	2. It is not clear for me the rationale behind the sequential states of neurons and the authors’ claim that “This state definition deviates from the current practice in related problems that would define a state as an observation of a single neuron.”. Also, what does “standard model” mean in page 4? Is the sequential state invariant to permutations in neuron topology? 
+
+- Questions
+	1. How did you set the maximum of u_i neurons in your experiments?
+	2. Do you have any plan to publish the source codes for reproducibility? 
+	3. Please address and clarify the cons above.
+
+- Additional feedback
+	- Typo: wrong citation for ENAS (line 7, “Neural Architecture Search” paragraph in section 2)
+	- I would like to recommend denoting the accuracy of MWC in section 4.2.",6,3.0,ICLR2021
+uZYcBOsX7-_,1,OLOr1K5zbDu,OLOr1K5zbDu,Review for Submission 2589,"##########################################################################
+Summary:
+ 
+The authors present TRIPS, a Triple-Search framework to jointly search for the optimal network, precision and accelerator for a given task with max accuracy and efficiency in a differentiable manner. TRIPS focuses on efficiently exploring the large search space that previous reinforcement learning based solution cannot afford to due to poor scalability. 
+
+The authors propose a heterogeneous sampling approach that enables simultaneous update of weights and precision without biasing the precisions options or exploding the memory consumption. Further, they also formulate a novel co-search pipeline that enables a differentiable search for optimal accelerator.
+
+Lastly, the authors present ablation studies and comparison with SOTA solutions that either solve the triple search problem described above or search (network, precision, accelerator) over a subset of its search space.
+
+##########################################################################
+Reasons for score: 
+ 
+The paper presents a thorough description of TRIPS a joint search algorithm that enables scalable search of optimal network, precision and accelerator for maximum accuracy and efficiency. The experiments demonstrate the improvement in accuracy and efficiency compared to a wide range of previous solutions. Further ablation studies show breakdowns of accuracy and efficiency improvements derived from key enablers of TRIPS. With a few clarifications for the questions mentioned in the Cons section this paper should be accepted.
+
+ 
+##########################################################################
+Pros:
+
+1.	This work presents a novel attempt to explore the large search space of network, precision and accelerator design, including detailed parameters such as tiling strategies, buffer sizing, processing elements (PE) count and connections etc.
+2.	Section 2 presents a detailed overview of the past papers that have attempted to optimize all subsets of the search space explored by the proposed solutions.
+3.	Authors do a great job of either estimating or collecting data for comparing the proposed solution with the many different approaches discussed in Section 2.
+4.	Figure 3 demonstrates good data establishing the superiority of TRIPS optimal solution compared with other baselines. It would be helpful if the authors could shed light on how they tune TRIPS to generate a range of different solutions that tradeoff max accuracy with max FPS.
+5.	Figure 4 presents a useful comparison to previous solutions that do not optimize over precision space. TRIPS shows better accuracy and FPS even with fixed precision.
+6.	Ablation studies in section 4.3, are useful in breaking down the incremental improvement in the accuracy vs FPS tradeoff curve of TRIPS. 
+7.	Appendix sections provide useful details about TRIPS training and accelerator design space explored. Additionally, section C shows another useful contrast of TRIPS with previous solutions based on SOTA expert-designed and tool-generated solutions. Remaining sections provide useful insights on the searched space.
+ 
+##########################################################################
+Cons: 
+ 
+1.	Authors should consider citing the following in Section 2, DNN accelerators para, since it explores mapping network to accelerator theme of this section:
+Gao, Mingyu, et al. ""Tetris: Scalable and efficient neural network acceleration with 3d memory."" Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems. 2017.
+2.	Towards the end of Section 2, authors mention that TRIPS selects network, precision and accelerator that enable better transferability to other tasks/applications. However, there are no results or discussion on transferability of TRIPS models to other applications, consider removing this line.
+3.	While Section 3 is comprehensive in its detailed description of TRIPS implementation and key enablers. It is somewhat difficult to read, consider reordering the implementation discussion and Section 3.2/3.3.   
+4.	Section 4.2, it’s not clear why the authors selected the hardware limitation of 512DSP units. 
+5.	Further, Section 4.2, Co-exploration of networks and accelerators part, compares previous solutions that optimize network and accelerator while keeping the precision fixed (searched optimal value). It would be helpful if the authors added the precisions used for the previous work datapoints as well. Are the selected datapoints for previous papers, the max accuracy points? How do authors select the 16bit value they choose for Figure 4 fixed precision TRIPS?
+6.	Additionally, the authors do not present TRIPS-16bit datapoint for Figure 4c), please add that.
+7.	The analysis presented in Table 2 seems highly skewed, with almost 1000x improvement in Area and EDP. Knowing that the support for heterogenous functionalities is costing baseline ASIC implementations dearly makes the comparison unfair. I recommend that authors attempt to minimize the impact of such features, since they make the baseline ASIC accelerators more general which the proposed solution does not support.
+ 
+##########################################################################
+Questions during rebuttal period: 
+
+Kindly address a few questions in the Cons section. 
+
+#########################################################################
+Some typos: 
+1.	Section 1: “to optimize the acceleration efficiency for* a* given* DNN”
+2.	Section 2: Hardware-aware NAS subsection, “acceleration efficiency, thus can lead* to sub-optimal solutions.”
+3.	Section 2: Co-exploration/search techniques subsection: “Built upon prior art*, our”
+4.	Section 4.3: Comparison with sequential optimization subsection: “and hardware side, a natural* design”
+5.	Appendix D: Insights for the searched network subsection:  “find that while wide-shallow* networks”, reword the whole sentence.
+",6,2.0,ICLR2021
+uC4zX3qy49r,2,zQTezqCCtNx,zQTezqCCtNx,Easy to understand and interesting while requires more explanations,"This paper investigates the adversarial robustness from the activation perspective. Specifically, the authors analyzed the difference in the magnitude and distribution of activation between adversarial examples and clean examples: the activation magnitudes of adversarial examples are higher and the activation channels are more uniform by adversarial examples. Based on the above interesting findings, the authors claim that different channels of intermediate layers contribute differently to the class prediction and propose a Channel-wise Activation Suppressing (CAS) method to suppress redundant activations, which can improve the DNN robustness. 
+
+Some highlights in this paper:
++ The CAS strategy is simple and can be easily applied to existing models. Combining CAS with the existing adversarial training methods leads to better DNN robustness.
++ The experiments are well-conducted and convincing. The authors not only provided ablation experiments to verify the effectiveness of CAS, but also provided both the performance of the last epoch and the performance of early stop, which confirmed that CAS can improve the DNN robustness.
++ The paper is well-written and the idea is easy to follow.
+
+However, there are some downsides. I’d like more details about:
+- Adversarial training inhibits the magnitude of activation, what is the connection between this and network robustness?
+- The closer the activation distribution of the adversarial example is to that of the clean example, the better the robustness of the network. It would be good to provide more discussions and explanations here.
+
+Overall the paper is easy to understand and interesting.
+",7,4.0,ICLR2021
+rkhhu9bEg,2,Byiy-Pqlx,Byiy-Pqlx,interesting new ,"The paper proposes a new memory access scheme based on Lie group actions for NTMs.
+
+Pros:
+* Well written
+* Novel addressing scheme as an extension to NTM.
+* Seems to work slightly better than normal NTMs.
+* Some interesting theory about the novel addressing scheme based on Lie groups.
+
+Cons:
+* In the results, the LANTM only seems to be slightly better than the normal NTM.
+* The result tables are a bit confusing.
+* No source code available.
+* The difference to the properties of normal NTM doesn't become too clear. Esp it is said that LANTM are better than NTM because they are differentiable end-to-end and provide a robust relative indexing scheme but NTM are also differentiable end-to-end and also provide a robust indexing scheme.
+* It is said that the head is discrete in NTM but actually it is in space R^n, i.e. it is already continuous. It doesn't become clear what is meant here.
+* No tests on real-world tasks, only some toy tasks.
+* No comparisons to some of the other NTM extensions such as D-NTM or Sparse Access Memory (SAM) (https://arxiv.org/abs/1610.09027). Although the motivations of other NTM extensions might be different, such comparisons still would have been interesting.
+",6,3.0,ICLR2017
+5Mc64UkeQFv,1,ohdw3t-8VCY,ohdw3t-8VCY,Official Blind Review #1,"Paper Summary:
+* This paper proposes a framework for controllable summarization, CTRLsum.  It is different from standard summarization models that CTRLsum uses a set of keywords extracted from the source text automatically or descriptive prompts to control the summary.  Experiments with three domains of summarization datasets and five control aspects.
+
+Strengthes:
+* The authors investigated the effectiveness of the proposed model through extensive experiments.
+
+Weaknesses:
+* The proposed method that uses keywords as an additional input text is almost the same as CIT (Saito et al., 2020), and the scores of CTRLsum on the CNNDM dataset reported in Table 7 does not outperformed those of CIT.  Also, it is not novel to use descriptive prompts to control natural language generation.
+   * Saito et al.: Abstractive Summarization with Combination of Pre-trained Sequence-to-Sequence and Saliency Models. CoRR abs/2003.13028 (2020)
+* I think that the author's claim, ""keywords and prompts are complementary"", is not evaluated fully.
+
+Questions:
+
+* With respect to contribution summarization, did you evaluate CTRLsum(keyword without prompt) and CTRLsum(prompt without keywords)?  The control tokens ""the main contributions of this paper are : ( 1 )"" is far from the keywords used during training, and so I think that the keywords are not effective for contribution summarization.  In fact, BART that uses prompt worked well for contribution summarization.
+
+* Did you evaluate the ablation tests with respect to the special token ""|"" and keyword dropout?
+
+* Can CTRLsum control the generation with multiple aspects (length and entity control, length and QA control, etc.) simultaneously?  The length of  summaries generated by CTRLsum is strongly dependent the number of keywords, and so I think it is difficult to simultaneously control multiple aspects including length control.
+
+Update:
+Thank you for the answers to my questions and additional experiments. ",6,4.0,ICLR2021
+S1eXG9ujFr,2,B1xeZJHKPB,B1xeZJHKPB,Official Blind Review #3,"The paper has two main messages: 1- Averaging over the explanation (saliency map in the case of image data) of different methods results in a smaller error than an expected error of a single explanation method. 2- Introducing a new saliency map evaluation method by seeking to mitigate the effect of high spatial correlation in image data through grouping pixels into coherent segments. The paper then reports experimental results of the methods introduced in the first message being superior to existing saliency map methods using the second message (and an additional saliency map evaluation method in the literature). They also seek to magnify the capability of the 2nd message's evaluation method by showing its better capability at distinguishing between a random explanation and an explanation method with a signal in it.
+
+
+I vote for rejecting this paper for two main reasons: the contributions are not enough for this veneue, and the paper's introduced methods are not backed by convincing motivations. The first message of the paper is trivial and cannot be considered as a novel contribution: the ''proof'' is basically the error of the mean is smaller than the mean of the errors. Additionally, this could have been useful if the case was that there was a need for playing the safe-card: that is, all of the existing methods have equal saliency map error and averaging will decrease the risk. Not only authors do not provide any evidence but also both the experimental results of the paper itself (results in Table 2 and Fig 4 are disproving this assumption) and the existing literature disprove it. Even considering this assumption to be correct, the contribution is minimal to the field and benefits of averaging saliency maps have been known since the SmoothGrad paper. The second contribution is an extension of existing evaluation methods (e.g. SDC) where instead of removing (replacing by mean) individual pixels, the first segment the image and remove the segments. The method, apart from being very similar to what is already there in the literature, is not introduced in a well-motivated manner. The authors claim that their evaluation method is able to circumvent the problem with removing individual pixels (which is the removed information of one pixel is mitigated by the spatial correlations in the image and therefore will not result in a proportional loss of prediction power) by removing ''features'' instead. Their definition of a feature, though, are segments generated by simple segmentation methods. There is a long line of literature showing the incorrectness of this assumption; i.e. a group of coherent nearby pixels does not necessarily constitute a feature seen by the network and does not necessarily remove the mentioned problem of the high correlation of pixels. This method does not remove ""the interdependency of inputs"" for the saliency evalatuion metric. Even assuming the correctness of this assumption, the contribution over what already exists in the literature is not enough for this venue.
+
+A few suggestions:
+
+* The authors talk about a ''true explanation''. This concept needs to be discussed more clearly and extensively. What does it mean to be a true evaluation? It is also important to prove that the introduced evaluation metric of IROF would assign perfect score for a given true explanation.
+
+* The mentioned problem of pixel correlations that IROF seeks to mitigate is also existing in other modalities of data and the authors do not talk about how IROF could potentially be extended.
+
+* The qualitative results in the text and the appendix do not show an advantage. It would be more crips if the authors could run simple tests on human subjects following the methods in the previous literature.
+
+* There are many many grammatical and spelling errors in the paper. The font size for Figures is very small and unreadable unless by zooming in.
+
+* Many of the introduced heuristics are not backed by evidence or arguments. One example is normalizing individual saliency maps between 0-1 which can naturally be harmful; e.g. aggregating a noisy low-variance method with almost equal importance everywhere (plus additive noise) and a high-variance one which does a good job at distinguishing important pixels - AGG-VAR will not mitigate this issue. 
+
+One question:
+
+The authors introduced aggregation as a method for a ''better explanation''. It has been known that another problem with saliency maps is robustness: one can generate adversarial examples against saliency maps. It would be an interesting question to see whether aggregation would improve robustness rather than how good the map itself is.",3,,ICLR2020
+y8ty_4C6Md,3,8znruLfUZnT,8znruLfUZnT,The method seems to be a small extension of CDLNet and the evaluation could be improved.,"## Summary
+
+The paper proposes a denoising method with a neural network inspired from convolutional dictionary learning. In the proposed method, one atom of the dictionary is constrained to be a low frequency filters and all other atoms are to be high frequency atoms. The authors also propose to make the threshold depends on the noise level to better adapt to different noise level and to use strided convolution to reduce the computational cost of the method. The method is then evaluated on images from BSD68.
+
+----
+
+*For extra citations, see bibtex at the end. Sorry if I ended up putting a lot of them but I feel the bibliography is a bit lacking.*
+
+## Overall assessment
+
+- A major weakness of this work is that it seems to be equivalent to doing CDL on a signal filtered with high pass filter $g^{-1} * y_i$ because $\theta_0 = 0$. Indeed, using parseval on the data fitting term and the equivalence between convolution and point wise multiplication in the frequency space, one can easily show that $z^1_i$ will correspond to the low frequency part of $y_i$ and as the filters $d^i$ all only have high frequency, they will try to reconstruct the high frequency part of $y_i$. This means that only keeping high frequency of $y_i$, one can use the classical CDL approach algorithm to recover the same method as the one proposed in this paper. To me, the method boils down to stating an equivalent model where the preprocessing is integrated in the model. If the authors think I am wrong, they could try to perform experiment 3.1 and show that there is significant difference between filtering $y$ and using the proposed method. At least, this point should be mentionned and discussed in the manuscript. Note that such approach of integrating multiple component can be related to the work on morphological component analysis (see `[Elad2005]`) and the integration of preprocessing step in CDL have recently been proposed with detrending in `[Lalanne2020]`.
+- Apart from this point, the novelty of this work is not very significant. The adaptation of the thresholds with the input noise level in the context of denoising has been proposed in `[Isogawa2017]` and `[Ramzi2020]` and  the use of strided convolution are proposed in `Simon & Elad (2019)`.
+- The effect of the stride on the denoising is not evaluated. This would be interesting to look at how the denoising performance change when changing the stride. In particular, how does FCDLNet compare with FCDLNet with `stride` $\in\\{1, 3, 4\\}$. Does it impact the performances a lot?
+- It would also be interesting to study the impact of the noise level estimator. If the estimator is biased, how does it impact the performances? This is also related to Question 3, as I suppose if the estimator is biased but used for training, the thresholds $\nu$ can also be adapted to cope for this bias. This would be interesting to add such experiments.
+- For the computational complexity, I would be interested to see comparison with modern convolutional sparse coding algorithm such as the one in `sporco` (`[Wohlberg2017]`) or the LGCD algorithm from `[Moreau2019]`, which scales to much larger images.
+- The writing is not very clear and not always correct and there are many typos (see bellow).
+
+
+## Some question
+
+1. Could the authors comment on why the learned network is more interpretable than a classical network?
+2. Why  does the authors change the constraint to the one in (7) ? This makes the model somewhat incomparable with other approaches as $d^j$ won't have the unit norm property. Moreover, I don't really see the point, as simply stating that $\\|d^j\\| \le 1$ also constrains appropriatly $\\|\tilde d^j\\|$ to be smaller than roughly $\frac{1}{\\|g\\|}$, which is a similar constraint but comparable to the original one.
+3. When training the network, is the true noise level given as an input of the model or is it also estimated using the wavelet based estimator? This is unclear from the text and should be clarified.
+4. What is the training time for the proposed model? This is not discussed, but I guess this is similar to the training time of other models.
+
+
+## Minor comments, nitpicks and typos
+
+- For the citation, when they are between parenthesis, could the authors use `\citep` to have proper formating.
+- p.1: `Mairal et al. (2014)`: This is not an arXiv paper, the proper citation is `[Mairal2014a]`.
+- p.1: The first paragraph could be improved a lot. It is unclear why this would be an inverse problem (there is no sensing matrix here, the dictionary correspond to the prior knowldge in the inverse problem literature). This is mainly denoising so the paragraph should be fixed to reflect this.
+- p.1: `a linear combination of a collection of vectors`: this is simply linear representation. The sparse linear representation also promotes the usage of only a few atoms.
+- p.1: `where $n_i \sim \mathcal N(0, \sigma^2 \bm I)$.` The authors should add in plain word that this `is an additive Gaussian white noise.`
+- p.2: `is nontrivialy` -> `non trivialy`? What does it mean to be trivialy related? I would remove this as it is unclear.
+- p.2: `the Convolutional Sparse Coding (CSC) model has been introduced`: the original work introducing such model is `[Grosse2007]`.
+- p.3: `interpretabile` -> `interpretable`
+- p.4: Note that there is a third option for training LISTA networks which is to use loss in Eq.(5) as a training loss, as it is done for instance in `[Ablin2019]`.
+- p.5: `the Guassian distribution prior` -> `Guassian`
+- Table.1: Could the authors highlight the leading method in the table?
+- p.7: It is unclear whether the timing for FCDLNet is performed for full image denoising or on 128x128 patches.
+- Figure.4: Could the authors add the error bars in this plot? As the gain is small between CDLNet and FCDLNet, it would be interesting to see the confidence here.
+
+### References
+
+```bibtex
+@inproceedings{Ablin2019,
+  title = {Learning Step Sizes for Unfolded Sparse Coding},
+  booktitle = {Advances in {{Neural Information Processing Systems}} ({{NeurIPS}})},
+  author = {Ablin, Pierre and Moreau, Thomas and Massias, Mathurin and Gramfort, Alexandre},
+  year = {2019},
+  pages = {13100--13110},
+  address = {{Vancouver, BC, Canada}},
+  archivePrefix = {arXiv},
+  copyright = {All rights reserved},
+  eprint = {1905.11071},
+  eprinttype = {arxiv}
+}
+
+@article{Elad2005,
+  title = {Simultaneous Cartoon and Texture Image Inpainting Using Morphological Component Analysis ({{MCA}})},
+  author = {Elad, Michael and Starck, J. L. and Querre, P. and Donoho, D. L.},
+  year = {2005},
+  volume = {19},
+  pages = {340--358},
+  journal = {Applied and Computational Harmonic Analysis},
+  number = {3},
+  pmid = {16370462}
+}
+
+@article{Grosse2007,
+  title = {Shift-{{Invariant Sparse Coding}} for {{Audio Classification}}},
+  author = {Grosse, Roger and Raina, Rajat and Kwong, Helen and Ng, Andrew Y.},
+  year = {2007},
+  volume = {8},
+  pages = {9},
+  journal = {Cortex}
+}
+
+@article{Isogawa2017,
+  title={Deep shrinkage convolutional neural network for adaptive noise reduction},
+  author={Isogawa, Kenzo and Ida, Takashi and Shiodera, Taichiro and Takeguchi, Tomoyuki},
+  journal={IEEE Signal Processing Letters},
+  volume={25},
+  number={2},
+  pages={224--228},
+  year={2017},
+  publisher={IEEE}
+}
+
+@inproceedings{Lalanne2020,
+  title = {Extraction of {{Nystagmus Patterns}} from {{Eye}}-{{Tracker Data}} with {{Convolutional Sparse Coding}}},
+  booktitle = {2020 42nd {{Annual International Conference}} of the {{IEEE Engineering}} in {{Medicine}} \& {{Biology Society}} ({{EMBC}})},
+  author = {Lalanne, Clement and Rateaux, Maxence and Oudre, Laurent and Robert, Matthieu P. and Moreau, Thomas},
+  year = {2020},
+  month = jul,
+  pages = {928--931},
+  publisher = {{IEEE}},
+  address = {{Montreal, QC, Canada}},
+}
+
+@article{Mairal2014a,
+  title = {Sparse {{Modeling}} for {{Image}} and {{Vision Processing}}},
+  author = {Mairal, Julien and Bach, Francis and Ponce, Jean},
+  year = {2014},
+  volume = {8},
+  pages = {85--283},
+  journal = {Foundations and Trends\textregistered{} in Computer Graphics and Vision},
+  number = {2-3}
+}
+
+@article{Moreau2019,
+  title={Distributed Convolutional Dictionary Learning (DiCoDiLe): Pattern Discovery in Large Images and Signals},
+  author={Moreau, Thomas and Gramfort, Alexandre},
+  journal={arXiv preprint arXiv:1901.09235},
+  year={2019}
+}
+
+@inproceedings{Ramzi2020,
+  title = {Wavelets in the {{Deep Learning Era}}},
+  booktitle = {European {{Signal Processing Conference}} ({{EUSIPCO}})},
+  author = {Ramzi, Zaccharie and Starck, Jean-Luc and Moreau, Thomas and Ciuciu, Philippe},
+  year = {2020},
+  month = jul,
+  pages = {1417--1421},
+}
+
+@inproceedings{Tolooshams2018,
+  title = {Scalable Convolutional Dictionary Learning with Constrained Recurrent Sparse Auto-Encoders},
+  booktitle = {{{IEEE International Workshop}} on {{Machine Learning}} for {{Signal Processing}} ({{MLSP}})},
+  author = {Tolooshams, Bahareh and Dey, Sourav and Ba, Demba},
+  year = {2018},
+  archivePrefix = {arXiv},
+  eprint = {1807.04734v1},
+  eprinttype = {arxiv}
+}
+
+@inproceedings{Wohlberg2017,
+  title={SPORCO: A Python package for standard and convolutional sparse representations},
+  author={Wohlberg, Brendt},
+  booktitle={Proceedings of the 15th Python in Science Conference, Austin, TX, USA},
+  pages={1--8},
+  year={2017}
+}
+
+```",4,4.0,ICLR2021
+WIoGXqSMrOU,3,zfO1MwBFu-,zfO1MwBFu-,"Good main idea, but hard to judge whether approximations are justified and whether method would work well on more complex data","**Update after discussion with authors**
+I want to thank the authors for their incredibly detailed responses and engaging so actively in the discussion. Some of my criticism could be addressed, while other issues are still somewhat open. If the main merit of the paper is to make the ""sequential VAE community"" aware of issues that have been discussed and addressed before, then I think the paper does an OK job at that (though I'm not entirely sure what community that is, the issues have been discussed before in the fields of vision and VAEs with autoregressive decoders). 
+
+I want to strongly encourage the authors to be as precise as possible when describing the novelty - maximizing the mutual information that a representation carries w.r.t. some relevance variable while simultaneously minimizing information that it carries w.r.t. to another variable is NOT novel. What is novel is the application of that principle to separating ""global"" from ""local"" information in sequential data (and how to actually perform this originally intractable optimization in practice). I also want to encourage the authors to state what's known and what's new as clearly as possible and improve the quality and clarity of the ""educational"" review of why maximizing mutual information is not enough as much as possible.
+
+Viewing the paper as ""showing how a known problem also appears when separating global from local information, and how to apply known solution-approaches to the problem in this specific context"", shifts the relative importance of the issues raised by me. Essentially, that view emphasizes the paper as mostly an application paper (rather than a novel theoretical contribution). Accordingly (but please make sure that that shift in view is also clear in the final paper), I am weakly in favor of accepting the paper and have updated my score accordingly.
+
+---
+
+**Summary**
+The paper tackles the problem of separating ‘global’ from ‘local’ features in unsupervised representation learning. In particular, the paper tackles a common problem in autoencoders where the decoder (generative model) is autoregressive and conditioned on a variable. Ideally, the latter variable captures all global information (such as e.g. speaker identity) whereas the autoregressive model deals with generating local structure (such as e.g. phonemes). As the paper points out, capturing only global information (and all of it) in the conditioning variable alone is notoriously difficult with standard variational autoencoder objectives, and several solutions have been proposed in the past. In this paper the idea is to add an explicit penalty for statistical dependence (mutual information) between the global and the local random variable. This intractable objective is simplified with a series of approximations, leading to a novel training objective. Results are shown for speech data, and MNIST/FashionMNIST, where the proposed training objective outperforms a beta-VAE objective and an objective that explicitly aims to maximize information on the global variable.
+
+---
+**Contributions, Novelty, Impact**
+
+1) Analysis of shortcomings of mutual-information maximization to regularize latent representations into capturing ‘global’ features. This topic has been widely discussed in the literature before, typically in the context separating nuisance factors from relevant variations in the data, or more broadly: separating relevant from irrelevant information (which is canonically addressed in the information bottleneck framework of course). Most of this previous discussion was aimed at supervised learning ([2]), but there is a considerable body of work in unsupervised learning as well ([1] discusses the same issue but with more clarity), and some recent, very relevant work targeting VAEs with autoregressive decoders as well ([3] is among the state-of-the-art models). The paper provides a recap of this literature, but misses some key references, and the clarity of the writing (pages 1-4) could be improved (see my comments on clarity below). Therefore I would rate the impact of this contribution as low.
+
+2) Proposal of a novel regularizer. The main idea behind the regularizer Eq. (8) is good, but certainly not novel - it has been broadly discussed in the literature and implemented in various ways. The merit is thus in the particular derivation and approximations that lead to the objective in Eq. (13) and (14). To me the derivation seems correct, though the precise motivation is somewhat unclear (what shortcomings of alternative approaches are addressed here, e.g. using the density ration trick?). I personally think that there is sufficient novelty, but in the current manuscript it is hard to assess whether the novel method has benefits compared to strong competitor methods (which are unfortunately missing from the experiments).
+
+3) Experiments on a speech dataset (using a state-space-model decoder), and MNIST/FashionMNIST (using a PixelCNN). Results indicate that the extracted latent space does capture global features slightly better than a beta-VAE, or (quite a bit better than) a MI-maximizing VAE. There is also some indication that local features capture less global information with the proposed method compared to a beta-VAE. These results are promising, but not surprising since beta-VAE and MI-VAE were not designed to solve the shortcomings that the method is trying to address. For results to be more convincing and stronger, it would be good to compare against alternative approaches that have the same objective, such as e.g. [1] and [3]. Additionally more control experiments and ablations as well as reporting more metrics (l(x;z) and I(z;s), or proxies/approximations thereof) would strengthen the findings and thus the potential impact (see my comments on improvements below).
+
+(I am not an author of any of these)
+[1] Moyer et al. Invariant Representations without Adversarial Training, 2018
+[2] Jaiswal et al. Discovery and Separation of Features for Invariant Representation Learning, 2019
+[3] Razavi et al. Generating Diverse High-Fidelity Images with VQ-VAE-2, 2019
+
+---
+**Score and reasons for score**
+The paper addresses an important problem that has received attention in the literature for at least two decades (the InfoBottleneck framework lays the theoretical foundations here). The particular application to: (i) unsupervised learning, and (ii) global-conditioned autoregressive (VAE) models is very timely and has received less attention in the literature (but there are some papers). 
+
+My main issue is that the paper addresses two problems: (A) separating global from local information, (B) avoiding that autoregressive decoders ignore the global latent variable. Clearly stating both problems, reviewing the literature for each of them, and then showing how the paper solves both of them (and showing experimental results for both of them) would really help with clarity and readability of the paper. It would also help flesh out the novel contributions made by the paper. Additionally, it is not entirely clear how well the proposed objective actually addresses (A) and (B) in the experiments. There is some good indication for (A), but it is not directly measured (e.g. by estimating I(x:z) and I(s;z)), the effect is only shown indirectly via AoLR and mCAS (or Err(z) and Err(s)). The same is true for (B): there is some that the generative model does not ignore the global latent code via the mCAS experiments, but it is quite indirect (also looking at the generated examples in appendix K raises some doubts about diversity of the generative model). 
+
+Overall I think the ingredients for a good paper are there, but they are not quite coming together yet. A deeper look into the empirical results (control experiments, additional metrics), and a comparison against strong competitor methods are needed. My recommendation would also be to really focus on the new objectives (Eq. 13 and 14) and discuss in more detail how they differ from competitor approaches and what the theoretical/empirical advantages of these differences are (for instance I am personally not yet fully convinced that using the density-ratio-trick with a neural network classifier will always work well in practice). If all of that were in place, I think the paper would be significantly stronger and could potentially have wide impact. I thus recommend a major revision of the work, which is not possible within the rebuttal period. Below are concrete suggestions for improvements, and I will of course take into account the other reviews and authors’ response.
+
+---
+**Strengths**
+
+1) The problem (separation of global and local info in variational unsupervised representation learning with autoregressive decoders) is timely and important, and has been somewhat neglected in the representation learning community (though there is work out there, and the same problem has been discussed extensively in a related context, such as e.g. supervised learning).
+
+2) The Method builds on a body of previous great work and applies it in an interesting context (global vs. local features).
+
+---
+**Weaknesses**
+
+1) Merits of the method somewhat unclear - the motivation/derivation when going from Eq. 8 to Eq. 13, 14  is a bit ad-hoc. What alternatives are there to the choices/approximations made? What are the advantages/disadvantages of these? Answering this might also involve some control experiments and ablations.
+
+2) The experiments show somewhat indirectly that the goals were achieved. There is some empirical evidence that the method is working to some degree, but it remains unclear whether e.g. the learned z capture only global information (and all of it), and whether s captures only local information (and how much of it). What’s needed here are additional results/experiments.
+
+3) The current writing is ok but could be improved. I think it would be helpful to clearly state the problems and previous approaches of solving them (and the issues with these previous approaches). This would make it easier to see how the proposed method fits into the wider picture and which specific problem it addresses/improves upon. I think Alemi 2018 (cited in the paper) and [1] mentioned above do a very good job of describing the overall problem..
+
+---
+**Correctness**
+The derivation of the method looks ok to me, though it would be nice to justify the approximations made and attempt to empirically verify that they do not have a severe impact on the solution. The conclusions drawn from the experiments are broadly ok, but since the evaluation measures the desired properties in a quite indirect way, the generality of the findings and the extent to which the method solves the problem (quantitatively) remain somewhat unclear.
+
+---
+**Clarity**
+It took me a bit longer to follow the main line of arguments than it should have (which might of course be my fault). It’s a bit hard to pinpoint to a specific paragraph, but perhaps the following suggestions are helpful for improving readability. It might be worth clearly stating the main problems (denoted (A) and (B) further up in ‘Score and Reasons for Score’) and separately discussion how they have been addressed (and what the remaining issues are) and how the paper addresses them. Currently this is entangled in the derivation of the method.
+
+It would probably also help to have a short paragraph that summarises the novel contributions clearly (which makes it clear what’s novel and what’s been proposed before).
+
+---
+**Improvements (that would make me raise my score) / major issues**
+
+1) Comment on all assumptions made when going from Eq (8) to (13), (14). Are these assumptions justified in practice? Would there be alternative choices, and if yes what are the downsides of these alternative choices? Some of the assumptions will then lead to further control experiments that would ideally be included in the paper. One example (perhaps the most important one) is below in 2)
+
+2) Neural network classifiers are notoriously known to be ill-calibrated (typically having over-confident probability estimates). This could be problematic in the DRT approximation since the discriminator’s output probability crucially matters! Is the discriminator well calibrated in practice? How robust is the method against calibration issues? Is the problem expected to get worse when scaling to more complex data and bigger network architectures? This needs to be discussed, but ideally some points are also verified empirically.
+
+3) beta-VAE and MI-VAE are ok baselines, but are not sufficient to show that the method performs very well. These two baseline methods have not been designed to address the main issue (separating global from local information) - it is thus not too surprising that the proposed method performs well. What’s needed is comparisons against strong baselines, e.g. a (hierarchical) VQ-VAE2 (ref [3] further up). Given that the method only slightly outperforms beta-VAE on the metrics shown (which has no explicit incentive to capture global information) this is important.
+
+4) Report additional metrics (for each experiment it would be good to also report: reconstruction error, estimates of I(x;z) and I(z;s)). As \gamma is varied, does the method lead to consistent increase in I(x;z) and decrease in I(z;s)? Are the values for the latter two significantly better than when using beta-VAE/MI-VAE?
+
+5) Reporting AoLR and mCAS with a logistic regressor/classifier is ok, because it says something about latent-space geometry which could be interesting. But for the paper it is more important to capture the exact amount of global information captured by z and s. Therefore it would be good to show additional results for AoLR and mCAS where the regressor/classifier is a powerful nonlinear model (a deep neural net).
+
+6a) Alemi et al. 2018 gets cited quite a bit in the paper, but is not very well represented in the paper. In particular: the paper proposes a quantitative measure as a diagnostic to see how much information is captured by the latent variable and how much of that is used/ignored by the decoder (which leads to the definition of several operating regimes, such as the “auto-decoding” regime). Why not report the same measure in the current method?
+
+6b) Alemi et al. 2018 actually propose a modified objective to target a specific rate. They empirically observe that a beta-VAE with beta<1 *in their experiments* leads to the VAE operating in the desired regime. As far as I understand they do not propose this as a general solution to fix decoders ignoring latent codes. This should be mentioned in the paper. As a consequence 6a) becomes even more important, or without any verification beta-VAE becomes an even weaker baseline, meaning that comparison against strong methods becomes more important.
+
+---
+**Minor comments**
+
+a) Eq. (10) should be an inequality, because I(z;s) is upper bounded on r.h.s.?
+
+b) How was it determined that “alpha=1 works reasonably”, is this based on some control experiments?
+
+c)  Eq (13). Why this particular mixing in of the KL-term, why not multiply KL(s) with (1-\gamma) as well?
+
+d) Table 1: report the reconstruction error. In particular, for high \gamma is there still reasonable reconstruction performance (and thus separation into global z and local s), or is all information except global information discarded and s essentially does not capture much meaningful information anymore, making good reconstruction impossible?
+
+e) Fig 3a - is the x-axis ELBO or KL?
+
+f) Fig 3, Table 1: ideally report multiple repetitions with error bars.
+
+g) For \gamma=0.6 in appendix K, there seems to be very little diversity in samples drawn from either model. This should be mentioned more clearly in the main text.
+",6,3.0,ICLR2021
+ryxaWOnSKH,1,BJgZBxBYPB,BJgZBxBYPB,Official Blind Review #3,"The problem addressed by this paper is the estimation of trajectories of moving objects thrown / launched by a user, in particular in computer games like angry birds or basketball simulation games. A deep neural network is trained on a small dataset of ~ 300 trajectories and estimates the underlying physical properties of the trajectory (initial position, direction and strength of initial force etc.). A new variant of deep network is introduced, which is based on an encoder-decoder model, the decoder being a fully handcrafted module using known physics (projectile motion).
+
+I have several objections, which can be summarized by the simplicity of the task (parabolic trajectories without any object/object or object/environment collisions / interactions), the interest of the task for the community (how does this generalize to other problems?), and the writing and structuring of the paper. I will detail these objections further in the rest of the review.
+
+Learning physical interactions is a problem which has received considerable attention in the computer vision and ML communities. The problem is certainly interesting, but I think we should be clear on what kind of scientific knowledge we want to gain by studying a certain problem and by proposing solutions. The tasks studied by the community are mostly quite complex physical phenomena including multiple objects of different shapes and properties and which interact with each other. All these phenomena can be simulated with almost arbitrary precision with physics engines, and these engines are mostly also used for generating the data. In other words, the simulation itself is solved and is not the goal of this body of work. The goal is to learn differentiable models, which can be used as inductive bias in larger models targeting more general tasks in AI.
+
+Compared to this goal, the proposed goal is far too easy: learning projectile motion is very easy, as these trajectories can be described by simple functions with a small number of parameters, which also have a clear and interpretable meaning. The simplicity of the task is also further corroborated by the small number of samples used to estimate these parameters (in the order of 300). A further indication is the fact, that the decoder in the model is fully hardcoded. No noise modelling was even necessary, which further corroborates that a very simple problems is addressed.
+
+In order words, I am not really sure what kind of scientific problem is solved by this work, and how this knowledge can help us to solve other problems, harder problems.
+
+My second objection is with the written form of the paper. The paper is not well enough structured and written, many things are left unsaid. First of all, the problem has never been formally introduced, we don’t know exactly what needs to be estimated. What are the inputs, outputs? Is computer vision used anywhere? How are the positions of the objects determined if not with computer vision? How are the user forces gathered? What are “in game variables” mentioned multiple times in the document? No notation has been introduced, no symbols have been introduced (or too late in the document). For instance, there is no notation for the latent space of the encoder-decoder model.
+
+The figures are not very helpful, as the labelling of the blocks and labels is very fuzzy. As an example, For InferNet, inputs and trajectories are “Trajectories”, so what is the difference? Of course we can guess that (inputs are measured trajectories, outputs are reconstructed trajectories), but we should not guess things when reading papers.
+
+The figure for encoder-decoder model is very confusing, as the different arrows have different functional meanings and we have no idea what they mean. The outputs of the encoder and the MLP both point to the latent space and at a first glance the reader might think that they are concatenated, which raises several questions. Reading the text, we infer that first a model is trained using on one of the arrows (the one coming from the encoder) and ignoring the other one, and then the MLP is learned to reconstruct the latent space using the other arrow (the one coming from the MLP), but this is absolutely impossible to understand looking at the figure, which does not make much sense. We can infer all this from the text around equations (1) to (3), which is itself quite fuzzy and difficult to understand, in spite of the simplicity of the underlying maths.
+
+The relationship of RelateNet and InferNet is not clear. While the task of InferNet is clear, the role of InferNet in the underlying problem is not clear and it has not been described how it interacts with RelateNet.
+
+It is unclear how the transfer between science birds and basketball has been performed and what exactly has been done there.
+
+As mentioned above, the role of “in game variables” is unclear. What are those? I suggest to more clearly define their roles early in the document and use terms from well-defined fields like control (are they “control inputs”) or HCI (are they “user actions”?).
+
+In the evaluation section, we have baseline models BM1 and BM2, but they have never been introduced. We need to guess which of the models described in the paper correspond to these.
+
+The related work section is very short and mostly consists of an enumeration of references. The work should be properly described and related to the proposed work. How does the proposed work address topics which have not yet been solved by existing work?
+",1,,ICLR2020
+r1lMN7j9FS,1,r1eWdlBFwS,r1eWdlBFwS,Official Blind Review #1,"This paper studied the problem of learning the latent representation from a complex data set which followed the independent but not identically distributions. The main contributions of this paper are to explicitly learn the commonly shared and private latent factors for different data populations in a unified VAE framework, and propose a mutual information regularized inference in order to avoid the “leaking” induced by the shared representations across different populations. The isolation of the commonly shared and population specific latent representations learned by the proposed are empirically demonstrated on several applications. However, I have some concerns regarding this paper as follows.
+(1) It is not clear why the private representation exhibits latent features from the shared space when using equation 3 and how this phenomenon hurts this CPVAE model.
+(2) In equation 1, how to define the isotropic diagonal covariance matrix in the Gaussian distribution p? Is it parameterized by g?
+(3) In equation 3, what is the prior distribution of p(z_ki, t_ki)?
+(4) In equation (4)(5), why could the marginal KL term be canceled out when using I_q(x_k; t_k) - I_q(x_-k; \tilde{t}_k)?
+(5) The mutual information regularized inference involved the KL term between any two private factors from different populations. It might be not efficient for optimization. Thus, it will be helpful if the authors provide the model efficiency analysis compared with other baseline methods.
+
+Minor comments:
+(1) what is the symbol “n_k”? Did it denote the number of examples for the k-th population?
+(2) For mutual information regularized inference, it used two different notations: “I_q(x_k; t_k) - I_q(x_-k; t_k)” and “I_q(x^k; t^k) - I_q(x^-k; t^k)”.
+",6,,ICLR2020
+Byl5H39iqS,3,S1gFvANKDS,S1gFvANKDS,Official Blind Review #4,"This is a positive review. Feel free to skip to the feedback.
+
+SUMMARY OF PAPER
+This paper explains how to use cluster graphs to easily compute the asymptotic behaviour of any given correlation function (Definition 1) for deep linear networks. By ""asymptotic behaviour"" I mean that it upper-bounds correlation functions by c·n^s, where s is a nonnegative integer given by the particular cluster graph, and c a ""constant"" (which I think depends on the particular input, x, to correlation function, among other things).
+
+The authors then conjecture (Conjecture 1) that these bounds transfer to deep nonlinear networks, and that they are tight:
+- Appendix C proves that these upper bounds also hold for deep ReLU networks, and 1-hidden-layer networks with smooth nonlinearity. This is also mentioned in page 3.
+- Section 2.3 empirically shows that these bounds are pretty tight. (in terms of the exponent, none of the theory here gives a value for the constant c)
+
+The tool provided above is the main result. The authors then use it to provide some results about wide networks  
+- They give a different proof that for large width, the Neural Tangent Kernel stays constant during training. This is because its derivative wrt. time as a function of width n is O(n^-1), and thus 0 for n->infinity.
+- Using the ease of calculation from the tool, they approximate the change in the NTK over training time for any network. They do this using its value at initialization + a term that depends on n^-1. These results are numerically verified in Figure 1.
+- They present numerical evidence for the accuracy of this approximation to the change in NTK over time. 
+
+The authors spend the last 2 pages explaining how cluster graphs derive from Feynman diagrams (FDs), and why these help compute asymptotics.
+
+WHY I AM ACCEPTING THIS PAPER
+
+The paper adapts FDs and cluster graphs, which is a potentially very useful tool for other wide-network researchers, and could accelerate research in this whole sub-field. It also shows their power by providing a surprisingly large amount of novel theoretical results.
+
+FEEDBACK
+
+At a very high level, there is only one thing that I think isn't made quite clear by the presentation, and it should be. If I understand correctly, Feynman diagrams (or cluster graphs) are only used here to calculate correlation functions for deep *linear* networks. Then, other results establish that the width-dependent asymptotic behaviour for linear networks holds as-is for nonlinear networks, and these results with FDs constitute Conjecture 1. There are proofs for ReLU and 1-layer smooth networks, mentioned in pg. 3; and the experiments in the paper support it for common nonlinear deep networks as well. I think that asymptotics for linear networks transfer to nonlinear ones is an interesting result, which doesn't depend on FDs.
+
+What follows are details.
+
+It is unclear to me whether cluster graphs are as ""powerful"" as FDs, i.e. whether the bound at the end of the Proof in page 8 is always saturated. Are there some cases in which you need to use FDs to get a tighter upper bound?
+
+In Table 1 you should say that the values under ""lin. ReLU tanh"" are the fitted exponent s_C. This is not explained. Perhaps you  can mark the only 2 cases (in the 5th row) where the bound is not tight. It would be nice to know how much error remains between the fitted c·n^(s_C) and the empirical values.
+
+Please explain x1 <-> x2 in eq. 8
+
+In figure 1b, consider adding the finite-width limit prediction for the training dynamics. You have already done so for the prediction of the NTK during training in figure 5c, you could indicate it in the same way in 5b.
+
+
+Typos:
+
+Figure 2 caption: feynman -> Feynman
+
+pg 8. anlytic evicence -> analytic",8,,ICLR2020
+uuqsJUpG2Ks,1,uMNWbpIQP26,uMNWbpIQP26,Review of Paper2008,"This paper studied an algorithm for solving unconstrained smooth finite-sum optimization, called stochastic generalized mirror descent (SGMD). The algorithm SGMD is a generalization of several existing, popular algorithms, including stochastic gradient descent, mirror descent and Adagrad. 
+
+The main contribution lies in the convergence rate analysis of the algorithm SGMD based on the Polyak-Lojasiewicz (PL) inequality, which in turn yields linear convergence rate results for some existing methods such as Adagrad. Specifically, the author(s) showed in Theorem 3 that if the objective function satisfies the PL inequality and has a Lipschitz gradient and if the potential function (or called the mirror function) satisfies certain technical assumptions, then SGMD converges linearly to a global minimum. If the PL inequality is satisfied only locally, then local linear convergence result for the GMP (the deterministic version) was also proved. As another contribution, the paper showed that the GMD exhibits an implicit regularization phenomenon in the sense that it converges to a particular optimizers among others. 
+
+The Taylor-series-based analysis for stochastic algorithms seems to be new and deserves some merits. However, I do have some doubts about the main results. 
+
+First, I'm not sure if one can obtain new, useful algorithms from the general algorithmic framework in the paper and/or deduce new convergence rate results for existing algorithms. If yes, the paper should point it out explicitly, discuss such consequences and compare with the related algorithms/theoretical results. These are not clear from the current presentation of the paper.
+
+The practical implication of the implicit regularization result is also not clear. More efforts should be spent on discussing the meaning or interpretation of the interpolating optimal solution SGMD prioritizes, especially in the context of machine learning problems (e.g., when the optimization problem is the training of neural network or some supervised learning tasks).",5,4.0,ICLR2021
+SJelB0Pj5H,3,BJlnmgrFvS,BJlnmgrFvS,Official Blind Review #4,"Summary:
+This paper studies the problem of learning a policy from a fixed dataset. The authors propose to estimate a smooth upper envelope of the episodic returns from the dataset as a state-value function. The policy is then learned by imitating the state action pairs from the dataset whose actual episodic return is close to the estimated envelope.
+
+Recommended decision:
+The direction of imitating ""good"" actions from the dataset is interesting. The intuition of estimating an upper envelope of the value function seems reasonable. However, I feel like this paper is not ready to be published in terms of its overall quality, mainly due to the lack of correctness, rigorousness and justification in statements and approaches.
+
+Major comments:
+
+- On the top of page 4:  ""Because the Mujoco environments are continuing tasks, it is desirable to approximate the return over the infinite horizon, particularly for i values that are close to the (artificial) end of an episode. To do this, we note that the data-generation policy from one episode to the next typically changes slowly. We therefore apply a simple augmentation heuristic of concatenating the subsequent episode to the current episode, and running the sum in (1) to infinity."" I cannot see how this approach is validated. The reset of initial state makes cross-episode cumulative reward from a state s not an approximation to the real return from state s. Estimating the infinite horizon return from finite horizon data is indeed a challenge here and simply cut the return at the end of an episode is be problematic. But the solution proposed by the authors is wrong in principle and cannot be simply justified by ""good empirical performance"". I feel hard to regard this choice a valid part of an algorithm unless further justification can be provided.
+
+- Statements of theorems (4.1 and 4.2) are non-rigorous and contain irrelevant information: ""lambda-smooth"" is not an appropriate terminology when lambda is the weight of the regularizer. The actual ""smoothness"" also depends on the other term in the loss (same lambda does not indicate same smoothness in different objectives). For the same reason, Theorem 4.2 is wrong as changing K also changes the smoothness of the learned function. Proof of Theorem 4.2 in appendix is wrong as the authors ignore the coefficients in the last equation. Theorem 4.1-(1) cannot be true unless how V_\phi is parameterized is given: e.g. if there is no bias term or the regularization is applies to the bias term V will always output 0 as lambda 0-> \infty. The ""2m+d"" in Theorem 4.1-(2) is irrelevant to this work and cannot be justified without more detailed statements about how the network is parameterized. I appreciate the motivation that the authors try to validate the use of their objective to learn a ""smooth upper envelope"" but most of these statements are somewhat trivial and/or wrong section 4.1 does not actually deliver a valid justification.
+
+- The use of ""smooth upper envelope"" itself can bring both over-estimation and under-estimation. For example, if one can concatenate different parts from different episodes to get a trajectory with higher return, the episodic return for the states along this trajectory is an under-estimate. Although it is fine to use a conservative estimate it would be better to be explicit about this and explain why this may not be a concern. On the other hand, it can bring over estimation to the state-values due to the smoothness enhanced to the fitted V. It would be better to see e.g. when these concerns do not matter (theoretically) or they are not real concerns in practice (by further inspecting the experiments). 
+
+- Regarding Experiments: Why Hopper, Walker, HalfCheetah are trained with DDPG while Ant is trained by SAC? The performance of Final-DDPG/SAC after training for 1m steps looks way below what SAC and TD3 can get. Is it because they are just partially trained or noise is added to them? The baseline online-trained policy should not contain noise for a fair comparison. That said, in batch RL setting it is not necessary to compare to online-trained policy because it is a different setting. But if the authors want to compare to those, choice of baseline should be careful. An important baseline which is missing is to run vanilla DDPG/TD3/SAC as a batch-mode algorithm.  
+
+
+Minor comments:
+
+- Section 3, first paragraph: It is not very meaningful to say ""simulators are deterministic so deterministic environments are important"". Simulators are made by humans so they can be either deterministic or stochastic. ""many robotic tasks are expected to be deterministic environments"" is probably not true. I do not view ""assuming deterministic envs"" as a major limitation but I do not find these statements convincing as well. Similarly, the argument for studying non-stationary policy seems unsupportive: if the dataset comes from training a policy online then why do we care about learning another offline policy rather than just use or continue training the online policy. One argument I can see is that the online policy is worse. But the fact that these policies are worst than running e.g. SAC for a million steps makes the motivation questionable. Again, I do not view ""choice of setting"" as a limitation but I just find these statements a bit unsupportive. 
+
+
+Potential directions for improvement:
+
+To me the main part of the paper that looks problematic is Section 4.1 (both the approximation of infinite horizon returns and the theorems). It would be better to see a more rigorous and coherent justification of this approach (or some improved version), e.g. by either presenting analysis that is rigorous, correct and actually relevant or leave the space for more detailed empirical justification (e.g. whether potential over/under-estimating happens or not, comparing the estimated V to real episodic return of the learned policy).
+",3,,ICLR2020
+WtwaY2lQ26h,4,pbXQtKXwLS,pbXQtKXwLS,Finding initialization rules through the study of Gaussian Processes: one-layer case,"### Summary
+The authors propose a rule for neural network (NN) initialization, which takes into account input data. 
+
+They suppose that weights and biases of a NN are randomly drawn resp. from $\mathcal{N}(0, \sigma_w^2 / N)$ and $\mathcal{N}(0, \sigma_b^2)$, where $N$ is the number of inputs. Then, they are able to compute *explicitly* the covariance matrix of the corresponding Gaussian Process. Since an explicit result is needed, they concentrate on one-layer NNs.
+
+They use their explicit formula for the covariance matrix to compute the likelihood of the data, given $\sigma_w^2$ and $\sigma_b^2$. Therefore, they are able to select the best pair $(\sigma_w^2, \sigma_b^2)$ according to data, i.e., maximizing the likelihood.
+
+### Clarity
+I did not understand the experimental setup presented in Section 2.5. For instance, the train/test location are supposed to lie in the interval $[0, 1]$, but the test location points lie in $[0, 2]$ in the graphs.
+Besides, all the computation of the covariance should be put in appendix.
+
+### Significance
+Since the paper relies on an explicit computation in one-layer NNs, the presented method has a very low significance. At least, the authors should propose an application in deeper NNs, even by making strong approximations. As such, no clue about any generalization is provided.
+
+Edit:
+### Rebuttal
+I did read the authors' rebuttal, and the main issue, i.e. the significance, has not been addressed. I cannot take into account the new experiments, since they are not in the paper. Anyway, an experiment with a 2-layer network would be a significant modification of the present paper, which would be not acceptable during the rebuttal phase.",3,3.0,ICLR2021
+HyirgqDgM,1,SJ1Xmf-Rb,SJ1Xmf-Rb,The paper presents an interesting problem of incremental classification inspired by the dual memory-system of brain. I feel the paper  explicitly describes the problem and explains the proposed methodology in great detail.,"Quality: The paper presents a novel solution to an incremental classification problem based on a dual memory system. The proposed solution is inspired by the memory storage mechanism in brain.
+
+Clarity: The problem has been clearly described and the proposed solution is described in detail. The results of numerical experiments and the real data analysis are satisfactory and clearly shows the superior performance of the method compared to the existing ones.
+
+Originality: The solution proposed is a novel one based on a dual memory system inspired by the memory storage mechanism in brain. The memory consolidation is inspired by the mechanisms that occur during sleep. The numerical experiments showing the FearNet performance with sleep frequency also validate the comparison with the brain memory system.
+
+Significance: The work discusses a significant problem of incremental classification. Many of the shelf deep neural net methods require storage of previous training samples too and that slows up the application to larger dataset. Further the traditional deep neural net also suffers from the catastrophic forgetting. Hence, the proposed work provides a novel and scalable solution to the existing problem.
+
+pros: (a) a scalable solution to the incremental classification problem using a brain inspired dual memory system
+          (b) mitigates the catastrophic forgetting problem using a memory consolidation by pseudorehearsal.
+          (c) introduction of a subsystem that allows which memory system to use for the classification
+
+cons: (a)  How FearNet would perform if imbalanced classes are seen in more than one study sessions?
+          (b) Storage of class statistics during pseudo rehearsal could be computationally expensive. How to cope with that?
+          (c) How FearNet would handle if there are multiple data sources?",7,4.0,ICLR2018
+6isphtcv46i,3,S9MPX7ejmv,S9MPX7ejmv,"Interesting problem, but lacking clarity and motivation","Summary:
+This paper seeks to train multi-objective RL policies that are robust to environmental uncertainties. There are two main contributions: a novel approach to solve this problem, and a novel metric to evaluate Pareto fronts. The metric combines the typical hypervolume metric (that captures the quality/performance of a Pareto front) with a novel ""evenness"" metric, that captures how well solutions are spread out across the space of preferences. The proposed approach, called BRMORL, consists of training a protagonist policy that maximizes utility alongside an adversarial policy that seeks to minimize utility (motivated by zero-sum game theory), while using Bayesian optimization to select preferences to train on, in order to optimize the hypervolume-and-evennesss metric. Both the protagonist and adversarial policy are conditioned on preferences.
+
+Recommendation:
+This paper connects two seemingly orthogonal problems, multi-objective RL and robustness. This is an interesting topic, but there are several issues regarding clarity and the motivation (as detailed in the cons list below). I think this paper could be a valuable contribution for MORL, but _not_ for MORL that is robust to environmental uncertainty, which is what the claim is. Thus I recommend rejection.
+
+Pros:
+* Training policies that are robust _and_ flexibly trade off between preferences is an interesting and relevant problem.
+* The empirical evaluation shows that the approach outperforms ablations and an existing state-of-the-art MORL approach (Xu et al. 2020) on continuous control tasks.
+
+Cons:
+* Clarity: the introduction should clearly define what _robustness_ means. Currently it's unclear what problem this paper is trying to solve. Does the approach try to achieve robustness to environment dynamics / perturbations, or robustness across preferences, or both? My interpretation is that robustness refers to both kinds. I can understand how BRMORL would improve robustness across preferences, and perhaps also perturbations, but am skeptical about whether it improves robustness to environment dynamics (see next point).
+* The motivation behind this approach is questionable: I'm not convinced that BRMORL actually leads to training policies that are more robust, with respect to environment dynamics or perturbations. This is not shown clearly in the empirical evaluation, and also is not obvious from the approach itself. I don't see the connection between having an adversarial policy and being robust to the dynamics of the system (e.g., masses of limbs). Figure 6 shows that BRMORL has better robustness to environmental uncertainty than SMORL, but that could just be because SMORL is the worst-performing ablation, and just doesn't find particularly high-performing policies (as shown in Figure 5c). How does BRMORL compare to RMORL or SRMORL?
+* It would help to have an algorithm box for BRMORL, that clarifies how the adversary policy, protagonist policy, and Bayesian optimization are used to gather data and for training.
+* The proposed metric is questionable. The goal is to capture both diversity and quality of solutions, but in Figure 3, I would argue that Pareto front 1 is indeed better, because these points dominate _all_ of the points on Pareto fronts 2 and 3, and the purpose of MORL is to find non-dominated policies.
+* The chosen scalar utility function $U$ is not properly justified. In particular, does $M$ (in Equation 2) still make sense when the objectives have significantly different reward scales (e.g., if one objective's return is typically from 0 to 10, and the other's is from 10 to 100)? Even after normalizing, the Q-value term will only be in a portion of the first quadrant, whereas the $w$ term can cover the entire first quadrant.
+* Unjustified hyperparameters for trading off between terms in the losses: $k$ in the scalar utility function, $\beta$ for the two terms in the Q-function loss, and $\lambda$ for the comprehensive metric that combines hypervolume and evenness. How should these be chosen?
+* The Related Work doesn't give enough credit to existing MORL approaches. First, Xu et al. (2020) is actually able to find a well-distributed set of Pareto-optimal solutions. In addition, existing methods are stated to only be able to find solutions on the convex portions of a Pareto front. Bringing up this point implies that BRMORL does better (i.e., is able to find solutions on concave portions of the Pareto front), but this is not shown empirically. Finally, the related work states that most existing approaches are only applied to domains with discrete action spaces. It should acknowledge that both Abdolmaleki et al. (2020) and Xu et al. (2020) are applied to high-dimensional continuous control tasks.
+* Lack of experimental details for reproducibility, e.g., network architetures and DDPG hyperparameters.
+
+Other comments:
+* There are quite a few grammatical errors and typos throughout the paper.
+* Definition 3 is imprecise. First, is $a$ a policy or an action? It seems like it should be a policy because it's a member of the policy set, but it's used to denote actions in the previous section, Section 3.1. Also, why are $I$ and $II$ included in the game definition, when they are already represented by the policy sets?
+* There is not enough explanation given for Figure 1. Where do the uniformly-sampled preferences come from (the gray dashed lines)? What is the ""optimal guess point""? Does Bayesian optimization only suggest one preference at a time (in red)? What is the acquisition function? (This is defined too late in the paper, and only in the caption for Figure 4.)
+* It would be more accurate to make the $k$ explicit in equations 8 and 9, because it's different in $M(\cdot)$ for the two equations, but the current notation implies it's the same.
+* In the empirical evaluation, SRMORL, an ablation of BRMORL, finds policies that dominate those found by BRMORL (Figure 5c). How can this be interpreted / explained?
+* Table 3 needs an accompanying explanation of the different MORL methods.",5,4.0,ICLR2021
+JGO_OAVdEQU,2,tij5dHg5Hk,tij5dHg5Hk,RAFT,"This paper analyses the recently proposed Bootstrap Your Own Latent (BYOL) algorithm for self-supervised learning and image representation. 
+The authors first derive an alternative training procedure called BYOL' by computing an upper bound of the BYOL objective function.
+After diverse analyses, the authors then introduce Run Away From Your Teacher (RAFT), where RAFT is another BYOL variant that resembles contrastive method by having an attractive and repealing term in the training objective. According to the authors, this decomposition allows for a better understanding of the training dynamics.
+
+Finally, the authors made the following transitivity reasoning:
+ - BYOL and BYOL' are almost equivalent
+ - RAFT and BYOL' are shown to be equivalent under some assumptions.
+Thus, conclusions that are drawn from analyzing RAFT should still hold while analyzing BYOL. They thus link the interest of BYOL's predictor and the EMA through the RAFT loss decomposition. 
+
+I have multiple strong concerns regarding this paper. These concerns are both on the paper results, shortcuts in the analysis, and the writing style. 
+
+
+Results:
+--------------
+
+ - In section 4, the authors introduce BYOL' as a variant of BYOL. To do so, they derive an upper bound on the BYOL loss, i.e. the L2 distance between the projection and the projector, and they try to minimize it. However, this approach disregards that BYOL does not minimize a loss (due to the stop gradient). In other words, the BYOL objective keeps evolving during training; the target distribution is non-stationary.  As mentioned in the BYOL paper: ""Similar to GANs, where there is no loss that is jointly minimized w.r.t. both the discriminator and generator parameters; there is therefore no a priori reason why BYOL’s parameters would
+converge to a minimum of L_BYOL given the online and target parameters"". Minimizing an upper-bound is at best insufficient, at worst a non-sense. The sentences, ""minimizing L_{BYOL'}  would yield similar performance as minimizing L_{BYOL}"" and ""we conclude that optimizing L_{BYOL'} is almost equivalent to L_{BYOL}"" are unfortunately wrong.  This is somewhat highlighted different qualitative results in Appendix F.1.b != F.1.d.
+A better approach would be to ensure that the *gradients* go in a similar direction (so the training dynamics are similar rather than the objective function). However, even such a demonstration could be insufficient due to compounding factors in the training dynamics. 
+ - The 1-1 mapping between BYOL' and RAFT rely on three hypotheses. While (i) and (ii) are reasonable, hypothesis (iii) is quite strong, and more importantly, neither elaborated nor discussed. In other words, I am unable to validate/invalidate the interest of the theoretical results. Would it be possible to measure the normal gradient empirically? To bound it? 
+ - In section 3, i would recommend the author to mention that multiple components were also in the BYOL paper; especially when writing ""therefore, we conclude the predictor is essential to the collapse prevention of BYOL.""
+ - Although I acknowledge that self-supervised learning requires heavy computational requirement, and few teams may run experiments on ImageNet. Yet, I would recommend the authors to not use CIFAR10 as the dataset has multiple known issues (few classes, small images, few discriminative features). Other variants such at STL or ImageNete can be trained on a single GPU over a day, and are less prone to misinterpretation in the results. Besides, I want to point out that BYOL was not correctly tuned: the experiments are based on a different optimizer (Adam vs LARS) and no cosine decay were used for the EMA, while these two components seem to be critical, as mentioned in BYOL and arxiv:2010.1024. 
+
+Overall, I have a serious concern about the paper's core contributions. However, there are still some good elements in the paper that I think are under-exploited:
+ - RAFT is itself an original, new and interesting algorithm. The potential link to BYOL is indeed an interesting lead, but in its current state, I would make it a discussion more than a key contribution.
+ - Table D.3 shows that RAFT/BYOL' does not collapse without predictors when \beta is high. Albeit providing low accuracy, a non-collapse is quite surprising. Unfortunately, the authors leave it for future work
+
+
+Shortcuts:
+--------------
+I was surprised by multiple shortcuts in the reasoning process or undiscussed conclusions:
+ - The authors mention that the predictor is a dissatisfactory property of BYOL. Could they elaborate? This is actual the key component of the method (if not the only one!), and such pro/cons could be detailed in light of other methods. 
+ - In section 4.1, the authors mention that: similar accuracies and losses are sufficient somewhat confirm that BYOL and BYOL' are similar. Two completely different methods may have the same errors while being radically different... 
+ - In Section 4.2, the authors mention that ""Based on the form of BYOL, we conclude that MT is used to regularize the alignment loss"". However, there is no experiments to try to contradict/validate this claim. Differently, the EMA may ease the optimization process or it may have different properties. Even if I understand the logic behind this statement, I regret that the authors do not try to confort it. 
+- In section 4.2, the authors mention that there exist multiple works (while only citing one...) demonstrating that EMA is ""roughly"" equivalent to sample averaging and may encourage diversity. While this is sometimes true in specific settings (cf. markov game and fictitious play), this is also known to ease optimization (cf. target network in DQN). Stating that RAFT is better than BYOL because it better leverage the EMA target is tricky without proper analysis.
+- Albeit understandable, the transitivity between BYOL and RAFT is difficult to defend due to multiple approximations and hypothesis. Therefore, it is of paramount importance that the approximations and hypothesis are validated, which is not sufficiently done in the paper. 
+
+
+Writing: 
+--------------
+ - Although papers' writing quality remain subjective, I tend to expect a formal language. I kind of feel ill-at-ease when reading sentences including ""BYOL works like a charm"", ""disclosing the mistery"", ""to go harsher"", ""bizarre phenomon"". Other sentences also expresses judgement such as ""inconsistent behavior"", ""dissatisfactory property of BYOL"" or ""has admirable property"" without proper argumentation. 
+ - It is non-trivial to follow the different version of the algorithms... which are defined in the appendix. Please consider renaming BYOL'. 
+ - A related work section would have been useful to put in perspective BYOL that are theoretically motivated e.g. AMDIM, InfoMin, other self-supervised learning methods without negative example, e.g. DeepCluster, SwAV. Section 2 is more about the background, not related work.
+ - there are a few confusions in the notation, \alpha \beta have different meaning across equations (Eq 7 vs 8)
+ - In section 3, random is ill-defined. In Cifar10, random should be 10%, I assume that you refer to random projection. Please clarify.  
+ - Figure 1 is clear, and I recommend to keep it as it is.
+ - From my perspective, the mathematical explanation in Section 5 is quite obfuscated, and I would recommend a full rewriting.
+ - Please avoid unnecessary taxonomy, e.g. uniformity optimizer, effective regulizers and others.
+ - In conclusion, you mentioned some results about the projector. However, you never detail them in the paper. Please, do not discuss unpublished results.
+
+Overall, I had difficulties following the paper: I keep alternating between the appendix, previous sections, and the text. Again, the phrasing makes me ill-at-ease.
+
+
+Conclusion:
+--------------------
+I have some serious concerns about the core results of the paper. Importantly, Theorem 4.1 follows a misinterpretation of the BYOL training dynamics. From my perspective, there are too many unjustified claims, and I cannot recommend paper acceptance. However, there is some good idea in the paper, and I strongly encourage the authors to study RAFT independently of BYOL in the future.",3,5.0,ICLR2021
+rJx81Y1ph7,3,H1eH4n09KX,H1eH4n09KX,Official review,"
+The paper presents a model to perform audio super resolution. The proposed model trains a neural network to produce a high-resolution audio sample given a low resolution input. It uses three losses: sample reconstructon, adversarialy loss and feature matching on a representation learned on an unsupervised way.
+
+From a technical perspective, I do not find the proposed approach very novel. It uses architectures following closely what has been done for Image supre-resolution. I am not aware of an effective use of GANs in the audio processing domain. This would be a good point for the paper. However, the evidence presented does not seem very convincing in my view. While this is an audio processing paper, it lacks domain insights (even the terminology feels borrowed from the image domain). Again, most of the modeling decisions seem to follow what has been done for images. The empirical results seem good, but the generated audio does not match the quality of the state-of-the-art.
+
+The presentation of the paper is correct. It would be good to list or summarize the contributions of this work.
+
+Recent works have shown the amazing power of auto-regressive generative models (WaveNet)  in producing audio signals. This is, as far as I know, the state-of-the-art in audio generation. The authors should motivate why the proposed model is better or worth studying in light of those approaches. In particular, a recent work [A] has shown very high quality results in the problem of speech conversion (which seems harder than bandwidth extension). It would seem to me that applying such models to the bandwith extension task should also lead to very high quality results as well. What is the advantage of the proposed approach? Would a WaveNet decoder also be improved by including these auxiliary losses?
+
+While the audio samples seem to be good, they are also a bit noisy even compared with the baseline. This is not the case in the samples generated by [A] (which is of course a different problem). 
+
+The qualitative results are evaluated using PESQ. While this is a good proxy it is much better to perform blind tests with listeners. That would certainly improve the paper. 
+
+Feature spaces are used in super resolution to provide a space in which the an L2 loss is perceptually more relevant. There are many such representations for audio signals. Specifically the magnitude of time-frequency representations (like spectrograms) or more sophisticated features such as scattering coefficients. In my view, the paper would be much stronger if these features would be evaluated as alternative to the features provided by the proposed autoencoder. 
+
+One of the motivations for defining the loss in the feature space is the lack (or difficulty to train) auxiliary classifiers on large amounts of data.  However, speech recognition models using neural networks are quite common. It would be good to also test features obtained from an off-the-shelf speech recognition system. How would this compare to the proposed model?
+
+The L2 ""pixel"" loss seems a bit strange in my view. Particularly in audio processing, the recovered high frequency components can be synthesized with an arbitrary phase. This means that imposing an exact match seems like a constraint as the phase cannot be predicted from the low resolution signal (which is what a GAN loss could achieve). 
+
+The paper should present ablations on the use of the different losses. In particular, one of the main contributions is the inclusion of the loss measured in the learned feature space. The authors mention that not including it leads to audible artifacts. I think that more studies should be presented (including quantitative evaluations and audio samples).
+
+How where the hyper parameters chosen? is there a lot of sensitivity to their values?
+
+
+[A] van den Oord, Aaron, and Oriol Vinyals. ""Neural discrete representation learning."" Advances in Neural Information Processing Systems. 2017.
+",4,4.0,ICLR2019
+Jrf1dulzrls,1,mPmCP2CXc7p,mPmCP2CXc7p,"I wonder if it can be applied directly to the online setting, which gradually decreases the number of features.","This paper proposes an RNN model for adaptive dynamic feature selection, for efficient and interpretable human activity recognition (HAR). From the intuition that human activity can be predictable by using a small number of sensors, the paper introduces an l0-norm minimization problem with parameter regularization, and provide a logic on formulating a dynamic feature selection model with relaxations. The difficulty of the discrete optimization problem is solved by differentiable relaxation, which is known as Gumbel-Softmax reparameterization techniques. The formulation is naturally led to an RNN model that uses histories as input with an additional sigmoid unit for adaptive feature selection. 
+
+Empirical studies are performed to show the superiority of the adaptive feature selection network. Results are shown on the task of 1) UCI-HAR smartphone dataset with 561 features, 2) UCI Opportunity sensor dataset with 242 features, 3) ExtraSensory dataset with 225 features for multilabel binary classification. In particular, by using the adaptive feature selection technique, the average number of features necessary for HAR prediction can be very small (0.3%, 15.9%, 11.3% among all features) at any given time. Overall, the paper is well written. In particular, analysis results on three datasets are clear and detailed, so that the reader would be available to understand what sensors were necessary for HAR prediction.
+ 
+The key concern about the paper is that the algorithm lacks practicality. To show the adaptive selection algorithm is efficient, it should be shown that the algorithm drastically reduces features that are not necessary for prediction over time, while maintaining the performance even in the lighter feature space. Although the average number of features selected by the adaptive selection algorithm for each snapshot is small, all features are entered as input, which may not help to speed up the algorithm. To claim that the algorithm is efficient, it is required to show that the computation cost can be saved. Also, based on the current experimental results, it is difficult to say that features that were not used in earlier timestamp will not be used in later timestamp with a different context. 
+ 
+Minor comments and questions:
+- Can you report the running time of each model? 
+- Is this model working in an online setting without tuning? If yes, would you like to clarify? If no, may I think this technique is for maintaining a dashboard that informs important features every time to users by calculating feature importance over time?
+- The performance of the adaptive method on the NTU-RGB-D dataset is quite poor. What part of the dataset do you think caused the difficulty in feature selection? Do all features important?
+- The technical novelty seems to be low if the proposed model is an RNN with an additional sigmoid layer.
+- Figure 2a does not have a ground truth blue line.
+",4,3.0,ICLR2021
+yEDt2Q8FmU-,2,7ODIasgLJlU,7ODIasgLJlU,Proposes an adaptive heuristic for deciding when to update the simulation policy for deep RL,"######################################
+
+Summary:
+In many real world applications for RL such as medicine, there are limits on the number of policies from which we can simulate data. This paper proposes an approach that adaptively decides when to update the simulation policy, based on the difference between it and the current learned policy. Experiments on a medical treatment environment and Atari show that the approach obtains similar performance to on-policy RL with fewer changes of the simulation policy.
+
+######################################
+
+Pros:
+1. The proposed approach is straightforward, adaptive, and achieves results comparable to the classical on-policy setting with fewer policy switches on all environments shown. It is also applicable to both model-based and model-free RL.
+2. The paper is organized well, and the algorithms are clearly explained.
+
+Cons:
+1. I did not find the theoretical justification for the proposed approach to be very convincing for RL, since it is based on a construction in a simplified linear regression case. However, I think this is ok since the paper is application-focused.
+2. Based on the results on GYMIC, the proposed approach seems to have much greater variance than the other algorithms, especially at the early stages of training. This is often detrimental for the applications considered, such as medicine, in which robustness is also desirable.
+
+######################################
+
+Overall:
+I would lean toward accepting this paper. I am not completely familiar with the literature on RL with low switching cost, but the proposed approach appears to be novel. The experiments show that when combined with Rainbow DQN,  it effectively reduces the switching cost on a range of environments and is based on the training path, requiring less environment-specific hand-tuning than fixed or adaptive interval switching.
+
+######################################
+
+Further comments and questions:
+1. How were the six Atari games chosen? How do the different approaches compare in other games?
+2. There are a number of typos, e.g. ""deno"" in the first paragraph of section 3.1. 
+
+######################################
+
+Update after reading other reviews and author response:
+
+I have decided to lower my score from 6 to 5, as I agree with Reviewer 3 that more experimental analysis of the method is needed (ablations, sensitivities, etc.) given that the theoretical backing is not convincing. The authors also did not directly answer our questions.",5,3.0,ICLR2021
+MIKpdxPvX1,2,pavee2r1N01,pavee2r1N01,some questions on the experimental results,"The paper present some regularization schemes for ReLU networks, based in geometrical properties (polytopes and analytical centers), aiming at robustifying networks against adversarial examples. The idea is to train the network such that the partition of the input space contains many less linear regions, and such that in each region, training points are gathered around the analytical center (and thus far from boundaries), making more difficult the attack task. 
+Doing so the authors can  give certificates for robustness.  The proposed approach appears to be competitive in terms of bounds and do so with less hyper-parameters than other methods. 
+
+The idea of using regularization terms to make networks more robust makes sense, the particular proposed regularizers have some quite easy to catch motivation. 
+The presented results match state-of-the-art, however I have some (maybe naive) questions : 
+-> why TE are missing sometimes? 
+-> how can LB>UB (CNN l_inf) ? I would have say that max(LB)=min(UB)=true minimal distortion ? 
+Fig 3 is unreadable which is a pity since I think it could be some relevant information, its interpretation is quite brief too. 
+
+Overall I think the presented work is interesting and quite well executed. Theoretical part seems ok, but I'm not familiar with some geometric notions so I might have missed things. 
+I tend to accept the paper but I'm waiting for more explanations on the experimental results. ",6,3.0,ICLR2021
+HylsrDYTtS,1,BJglA3NKwS,BJglA3NKwS,Official Blind Review #3,"SUMMARY: A new similarity function replacing the dot product of key and query in attention modules - instead take a shared-weighted sum
+
+Good paper, sound theory, very clear explanations. Literature review was sufficient to explain the problem and underlying theory.
+
+Results look convincing, although they cannot be verified unless code is shared.
+
+Reasonable direction of exploration - there are several possible similarity functions, this paper explores one of them that offers significantly less computational resources, which are essential for on-device applications. Thorough exploration of this idea was done and I am convinced this is a good alternative to regular attention.
+
+All the rest of the paper was to incorporate this new method in different tasks in different architectures using or not using attention, and seeing the differences in terms of computational resources. Well thought experiments and results. Again, cannot be verified unless code is shared. ",6,,ICLR2020
+KxBPhdn-j08,3,K3qa-sMHpQX,K3qa-sMHpQX,Report on ForceNet,"##########################################################################
+Summary:
+This article presents the force fields predictor ForceNet. The authors show that ForceNet reduces the estimation error of atomic forces by 30% compared to existing ML models. Reconstructing molecular force fields is a very active and thriving field, hence this paper is highly relevant. In short, this article presents a very good model based a selection of very good methodologies and combining them in a clever way.
+
+##########################################################################
+Reasons for score: 
+Overall, I vote for not accepting, but if the authors address most of the comments I don't have a problem if it is accepted. As mentioned before, this problem is of high relevance and is likely to have a considerable impact in the physics and simulations community. Nevertheless, this is a non-energy conserving and non-rotationaly covariant force field, meaning that for most of the applications of a force field, this will provide unphysical results.
+
+##########################################################################
+
+Pros:  
+1. ForceNet reduces the estimation error of atomic forces by 30% compared to existing ML models.
+ 
+2. It is based on strong and reliable previous results, which makes it a robust model.
+ 
+##########################################################################
+Comments: 
+
+-The authors state ”However, if successful, accurate and fast ML-based models may lead to significant practical impact by accelerating simulations from O(hours-days) to O(ms-s), which in turn accelerates applications such as catalyst discovery.” And “If successful, ML could be applied to problems such as catalyst discovery which is key to solving many societal…” These statements hint that this has not been done yet, which is a half truth. As exposed in the references cited in the paper, there are already very successful ML-methodologies in this field, but it has to be stressed that the successes come from applications in local domains. This mean that errors of sub-0.1 kcal/mol have been achieved for molecules already couple of years ago (Chmiela et al Nat Commun 9, 3887 (2018), Christensen et al J. Chem. Phys. 152, 044107 (2020), and Unke et al J. Chem. Theory Comput. 2019, 15, 6, 3678–3693). Hence, phrasing the manuscript like this, it could be misleading for the reader.
+-The authors state “This is possibly because the force-centric models capture the dependency of atomic interactions on atomic forces more explicitly than the energy-centric models.” This topic has been extensively analysed by Prof. Müller’s group (Sci. Adv. 3 (5), e1603015, 2017; Nat Commun 9, 3887 (2018)) and recently by Prof. von Lilienfeld (Mach. Learn.: Sci. Technol. 1 045018, 2020). In short, learning forces is equivalent to learn linearisation of the energy surface which is much more informative.
+-MLP is not defined prior usage.
+-The directed message e_{st} is a vector or scalar? In some parts appeared just as e_{st} and on other as \textbf{e}_{st}. 
+-The atomic radii is a given physical value or is it a learning parameter?
+-The authors state: “As we demonstrate in Section 4, the replacement of ReLU with Swish consistently and significantly improves the predictive accuracy while maintaining scalability across all choices of basis functions.” This is not relevant, since it is a direct statement. A more interesting comparison would be to shifted tanh function.
+
+-From my point of view, the main downside of this article is the fact that the ForceNet is neither exactly covariant not energy conserving.",5,5.0,ICLR2021
+ysTPLuahTjB,1,Xa3iM4C1nqd,Xa3iM4C1nqd,A Work on Making Self-Supervised Learning Endow Robustness,"This paper uses a different data augmentation (AugMix) scheme to improve self-supervised representation learning. It improves accuracy and (corruption and adversarial) robustness by a sufficiently interesting amount. The paper's presentation is clear, but the paper could be more thorough. Since the technique is simple and general, it could easily be broadly applicable to the burgeoning area of self-supervised learning.
+
+Other pros:
+_On the surprising similarities between supervised and self-supervised models_ shows that most techniques for self-supervised learning don't improve robustness at all, so self-supervised learning produces representations that have limited use downstream.
+This work counteracts this limitation by also assessing the robustness of self-supervised representations, and the technique they propose improves robustness. They improve robustness with data augmentation, a volatile ingredient in self-supervised learning. SimCLR reminds us that the choice of image modifications is crucial and must be carefully selected, so it was not obvious a priori that their technique should help.
+
+Other cons:
+The paper would be stronger if it showed their augmentation scheme helping with more than MoCo. That would demonstrate generality. I'll boost my score if it improves another self-supervised learning technique.
+The paper could show results with another robustness benchmark, such as ImageNet-R. This would make a stronger case for robustness. This would be quick to add.
+URRL-SR uses embedding distances, but this seems like logit pairing. Logit pairing can lead to an overestimate of adversarial robustness. Please include results with an adversarial attack with restarts. Also state whether you're using l2 or l_infty attacks, rather than having refer to Ilyas et al.
+The paper has one only figure, and Table 2's formatting is slightly unintuitive.
+
+Update: I am happy with the changes and am keeping my score.",7,5.0,ICLR2021
+NTpqIcJOSoT,4,yqPnIRhHtZv,yqPnIRhHtZv,Insufficient Experiments,"In this paper, the authors proposed a new representation of persistence diagrams that can include `''essential features''. Essential features correspond to the intrinsic topology of the underlying space that will not die during the filtration. To include the fact that the essential features are infinitely far from other normal features in the diagram, the authors proposed to use a Poincare ball representation, which maps the diagram into a disk whose boundary is infinitely far from inside. 
+
+The authors further proposed a classifier that learns the parameterization of the embedding of a diagram in the Poincare ball. The presentation learning procedure seems to be similar to (Hofer et al. 17), except for using a Poincare ball representation instead of the Euclidean square/triangle representation. Experiments are carried on graph classification tasks and image classification task.
+
+On the positive side, I think the proposed representation well unified essential and non-essential topological structures. It is elegant and well-thought. A stability theorem is proven (in a similar manner as other known representations). The references are also reasonably complete.
+
+However, I am having doubts on the practical motivation of the representation in the learning context. To me, essential features can be considered by simply adding the histogram of their birth times as additional features (to the neural networks at an appropriate layer). I think this approach is a natural and necessary baseline to be compared with.
+
+Generally, the empirical results are not particularly strong. On graph datasets, the proposed method is only winning 2 out of 5 times. Many important baselines are also missing. For graph classification methods, state-of-the-art classifiers such as GIN and GraphSAGE should be at least compared with. For topological classifiers, many kernel methods could be compared with: PWGK, PI (both of which were used in the other experiment), also Sliced-Wasserstein Kernel. I honestly think an ensemble of these methods and the histogram of birth times of essential diagrams can be easy to tune and perform better. 
+
+Other questions/comments:
+
+The references are not standard and need to be fixed. 
+
+For graph with Rips filtration, why are there 1-dim essential homology? Wouldn't all 1D homology features be killed eventually?
+
+Details of the classifier (how the representation is learned) is still not clear even after reading the section in the supplemental material. 
+
+Experimental details are missing. I understand this is 80% training 20% testing. But how many folds were used to evaluate? Some baseline numbers are very similar to the numbers in the GIN paper. However, the GIN paper was using a 10-fold cross-validation. So there is some discrepancy in the experiments. I would appreciate if the authors could kindly elaborate.
+
+Overall I think this is a nice mathematical formulation to incorporate essential features into the representation of persistent homology. But the practical usage in learning is not very convincing. Fundamentally, the essential features are completely different from non-essential ones. There is nothing in between them. Thus the benefit of a unified representation does not bring much more information than simply treating them separately.
+
+
+** After rebuttal: 
+I am increasing the score to 6. I appreciate the authors' response to the reviews. They did the additional experiments I asked for. 
+
+As I stated in the original review, I really liked the unified approach. It is elegant and is nicely presented.
+
+After reading the authors' response to R4, it is clarified that the 1D essential homology is because the computation over all threshold is too expensive. I think there might be an opportunity to better justify this paper: we often have to stop the filtration early due to computational concern. This unified representation could potentially be a good solution for this: without computing the actually death time, the unified solution can still 'learn' the real death time of the 1D essential classes. The authors might want to discuss or ideally empirically verify this in the final version. For example, can you show that using the new approach, and stopping earlier during the filtration, the unified classifier can be as good as when we run the whole filtration and compute the real death  time for all 1D classes. Moreover, it will be ideal if the authors can manage to show that the unified approach can actually learn the real death time for these fake essential classes (I do not know how). This way the paper can potentially have a bigger impact. 
+
+
+",6,4.0,ICLR2021
+rJgVWTWJ5S,3,SyxV9ANFDH,SyxV9ANFDH,Official Blind Review #2,"the paper attempts to infer Granger causality between nonlinearly interacting stochastic processes from their time series measurements. instead of using MLP/LSTM etc to to model time series measurement, the paper proposed to use component-wise time series prediction model with Statistical Recurrent Units to model the measurements. they consider a low-dimensional version of SRU, which they call economy-SRU. in particular, they use group-wise regularizing to accompany the particular structure of the model to aid interpretability. they compared the performance with existing models with MLP/LSTM and show some gains in a few examples (but not all.) the proposal is interesting, but the experiment section might need further strengthening. currently, the experimental results do not immediately pop out as showing eSRU particularly useful.",6,,ICLR2020
+SklAX8hm5H,3,rklMnyBtPB,rklMnyBtPB,Official Blind Review #1,"The paper proposes to do adversarial training on multiple L_p norm perturbation models simultaneously, to make the model robust against various types of attacks. 
+
+[Novelty] I feel this is just a natural extension of adversarial training. If we define the perturbation set in PGD to be S, then in general S can be union of perturbation set of several L_p norm, and the resulting algorithm will be MSD (everytime you do a gradient update and then find the worst case projection in S). It would be interesting to study the convergence of this kind of algorithms, since S is no longer convex, the projection is trickier to define. Unfortunately this is not discussed in the paper. 
+
+In terms of experiments, this is an interesting data point to show that we can have a model that is (weakly) robust to L1, L2 and Linf norms simultaneously. However, the results are not surprising since there's more than 10% performance decreases compared to the original adversarial training under each particular attack. So it's still not clear whether we can get a model that simultaneously achieves L1, L2, Linf robust error comparable to original PGD training. 
+
+[Performance] 
+- It seems MSD is not always better than others (worst PGD and PGD Aug). For MNIST, MSD performs poorly on Linf norm and it's not clear why.
+- There's significant performance drop in clean accuracy, especially MSD on MNIST data. 
+
+[Suggestions]
+- As mentioned before, studying the convergence properties of the proposed methods will be interesting. 
+- It will be interesting if you can train on a set of perturbation models and make it also robust to another perturbation not in the training phase. For instance, can we apply the proposed method to L{1,inf} in training and generalize to L2 perturbation? 
+
+=====
+Thanks for the response. I still have concerns about novelty so would like to keep my rating unchanged. 
+
+",3,,ICLR2020
+KVBDzmNOdTe,3,QFYnKlBJYR,QFYnKlBJYR,Interesting setting with straightforward solution,"Post rebuttal: The updates clearly explain the resampling procedure of this paper, and strengthen the theoretical part of this paper. As a result, I'd like to change my rating to 6 and recommend an acceptance.
+
+Additional thoughts emerged from the discussion with authors (which are irrelevant to the rating): I agree that experiments in the paper demonstrate the modified SAC algorithm has a significant improvement compared to baseline algorithms. And I believe that the paper could benefit from including some theoretical justifications to the loss function and data collection scheme (though I completely understand the difficulty of theoretically justify deep RL algorithm). For example:
+- Assuming all the loss functions can be optimized to optimal, will the policy converge to optimal or near-optimal solutions?
+- Assuming the value net can be optimized to optimal, how the resampling process change the gradient of policy net? In which case would the on-policy sample with truncated trajectory (i.e., the value function computed by Eq. (8) where the length of the trajectory is $n$) out-perform off-policy sample with full trajectory (i.e. SAC without resampling)? If I understand correctly, without resampling the error of the value net suffers from the amplification caused by distribution mismatch (which is potentially exponential?). And with resampling, would the error of value net come from the truncation?
+
+Additional minor issues:
+- Definition 5: There is no base (i.e., $n=0$) in the recursive definition of $\hat{v}_n^{soft}$.
+
+---------
+The paper considers reinforcement learning with delays, motivated by real-world control problems. Novelty of the setting is that the delay is random and changing. Algorithm proposed by the paper uses importance sampling to create on-policy samples of augmented observations, and is empirically shown to out perform base line SAC algorithm.
+
+While the idea of the paper is clean, I found it a bit hard to follow the complicated notations and definitions without intuition explanation. I appreciate the effort of making the paper mathematically rigorous, but I believe that the paper could be more easy-to-follow and have a larger impact to the community if there were more explanations before/after each definition, especially when the math behind this paper is not super complicated.
+
+Some additional questions:
+1. What is the observation if the delay decrease by more than one? If I understand correctly, does Eq. (1) imply that the agent can only observe the last state? In other words, suppose there are some delay of the network such that there are no observations for 5 time steps, and after that all the network packages arrive at the same time. Will the agent discard the information of time steps 1, 2, 3 and 4?
+2. In order to have a Markovian transition, definition 1 requires $K$ being the maximum possible total delay. However, Theorem 1 assumes that total delays are longer than the trajectory. Does it imply that $w_i+\alpha_i$ is a constant? Otherwise at least one of the assumptions can not be true.
+3. I couldn't follow the proof of Theorem 1 (and Lemma 6). In Definition 3, Eq. (2), what is $u_0^*$? For the induction in Lemma 6, what is the induction base? If I understand correctly (please correct me if I'm wrong), the operator $\sigma_n^\pi(\tau_n^\star\mid x_0;\tau_n)$ is similar to probability ratio in standard importance sampling method, and can only assign non-zero value to trajectories $\tau_n^\star$ such that $s_i^\star=s_i,\forall i$. If this is the case, I'm not convinced that Eq. (3) can hold. For example, there might be a sequence $\tau_n^\star$ such that $p_n^\mu(\tau_n\mid x_0)=0$ for every $\tau_n$ such that $s_i=s_i^\star$. And policy $\pi$ can reach such sequence (i.e., $\pi_n^\pi(\tau_n^\star\mid x_0)>0$). Will it violate Eq. (3)?
+
+In summary, I believe that the RL with delay setting is important and interesting. However, due to over-complicated notations and theorem statement, I'm not able to verify the soundness/correctness of the method. I can not recommend acceptance at this point, and I'm willing to discuss and change my score if my main concerns are answered.",6,3.0,ICLR2021
+SyliaCJBYr,1,rJgLlAVYPr,rJgLlAVYPr,Official Blind Review #3,"This paper investigates the question of identifying concise equations from data to understand the functional relations. In particular, a set of base functions are given in hand and the goal is to obtain the right composition of these functions which fits the target function. The main contribution of the paper is to introduce a selection layer, which enhances sparse connections in the network. Several experiments are conducted to show the effectiveness of the method. 
+
+My main concern of the paper is about the novelty and the lack of comparison of existing methods. The framework of finding functional relations is set up in [1,2], the main contribution of the paper is a refine architecture with the introduction of the selection layer. However, this selection layer is nothing but incorporating a softmax function. The idea of combining softmax functions in the hidden layers is not novel neither, which could be found in [3,4]. As a result, I find the contribution of the paper very limited, which could be summarized as applying an existing technique on a specific problem. Moreover, in the experimental section, there is a lack of comparison with existing methods such as EQL[1,2] and I consider it a major omission. 
+
+Overall, due to the novelty concern and the lack of comparison, I do not support publication of the paper.
+
+[1] Sahoo et al.  Learning Equations for Extrapolation and Control
+[2] Martius et al. Extrapolation and learning equations
+[3] Graves et al.  Neural turing machines
+[4] Graves et al.  Hybrid computing using a neural network with dynamic external memory",1,,ICLR2020
+SyYWBfzNl,2,HyEeMu_xx,HyEeMu_xx,"Good paper, but would help to have experiments on a more benchmarked dataset","This paper presents a hierarchical attention model that uses multiple stacked layers of soft attention in a convnet. The authors provide results on a synthetic dataset in addition to doing attribute prediction on the Visual Genome dataset.
+
+Overall I think this is a well executed paper, with good experimental results and nice qualitative visualizations. The main thing I believe it is missing would be experiments on a dataset like VQA which would help better place the significance of this work in context of other approaches.  
+
+An important missing citation is Graves 2013 which had an early version of the attention model. 
+
+Minor typo:
+""It confins possible attributes.."" -> It confines..
+""ImageNet (Deng et al., 2009), is used, and three additional"" -> "".., are used,""",6,3.0,ICLR2017
+HJeOJlV3FS,1,BkgRe1SFDS,BkgRe1SFDS,Official Blind Review #2,"This paper proposes a novel approach to hierarchical reinforcement learning approach by first learning a graph decomposition of the state space through a recurrent VAE and then use the learned graph to efficiently explore the environment. The algorithm is separated into 2 stages where in the first stage random walk and goal conditioned policy is used to explore the environment and simultaneous use a recurrent binary VAE to compress the trajectory. The inference network is given the observation and action and the reconstruction is to, given the hidden state or hidden state+observation, reconstruct the action taken. The approximate posterior takes on the form of a hard Kumaraswamy distribution which can differentiably approximate a binary variable; when the approximate posterior is 0, the decoder must reconstruct the action using the hidden state alone. The nodes of the world graph are roughly states that are used to reconstruct the trajectories in the environment. After the graph is constructed, the agent can use a combination of high-level policy and classical planning to solve tasks with sparse reward.
+
+Personally, I quite like the idea of decomposing the world into important states -- it is closely related to the concept of empowerment [1] which the authors might want to take a further look into. I believe extracting meaningful abstraction from the environment will be a key component for general purpose RL agent. One concept I really like in the paper is using the reconstruction error as the reward for the RL agent, which has some flavors of adversarial representation learning. Further, I also really like the idea of doing structured exploration in the world graph and I believe doing so can help efficiently solve difficult tasks.
+
+However, I cannot recommend accepting this paper in its current draft as there might be potential major technical flaw and I also have worries about the generality of the algorithm. My main concerns are the following:
+    1. The ELBO given in the paper is wrong -- the KL divergence should be negative. I want to give the paper the benefit of doubts since this could be just a typo and some (very rare) researchers use a reverse convention; however, this sign is wrong everywhere in the paper including the appendix yet the KL between Kuma and beta distributions uses the regular convention. I tried to check the source code provided by the authors but the code only contains architecture but not the training objective, training loops or environments. As such, I have to assume that the ELBO was wrongfully implemented, unless the author can provide the full source code, or, if the ELBO is indeed incorrectly implemented, rerun the experiments with the correct implementation.
+
+    2. The proposed method for learning the graph does not have to be a VAE at all. The appendix shows that the paper uses a 0.01 coefficient on the KL, which is an extremely small value for VAE (in fact, most VAE’s have beta larger than 1 for disentanglement). Thus, I suspect the KL term is not actually doing anything; instead, the main reason why the model worked might be due to the sparsity constraints L_0 and L_T. In other words, the model is simply behaving like a sequence autoencoder with some sort of hard attention mechanism on the hidden code, which might explain why the model still worked well even with the wrong ELBO. To clarify, I think this is a perfectly acceptable approach for learning the graph and it would still be very novel, but the manuscript should be revised accordingly to reflect this. If the (fixed) VAE is important, then this comparison (0 KL regularization) would be a nice ablation regardless.
+
+    3. Algorithm 1 requires navigating the agent to the key points from \mathcal{V}_p. This assumption is quite strong. When the transition dynamic is deterministic and fully reversible like the ones considered in the paper, using the reverse of replay buffer can indeed take the agent back to s_p, but in settings where the transitions are stochastic or the transitions are non-linear or non-reversible, how should the algorithm be used?
+
+    4. It is not clear how \mathcal{V}_p are maintained. If multiple new nodes are added every iteration, wouldn't there be more than necessary nodes in \mathcal{V}_p? It seems to me some pruning criteria were used unless the model converged within small number of iterations? Are the older ones are discarded in favor of newer ones?
+
+    5. How are the actions sequences “normalized”?
+
+    6. In what way are the Door-Key environment stochastic? It seems like the other environments also have randomness, so is the only difference the lava pool?
+
+I believe the propose method is sound, so if the revision can address either 1 or 2, I am willing to raise my score to weakly accept. If the revision in addition addresses 3, 4, 5, 6 in a reasonable manner, I am willing to raise my score to accept.
+
+=======================================================================
+Minor comments that did not affect my decision:
+    - I think mentioning the names of the environment in the abstract might be uninformative since the readers do not know what they are a priori.
+
+Reference:
+[1] Empowerment -- An Introduction, Salge et al. 2014
+",6,,ICLR2020
+SkxelqZCFH,1,ryl4-pEKvB,ryl4-pEKvB,Official Blind Review #2,"This paper presents DeepAGREL, a framework for biologically plausible deep learning that is modified to use reinforcement learning as a training mechanism. This framework is shown to perform similarly to error-backpropagation on a set of architectures. The idea relies on feedback mechanism that can resemble local connections between real neurons.
+
+This paper is an interesting approach to provide a reinforcement learning paradigm for training deep networks, it is well written and the experiments are convincing, although more explanation about why these specific architectures were tested would be more convincing. I also think the assumptions about feedback connections in real neurons should be visited and more neuroscientific evidence from the literature should be included in the paper. Do we expect feedback to happen at each level of a neuron-neuron interaction and between each pair of connected neurons? Is there a possibility that feedback is more general to sets of neurons, or skips entire layers of neurons? I think more neuroscience background would help this paper (and others on the topic). Nonetheless, I think the paper does offer an interesting proposal of a more biologically plausible form of deep learning.
+",6,,ICLR2020
+DviyGBgd1TU,3,cO1IH43yUF,cO1IH43yUF,Thorough Analysis of Various Finetuning Factors,"The paper focuses on instability issues in BERT finetuning on small datasets. They list three factors which leads to instability, and provide simple fixes for each:
+1. Lack of bias correction term in BertAdam -- Fix was to use standard Adam 
+2. Using all pretrained layers for finetuning -- Reinitializing the last few layers before finetuning. 
+3. Training for a predefined number of epochs -- Train for a large number of epochs.
+
+The fixes proposed reduces the variance in the results, and in most cases also improves performance. They also show that several proposed solutions to fix training instability lose their impact when the aforementioned fixes are incorporated. 
+
+Overall, I like the paper; the observation about reinitializing top layers of BERT was interesting and counter intuitive to me; and I think this will be the most important contribution of the paper. 
+
+Although not directly related to BERT, this paper (https://arxiv.org/pdf/1804.00247.pdf) also suggests training for longer epochs. This paper should be cited here. The tasks considered in the original BERT paper had large datasets, so I think the 2-3 epoch suggestion was tuned to those. 
+
+The result about BertAdam being unstable in low data settings, was a nice contribution. I feel this algorithm was also suggested considering the large datasets considered in the BERT paper.     
+
+ ",6,4.0,ICLR2021
+rJUCC0pQg,1,Sy8gdB9xx,Sy8gdB9xx,"Memorization, overfitting, generalization","the authors of this work shed light on the generalization properties of deep neural networks. Specifically, the consider various regularization methods (data augmentation, weight decay, and dropout). They also show that quality of the labels, namely label noise also significantly affects the generalization ability of the network.
+
+There are a number of experimental results, most of which are intuitive. Here are some specific questions that were not addressed in the paper:
+1. Given two different DNN architectures with the same number of parameters, why do certain architectures generalize better than others? In other words, is it enough to consider only the size (# of parameters) of the network and the size of the input (number of samples and their dimensionality), to be able to reason about the generalization properties of a given network?
+
+2. Does it make sense to study the stability of predictions given added dropout during inference?
+
+Finally, provided a number of experiments and results, the authors do not draw a conclusion or offer a strong insight into what is going on with generalization in DNNs or how to proceed forward. ",10,4.0,ICLR2017
+hZSUGrCtoG,2,xHKVVHGDOEk,xHKVVHGDOEk,"Review on ""Influence functions in deep learning are fragile""","##########################################################################
+
+Summary:
+The paper provides an extensive experimental study on using influence functions in deep neural networks. The authors suggest that the estimation via influence functions can be quite fragile depending on various architectures of neural networks. To point out such information, approximation with inverse-Hessian Vector product techniques might be incorrect, which leads to low quality influence estimates. They show several meaningful findings which address the true nature of using influence functions in deep neural networks. One of their findings through various experimental study is that the network depth and width strongly affect influence estimates.
+
+##########################################################################
+
+Reasons for score: 
+The introduced through experimental settings seem to be highly interesting and reasonable. I like the idea of addressing weakness of influence functions in deep neural network and the authors’ findings are interesting. I believe it will be better if the authors perform the ablation studies with different parameters and more datasets to generalize their findings.
+
+##########################################################################
+
+Pros: 
+1. The paper address the important issue of influence functions, which can affect the field of model interpretability. 
+2. The introduced experimental settings seem to be reasonable to empirically show that using influence functions in deep neural networks can be very sensitive depending on the architectures of neural networks, the use of weight decay, and the choice of hyper-parameters.
+3. The experiments are well designed with multiple architectures of neural networks and detasets. 
+
+In the case of non-convex loss functions, the assumption of fist-order influence functions is not generally true. They empirically found that the Taylor’s gap is strongly affected by common hyper-parameters for deep networks. 
+
+##########################################################################
+
+Cons: 
+More ablation studies with different set of hyper parameters and multiple datasets can improve the quality of this paper. 
+
+##########################################################################
+
+Questions during rebuttal period: 
+In the case study using a CNN architecture on the small MNIST, the authors need to perform the experiments with different number of selected training samples with both the highest and the lowest influence scores to amplify their point. 
+ 
+#########################################################################",6,4.0,ICLR2021
+5u9L-cafLA8,3,E_U8Zvx7zrf,E_U8Zvx7zrf,The novelty is not too high. Comparison with more advanced baselines is needed.,"This paper proposes a method, called OLCO3, to reduce the communication cost in distributed learning. Experiments on real datasets are used for evaluation. The paper is well written. 
+
+The main idea of OLCO3 is to combine many existing communication reduction methods, including pipelining, gradient compression and periodic averaging. There does exist some novelty in OLCO3, but the novelty is not high. Furthermore, there has appeared one similar paper[A] which combines sparsification, quantization and local SGD into the same framework for communication reduction. But this paper does not cite [A], and empirical comparison with [A] is also not provided. 
+
+Another shortcoming of OLCO3 is that the computation-communication overlapping technique will introduce an extra memory cost of O(sd), which might be unacceptable for large deep models with a huge d when s is relatively large. 
+
+For experiments, the convincingness can be improved if test accuracy/training error vs. wall-clock time is also provided. 
+  
+[A]. Basu, Debraj, et al. ""Qsparse-local-SGD: Distributed SGD with quantization, sparsification and local computations."" Advances in Neural Information Processing Systems. 2019.
+
+
+------------------------
+After discussion:
+
+The authors do not provide rebuttal. Hence, I keep the original opinion to give this paper a weak reject.
+",5,4.0,ICLR2021
+heM21lRzQCX,3,AjrRA6WYSW,AjrRA6WYSW,Interesting paper with some work needed.,"Post Rebuttal:
+
+I think the paper is much improved, and I'm bumping up my score accordingly. 
+
+Small point - in corollary 3.2, to say 'with high probability' still requires that explicit conditions such as $|\beta|/\sqrt{d} \to \infty$ are stated. Thm 3.1 makes no claim of _high_ probability - it instead has an explicit bound, and does not impose conditions needed to make this bound 'high'. As a conseuqnce, the conditions of Thm 3.1 do not suffice for corollary 3.2. 
+
+----
+
+
+Summary:
+
+The paper studies the estimation of the number of communities in a graph drawn from a stochastic block model (SBM) using spectral properties of the Bethe Hessian of the observed graph. While this scheme has been proposed in the prior literature, the focus here is on sparse graphs, with average degrees bounded as $o(\log N)$. The authors show a sufficient condition for consistency of this procedure in this regime, and provide simulations as evidence for the effectiveness of the resulting method. These simulations incorporate the challenges of having to estimate the various parameters of the generative model. 
+
+---
+
+Strengths:
+
+The problem is contexaulised well, and related work is adequately discussed and compared to. The theoretical analyses provided are crisp, and use existing results on concentration of random graphs in a nice way. I particularly appreciate that the simulations dealt with the fact that parameters of the models need to be estimated. The procedure to do so is well described, with the performance gains clearly demonstrated for small $K$.
+
+Weaknesses:
+
+For me there are two main weaknesses:
+
+a) Section 3 needs some polish. 
+
+First, The section is begun by directly launching into a proof sketch for the relevant method. This comes out of nowhere, and would be helped by describing the precise main result before launching into a proof of it. This could be done, perhaps, by stating Thm 3.3 before the argument, and then launching into the proof sketch.
+
+Second, I don't think corrolary 3.4 is entirely accurate. There's a minor issue of an additive $(d/N)$ term that's missed on the right hand side. More importantly, in order to show consistency in the usual setting of large graphs that is taken in SBMs, you need to be able to drive the probability of error to $0$, which requires that $\zeta/\sqrt{d} \to \infty$ as $N$ blows up. The necessary condition stated doesn't allow that _unless_ $d_{\max}$ blows up with $N$. This is worth saying since the result of this corollary is ultimately compared to weak recovery of the $2$-SBM, where the claim made is not valid because the condition for weak recovery that is stated holds even when $d_{\max}$ and $d$ are $O(1)$. 
+
+Speaking of, I want to point out that the condition on the top of page 4 is for _weak recovery_, that is, the problem of obtaining a partition which is _slightly_ better than random guessing. The phrasing of the paragraph suggests that the problem being compared to is that of exact recovery, for which the SNR threshold is very different (e.g. degrees need to blow up logarithmically). Also, it is worth noting here that even the distinguishability problem for the SBM (i.e., hypothesis testing between an SBM and an unstructured Erdos-Renyi graph of the same average degree) in the $2$-SBM requires $(a-b)^2 > 2(a+b)$. This is a weaker problem than recovering the number of communities, and so effectively serves as a lower bound. See, e.g. [1].
+
+Minor aside - it might be better to state Corollary 3.4 in terms of $\lambda$ instead of $\lambda_K^{\downarrow}$. The reason for this is that as it stands, it looks like the left hand side grows linearly with $N$, while the right hand side grows sublogarithmically. This is, of course, not the case because $\lambda_K^{\downarrow}$ itself must behave at the scale $o(\log N)/N$ in order for the degrees to be $o(\log N)$. The normalised $\lambda$ gets rid of this effect.
+
+b) In my opinion, the experiments need to be deepened. 
+
+Ultimately the result given can be seen as a thresholding of $\frac{d\lambda}{\sqrt{d_{\max} -1}} \frac{N_{\min}}{N},$ or, in the multinomial prior chosen, in terms of $\frac{d\lambda}{\sqrt{d_\max}-1} \frac{1}{K}$ (with high probability, up to small corrections). The experiments vary the first term by altering $\eta$, but the values of $K$ explored are very small - only $K = 3$ and $4$ are explored. This effectively means that the behaviour of the scheme with respect to the ground truth $K$ is left largely unexplored. (It is claimed that Fig 5.1 shows the behaviour with respect to $N_{\min}$ but this is not the case - for instance, $N_{\min}$ increases by a factor of $1.75$ when going for Fig 5.1 b) to c), but the threshold remains the same. I think what's actually being seen is the limiting effect in that the data aligns more strongly with the large graph limit as you increase $N$.)
+
+In fact, the behaviour with respect to $K$ has twofold importance. First, it enters as a parameter of the threshold that is developed, and this needs to be empirically confirmed. Secondly, a critical issue in many community SBMs is the behaviour of the achievability thresholds as $K$ grows large relative to $N$ - for instance, for large $K$, the Kesten-Stigum theshold and the information theoretic thresholds separate, and a computational-statistical gap is conjectured for weak recovery. Thus exploring how the scheme empirically behaves as $K$ grows is of technical interest.
+
+Minor comment - in section 5.1, it might be worth explicitly mentioning that you set $\tilde{d} = 3\sqrt{\log N}$ and vary $\rho$ to make the average degree come out correct. Also, it would be valuable to explicitly show the threshold condition of Corollary 3.4 in terms of these new parameters.
+
+---
+
+Overall impressions:
+
+I quite like this paper. The problem is certainly relevant, and the analyses provided are the first, in my knowledge, to explore the sublogarithmic degree regime in the recovery of the number of communities in a graph with planted partitions with both theoretical and parameter-aware simulation studies. The method is interesting and well fleshed out.  I think the paper does have some flaws - most critically the empirical study with respect to $K$, and the issues I mentioned regarding Corollary 3.4 (which is a bit oversold as of now). These are the main bottlenecks in my current rating of marginally below acceptance. I can fully see myself bumping the rating to 7 if they are dealt with.
+
+---
+
+Nitpicking:
+
+- Operators in mathmode should be escaped via ``````\operatorname {XYZ}, which gives $\operatorname{Diag}(A)$ instead of $Diag(A)$. Similarly \min, \max, \log are available as mathmode commands, giving $N_{\min}$ and not $N_{min}$.
+
+- I'm not sure that the line plots in section 5 are the best way to present this data, mainly because there are so few data points. I'm also not sure what should be used instead, so this is not totally constructive, but it may be worth exploring other ways (maybe just a table?). Also, i have no clue what CND.MTD.correct means.
+
+[1]: Banks, J., Moore, C., Neeman, J., & Netrapalli, P. (2016, June). Information-theoretic thresholds for community detection in sparse networks. In Conference on Learning Theory (pp. 383-416).",7,4.0,ICLR2021
+S_P6fy7E5qA,2,#NAME?,#NAME?,This paper proposes a stage-wise strategy to train Generative Adversarial Networks for videos. The contribution of this paper is very limited and the experiments are not convincing.,"Pros:
+1. A stage-wise approach to train GANs for video is defined to reduce the computational costs needed to generate long high resolution videos.
+2. The authors provide some quality results of the proposed approach.
+
+Cons:
+1. The contribution of this paper is very limited. The authors just do some incremental improvement based on current GAN models, and the theoretical analysis for the stage-wise training approach is not enough. 
+2. The experiments are not convincing. The authors only compared the baseline methods in the experiments. Besides, the proposed training strategy should be applied in different generation models based on GAN to show the effectiveness in different cases. 
+3. This paper aims to reduce the computation cost of the model training, but do not achieve significant effect, which takes 23 days for model training.
+",3,4.0,ICLR2021
+ILghL-IzS3y,3,TEtO5qiBYvE,TEtO5qiBYvE,"Review of ""Continual Memory: Can We Reason After Long-Term Memorization?""","--------------------------------------------------------------------------------------------------------------------------------
+Summary:
+
+In this paper, the authors propose the Continual Memory (CM) targeted towards a reasoning scenario called “reasoning after memorization”. The main goal of CM is to enable long-term memorization as opposed to memory networks that suffer from gradual forgetting. They evaluate their model both on synthetic data as well as a few downstream benchmarks. 
+
+--------------------------------------------------------------------------------------------------------------------------------
+Overall assessment:
+
+I really struggled with this one and I think there are some interesting ideas in there. However, it was very hard for me to understand the main motivation and story behind the proposed model and its design choices. Moreover, the task itself is not clearly defined until the experiments section making it really hard to understand the claims and motivations of the work. I will provide detailed feedback below. 
+
+--------------------------------------------------------------------------------------------------------------------------------
+Feedback:
+
+(1) One thing that can improve the paper substantially is re-structuring the introduction to clearly state the motivation, studied task, proposed solution and the main contributions of the work. 
+
+(1-1) For example, the authors briefly mention QA/VQA/Recommendation in the beginning of the introduction and then do not formally present/discuss their studied task is in the introduction. The QA/VQA/Recommendation are large research areas with many different benchmarks and approaches. Some references to reasoning has also been mentioned in the introduction, but what area in reasoning is this paper specifically studying? It would be very helpful for the reader to understand early on what the target of the paper is.
+
+(1-2) Some concepts used in the introduction are not well defined. For example, the authors refer multiple times to “Reasoning while experiencing” and “reasoning after memorizing” without formally defining them. I was not familiar with these notions and wasn’t able to find any pointers through online search. However, if these are known concepts in a sub-area, it would be very helpful if the authors can add a citation to where they were originally defined. If not, it would be helpful to formally define them. Another vague concept is ""raw content"". It is not clear what it is referring to. Is it the input? perhaps some source of knowledge? If the task is defined the authors can use examples to make these concepts more clear.
+
+(1-3) The introduction makes some connections to human cognition all throughout that read a bit subjective and are stated without any citations. For example paragraph 2 in the introduction. 
+
+It is really hard for me to understand the main contributions of the paper and to make a fair assessment until the paper text has been revised. If the authors are willing to submit a modified version during the author response period, I will re-read and re-evaluate my score. ",4,2.0,ICLR2021
+rkxBU7f0qH,4,rklhqkHFDB,rklhqkHFDB,Official Blind Review #3,"The paper presents a Neural Network based method for learning ordinal embeddings only from triplet comparisons. 
+A nice, easy to read paper, with an original idea.
+
+Still, there are some issues the authors should address:
+
+- for the experiment with Imagenet images, it is not very clear how many pictures are used. Is this number 2500? 
+- the authors state that they use ""the power of DNNs"" while they are experimenting with a neural network with only 4 layers. While there is no clear line between shallow and deep neural networks, I would argue that a 4 layer NN is rather shallow.
+- the authors fix the number of layers of the used network based on ""our experience"". For the sake of completeness, more experiments in this area would be nice. 
+- for Figure 6, there is not a clear conclusion. While, it supports that "" that logarithmic growth of the layer width respect to n is enough to obtain desirable performance.""  I don't see a clear conclusion of how to pick the width of hidden layers, maybe a better representation could be used.
+- I don't see a discussion about the downsides of the method (for example, the large number of triplet comparison examples needed for training; and possible methods to overcome this problem).
+- in section 4.4 when comparing the proposed approach with another methods why not use more complex datasets (like those used in section 4.3)
+- in section 4.3, there is no guarantee that the intersection between the training set and test set is empty. 
+- in section 4.3 how is the reconstruction built (Figure 3b)?
+
+A few typos found:
+- In figure 3 (c) ""number |T of input"" should be  ""number |T| of input""
+- In figure 5 (a) ""cencept"" should be ""concept""
+- In figure 8 ""Each column corresponds to ..."" should be ""Each row corresponds to ..."".
+- In the last paragraph of A1 ""growth of the layer width respect"" should be ""growth of the layer width with respect""
+- In the second paragraph of A2 ""hypothesize the that relation"" should be ""hypothesize that the relation"".
+- In section 4.3 last paragraph, first sentence: ""with the maximunm number"" should be ""with the maximum number""
+",6,,ICLR2020
+B1EW3dbEg,1,H1Heentlx,H1Heentlx,Linear VCCA,"7
+
+Summary:
+This paper describes the use of variational autoencoders for multi-view representation learning as an alternative to canonical correlation analysis (CCA), deep CCA (DCCA), and multi-view autoencoders (MVAE). Two variants of variational autoencoders (which the authors call VCCA and VCCA-private) are investigated. The method’s performances are compared on a synthetic MNIST dataset, the XRMB speech-articulation dataset, and the MIR-Flickr dataset.
+
+Review:
+Variational autoencoders are widely used and their performance for multi-view representation learning should be of interest to the ICLR community. The paper is well written and clear. The experiments are thorough. It is interesting that the performance of MVAE and VCCA is quite different given the similarity of their objective functions. I further find the analyses of the effects of dropout and private variables useful.
+
+As the authors point out, “VCCA does not optimize the same criterion, nor produce the same solution, as any linear or nonlinear CCA”. It would have been interesting to discuss the differences of a linear variant of VCCA and linear CCA, and to compare it quantitatively. While it might not make sense to use variational inference in the linear case, it would nevertheless help to understand the differences better.
+
+The derivations in Equation 3 and Equation 13 seem unnecessarily detailed given that VCCA and VCCA-p are special cases of VAE, only with certain architectural constraints. Perhaps move to the Appendix?
+
+In Section 3 the authors claim that “if we are able to generate realistic samples from the learned distribution, we can infer that we have discovered the underlying structure of the data”. This is not correct, a model which hasn’t learned a thing can have perfectly realistic samples (see Theis et al., 2016). Please remove or revise the sentence.
+
+Minor:
+In the equation between Equation 8 and 9, using notation N(x; g(z, theta), I) as in Equation 6 would make it clearer.",7,4.0,ICLR2017
+B1ggkv9ptr,2,Hye00pVtPS,Hye00pVtPS,Official Blind Review #2,"The authors propose a learning strategy to fit predictive models on data separated across nodes, and for which different set of features are available within each node.  
+This concept is developed by introducing the concept of two degree separation across horizontal (nodes) and vertical (feature) axis. The proposed approach consists in an iterative scheme where i)  models at independentently trained at each site, and ii) models' parameters are subsequently averaged and redistributed for the next optimisation round.  
+
+The problem tackled in this work is interesting, with an important application on medical records from > 100,000 individuals followed  over time. Unfortunately the paper is not clear in several aspects, and presents methodological issues. Here my main comments on this work:
+
+- The authors should definitely refer to the concept of meta-learning [1], which addresses modelling problems very close to the one presented in this work: training a meta-model by aggregating information from different learning tasks.  The paper should definitely compare the proposed methodology with respect to this paradigm. 
+
+- The fact that the parameters can be averaged across nodes implies that they must be of same dimension. This is counterintuitive, as the dimension of the data represented at each site may significantly differ depending on the kind of considered feature. This aspect points to some methodological inconsistency.
+
+- There is no comparison with any other federated method, neither with any classification method besides a NN, at least with the aggregated data. Also it could have been possible to reduce the number of input features using simple dimensionality reduction previous to the NN, such as PCA. 
+
+- Vertical separation importance: At the end it looks like diagnosis is the main driver for the classification, showing results that are comparable to the ones obtained with the aggregated data. It is therefore not clear whether the proposed application allows to clearly illustrate the benefit of using this method with regard to vertical separation.
+  
+- All in all, the paper appears in a draft form, and the text is often inconsistent. For example, there is often inconsistency in the number of branches, or types of data considered, figures are not self-explanatory and present notation and symbols not defined anywhere. The bibliography is given in a non-standard format. 
+
+[1] Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. Finn, C., Abbeel, P., & Levine, S.  Proceedings of the 34th International Conference on Machine Learning-Volume 70 (pp. 1126-1135). ",1,,ICLR2020
+Sklmiy_6FH,2,rygw7aNYDS,rygw7aNYDS,Official Blind Review #1,"This paper studies the asymptotic properties of action value function $Q$ and value function $V$. Specifically, the authors assume that we can collect $n$ data points $\{(s_t,a_t,r_t(s_t,a_t),s_{t+1})\}_{t=1}^{n}$. Based on the collected data, the authors calculated the sample mean of the unknown reward function $\widehat r_n$ and transition probability $\widehat{P}(\cdot|s,a)$. They further defined an estimator $\widehat Q_n$, which is the fixed point of the empirical Bellman operator $\widehat{\mathcal{T}}_n$ derived from $\widehat r_n$ and $\widehat P_n$. The authors proved under certain assumptions that $\widehat Q_n\rightarrow Q^*$ almost surely. Based on this argument, they also derived a similar convergence of value function $\widehat V_n$ in distribution. Confidence intervals can also be established based on these results.
+
+The proof of convergence in distribution is established based on the following idea: $\widehat r_n$ and $\widehat P_n$ converge to $r$ and $P$ by central limit theorem. By examining the Jacobian matrix of $\widehat{\mathcal{T}}_n$ with respect to $\widehat Q$, it is found that the implicit function $\widehat{ \mathcal{T}}_n \widehat Q -\widehat Q=0$ can be solved in small neighborhoods of $\widehat r_n, \widehat P_n$ and $\widehat Q_n$ as
+$$
+\widehat{Q}_n = \phi(\widehat{r}_n, \widehat P_n) \qquad\text{(implicit function theorem)}
+$$  
+Then by Slutsky theorem, $\widehat Q_n = \phi(\widehat r_n, \widehat P_n)$ also converges to $Q^*$ in distribution. The proof is straightforward and easy to follow. However, I have some concerns about the above analysis.
+
+1. The data $\{(s_t,a_t,r_t(s_t,a_t),s_{t+1})\}_{t=1}^{n}$ are collected by executing a fixed policy for n steps, which means the data points are not i.i.d. Therefore, the convergence of  $\widehat r_n$ and $\widehat P_n$ by central limit theorem may not hold ground, which further leads the later proofs potentially problematic.
+
+2. The $\phi(\cdot,\cdot)$ function in implicit function theorem only exists in neighborhoods of $\widehat r_n, \widehat P_n$ and $\widehat Q_n$. It is unclear from the current proof that $\widehat Q_n$ fall into the same neighborhood of $Q^*$. Therefore, it needs to be justified that the limit point is $Q^*$.
+
+3. Even for the empirical estimator $\widehat Q_n$, it is still the fixed point of $\widehat{ \mathcal{T}}_n$, which is hard to compute or solve. Thus Algorithm 1 may be inefficient in practice.
+
+Other comments:
+
+The font and margin of this paper does not conform the format requirement of ICLR.
+What is $\sigma_R(i,j)$ in Theorem 1 and Corollary 1? It seems to be undefined.
+
+I found it hard to read when the authors vectorize all the matrices and tensors to $N$ dimension vectors and $N\times N$ matrices. In particular, the rearranged notations such $\mu((i-1)m_a+j)$, $\tilde P^{\pi}((i-1)m_a+j,(i’-1)m_a+j’)$ are much more complicated that $\mu(i,j)$ and $\tilde P^{\pi}(i’,j’|i,j)$. Similarly, the mapping $F(Q’,r’,P’)$ defined in the proof of Theorem 1 can be represented by the Bellman operator $\mathcal{T}$ defined on page 3. 
+
+====post rebuttal==
+I read other reviewers' comments and the author's response. I do not think the authors have addressed all my concerns. I will keep my rating. ",3,,ICLR2020
+Hy2hrgMEe,3,BkfiXiUlg,BkfiXiUlg,"interesting idea, weak experiments","This paper introduces a novel hierarchical memory architecture for neural networks, based on a binary tree with leaves corresponding to memory cells.  This allows for O(log n) memory access, and experiments additionally demonstrate ability to solve more challenging tasks such as sorting from pure input-output examples and dealing with longer sequences.
+
+The idea of the paper is novel and well-presented, and the memory structure seems reasonable to have advantages in practice. However, the main weakness of the paper is the experiments. There is no experimental comparison with other external memory-based approaches (e.g. those discussed in Related Work), or experimental analysis of computational efficiency given overhead costs (beyond just computational complexity) despite that being one of the main advantages. Furthermore, the experimental setups are relatively weak, all on artificial tasks with moderate increases in sequence length.  Improving on these would greatly strengthen the paper, as the core idea is interesting.",5,5.0,ICLR2017
+HyxvH-iRFS,3,Bkgwp3NtDH,Bkgwp3NtDH,Official Blind Review #1,"This paper proposed a ""learnable"" trojan by training a neural network that takes a sample (e.g. an image) as an input and generates a programmable trigger pattern as an output. The authors argue that the proposed method can support dynamic and out scope target classes, which are particularly applicable to backdoor attacks in the transfer learning setting. The authors conducted experiments on large-scale models (ImageNet) in two settings: (1) outsourced training attack;  (2) transfer learning attack.
+
+Although the idea of making a backdoor attack more robust (e.g. more transferrable) and programmable is interesting, I don't think the current results fully substantiate the claimed benefits. Below are my concerns:
+
+1. Lack of performance comparison: in the outsourced training attack, the attack success rate is quite low ( the best top-1 attack rate is ~50% on VGG). In the existing literature, the backdoor attack success rate can be made nearly 100%. This makes me wonder whether the low attack success rate only occurs in the proposed attacking method. Since there are no results from existing attacks, there is no way to evaluate how good the proposed attack is.
+
+2. How about small dataset/model? I understand that the authors want to emphasize the scalability of their attack to large models like ImageNet. However, there are no comparisons on ImageNet (my comment 1). The authors are suggested to compare performance on standard datasets in backdoor attack literature (e.g. CIFAR-10, traffic sign)
+
+3. It was unclear how ""dynamic"" the proposed method can be. Based on the attack formulation in equation (3), in order to make the attack ""dynamic"" in terms of changing different target classes, the attackers need to train a neural trojan network for every target class, which does not seem to be dynamic to me. Can the authors further justify the advantage of the dynamic feature in the proposed attack? And I have concerns about how many target classes an attacker can ""dynamically"" change. Some experiments showing the number of target classes vs attack performance and clean data accuracy will be very helpful.
+
+4. The defense argument against detection methods is weak. Unless the authors can show the proposed attack has the ability to simultaneously backdoor all possible target classes, simply arguing the attack is dynamic and thus can evade detection is not convincing, not to mention in the backdoor setting, attacker should make the first move before the defender takes action.
+
+*** Post-rebuttal comments
+I thank the authors for the clarification. However, I feel my comments have not been fully addressed, especially on the part on justifying 50% attack success rate on VGG should be considered as significant in the considered setup. Without any valid comparisons, I find it difficult to assess the contributions. In addition, the authors did not add new empirical evidence regarding my questions but mainly re-iterated the applicability of the proposed method, so I will remain my review rating.
+***",1,,ICLR2020
+QN95cZ5R_m,4,M_KwRsbhi5e,M_KwRsbhi5e,Interesting and successful approach to branch selection,"The paper proposes a method of choosing variables for branching in branch and bound approaches to MIP solving. The approach is based on reinforcement learning.
+
+The paper presents an adequate overview of previous approaches to the problem. There is not a lot of detail about how these approaches work but the overview of the techniques given allows the reader to see how the proposed approach differs from this earlier work and motivates the technique.
+
+The MDP formulation used appears not to be novel, but the paper presents a novel way to use reinforcement learning to find good strategies.
+
+The paper makes a clear argument about why simply choosing the same variable as strong branch would is not the best variable selection strategy. This is supported by later experimental results. I am not sufficiently close to the field to know whether this is a novel argument or accepted fact. In any case, it seems worth stating here as it illustrates a significant fault with many existing alternative approaches.
+
+There are quite a lot of decisions made in the design of the algorithm. While it is clear what has been done, I am unclear on why many of these choices have been made. For example, what is it about NS-ES that makes the authors think it is suited to this task? Is there some feature of the problem that means this is an appropriate method? Ditto the novelty metric. Is is clear how to compute the metric, but there is a lack of argument or intution on why this is an appropriate way to measure the distance between two sets of polytopes.
+
+The benchmarks used in the testing are sensible. It is always easy to raise questions about whether testing could be on a larger set of problems, but those presented here seem suitable for a conference paper. The competing methods evaluated against are appropriate and fair. The setting does not seem biased towards any of the methods.
+
+The novel approach appears significantly to beat other learning approaches on two of the problem types. The new approach seems about the same as the other approaches on the facility location problems. It would be interesting to understand what it is about this problem that gives different results.
+
+In table 1, wins are defined by number of nodes visited. In table 2, the time taken is used instead. Either method could be justified, but to change between experiments without good justification looks dubious.
+
+A side benefit of the paper is that it results in determining branching rules which seem to perform well while not (intentionally) imitating strong branching. There is some investigation of why this is and why the technique works. Further research could build on this and the result may lead to further work on manually constructed branching rules.
+
+There is a discussion section in the appendices. This seems very odd.
+
+The paper is very well written - the language is easy to understand and the arguments being made are clear.
+",8,3.0,ICLR2021
+SJgLFl3CFB,1,ryx6daEtwr,ryx6daEtwr,Official Blind Review #1,"This paper introduces a method to detect cars from a single image. The method imposes several handcrafted constraints specific to the dataset in order to achieve higher improvement and efficiency. These constraints are quite strong and they not generalize to new situations (eg. a car in the sky, a car upside down a car with multiple wheels). The results do not seem particularly strong because the dataset seems easy and the improvements over previous works is small. 
+
+I would suggest to emphasise the improvements over previous works and send this paper to a specialized journal or venue in vehicle detection. ",1,,ICLR2020
+B1xpAhx63m,2,B1xOYoA5tQ,B1xOYoA5tQ,"Novel approach to classification for resiliance against adversial attacks, supported by multiple experiments.","This paper argues that a random orthogonal output vector encoding is more robust to adversarial attacks than the ubiquitous softmax. The reasoning is as follows:
+
+1. different models that share the same final softmax layer will have highly correlated gradients in this final layer
+2. this correlation can be carried all the way back to the input pertubations
+3. the use of a multi-way encoding results in a weaker correlation in gradients between models
+
+I found (2) to be a surprising assumption, but it does seem to be supported by the experiments. These show a lower correlation in input gradients between models when using the proposed RO encoding. They also show an increased resiliance to attack in a number of different settings. 
+
+Overall, the results seem to be impressive. However, I think the paper would be a lot stronger if there was a more thorough investigation of the correlation between gradients in all layers of the models. I did not find the discussion around Figure 1 to be very compelling, since it is only relevant to the encoding layer, while we are only interested in gradients at the input layer. The correlation numbers in Table 2 are unexpected and interesting. I would like to see a deeper investigation of these correlations.
+
+I am not familiar with the broader literature in this area, so giving myself low confidence.
+",6,2.0,ICLR2019
+r1orV4zNg,3,HkE0Nvqlg,HkE0Nvqlg,review,"This is a very nice paper. The writing of the paper is clear. It starts from the traditional attention mechanism case. By interpreting the attention variable z as a distribution conditioned on the input x and query q, the proposed method naturally treat them as latent variables in graphical models. The potentials are computed using the neural network.
+
+Under this view, the paper shows traditional dependencies between variables (i.e. structures) can be modeled explicitly into attentions. This enables the use of classical graphical models such as CRF and semi-markov CRF in the attention mechanism to capture the dependencies naturally inherit in the linguistic structures.
+
+The experiments of the paper prove the usefulness of the model in various level — seq2seq and tree structure etc. I think it’s solid and the experiments are carefully done. It also includes careful engineering such as normalizing the marginals in the model.
+
+In sum, I think this is a solid contribution and the approach will benefit the research in other problems.
+",8,4.0,ICLR2017
+rkenLPW8tS,1,Hkxvl0EtDH,Hkxvl0EtDH,Official Blind Review #1,"The paper tackles a very important problem, manipulation robustness of modern machine learning models, by applying a chain of design choices perfectly well:
+  * VAEs as probabilistic generators
+  * Manipulations as independent causes that can be generated by interventions
+  * Causal inference for manipulation-corrected training
+  * Bayesian inference for robust prediction
+
+Sticking to the Bayesian point of view, the method can perform model averaging across potential manipulations at the test time. This is an extremely elegant property, which is also proven in the experiments to be very effective.
+
+Being able to learn previously unseen types of manipulations only by proper application of causal inference tools is a very important news for the adversarial robustness community.
+
+Figure 9 is a spectacular proof of concept to illustrate the disentanglement property of the proposed method.
+
+I have only one point for improvement. I do not buy the argument that we should do q(m|x) but not q(m|x,y). Why should we exclude class-specific manipulations? This wouldn't affect the cause (the class) but only the outcome, so would be a valid manipulation. I would actually expect the model to work still well with q(m|x,y). Could the authors comment on what the concrete benefit of leaving out y from manipulations is?
+
+The construction of the loss in Eq 5 is in spirit semi-supervised learning on m. We show the true m=0 cases to the model during training but assume not to know the labels of the manipulated samples. It could be beneficial to draw a link to semi-supervised learning here and even tie it to the arguments about the relationship between causal inference and semi-supervised learning:
+
+Schölkopf et all., On Causal and Anticausal Learning, ICML, 2012.
+",8,,ICLR2020
+SkxkjiCatr,3,Hye87grYDH,Hye87grYDH,Official Blind Review #1,"The paper proposes ""sparse self-attention"", where only top K activations are kept in the softmax. The resulting transformer model is applied to NMT, image caption generation and language modeling, where it outperformed a vanilla Transformer model.
+
+In general, the idea is quite simple and easy to implement. It doesn't add any computational or memory cost. The paper is well written and easy to read. The diverse experimental results show that it brings an improvement. And I think this can be combined with other improvements of Transformer.
+
+However, there are quite many baselines are missing from the tables. The sota on De-En is actually 35.7 by Fonollosa et.al. On enwik8, Transformer XL is not the best medium sized model as the authors claimed. See below:
+
+NTM En-De: 
+- Wu et.al. Pay Less Attention with Lightweight and Dynamic Convolutions, 2019
+- Ott et.al. Scaling Neural Machine Translation, 2018
+NTM En-Vi: 
+- Wang et.al. SwitchOut: an Efficient Data Augmentation Algorithm for Neural Machine Translation, 2018 
+NTM De-En: 
+- Wu et.al. Pay Less Attention with Lightweight and Dynamic Convolutions, 2019
+- Fonollosa et.al. Joint Source-Target Self Attention with Locality Constraints, 2019
+- He et.al. Layer-Wise Coordination between Encoder and Decoder for Neural Machine Translation, 2018
+LM Enwik8:
+- Sukhbaatar et.al, Adaptive Attention Span in Transformers, 2019
+
+Other comments:
+- More experimental details are needed. What is the value K? How different K values affect performance? What is the number of parameters of NMT models.
+- The claim ""top layer of the vanilla Transformer focuses on the end position of the text"" can't be true generally. Probably only true for a certain task. 
+- Where the numbers in Figure 1 come from? Is it a single attention head or average of all?
+- Page 4, ""the high are ..."" probably typo?
+- The related work is missing ""Scaling Memory-Augmented Neural Networks with Sparse Reads and Writes"" by Rae et.al., which also uses sparse attention.",6,,ICLR2020
+SkSfxq9xM,2,r1hsJCe0Z,r1hsJCe0Z,"Interesting and challenging application with impressive results, but maybe a bit narrowly focused in its scope. ","This paper introduces a neural network architecture for fixing semantic bugs in code.  Focusing on four specific types of bugs, the proposed two-stage approach first generates a set of candidate repairs and then scores the repair candidates using a neural network trained on synthetically introduced bug/repair examples. Comparing to a prior sequence-to-sequence approach, the proposed approach achieved dominantly better accuracy on both synthetic and real bug datasets. On a real bug dataset constructed from GitHub commits, it was shown to outperform human. 
+
+I find the application of neural networks to the problem of code repair to be highly interesting. The proposed approach is highly specialized for the specific four types of bugs considered here and appears to be effective for fixing these specific bug types, especially in comparison to the sequence-to-sequence model based approach.  However, I was wondering whether limiting the output choices (based on the bug type)  is going a long way toward improving the performance compared to seq-2-seq, which does not utilize such output constraints.  What if we introduce the same type of constraints for the seq-2-seq model? For example, one can simply modifying the decoding process such that for locations that are not in the candidate set, the network simply  makes no change, and for candidate-repair locations, the output space is limited to the specific choices provided in the candidate set.  This will provide a more fair comparison between the different models. 
+Right now it is not clear how much of the observed performance gain is due to the use of these constraints on the output space. 
+
+Is there any control mechanism used to ensure that the real bug test set do not overlap with the training set? This is not clear to me. 
+
+I find the comparison result to human performance to be interesting and somewhat surprising. This seems quite impressive.  The presented example  where human makes a mistake but the algorithm is correct is informative and provides some potential explanation to this. But it also raises a question. The specific example snippet could be considered to be correct when placed in a different context.  Bugs are context sensitive artifacts. The setup of considering each function independently without any context seems like an inherent limitation in the types of bugs that this method could potentially address.  Some discussion on the limitation of the proposed method seems to be warranted. 
+
+
+
+
+Pro:
+Interesting application 
+Impressive results on a difficult task
+Nice discussion of results and informative examples
+Clear presentation, easy to read.
+
+Con: 
+The comparison to baseline seq-2-seq does not seem quite fair
+The method appears to be highly specialized to the four bug types. It is not clear how generalizable it will be to more complex bugs, and to the real application scenarios where we are dealing with open world classification and there is not fixed set of possible bugs. 
+",6,4.0,ICLR2018
+X6Ibs-oztA,3,XvOH0v2hsph,XvOH0v2hsph,Official Blind Review #3,"Instead of using validation accuracy to determine the efficacy of a network,  this paper recommends to use Sum over Training Losses (SOTL). SOTL-E is a variant where the sum of training losses begins to be computed after the first E epochs. 
+They also designed an early stopping mechanism based on Baker et al's SVR where they extrapolate SOTL instead of validation accuracy.
+
+Questions:
+1. How can training loss be used to identify a good network? It should theoretically lead to overfitting and poor generalization. Going by this argument, if we apply any kind of regularization such as dropout or weight decay, the training loss would not be low  while the test accuracy might still improve.
+It is surprising that SOTL-E is able to rank the networks better than TestL at 200 for cifar10 and cifar100. Why do you think this is the case? 
+
+2. DARTS Experiment: 
+   (a) In  Figure 3(a), how is DARTS search replicated using just 100 random architectures? As it uses a SuperNet, it requires all possible architectures possible with the chosen operations. 
+   (b) In  Figure 3(b) and 3(c), the final architecture needs to be trained for 600 epochs. So it is natural that the rank correlation of SOVL, SOVL-E etc is poor for the first 100 epochs. 
+
+3. What would be more interesting is,  DARTS currently they perform a bi-level optimization. So instead of the architecture parameters $\alpha$ trying to minimize the validation loss, can they also minimize the training loss (theoretically this should not generalize well)? If not this, can you devise a way to plug in SOTL in the bi-level optimization and choosing the best architecture? If SOTL performs well even in that case, then you could a stronger case.
+
+4.  Training very deep networks is not easy and takes more than 100 epochs to obtain good accuracy. So your observation might be a side effect of that too.  As SOTL could be applied to any deep learning networks, can you also repeat the experiment by training 100 networks sampled from a smaller search space, such as mobile search space (mobilenet, squeezenet, shufflenet etc), that takes less than 80 epochs to finish training, to see if it still holds true? Then use this SOTL and SOTL-E to determine the best network. Also compare with the baselines.
+
+5. In Figure 4, how do SOVL, SOVL-E and Validation accuracy fare? Please include those too in the plots. 
+
+6. What is the difference between the setup of 1 and 2 (a) to (c) apart from the fact that SOVL-E and TestL are not included in 2?
+
+7. If SOTL-E is the average training loss of the final epoch, why not call it that? (I understand that it is still the sum of losses for all the batches but as it not across epochs it is misleading).
+
+As this is paper is proposing something that is fundamentally opposite to what has been studied widely thus far, it requires a lot more scrutiny. I do not think we can accept it with just empirical results and the theoretical motivation currently provided.
+
+_____________________________
+_____________________________
+Post Rebuttal:
+
+Thank you for replying to all of my questions.. 
+
+Plugging in your new metric to DARTS seems to be promising, especially if it alleviates the DARTS collapse problem. Given that the community is more interested in one-shot NAS algorithms, this might be worthwhile pursuing
+
+From the new plot in Figure 4 and NAS experiment in Figure 5, it is evident that the sum of training loss is able to rank the networks more effectively in the first 50 epochs. So one could use SOTL-E for early stopping rather than validation accuracy. This would also be effective in hyperband where the architectures are discarded after training them for very few epochs.
+
+
+",6,4.0,ICLR2021
+ByeL_F-q9B,3,rJxe3xSYDS,rJxe3xSYDS,Official Blind Review #2,"This work addresses the problem of training softmax classifiers when the number of classes is extreme. The authors improve the negative sampling method which is based on reducing the multi-class problem to a binary class problem by introducing randomly chosen labels in the training.  Their idea is generating the fake labels nonuniformly from an adversarial model (a decision tree). They show convincing results of improved learning rate.
+The work is very technical in nature, but the proposal is presented in detail and in a didactic way with appropriate connections to alternative methods, so that it may be useful for the non-expert (as me).
+That is the reason why I recommend to accept this work: even not being an expert I found the paper educative in introducing the problem and interesting in explaining the proposal. ",8,,ICLR2020
+rJenDhy5nX,3,BJfvAoC9YQ,BJfvAoC9YQ,Continual learning approach with increasing computational cost over time,"This paper proposes a continual learning approach which transforms intermediate representations of new data obtained by a previously trained model into new intermediate representations that are suitable for a task of interest.
+When a new task and/or data following a different distribution arrives, the proposed method creates a new transformation layer, which means that the model’s capacity grows proportional to the number of tasks or data sets being addressed over time. Intermediate data representations are stored in memory and its size also grows.
+The authors have demonstrated that the proposed method is robust to catastrophic forgetting and it is attributed to the feature transformation component. However, I’m not convinced by the experimental results because the proposed method accesses all data in the past stored in memory that keeps increasing infinitely. The authors discuss very briefly in Section 5.2 on the performance degradation when the memory size is restricted. In my opinion, the authors should discuss this limitation more clearly on experimental results with various memory sizes.
+
+The proposed approach would make sense and benefit from storing lower dimensional representations of image data even though it learns from the entire data over and over again.
+But it is unsure the authors are able to claim the same argument on a different type of data such as text and graph.",4,3.0,ICLR2019
+Yq6DO9S_lf,3,UwOMufsTqCy,UwOMufsTqCy,"Good paper, accept","Summary:
+
+This paper presents a new rule based  classifier called Rule based Representation Learner (RRL), that automatically learns interpretable non-fuzzy rules for data representation.  In order to train this model efficiently  the paper presents a learning algorithm that projects the discrete RRL model to a continuous space and hence optimise the model using gradient descent. Through experiments on 9 small and 4 large datasets shows that RRL improves over other methods, has low complexity and interpretable. 
+
+Reasons for score:
+
+I think this is a Good paper and vote for accepting.  This paper presents a solid contribution in learning rule based classifiers and hence to interpretable machine learning.  The paper is clearly written and superiority of the presented models are backed by strong experimental results.
+
+Pros:
+
+1. The paper is clearly written and easy to follow.
+2.  The paper addressees an important problem of interpretable machine learning and presents a novel model which is scalable.
+I find the technique of gradient grafting particularly interesting.
+3.  Very strong experimental results and ablation studies.
+
+",7,2.0,ICLR2021
+tkbFYWlIjG,4,IUaOP8jQfHn,IUaOP8jQfHn,A timely and well-executed evaluation of existing methods,"The paper presents an empirical evaluation of a number of recent models for unsupervised object-based video modelling. Five different models are evaluated on three (partially novel) benchmarks, providing a unifying perspective on the relative performance of these models. Several common issues are identified and highlighted using challenge datasets: The reliance on color as a cue for object segmentation, occlusion, object size, and change in object appearance. The paper concludes with several ideas for alleviating these issues.
+
+Strengths:
+ 1. The paper represents a much needed comparison of several related models which have previously not been evaluated on common benchmarks. Given the rapidly increasing number of competing models in this space, I believe analysis papers like this one serve an important role.
+ 2. The paper highlights important weaknesses of unsupervised object models, such as the overreliance on color or the difficulties with handling occlusion. While I believe that some of these weaknesses were already known to the people working with these methods, they have not always been formally documented in the respective publications. This paper rectifies this, and also provides guidance as to the relative vulnerability of the different methods.
+ 3. The methodology is convincing and thorough. Architecture and hyperparameter choices are clearly documented in the appendix, and the datasets have been published. 
+
+Weaknesses:
+1. As an analysis paper, this publication does not provide specific new technical contributions to the issues it is evaluating. 
+2. The results are not entirely conclusive, in that there is no clear best model, and their relative quality varies with datasets and metrics. That said, some things can be very clearly observed, for instance the importance of object size for the performance of TBA.
+
+Overall, the paper serves an important role in consolidating the ecosystem of unsupervised object representations. Given the increasing need for such analysis papers, and the competent execution, I recommend acceptance.",7,3.0,ICLR2021
+xsSRr2EEso,4,sgNhTKrZjaT,sgNhTKrZjaT,Poor literature review and poor experiments ,"The paper focused on the issue of learning a policy for a given task using the learned representations a pre-trained VAE. The authors visualize that using a learned latent space of a pre-trained VAE is not good enough for learning policies and propose a solution for this problem: back-propagate gradient policies through the VAE encoder. The authors proposed two versions on this method, one with pre-training and one fully online.
+
+This paper suffers from fundamental issues. The biggest one in my opinion is its poor literature review. There is a bold statement in the introduction which shapes up the problem that the paper is trying to address and I quote: ""Recently, many such approaches employed the VAE framework which aims to learn a smooth representation of its domain. Most of these approaches follow the same pattern: First, they build a dataset of states from the RL environment. Second, they train the VAE on this static dataset and lastly train the RL mode using the VAE’s representation."" But the authors do not provide any reference for this. The related work section is also more focused on VAEs rather than its combination with RL. And as one would expect, there is no comparison with previous work either. Joint training of a VAE with a reward predictor or a policy network is not new. There is a massive body of research doing this. PlaNet, Dreamer by Danijar Hafner et al and their variations being the latest ones. 
+
+The experiments are also done on a single Atari game (Breakout) which is not sufficient to demonstrate the capability of the proposed method in a variety of tasks. And again, with no comparison to ANY of the previous work.
+
+Overall, the paper is suffering from multiple fundamental issues which makes it hard to be accepted as a scientific contribution in a top conference. ",1,5.0,ICLR2021
+cV6iNfZge1v,3,Xa3iM4C1nqd,Xa3iM4C1nqd,A new similarity loss to fine-tune representations using auxiliary tasks.,"The authors improve the fine-tuning of a data representation allowing
+resistance and recognition of adversarial or corrupt inputs.  Their method,
+URRL, uses an auxiliary unsupervised task and robust data augementation to
+improve performance for both clean and altered inputs.  A key addition is that,
+while fine-tuning on the auxiliary (downstream) task, performance on the
+original data is maintained by favoring small differences between original
+robust representation and the fine-tuned representation.
+
+The work is well presented, and the main significance is in providing a fast
+way to partially gain some of the robustness available in much more difficult
+adversarial training.  This might be of significance in applications like
+online learning, where only fast fine-tuning is practical.
+
+Their approach limits how much fine-tuning is allowed to deviate from the
+original, robust representation.  It is a *fast* way to partially recover some 
+of the benefits of a much more expensive adversarial training.  The problem
+they address is important.  For example, observations in related works often 
+note a slight decrease in performance on the original supervised task during
+later fine-tuning.  The issue to resolve is how to nicely limit degradation,
+even in face of aggressive or long-term fine-tuning.
+
+To maintain robustness during fine-tuning, one (expensive) option is to
+adversarially fine-tune the downstream task.  The authors propose a quicker
+approach: a loss function maintaining similarity between fine-tuned and
+original representation (e.g. low fine-tuning distortion of dot-product
+similarity).
+
+The experiments nicely evaluate robustness with a number of relevant metrics,
+using ImageNet with many corruption types.  Results are presented as improvements
+to the MoCo v2 codebase.  The use of a linear classifier stage seemed reasonable to me.
+
+Firstly they address what sets of data augmentations are most effective for
+images, using a mechanism that probabilistically switches between augmentation
+types for supervised and unsupervised tasks.
+
+Secondly, the more interesting results with similarity loss showed very good (usually best)
+adversarial performance.  Accuracy on clean data was degraded by ~ 10%, still less than the
+degradation caused by adversarial training.  Similarity loss can be applied to other existing
+techniques (like their MoCo baseline).
+
+Q: Does the similarity measure loss still make sense if fine-tuning uses
+data sampled from a new/changing input distribution?  It might
+(see also ""Test-Time Training with Self-Supervision
+for Generalization under Distribution Shifts"", Sun et al.)
+
+---
+
+The authors' comments strengthen the argument for the method.   I (perhaps liberally) conceptualize
+their approach as one of tethering ""close to a rotation"" but still don't have a simple understanding of
+why/when this should be better than tethering ""close to original parameters"". My original rating is unchanged.",7,4.0,ICLR2021
+HJeJHrJc27,3,B1lKS2AqtX,B1lKS2AqtX,"The authors propose a futurue video prediction model based on recurrent 3D-CNNs and propose a novel memory mechanism (Eidetic Memory) to capture long term relationships inside the recurrent layer itself. They obtain surpass the state of the art on two commonly used, (relatively) simple benchmark video prediction datasets. They further apply their model to early action recognition, performing an ablation study to evaluate the strengths of each model building block.","AFTER REBUTTAL:
+
+This is an overall good work, and I do think proves its point. The results on the TaxiBJ dataset (not TatxtBJ, please correct the name in the paper) are compelling, and the concerns regarding some of the text explainations have been corrected.
+
+-----
+
+The proposed model uses a 3D-CNN with a new kind of 3D-conv. recurrent layer named E3D-LSTM, an extension of 3D-RCNN layers where the recall mechanism is extended by using an attentional mechanism, allowing it to update the recurrent state not only based on the previous state, but on a mixture of previous states from all previous time steps.
+
+Pros:
+The new approach displays outstanding results for future video prediction. Firstly, it obtains better results in short term predictions thanks to the 3D-Convolutional topology. Secondly, the recall mechanism is shown to be more stable over time: The prediction accuracy is sustained over longer preiods of time (longer prediction sequences) with a much smaller degradation. Regarding early action recognition, the use of future video prediction as a jointly learned auxiliary task is shown to significantly increase the prediction accuracy. The ablation study is compelling.
+
+Cons:
+The model does not compare against other methods regarding early action recognition. Since this is a novel field of study in computer vision, and not too much work exists on the subject, it is understandable. Also, it is not the main focus of the work.
+
+In the introduction, the authors state that they account for uncertainty by better modelling the temporal sequence. Please, remove or rephrase this part. Uncertainty in video prediction is not due to the lack of modelling ability, but due to the inherent uncertainty of the task. In real world scenarios (eg. the KTH dataset used here) there is a continuous space of possible futures. In the case of variational models, this is captured as a distribution from which to sample. Adversarial models collapse this space into a single future in order to create more realistic-looking predictions. I don't believe your approach should necessarily model that space (after all, the novelty is on better modelling the sequence itself, not the possible futures, and the model can be easily extended to do so, either through GANs or VAEs), but it is important to not mislead the reader.
+
+It would have been interesting to analyse the work on more complex settings, such as UCF101. While KTH is already a real-world dataset, its variability is very limited: A small set of backgrounds and actions, performed by a small group of individuals.
+
+",7,5.0,ICLR2019
+SJWidMceG,2,r17lFgZ0Z,r17lFgZ0Z,borderline,"1) This paper conducts an empirical study of different unsupervised metrics' correlations in task-oriented dialogue generation. This paper can be considered as an extension of Liu, et al, 2016 while the later one did an empirical study in non-task-oriented dialogue generation.  
+
+2)My questions are as follows:
+i) The author should give the more detailed definition of what is non-task-oriented and task-oriented dialogue system. The third paragraph in the introduction should include one use case about non-task-oriented dialogue system, such as chatbots.
+ii) I do not think DSTC2 is good dataset here in the experiments. Maybe the dataset is too simple with limited options or the training/testing are very similar to each other, even the random could achieve very good performance in table 1 and 2. For example, the random solution is only 0.005 (out of 1) worse then d-scLSTM, and it also has a close performance compared with other metrics. Even the random could achieve 0.8 (out of 1) in BLEU, this is a very high performance.
+iii) About the scatter plot Figure 3, the authors should include more points with a bad metric score (similar to Figure 1 in Liu 2016). 
+iv) About the correlations in figure b, especially for BLEU and METEOR, I do not think they have good correlations with human's judgments. 
+v) BLEU usually correlates with human better when 4 or more references are provided. I suggest the authors include some dataset with 4 or more references instead of just 2 references.
+",5,4.0,ICLR2018
+SJB-0Mtlz,1,r1RQdCg0W,r1RQdCg0W,"Good ideas, but insufficient results","The manuscript proposes an efficient hashing method, namely MACH, for softmax approximation in the context of large output space, which saves both memory and computation. In particular, the proposed MACH uses 2-universal hashing to randomly group classes, and trains a classifier to predict the group membership. It does this procedure multiple times to reduce the collision and trains a classifier for each run. The final prediction is the average of all classifiers up to some constant bias and multiplier as shown in Eq (2).
+
+The manuscript is well written and easy to follow. The idea is novel as far as I know. And it saves both training time and prediction time. One unique advantage of the proposed method is that, during inference, the likelihood of a given class can be computed very efficiently without computing the expensive partition function as in traditional softmax and many other softmax variants. Another impressive advantage is that the training and prediction is embarrassingly parallel, and thus can be linearly sped up, which is very practical and rarely seen in other softmax approximation.
+
+Though the results on ODP dataset is very strong, the experiments still leave something to be desired.
+(1) More baselines should be compared. There are lots of softmax variants for dealing with large output space, such as NCE, hierarchical softmax, adaptive softmax (""Efficient softmax approximation for GPUs"" by Grave et. al), LSH hashing (as cited in the manuscript) and matrix factorization (adding one more hidden layer). The results of MACH would be more significant if comparison to these or some of these baselines can be available.
+(2) More datasets should be evaluated. In this manuscript, only ODP and imagenet are evaluated. However, there are also lots of other datasets available, especially in the area of language modeling, such as one billion word dataset (""One billion
+word benchmark for measuring progress in statistical language modeling"" by Chelba et. al) and many others.
+(3) Why the experiments only focus on simple logistic regression? With neural network, it could actually save computation and memory. For example, if one more hidden layer with M hidden units is added, then the memory consumption would be M(d+K) rather than Kd. And M could be a much smaller number, such as 512. I guess the accuracy might possibly be improved, though the memory is still linear in K.
+
+Minor issues:
+(1) In Eq (3), it should be P^j_b rather than P^b_j?
+(2) The proof of theorem 1 seems unfinished",6,4.0,ICLR2018
+rkeS5u3TYH,1,Bye4iaEFwr,Bye4iaEFwr,Official Blind Review #3,"This work studies the predictive uncertainty issue of deep learning models. In particular, this work focuses on the distributional uncertainty which is caused by distributional mismatch between training and test examples. The proposed method is developed based on the existing work called Dirichlet Prior Network (DPN). It aims to address the issue of DPN that its loss function is complicated and makes the optimization difficult. Instead, this paper proposes a new loss function for DPN, which consists of the commonly used cross-entropy loss term and a regularization term. Two loss functions are respectively defined over in-domain training examples and out-of-distribution (OOD) training examples. The final objective function is a weighted combination of the two loss functions. Experimental study is conducted on one synthetic dataset and two image datasets (CIFAR-10 and CIFAR-100) to demonstrate the properties of the proposed method and compare its performance with the relevant ones in the literature. The issue researched in this work is of significance because understanding the predictive uncertainty of a deep learning model has its both theoretical and practical value. The motivation, research issues and the proposed method are overall clearly presented. 
+
+The current recommendation is Weak Reject because the experimental study is not convincing or comprehensive enough. 
+
+1.	Although the goal of this work is to deal with the inefficiency issue of the objective function of existing DPN with the newly proposed one, this experimental study does not seem to conduct sufficient experiments to demonstrate the advantages (say, in terms of training efficiency & the capability in making the network scalable for more challenging dataset) of the proposed objective function over the existing one; 
+2.	Table 1 compares the proposed method with ODIN. However, as indicated in this work, ODIN is trained with in-domain examples only. Is this comparison fair? Actually, ODIN's setting seems to be more practical and more challenging than the setting used by the propose methods. 
+3.	The evaluation criteria shall be better explained at the beginning of the experiment, especially how they can be collectively used to verify that the proposed method can better distinguish distributional uncertainty from other uncertainty types.
+4.	In addition, the experimental study can be clearer on the training and test splits. How many samples from CIFAR-10 and CIFAR-100 are used for training and test purpose, respectively? Also, since training examples are from CIFAR-10 and CIFAR-100 and the test examples are also from these two datasets, does this contradict with the motivation of “distributional mismatch between training and test examples” mentioned in the abstract?  
+5.	The experimental study can have more comparison on challenging datasets with more classes since it is indicated that DPN has difficulty in dealing with a large number of classes. 
+
+Minor:
+
+1. Please define the \hat\theta in Eq.(5). Also, is the dirac delta estimation a good enough approximation here?
+2. The \lambda_{out} < \lambda_{in} in Eq.(11) needs to be better explained. In particular, are the first terms in Eq.(10) and Eq.(11) comparable in terms of magnitude? Otherwise,  \lambda_{out} < \lambda_{in} may not make sense.
+3. The novelty and significance of fine-tuning the proposed model with noisy OOD training images can be better justified.",3,,ICLR2020
+B1x6uRTsYB,3,Skltqh4KvB,Skltqh4KvB,Official Blind Review #1,"This work investigates the collection of methods that have been proposed to find units in neural networks that are selective for certain object classes.  Previous works have used different measures of selectivity (with sometimes contradictory results), and the authors investigate the degree to which these units qualify as “object detectors”.
+
+This research area is important for understanding deep networks because claims have been made as to the relative importance (or lack thereof) of these individual units as identified by different measures vis-a-vis distributed representations -- the identification of such units would be interesting for understanding the predictions of classification networks.
+
+The authors find that (1) different proposed measures of selectivity are not consistent and (2) units identified as selective cannot be considered object detectors due to the high false alarm / low hit rates, analyzing a large number of selectivity measures.  I would have liked to see experiments on more recent architectures (the focus of the paper is on a dated architecture (AlexNet)); there is analysis on units in GoogLeNet and VGG-16 but it would also be interesting to see results for more modern architectures (e.g. DenseNet and ResNet).
+
+Overall, I think that the authors have presented a strong meta-analysis and compelling argument for further study in rigorously identifying the presence (or lack thereof) of selective units in neural networks and the degree to which they may be considered ""object detectors.""",8,,ICLR2020
+r1lkMDOaKB,2,BJedt6VKPS,BJedt6VKPS,Official Blind Review #2,"The authors propose a new initialization scheme for training neural networks. The initialization considers fan-in and fan-out, to regularize the range of singular values of the Hessian matrix, under several assumptions.
+
+The proposed approach gives important insights for the problem of weight initialization in neural networks. Overall, the method makes sense. However, I have several concerns:
+
+- The authors do not consider more recent neural network designs such as normalization layers, skip connections, etc. It would be great if the authors could discuss how these layers would change the derivation of the initialization method. Also, preliminary experimental results using these layers are needed. Additionally, to me, normalization layers [Huang et al. Decorrelated Batch Normalization. CVPR 2018] implicitly precondition the Hessian matrix as well. It would be great if the authors also compare their approach to [Huang et al. 2018].
+
+- The authors compared to other initialization schemes such as [He et al., 2015] and [ Glorot and Bengio 2010]. But as the authors mentioned, there are approaches that scales backpropagation gradients also [Martens and Grosse, 2015; Grosse and Martens, 2016; Ba et al., 2017; George et al., 2018]. Since these methods are highly related to the proposed method, it would be great if the authors could show time complexities and performance differences of these methods as well.
+
+- Experiments on the CIFAR-10 dataset with AlexNet seem not exciting: the proposed Preconditioned approach only outperforms the Fan-out approach marginally. I would say that training a [He et al. 2015]-initialized neural network for 500 more iterations than a preconditioned neural network, yields a similar or better loss.
+
+Overall I think the work is very important and interesting. However, it lacks comprehensive comparison and consideration of more recent neural network layers.
+
+Post Rebuttal Comments
+I have read all reviewer comments and the author feedback. I appreciate that the authors addressed the skip connections in Appendix.
+
+1. The authors agree that batch norm requires different initialization schemes that are not included in this paper.
+2. I agree with the authors that their approach is complementary to the baseline optimization methods; and both approaches can be applied together. However, I still believe that it is informative to compare the two approaches because: (a). Both approaches address the same problem. Since the optimization based approach adds complexity and computational overhead to implementation, it would be great to show if using the proposed approach eliminates the need for the optimization based approach. (b). Is it necessary to use both approaches, or one of them is good enough?
+3. I understand that strong experimental evidence is not always required. However, I believe that the new technical insights of the paper alone is not significant enough (part of the reasons in point 1). Thus I was expecting stronger experimental evidences.
+
+Overall I agree with reviewer 1 that the topic is interesting, but in the paper’s current form, it is not ready. I keep my initial rating of weak reject.",3,,ICLR2020
+_lRJNs7Ocon,4,KpfasTaLUpq,KpfasTaLUpq,Deep encoders and shallow decoders for NMT,"Summary:
+The paper proposes deep encoder and shallow decoder models for auto-regressive NMT. They compare rigorously to NAR models. They also study three factors: layer allocation, speed measurement and knowledge distillation. They include that with a 12E-D1 model they obtain significant speed-up and can outperform the standard 6-6 AR model and almost always beat the NAR model in terms of quality. They also show that NAR models need deep decoders because they need to handle reordering.
+
+Reasons for score:
+I scored this paper a 9. I think this is an important paper which establishes very strong AR baselines for future NAR work in the field. They correctly point out the three issues with the comparisons that many NAR papers make. They conduct various meaningful ablation studies and validate their various hypothesis properly. They also show that certain factors like knowledge distillation need to be applied to both AR and NAR systems. Finally, they advocate for reporting both S_1 and S_max when comparing speed gains.
+
+Cons:
+- One issue I had with the presentation of the results was the selection of different formats and language pairs for different experiments. For example, table 2, 3 and 4 report on different subsets of language-pairs. Same with the figures. This might raise questions of whether the authors are randomly subselecting or selecting favorable subsets. I would have liked to see all experiments done on all LPs.
+
+
+Minor comments: 
+- Section 2.1: S_max - ""This is closer to practical scenarios where one wants to translate a large amount of text."" - this is a very subjective statement and I would tone this down. 
+- Section 2.2.2: ""Denote respectively by E and D the numbers of encoder and decoder layers."" -- please fix grammar
+
+Missing citations:
+- Section 1: Along with Sutskever, Bahdanau and Vaswani. please also cite https://www.aclweb.org/anthology/D13-1176.pdf and Wu et al. 2016 (https://arxiv.org/abs/1609.08144) when you mention state-of-the-art NMT.
+",9,5.0,ICLR2021
+S_uHz2VaL8N,5,ijVgDcvLmZ,ijVgDcvLmZ,Blind review,"This paper describes a new method for learning factored value functions in cooperative multi-agent reinforcement learning. The approach uses energy-based policies to generate this factorization. The method is presented and experiments are given for smaller domains as well as starcraft. 
+
+The idea of learning factored value functions is promising for learning separate value functions for each agent that allow them to learn in a centralized manner and execute in a decentralized manner (centralized training and decentralized execution). Several methods have been proposed along these lines, but as the paper points out, they have limitations that makes them perform poorly in some problems. 
+
+The proposed approach in this paper has some promising experimental results, but there are questions about the novelty and significance of the method. Furthermore, evaluating these contributions is difficult due to the lack of clear details in the paper. 
+
+In particular, the details of the approach itself in 3 are not clear. Starting with Definition 1, it seems like IGO is using an optimal *centralized* policy. Is this what is meant? If so, why is this needed (as opposed to an optimal decentralized policy). It will typically be impossible to achieve a centralized policy with decentralized information. Furthermore, the energy-based policies are defined in 3.2, but 'key' ideas such as approximating the weight vector aren't fully explained making the exact approach hard to determine. Also, it is beneficial that the current theorems and proofs are included, but the lack of sufficient detail makes it hard to parse and evaluate them. 
+
+There are also similar max entropy approaches, such as the paper below. 
+
+Iqbal, S. & Sha, F.. (2019). Actor-Attention-Critic for Multi-Agent Reinforcement Learning. Proceedings of the 36th International Conference on Machine Learning, in PMLR 97:2961-2970
+
+As well as other factorized methods, such as the papers below (which are admittedly new). 
+
+Weighted QMIX: Expanding Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. Tabish Rashid, Gregory Farquhar, Bei Peng, Shimon Whiteson. NeurIPS 2020.
+
+de Witt, Christian Schroeder, et al. ""Deep Multi-Agent Reinforcement Learning for Decentralized Continuous Cooperative Control."" arXiv preprint arXiv:2003.06709 (2020).
+
+The paper should discuss how the proposed method is an improvement over this other work and have a more comprehensive related work section. 
+
+The experiments are promising, but the relevant related work is not included and there isn't sufficient detail describing how the methods were run and discussing the results. In terms of comparisons, the paper should also need to compare with non-factored state-of-the-art methods. It is, of course, natural to compare with other factored methods, but what matters is general state-of-the-art performance of the domains. 
+
+As noted, the clarity and writing of the paper should be improved. Beyond the examples above, some other instances are below. 
+
+- If the reader doesn't already understand the relative overgeneralization problem, Section 2.3 probably isn't sufficient. Figure 1 is helpful, but it should be described in the text to make the issue clear. 
+
+- The connection between the overgeneralization problem and factored representations isn't completely clear. Factored representations have problems because they typically cannot represent the optimal value function (or policy). That is a separate issue than getting stuck in a local optimum (which can happen with any type of method).",3,5.0,ICLR2021
+Byeq6TEfn7,1,H1eSS3CcKX,H1eSS3CcKX,"Interesting theoretical results, but connection to the experimental results is not clear","In many machine learning applications, sorting is an important step such as ranking. However, the sorting operator is not differentiable with respect to its inputs. The main idea of the paper is to introduce a continuous relaxation of the sorting operator in order to construct an end-to-end gradient-based optimization. This relaxation is introduced as \hat{P}_{sort(s)} (see Equation 4). The paper also introduces a stochastic extension of its method 
+using Placket-Luce distributions and Monte Carlo. Finally, the introduced deterministic and stochastic methods are evaluated experimentally in 3 different applications: 1. sorting handwritten numbers, 2. Quantile regression, and 3. End-to-end differentiable k-Nearest Neighbors.
+
+The introduction of the differentiable approximation of the sorting operator is interesting and seems novel. However, the paper is not well-written and it is hard to follow the paper especially form Section 4 and on. It is not clear how the theoretical results in Section 3 and 4 are used for the experiments in Section 6. For instance:
+** In page 4, what is ""s"" in the machine learning application?
+** In page 4, in Equation 6, what are theta, s, L and f exactly in our machine learning applications?
+
+Remark: 
+** The phrase ""Sorting Networks"" in the title of the paper is confusing. This term typically refers to a network of comparators applied to a set of N wires (See e.g. [1])
+** Page 2 -- Section 2 PRELIMINARIES -- It seems that sort(s) must be [1,4,2,3].
+
+[1] Ajtai M, Komlós J, Szemerédi E. An 0 (n log n) sorting network. InProceedings of the fifteenth annual ACM symposium on Theory of computing 1983 Dec 1 (pp. 1-9). ACM
+",6,3.0,ICLR2019
+n9yGBO3v817,3,NPab8GcO5Pw,NPab8GcO5Pw,Lack novelty and significance,"This paper studies the loss landscapes of sparse linear networks. It proves that under squared loss, (1) spurious local minimum does not exist when the output dimension is one, or with separated first layer and orthogonal training data; and (2) for two-layer sparse linear networks, the good property in (1) does not exist anymore when the conditions are violated. The authors also report experimental results to show that two-layer sparse linear networks with two hidden neurons have spurious local minima.
+
+Overall, I vote for rejection. The proofs are detailed and seem correct. However, I was worried that (1) the proofs make incremental contribution compared with two existing works; (2) assuming activations are linear is too strict; (3) the insight given by the theorems is not clear; and (4) the applicable domain is not clear.
+
+Pros:
+
++ The authors prove that (1) sparse linear networks do not have any spurious local minima under some assumptions; and (2) for two-layer sparse linear networks, the previous properties do not stand anymore when the assumptions no longer hold. The authors have given detailed proofs for their arguments.
+
+Cons: I was concerned about the technical novelty and significance. They are not justified in this paper.
+
+- From technical aspects, the proves are simple extensions of existing works on linear neural networks; such as Kawaguchi (2016) and Lu & Kawaguchi (2017). The authors do not state clearly what new proof techniques are used in this paper.
+
+- This paper assumes all activations are linear and then proves in many cases spurious local minima do not exist. However, most neural networks in practice _have_ nonlinear activations. Some papers have suggested the nonlinearity in activations significantly change the loss landscape, such as [1-3] amongst others. I understand that it would be good to study a simpler model when a comprehensive model is intractable. But the simpler model needs to share similar properties with the real-world case. The three aforementioned papers concern me. 
+
+- The insights of the given theories are not clear. Did the authors hope to suggest pruning would bring something new into linear networks? It would be good if the authors can clearly discuss this and give some potential practical implications.
+
+- The potential application would be limited. Understanding the loss landscapes of sparse linear networks is important if we need to train them from scratch. However, most pruning methods are applied when neural networks have been trained. Some works even do not need fine-tuning after pruning. This undermines the importance of this paper.
+
+Questions: It would be good if the authors can comment on the cons.
+
+[1] Yun, C., Sra, S., and Jadbabaie, A., “Small nonlinearities in activation functions create bad local minima in neural networks,” ICLR 2019.  
+[2] He, F., Wang, B., and Tao, D., “Piecewise linear activations substantially shape the loss surfaces of neural networks,” ICLR 2020.  
+[3] Goldblum, M., Geiping, J., Schwarzschild, A., Moeller, M., and Goldstein, T., “Truth or backpropaganda? An empirical investigation of deep learning theory,” ICLR 2020.",4,4.0,ICLR2021
+aEgFjI22Qh,4,EoVmlONgI9e,EoVmlONgI9e,Solve Dec-POMDP problem based on MAAC and QMIX. Interesting and timely work~,"This article solves the Dec-POMDP problem, based on MAAC and QMIX algorithms. When the agents are homogeneous and optimized only through team rewards, it is easy to learn similar policies for each agent. This makes the multi-agent algorithm finally converge to a locally optimal joint policy. If each agent can complete different parts of the overall goal, the joint policy that converges to is obviously better. Many current works model the above problems as task assignment or role assignment problems. However, the agents with different roles in these works are basically rule-based, and the tasks are all manually defined. There are also some works that unsupervisedly generate different strategies and roles by introducing diversity constraints, but the generation process has nothing t o do with the task. In order to solve the shortcomings of the above methods, this paper proposes a MARL method based on reward shaping to encourage the division of labor between agents, and at the same time introduces two regularize terms to solve the problem of too similar agent policies at the initial training stage. At the same time, reward shaping and reinforcement learning are optimized simultaneously, forming a bi-level optimization problem. The paper also designed three tasks that emphasize division of labor to verify the effectiveness of the algorithm.
+1. Neither the MAAC nor the QMIX algorithm on which the paper is based has good scalability. Although the independent learning algorithm is simple, it can achieve better performance on many tasks and has good scalability. This paper should additionally use independent learning algorithms as baselines, and apply the intrinsic rewards proposed in this paper to independent learning algorithms.
+2. The three tasks used to verify the algorithm in this paper are all specially designed, with a strong emphasis on division of labor. I think the paper should additionally explain the limitations of the algorithm, such as which scenarios will be more effective and which scenarios will limit the learning ability of the agent.
+3. The optimization process of bi-level problems is very unstable. The algorithm proposed in this paper contains many hyperparameters, and the sensitivity of the algorithm to hyperparameters should be shown in the experimental part.",6,5.0,ICLR2021
+SkgQVvQ0tS,1,rJl5rRVFvH,rJl5rRVFvH,Official Blind Review #1,"The paper introduces an off-policy batch deep reinforcement learning method for learning human preferences in dialog generation. The key technical contribution is by controlling the KL divergence of the learned policy with a prior policy that was learned on other dialogs. The method is able to constrain the policy not deviate too much from the prior to avoid extrapolation error. Another technical contribution is the use of dropout uncertainty estimates to estimate the Q values, which is more scalable compared to double Q-learning. The authors also incorporate intermediate implicit rewards in dialogs to encourage more positive conversations. 
+In terms of methodology, the use of dropout as uncertain estimates seem to be firstly used for DRL to my knowledge. The experiments seem to be properly conducted, with particular focus on inferring what rewards encourage positive conversations.
+
+However, I am not fully convinced by the motivation of using KL control. If there is too much weight on the rewards of learning of human preferences compared to reasonable sentences, one could imagine other compared methods such as batch Q-learning could fail since they will learn to encourage positive conversations and diverge away from realistic language. Using KL to control its generated dialog to the prior is a way to prevent it from diverging. But I would think that with a proper tuning of the different type of rewards could also prevent the policy from diverging to generating non-realistic dialogs especially when there is a large amount of dialogs available. If possible I would hope to see tuning on the weights of different types of rewards used in the problem setup. Also I am curious if you use the same initialization (using the same pretrained language model) for other baselines? I did not seem to find it the paper. 
+Also the proposed method using KL control shares similarity to the one in Distral [1] where a distilled policy is used as prior policy and discounted KL divergence is used as regularizer, which limits it novelty in this perspective.
+
+Overall, the paper is fairly well-written. The paper develops a method for learning human preferences in dialogs in off-policy reinforcement learning and use KL control to avoid extrapolation error issues. The technical novelty of the method is limited as there is a similar method proposed in [1]. The experiments are illustrative but more comparisons with the other methods will be appreciated to make it more convincing. 
+
+[1] Teh, Yee, Victor Bapst, Wojciech M. Czarnecki, John Quan, James Kirkpatrick, Raia Hadsell, Nicolas Heess, and Razvan Pascanu. ""Distral: Robust multitask reinforcement learning."" In Advances in Neural Information Processing Systems, pp. 4496-4506. 2017.",3,,ICLR2020
+0ah7qI9IRNz,4,Rhsu5qD36cL,Rhsu5qD36cL,Very well written paper proposing an algorithm minimizing the divergence between estimated and true Log-Likelihood Ratios of SPRT and making it thereby Bayes optimal for various real-world applications. ,"The paper proposes a novel SPRT-TANDEM algorithm minimizing the divergence between estimated and true Log-Likelihood Ratios of SPRT and making it thereby Bayes optimal for various real-world applications. The paper is very well written, clear and scientifically sound and provides  extensive contributions, e.g. a database in addition to the algorithm. Performance of the algorithm is demonstrated via three experiments.  
+
+Previous research is given sufficient credit. The only thing I would still like to see more is the discussion at the conclusions. Why does this seemingly simple modification to the existing SPRT method provide so superior performance.
+
+The appendices are referred a lot in the text but they are missing from the paper?
+
+A very minor comment: The following sentence is a bit vaguely written:
+Long short-term memory (LSTM)-s/LSTM-m impose monotonicity on classification ...
+I guess it should be : Long short-term memory (LSTM) variants LSTM-S and LSTM-M impose monotonicity on classification ...
+",9,4.0,ICLR2021
+SkPLQan7l,1,H1hoFU9xe,H1hoFU9xe,Official review for 'Generative Adversarial Networks for Image Steganography',"I reviewed the manuscript as of December 6th.
+
+Summary:
+The authors build upon generative adversarial networks for the purpose of steganalysis -- i.e. detecting hidden messages in a payload. The authors describe a new model architecture in which a new element, a 'steganalyser' is added a training objective to the GAN model.
+
+Major Comments:
+The authors introduce an interesting new direction for applying generative networks. That said, I think the premise of the paper could stand some additional exposition. How exactly would a SGAN method be employed? This is not clear from the paper. Why does the model require a generative model? Steganalysis by itself seems like a classification problem (i.e. a binary decision if there a hidden message?) Would you envision that a user has a message to send and does not care about the image (container) that it is being sent with? Or does the user have an image and the network generates a synthetic version of the image as a container and then hide the message in the container? Or is the SGAN somehow trained as a method for detecting hidden codes performed by any algorithm in an image? Explicitly describing the use-case would help with interpreting the results in the paper.
+
+Additionally, the experiments and analysis in this paper is quite light as the authors only report a few steganalysis performance numbers in the tables (Table 1,2,3). A more extensive analysis seems warranted to explore the parameter space and provide a quantitative comparison with other methods discussed (e.g. HUGO, WOW, LSB, etc.) When is it appropriate to use this method over the others? Why does the seed effect the quality of results? Does a fixed seed correspond realistic scenario for employing this method?
+
+Minor comments:
+- Is Figure 1 necessary?
+- Why does the seed value effect the quality of the predictive performance of the model?",4,3.0,ICLR2017
+rygJndyntr,1,HJggj3VKPH,HJggj3VKPH,Official Blind Review #3,"Global convergence of NNs is an important research direction in deep learning. There have been significant progresses in this direction since last year. Most noticeable, the Neural tangent kernels (NTK) [1], which shows in the infinite width setting, NTK is deterministic and remains almost constant during gradient descent. NNs are essentially the same as kernel methods. Proofs of global convergence of NNs (without normalization) are built on this intuition. 
+
+This is the first paper (to my best knowledge) to prove such convergence result for weight normalization. The paper is very well written and the presentation is very clear and I really enjoy reading it! The proof builds on (and very similar to) [2] and other recent works concerning the global convergence of NNs in the kernel regime. The main strategy is as follows: 
+step 1. show the finite-width NTK concentrates near the infinite width NTK, whose least eigenvalue is positive (need proof) 
+step 2. show the change of finite-width NTK is tiny and essentially does not alter the least eigenvalue.   
+
+As such, the optimization problem is essentially a strong convex problem. Although the high level picture is very similar to previous convergence papers, the technical details are different. In particular, the authors show some improvements over  previous works: e.g. the width required for the proof here is n^4 (n == dataset size) rather than n^6 without normalization ([3]). I think these bounds are very loose and it is not clear to me if normalization provably reduces the degree of over-parameterization; see [4].  There are also some other interesting results concerning weight normalization, e.g. the NTK could be decoupled into the sume of the `directional`   NTK and the `length`  NTK. 
+
+Overall, this is a good paper but the contribution might not be sufficient for acceptance for the reasons below. 
+1. the overall framework is very similar to previous works and the dynamics of NNs still lives in the kernel regime, which might not be the most interesting regime for deep learning. 
+2. In this kernel regime, global convergence (or GD dynamics) might not be the most interesting object to study since we already have many existing works in this direction. I hope to see new insights beyond global convergence, e.g. benefits of normalization to generalization (potentially for multilayer networks), etc. See comments below. 
+
+
+Comments: 
+1. Extending to multilayer? 
+2. Show normalization provably reduces the level of over-parameterization needed for convergence. This requires a lower bound for the width that NN cannot converge if the width is below the threshold.  
+3. What can we say about generalization: normalization vs no normalization? e.g. [5]
+
+
+
+
+
+[1] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in
+neural networks. In Advances in neural information processing systems, pages 8571–8580, 2018.   
+[2] Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes overparameterized neural networks.
+[3]Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization
+and generalization for overparameterized two-layer neural networks. In International Conference on Machine
+Learning, pages 322–332, 2019.
+[4]Samet Oymak, Mahdi Soltanolkotabi. Towards moderate overparameterization: global convergence guarantees for training shallow neural networks
+[5] Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization
+and generalization for overparameterized two-layer neural networks. In International Conference on Machine
+Learning, pages 322–332, 2019.
+",3,,ICLR2020
+r1gFw9KcFH,1,SygkSkSFDB,SygkSkSFDB,Official Blind Review #2,"The paper studies the problem of the number of first-order-oracle calls for the SGD type of algorithms to find a stationary point of the objective function. The main results in the paper are built upon a new, general framework to analyze the SGD type of algorithms.
+
+
+The main framework can be summarized as follows: At each iteration, the algorithm receives h_t, a (potentially biased) estimator of the gradient at a given point x_t, and performs a simple update x_{t + 1} = x_t - \eta * h_t. The framework says that as long as the norm (V_t) of \Delta_t = h_t - v_t (where v_t is an unbiased estimator of the true gradient with bounded variance) satisfies a particular Lyapunov-type inequality, then the algorithm can find an epsilon-stationary point as long as epsilon is not too small. 
+
+
+The analysis of the framework is quite standard, one only needs to write the decrement in function value at each iteration into the following three terms: the norm of the true gradient of the function, \delta_t: the difference between v_t and the true gradient (so E[\delta_t] = 0) and \Delta_t: the difference between the received gradient h_t and v_t.
+
+
+The authors showed some application of this framework in Stacked SGD and decentralized SGD. The main intuitions of these applications are (1). \Delta_t comes from the synchronization difference of the nodes when computing the gradient. (2). The shrinking of V_t is due to the  (better) synchronization at each iteration. (3). The increment of V_t is due to the gradient update. 
+
+Overall, I find the general framework quite interesting and potentially useful for future research and could be used as a guide for choosing the proper algorithm in distributed computation.  The bounds in this paper are also in principle tight. The only question I have about this result is the dependency of m (the number of iterations between each evaluation of the gradient norm of the underlying function). (1). How can this (the evaluation of the gradient norm of the underlying function)) be done in a decentralized environment? What is the computation overhead?  (For example in DSGD, how can we compute \bar{x}_t?) (2). It seems that the computation cost (number of IFO) scales quadratically with respect to m. What is the intuition for this scaling? It appears to me that the scaling should be linear or better (the worst case is that within the ""m"" iterations, only one iteration has gradient >= epsilon). The authors should elaborate more on this point.
+
+
+",6,,ICLR2020
+ryefwM-CtS,3,SyxJU64twr,SyxJU64twr,Official Blind Review #2,"
+This paper proposes an auxiliary reward for model-based reinforcement learning. The proposed method uses ensembles to build a better model of environment dynamics and suggests some rules to optimize the new ensemble-based dynamics and to estimate the intrinsic reward.
+
+I am torn on this paper. I like the derivation of the method and the ideas behind it. I think it is an interesting direction of research. However, the experiments are limited to one domain and the paper needs proofreading. I will vote ""weak accept"" for this paper, as I think it is incremental and the experiments are too limited.
+
+As I said above, the paper could use some proofreading. Some sections (like pages 1-2) are well written, but others are full of grammatical mistakes. There is also a lot of redundant information.
+
+It is often stated that the ensemble model has better capacity than the single model. Some experimental proof of this better modelling capacity could help convince a reader that the ensemble is indeed beneficial and warranted (e.g., show that P_K is better than the single model P).
+
+Some evaluation or discussion on the computational costs of the method would be beneficial. I assume the ensemble-based method is more computationally intensive. Would it perform better than single surprise if they were compared according to wall clock time?
+
+
+Examples of minor issues:
+
+- Page 3, after equation 7, sentence beginning with ""In addition to the advantage that the mixture model (4) has the increased model order for modelling"" is confusing, contains redundant elements, and is not bringing useful information. It should be revised
+- page 4, in paragraph after eq. 11: the following sentence is grammatically incorrect, please revise: ""Propositions 1 and 2 suggest a way to intrinsic reward design under the mixture model for tight gap between η(π ̃∗) and η(π∗)""
+- in the same sentence, revise: ""be close to the true η(π∗) of the true optimal policy π∗, as desired."",
+- page 6 text above Figure 1: ""single-modal"" should be unimodal
+
+
+POST REBUTTAL 
+
+Thanks for writing your rebuttal. I have read it, as well as the other reviews. I think reviewer 1 touches on some important points, especially regarding the engineered sparse rewards. It seems the method is not properly justified given the environments used for its evaluation. Based on this, and the fact that the method is rather incremental, I would like to change my score to a weak reject. The method should be evaluated in a setting­ with truly sparse rewards.",3,,ICLR2020
+rkg1ktuu3Q,1,H1ecDoR5Y7,H1ecDoR5Y7,"Implications of local stability of dynamical system to ""real world"" setting not clear","In the paper, WGAN with a squared zero centered gradient penalty term w.r.t. to a general measure is studied. Under strong assumptions, local stability of a time-continuous gradient ascent/descent dynamical system near an equilibrium point are proven for the new GP term. Experiments show comparable results to the original WGAN-GP formulation w.r.t. FID and inception score.
+
+Overall, I vote for rejecting the paper due to the following reasons:
+- The proven convergence theorem is for a time-continuous ""full-batch"" dynamical system, which is very far from what happens in practice (stochastic + time discrete optimization with momentum etc). I don't believe that one can make any conclusions about what is actually happening for GANs from such an idealized setting. Overall, I don't understand why I should care about local stability of that dynamical system.
+- Given the previous point I feel the authors draw too strong conclusions from their results. I don't think Theorem 1 gives too many insights about the success of gradient penalty terms.
+- There are only marginal improvements in practice over WGAN-GP when using other penalty measures. 
+
+Further remarks:
+- In the introduction it is claimed that mode collapse is due to JS divergence and ""low-dimensionality of the data manifold"". This is just a conjecture and the statement should be made more weak.
+
+- The preliminaries on measure theory are unnecessarily complicated (e.g. partly developed in general metric spaces). I suggest that the authors try to simplify the presentation for the considered case of R^n and avoid unnecessarily complicated (""mathy"") definitions as they distract from the actual results. 
+
+==after rebuttal==
+After reading the authors rebuttal I increased the my rating to 6 as they addressed some of my doubts. I still think that the studied setting is too idealized, but it is a first step towards an analysis.",6,4.0,ICLR2019
+S1IwT5bNl,1,SJQNqLFgl,SJQNqLFgl,interesting direction - but not really solid,"The authors take on the task of figuring out a set of design patterns for current deep architectures - namely themes that are recurring in the literature.  If one may say so, a distributed representation of deep architectures. 
+
+There are two aspects of the paper that I particularly valued: firstly, the excellent review of recent works, which made me realize how many things I have been missing myself.  Secondly, the ""community service"" aspect of helping someone who starts figure out the ""coordinate system"" for deep architectures - this could potentially be more important than introducing yet-another trick of the trade, as most other submissions may do.
+
+However I think this work is still half-done, and even though working on this project is a great idea, the authors do not yet do it properly. 
+
+Firstly, I am not too sure how the choice of these 14 patterns was made. Maxout for instance (pattern 14) is one of the many nonlinearities (PreLU, ReLU, ...) and I do not see how it stands on the same grounds as something as general as ""3 Strive for simplicity"".
+
+Similarly some of the patterns are as vague as ""Increase symmetry"" and are backed up by statements such as ""we noted a special degree of elegance in the FractalNet"". I do not see how this leads to a design pattern that can be applied to a new architecture - or if it applies to anything other than the FractalNet. 
+
+Some other patterns are phrased with weird names ""7 Cover the problem space"" - which I guess stands for dataset augmentation; or ""6 over-train"" which is not backed up by a single reference. Unless the authors relate it to regularization (text preceding ""overtrain""), which then has no connection to the description of ""over-train"" provided by the authors (""training a network on a harder problem to improve generalization""). If ""harder problem"" means one where one adds an additional term (i.e. the regularizer), the authors are doing harm to the unexperienced reader, confusing ""regularization"" with something that sounds like ""overfitting"" (i.e. the exact opposite).
+
+Furthermore, the extensions proposed in Section 4 seem a bit off tune - in particular I could not figure out 
+-how the Taylor Series networks stem from any of the design patterns proposed in the rest of the paper. 
+-whether the text between 4.1 and 4.1.1 is another of the architecture innovations (and if yes, why it is not in the 4.1.2, or 4.1.0) 
+-and, most importantly, how these design patterns would be deployed in practice to think of a new network. 
+
+To be more concrete, the authors mention that they propose the ""freeze-drop-path"" variant from ""symmetry considerations"" to ""drop-path"". 
+Is this an application of the ""increase symmetry"" pattern? How would ""freeze-drop-path"" be more symmetric that ""drop-path""?
+ Can this be expressed concretely, or is it some intuitive guess? If the second, it is not really part of applying a pattern, in my understanding. If the first, this is missing. 
+
+
+What I would have appreciated more (and would like to see in a revised version) would have been a table of ""design patterns"" on one axis, ""Deep network"" on another, and a breakdown of which network applies which design pattern. 
+
+A big part of the previous work is also covered in cryptic language - some minimal explanation of what is taking place in the alternative works would be useful. 
+
+
+  
+",4,3.0,ICLR2017
+BJ9i5HFgSqx,3,_Tf6jEzbH9,_Tf6jEzbH9,Interesting paper,"The paper defines the task of context-agnostic learning and proposes an algorithm to solve the problem while assuming the ability to sample objects and contexts independently. They propose to decompose factors contributing to the risk into two, context bias and object error. Based on this interpretation, an algorithm is designed to 'greedily correct bias' while employing adversarial training (or robustness training) for 'local refinement'. The method achieves high accuracy on two synthetic visual tasks, digits and traffic sign classification, when a model is trained using one sample per class from the source domain and tested on an unseen target domain.
+
++) Theorem 3.1 provides a new view on risk. Risk is decomposed into two factors, context bias and object error. I think this gives new insight to consider the effect of context bias and object modeling on risk separately.
+
++) The experimental results are impressive because the model is trained using a very limited number of samples (one sample per class from the source domain) but showed high generalization performance on an unseen domain. The proposed method achieves promising performance on two synthetic classification tasks compared to other existing methods for few-shot learning and domain adaptation, which requires more labeled or unlabeled data during training.
+
+-) Their underlying assumption for greedy bias correction is that a classifier learns a strong bias on recent training inputs when taken as contexts. However, if stochastic gradient descent is used for optimization, I think it is unlikely because the model changes continuously. Therefore, it is uncertain how effective this greedy selection strategy can sample contexts with large bias.
+
+-) Relating to the above point, I also have a concern about the experimental validation. In all the experiments, gamma is defined as a function that takes object and context images and outputs their overlap. It is not guaranteed that the proposed heuristic sampling strategy generalizes to other gamma functions.
+
+-) Also, all the experiments are performed for a relatively small number of classes (up to 50), and synthetic images are small iconic images with objects in the center. Although the method shows promising results under this specific setting, it is hard to conclude that the proposed heuristics will generalize other settings, such as when there are more classes, image resolutions are higher, and objects have a larger variation in their appearance. I think evaluation on additional datasets with different characteristics (such as CIFAR-100, Caltech-256, CUB-200) would be necessary.
+
+-) The assumption that one can sample objects and contexts independently may restrict its application.
+
+Though I found this paper proposes an interesting view on risk, I would recommend 'reject' due to concerns stated above.
+
+---
+Thanks for the detailed response. I can agree that the observation function gamma in the form of addition can model many noisy signals. However, the argument in the paper that the proposed method works for an arbitrary gamma still lacks experimental validation. So I would like to keep my recommendation. My other concerns have been addressed.
+",5,3.0,ICLR2021
+upSsI5_wrEo,3,g11CZSghXyY,g11CZSghXyY,Official Review (Reviewer1),"After the discussion, my concerns were fixed. The paper explores the interesting relations of Mix Up and Uncertainty, which is useful and will be the right fit for the conference.
+
+**Summary:**
+
+The work studies how a better calibration of individual members of an ensemble affects the calibration of an ensemble. It is demonstrated that i) better calibration of individual members of the ensemble may lead to the worse calibration of the ensemble predictions ii) this is the case when mix-up / label smoothing are used during training. 
+
+To fix the issue Confidence Adjusted mixup Ensembles (CAMixup) is proposed. The CAMixup is an adaptive mixup data augmentation based on per-class calibration criteria. The core idea of CAMixup is to use powerful (unconfidence encouraging) mixup data-augmentation on examples of overconfident classes, and do not use mixup for the under-confident classes. The confidence criteria are computed once in an epoch.
+
+The empirical results are provided in- and out-of-domain for CIFARs(C) and ImageNet(C).
+
+**The concerns:**
+
+1) ECE is a biased estimate of true calibration with a different bias for each model, so it is not a valid metric to compare even models trained on the same data [Vaicenavicius2019]. In other words, the measured ECE has no guaranty to have something to do with the real calibration but reflects the bias of the measured metric. Other metrics, that are based on histogram estimates have the same problem. Please put extra attention to this concern.
+
+What I suggest is the following:
+
+a. Mesure NLL for in- and out-of-domain data. It seems to be still an adequate (indirect) criterion of calibration, and is an adequate criterion of uncertainty estimation. According to [Ashukha2020], the NLL needs a pre-calibrated model with temperature scaling for in-domain data (called calibrated NLL / calibrated LL). 
+
+b. To use the squared kernel calibration error (SKCE) proposed in [Widmann2019] along with de facto standard, but biased ECE. The SKCE is an unbiased estimate of calibration. There might be some pitfalls of this metric that I'm not aware of, but the paper looks solid and convincing. Also, please put attention to Figure 83 in the arХiv version.
+
+Yes, ECE is the standard in the field, but it is the wrong standard that prevents us from meaningful scientific progress, so we should stop using it.
+
+2) The standard deviation needs to be reported everywhere. Especially the differences between close values like (97.52, 97.52, 97.47 in Table 2) may appear not statistically significant. The same touches Fig 5 and other figures that are reported. Otherwise, it is impossible to stay how solid results are.
+
+**Minor comments:**
+
+1) Maybe it worth to provide plot mean-λi vs epoch to illustrate this ""Notice that λi is dynamically updated at the end of each epoch.""?
+
+2) Figure 4(a) is done slightly disorderly.
+
+3) In the paper, ECE is measured in percentages. As far as I can tell ECE is dimensionless quantities. It is not clear what is intended.
+
+**Final comment:** I put the ""marginally below acceptance threshold"" score, but I'm willing to increase it after the update with corrections (and hope that these corrections will be done).  I like the direction and CAMixup, but in-domain results are not very consistent (see Fig 5 (d)), the ECE has uncontrollable model-specific biases that ruin all the presented results. 
+
+[Widmann2019] Widmann D, Lindsten F, Zachariah D. Calibration tests in multi-class classification: A unifying framework. In Advances in Neural Information Processing Systems 2019 (pp. 12257-12267). https://arxiv.org/pdf/1910.11385.pdf
+
+[Ashukha2020] Ashukha A, Lyzhov A, Molchanov D, Vetrov D. Pitfalls of in-domain uncertainty estimation and ensembling in deep learning. ICLR, 2020.",7,4.0,ICLR2021
+HJlStebC5r,3,ryxUMREYPr,ryxUMREYPr,Official Blind Review #3,"This work addresses the important problem of generation bias and a lack of diversity in generative models, which is often called model collapse. It proposed a new metric to measure the diversity of the generative model's ""worst"" outputs based on the sample clustering patterns. Furthermore, it proposed two blackbox approaches to increasing the model diversity through resampling the latent z. Unlike most existing works that address the model collapse problem, a blackbox approach does not make assumptions about having access to model weights or the artifacts produced during model training, making it more widely applicable than the white-box approaches.  
+In terms of experiment setup, the authors chooses face generation as the area to investigate and measures the diversity by detecting the generated face identity. With the proposed methods, the authors showed that most STOA methods have a wide gap between the top p faces of the most popular face identities and randomly sampled faces. It further showed that the proposed blackbox approaches increases the proposed diversity metric without sacrificing image quality.
+
+The proposed diversity measuring metric is lacking both in terms of experimental proofs and intuitive motivations. While the black-box calibration of a GAN model may be attractive under specific settings, the authors did not consider the restrictions under those situations and their design may be hard to implement as a result. For those reasons, I propose to REJECT this paper.
+
+Missing key experiments that will provide more motivation that 1. the new metric reflects human perception of diversity 2. the new metric works better than existing ones:
+1. Please provide experiments and/or citation for using the face identity as a proxy for face image diversity. this is important since all your experiments rely on that assumption.
+2. Were there experiments that applies your metric to the training datasets like CelebA and FFHQ? In theory your metric should show no gap between N_R_obs and N_R_ref measured on the training dataset since that's the sampled ground truth.
+
+Missing assumptions about blackbox calibration approaches:
+1. If we do not have access to the model parameter, the training data, or the artifacts during training like the discriminator, what are some of the real world situations that fit this description? In those cases, is it too much to assume that we can control the random seed input to G? 
+2. Is it reasonable to assume some constraints on how much data we can get from the blackbox generator? A website that just exposes the image generation API may not allow you to ping their service 100k times to improve the generation diversity. If you are allowed to do that, it may be reasonable to assume that you can contact the API provider to get access to the rest of the model.
+
+Minor improvements that did not have a huge impact on the score
+1. I found the argument about FID in section 2.1 unconvincing. Are there proofs or citations for the claim that real images don't follow multivariate gaussian distribution after applying FID? Copying is indeed an issue that FID cannot detect, but it may be tangential to model collapse for real world concerns like privacy. 
+2. The statement ""IS, FID and MODE score takes both visual fidelity and diversity into account."" under ""Evaluation of Mode Collapse"" is contradictory to the description in sec 2.1 that IS in fact does not measure diversity. 
+3. You may want to consider stating the work as ""a pilot study"" (sec 6.) earlier in the abstract or in the introduction, so that the reader knows what to expect.
+",1,,ICLR2020
+SygGzfHpFS,2,rJxotpNYPS,rJxotpNYPS,Official Blind Review #3,"In this work, the authors propose a domain invariant variational autoencoder for domain generalization problem. Specifically, the data is assumed to be constructed from three independent variables, one for the domain, one for the class and one for the residual variations. The method can be used for both unsupervised and semi-supervised cases. Experimental studies on rotated MNIST dataset and a malaria cell images dataset verify the effectiveness of the proposed method.
+The paper is well-written and easy to follow. The proposed generative model is simple and technically sound. However, I have the following concerns.
+(1)	It is not clear how the problem setting, i.e., domain generalization, matters in DIVA. In another words, the proposed DIVA is not specific to domain generalization problem, but can be used for domain adaptation, multiple source transfer learning etc. Actually, I find that the authors compare with DA, which is a conventional domain adaptation method, in the experiment. Moreover, the experimental setups in section 4.1.3 is a multi-source transfer setting, and DIVA can be well be applied. In this sense, I am not very convinced on the claim of the contribution that DIVA is proposed for domain generalization. For me, DIVA is a more general method.
+(2)	With point 1, more related works on VAE for domain adaptation need to be discussed.
+(3)	The idea of constructing data from disentangled latent variables is not new, see a latest work [ref1]. The main difference of DIVA from [ref1] is the residual variation variable. Two baselines are necessary for comparisons: (1) [ref1], and (2) DIVA without the residual variation variable. Actually, the semantic meaning of z_x is not well discussed. Even in the experimental studies figures 2 and 3, it is hard to tell what Z_x actually represents. 
+(4)	Regarding the 4.1.2, why DA is selected as a baseline? How does DA deal with the multiple domains? Can any other domain adaptation methods, e.g., [ref2], or multiple source transfer methods, e.g. [ref3], be compared?
+(5)	In the right side of Table 1, the improvements seem to be very marginal considering the variance. This makes the ability of DIVA to use unlabelled data less convincing. 
+(6)	Regarding 4.1.3, it seems that the domain similarity plays an important role in the performance, comparing the results of M_{30} with M_{60}. Without the labelled M_{60}, which is very similar to the target M_{75}, the performance degenerates dramatically. The current DIVA treats all the domains equally, is it possible to have a weighted form of DIVA that distinguishes the contributions of different domains? 
+(7)	What is the task of the malaria cell images experiments? Is it to classify the parasitized and uninfected cells? For a given patient, it makes more sense that all the cells belongs to one category, either infected or healthy. How is the class distribution for a person (in this case a domain)? Is it very unbalanced? 
+(8)	For figure 3, It is hard to judge the cell images parasitized or uninfected without domain knowledge, can you give the label for each image? Again, the semantic meaning of Z-x is hard to tell. I am not convinced by the shape of the cell for Z-x.
+(9)	For 4.2.2, why and how DA is compared? As far as I know, it is for unsupervised domain adaptation. Moreover, the improvements are quite marginal. 
+Some minor comments:
+(1)	Page1, first para, 3rd line, “present” -> “presented”.
+(2)	Page2, first para, 2nd line, “Y” - > “Y denotes”
+(3)	Some references lack of page information
+The paper should be self-contained. I would suggest the authors move some paragraphs in appendix to the paper, for instance, 5.1.1, 5.2.2, and 5.2.3.
+Overall, the paper is presented with extensive empirical evaluations, but less theoretical justification. The significance of the paper is moderate as the key idea of learning disentangled latent variables has been studied, and the paper lacks of evidence to show the pure benefits of introducing Z_x as well as the comparison with the related work [ref1]. 
+[ref1] Learning Disentangled Semantic Representation for Domain Adaptation
+[ref2] Conditional Adversarial Domain Adaptation
+[ref3] Multiple Source Domain Adaptation with Adversarial Learning
+",3,,ICLR2020
+ZwanKxEK1eK,1,aAY23UgDBv0,aAY23UgDBv0,A deep probabilistic model for time series forecasting is proposed.  Detailed description for mathematics is required.,"This paper presented the variational dynamic mixtures as a deep probabilistic model for time series forecasting. The research issue, called the taxi trajectory prediction problem, is addressed. Some comments are provided.
+
+Pros:
+A new solution to mixture density network as a kind of generative model with latent states and multinomial observations was proposed. The detailed experiments were addressed. New evaluation metric was introduced.
+
+Cons:
+There are a number of notations and variables which were not clearly defined. This matter made the reading to be easily confused. A clear algorithm or working flow for complicated system was missing. Some descriptions were not clear.",4,3.0,ICLR2021
+SkxFdq2Y9r,3,rJgQkT4twH,rJgQkT4twH,Official Blind Review #4,"The paper presents a case study of training a video classifier using convolutional networks, and on how the learned features related to previous, hand-designed ones. The particular domain considered is of importance for biologists/medical doctors/neuroscientists: zebra fish swim bout classification.
+
+In order to identify which particular features the neural networks are paying attention to, the paper used Deep Taylor Decomposition, which allowed the authors to identify ""clever-hans""-type phenomena (network attending to meaningless experimental setup differences that actually gave a way the ground truth classes). This allowed for the authors to mask out such features and make the network attend to more meaningful ones. In particular observations like "" looking for salient features in the trunk of the tail while largely disregarding the tip"", are typically absent from most deep learning studies and it's quite interesting.
+
+Overall the paper is well written, the experiments are well designed; everything seems very rigorous and  well executed. It makes for a very good quality practitioner-level case study on video understanding, which may also be useful for people studying zebra fish or related simple life forms. My main concern with the paper is whether ICLR is an appropriate venue, as it does not provide pure machine learning contributions in the form of new techniques of generally applicable insights. ",6,,ICLR2020
+ByaL9g21M,1,SkxqZngC-,SkxqZngC-,"well rounded work with good experimention and nice ideas, but some questions","""topic modeling of text documents one of most important tasks""
+Does this claim have any backing?
+
+""inference of HDP is more complicated and not easy to be applied to new models""  Really an artifact of the misguided nature of earlier work. The posterior for the $\vec\pi$ of a elements of DP or HDP can be made a Dirichlet, made finite by keeping a ""remainder"" term and appropriate augmentation.  Hughes, Kim and Sudderth (2015) have avoided stick-breaking and CRPs altogether, as have others in earlier work. Extensive models building on simple HDP doing all sorts of things have been developed.
+
+Variational stick-breaking methods never seemed to have worked well.  I suspect you could achieve better results by replacing them as well, but you would have to replace the tree of betas and extend your Kumaraswamy distribution, so it may not work.  Anyway, perhaps an avenue for future work.
+
+""infinite topic models"" I've always taken the view that the use of the word ""infinite"" in machine learning is a kind of NIPSian machismo. In HDP-LDA at least, the major benefit in model performance comes from fitting what you call $\vec\pi$, which is uniform in vanilla LDA, and note that the number of topics ""found"" by a HDP-LDA sampler can be made to vary quite widely by varying what you call $\alpha$, so any statement about the ""right"" number of topics is questionable.  So the claim in 3rd paragraph of Section 2, ""superior"" and ""self-determined topic number"" I'd say are misguided.  Plenty of experimental work to support this.
+
+In Related Work, you seem to only mention HDP for non-parametric topic models.  More work exists, for instance using Pitman-Yor distributions for modelling words and using Gibbs samplers that are efficient and don't rely on the memory hungry HCRP.
+
+Good to see a prior is placed on the concentration parameter.  Very important and not well done in the community, usually.  
+ADDED:  Originally done by Teh et al for HDP-LDA, and subsequently done
+by several, including Kim et al 2016.   Others stress the importance of this.  You need to
+cite at least Teh et al. in 5.4 to show this isn't new and the importance is well known.
+
+The Prod version is a very nice idea.  Great results.  This looks original, but I'm not expert enough in the huge masses of new deep neural network research popping up.
+
+You've upped the standard a bit by doing good experimental work.  Oftentimes this is done poorly and one is left wondering.  A lot of effort went into this.
+ADDED:   usually like to see more data sets experimented with
+
+What code is used for HDP-LDA?  Teh's original Matlab HCRP sampler does pretty well because at least he samples hyperparameters and can scale to 100k documents (yes, I tried). The comparison with LDA makes me suspicious. For instance, on 20News, a good non-parametric LDA will find well over 400 topics and roundly beat LDA on just 50 or 200.  If reporting LDA, or HDP-LDA, it should be standard to do hyperparameter fitting and you need to mention what you did as this makes a big difference.
+ADDED:   20News results still poor for HPD, but its probably the implementation used ... their
+        online variational algorithm only has advantages for large data sets 
+
+Pros:   
+* interesting new prod model with good results
+* alternative ""deep"" approach to a HDL-LDA model
+* good(-ish) experimental work
+Cons:
+* could do with a competitive non-parametric LDA implementation
+
+ADDED:   good review responses generally
+",7,4.0,ICLR2018
+Sy-QZtjgz,3,HkbmWqxCZ,HkbmWqxCZ,THE MUTUAL AUTOENCODER: CONTROLLING INFORMATION IN LATENT CODE REPRESENTATIONS,"The authors propose a variational autoencoder constrained in such a way that the mutual information between the observed variables and their latent representation is constant and user specified. To do so, they leverage the penalty function method as a relaxation of the original problem, and a variational bound (infomax) to approximate the mutual information term in their objective.
+
+I really enjoyed reading the paper, the proposed approach is well motivated and clearly described. However, the experiments section is very weak. Although I like the illustrative toy problem, in that it clearly highlights how the method works, the experiment on real data is not very convincing. Further, the authors do not consider a more rigorous benchmark including additional datasets and state-of-the-art modelling approaches for text. 
+
+- {\cal Z} in (1) not defined, same for \Theta.",4,4.0,ICLR2018
+JsVoR1DlVCq,4,8CCwiOHx_17,8CCwiOHx_17,"Interesting application, seemingly solid work","This paper presents a technique for adversarial generation of environments for the interesting problem of web navigation, and provides an environment that enables learning to design complex websites out of a set of compositional primitives. Then, it also proposes a method to adversarially generate a curriculum of increasingly complicated websites, and uses it to train agents which can navigate more challenging, high-dimensional websites.
+
+Strengths:
+1. An interesting novel problem domain; which is going to be very useful in a number of human computer interaction applications.
+2. The web navigation environment is interesting - it will hopefully spur more research for this problem
+3. The training of agents with an autoregressive adversary policy is interesting.
+
+Weaknesses:
+1. The discussion on related work on this application seems sparse; hence, for me, it was hard to judge the novelty of this work.
+2. More discussion of the environment - some examples, what makes it hard, or easy would help the reader understand the key challenges.
+3. More discussion on past work in interactive learning with autoregressive adversarial policies would be helpful. It will help the reader understand why this is a different interactive task and what makes it more interesting or challenging.
+4. The experimental section is too sparse. Some more ablation studies on different parts of the model - e.g. budget enforcing on the adversary would be helpful.
+5. The b-paired and flexible b-paired agents seem to be very similar to each other - especially for some problems. Some more analysis of this would be useful.
+",6,1.0,ICLR2021
+H1aZfRINx,3,rJq_YBqxx,rJq_YBqxx,A well written paper,"
+* Summary: This paper proposes a neural machine translation model that translates the source and the target texts in an end to end manner from characters to characters. The model can learn morphology in the encoder and in the decoder the authors use a hierarchical decoder. Authors provide very compelling results on various bilingual corpora for different language pairs. The paper is well-written, the results are competitive compared to other baselines in the literature.
+
+
+* Review:
+     - I think the paper is very well written, I like the analysis presented in this paper. It is clean and precise. 
+     - The idea of using hierarchical decoders have been explored before, e.g. [1]. Can you cite those papers?
+     - This paper is mainly an application paper and it is mainly the application of several existing components on the character-level NMT tasks. In this sense, it is good that authors made their codes available online. However, the contributions from the general ML point of view is still limited.
+                   
+* Some Requests:
+ -Can you add the size of the models to the Table 1? 
+- Can you add some of the failure cases of your model, where the model failed to translate correctly?
+
+* An Overview of the Review:
+
+Pros:
+    - The paper is well written
+    - Extensive analysis of the model on various language pairs
+    - Convincing experimental results.    
+    
+Cons:
+    - The model is complicated.
+    - Mainly an architecture engineering/application paper(bringing together various well-known techniques), not much novelty.
+    - The proposed model is potentially slower than the regular models since it needs to operate over the characters instead of the words and uses several RNNs.
+
+[1] Serban IV, Sordoni A, Bengio Y, Courville A, Pineau J. Hierarchical neural network generative models for movie dialogues. arXiv preprint arXiv:1507.04808. 2015 Jul 17.
+",6,4.0,ICLR2017
+Bkl3cLLIFS,1,rJgDT04twH,rJgDT04twH,Official Blind Review #3,"This paper introduces several methods for training reinforcement learning agents from implicit human feedback gained through the use of EEG sensors affixed to human subjects. The EEG data is interpreted into error-related event potential which are then incorporated as a form of noisy feedback for training three different reinforcement learning agents. The first agent (full-access) requires feedback for each state-action selected. The second agent (active-learning) requires feedback on trajectories generated every N_E episodes. The third (learning from imperfect demonstrations) requires the human to provide EEG feedback over an initial set of demonstration trajectories, and subsequently requires no further human access. These methods are evaluated across several handmade gridworld-like environments including a 10x10 maze, a game of catch, and 1-D cursor game called Wobble. Using the EEG training procedures is shown to improve the speed of reaching a good RL policy for all of the three different training algorithms. Additional experiments are conducted to test the generalizability of the error-related event potentials across games, with results indicating a reasonable degree of generalization.
+
+I like the idea of using EEG a way to reduce the burden of collecting human feedback for training reinforcement learning agents. This paper does a good job of investigating several different methodologies for combining ErrP feedback into the main loop of DQN-based RL agents. Additionally, the fact that ErrP feedback seems to generalize between domains is a promising indicator that a single person may be able to provide feedback across many different domains without re-training the ErrP decoder. While I like the paper as an interesting idea and proof of concept, there are some flaws that make me doubt it would be realizable for more complex tasks.
+
+The drawback of this paper are the many open questions relating to the experiments:
+
+1) In Figures 4b, 6b, and 6d, what is the meaning of 'Complete Episode'?
+
+2) In order to assess how efficient each of these methods was in terms of the number of human labels required, how many human responses were needed for the ""full-access"" and ""First Framework"" experiments?
+
+3) In Figure 6 - what happened to the ""No Errp"" baseline?
+
+4) In Figure 5c - what are Game 1 and Game 2?
+
+5) Why are all the results shown on the Maze domain? Why are no results shown for Catch or Wobble?
+
+6) At an action speed of 1.5 seconds per action, I imagine that EEG is not much faster than having a human subject press a button to indicate their label. What prevents the use of faster speeds?
+
+More broadly, I think it would be interesting to compare how effective is EEG at collecting human preferences versus pressing buttons (such as in Knox et al) or selecting preferences between trajectories (as in Christiano et al)? 
+
+It's my feeling that the experiments are more of a proof of concept and many open questions exist about whether this method would scale beyond these simple domains that DQN masters in ~300 episodes. In particular, scaling up to actual Atari games as a would go a long way towards showing scalability to a well-studied RL domain.
+
+I thought the overall clarity of the writing was somewhat lacking with many grammatical mistakes throughout, and the necessity to refer repeatedly to the Appendices in order to understand the basic functioning of the RL algorithms and reward learning (7.4). It took several passes to understand the full approach.",3,,ICLR2020
+rJeu7N5hKS,2,r1gRTCVFvB,r1gRTCVFvB,Official Blind Review #2,"The paper tries to handle the class imbalance problem by decoupling the learning process into representation learning and classification, in contrast to the current methods that jointly learn both of them. They comprehensively study several sampling methods for representation learning and different strategies for classification. They find that instance-balanced sampling gives the best representation, and simply adjusting the classifier will equip the model with long-tailed recognition ability. They achieve start of art on long-tailed data (ImageNet-LT, Places-LT and iNaturalist).
+
+In general, this is paper is an interesting paper. The author propose that instance-balanced sampling already learns the best and most generalizable representations, which is out of common expectation. They perform extensive experiment to illustrate their points.
+
+--Writing:
+This paper is well written in English and is well structured. And there are two typos. One is in the second row of page 3, ""… a more continuous decrease [in in] class labels …"" and the other one is in the first paragraph of section 5.4, ""… report state-of-art results [on on] three common long-tailed benchmarks …"". 
+
+--Introduction and review:
+The authors do a comprehensive literature review, listing the main directions for solving the long-tailed recognition problem. They emphasis that these methods all jointly learn the representations and classifiers, which ""make it unclear how the long-tailed recognition ability is achieved-is it from learning a better representation or by handling the data imbalance better via shifting classifier decision boundaries"". This motivate them to decouple representation learning and classification.
+
+--Experiment:
+Since this paper decouples the representation learning and classification to ""make it clear"" whether the long-tailed recognition ability is achieved from better representation or more balanced classifier, I recommend that authors show us some visualization of the feature map besides number on performance. Because I am confused and difficult to image what ""better representation"" actually looks like. 
+
+The authors conduct experiment with ResNeXt-{10,50,101,151}, and mainly use ResNeXt-50 for analysis. Will other networks get similar results as that of ResNeXt-50 shown in Figure 1?
+
+When showing the results, like Figure 1, 2 and Table 2, it would be better to mention the parameters chosen for \tau-normalization and other methods.
+
+Conclusion:
+I tend to accept this paper since it is interesting and renews our understanding of the long-tailed recognition ability of neural network and sampling strategies. What's more, he experiment is comprehensive and rigorous.",8,,ICLR2020
+rJgVurunjX,1,r1xRW3A9YX,r1xRW3A9YX,Lack of clarity and limited experimental success,"This paper presents a generalization of TransE to Riemannian manifolds. While this work falls into the class of interesting recent approaches for using non-Euclidean spaces for knowledge graph embeddings, I found it very hard to digest (e.g. the first paragraph in Section 3.3). Figure 3 and 4 confused me more than helping me to understand the method. Furthermore, current neural link prediction methods are usually evaluated on FB15k and WN18. In fact, often on the harder variants FB15k-237 and WN18RR. For FB15k and WN18, Riemannian TransE seems to underperform compared to baselines -- even for low embedding dimensions, so I have doubts how useful this method will be to the community and believe further experiments on FB15k-237 and WN18RR need to be carried out and the clarity of the paper, particularly the figures, needs to be improved. Lastly, I would be curious about how the different Riemannian TransE variants compare to TransE in terms of speed?
+
+Update: I thank the authors for their response and revision of the paper. To me, results on WN18RR and FB15k-237 are inconclusive w.r.t. to the choice of using Riemannian as opposed to Euclidean space. I therefore still believe this paper needs more work before acceptance.",5,2.0,ICLR2019
+BkeUdm9QcB,3,HkxWXkStDB,HkxWXkStDB,Official Blind Review #1,"The paper proposes a novel data augmentations approach that improves the robustness of a model on the CIFAR-10 and ImageNet Common Corruptions benchmarks while maintaining training accuracy on clean data. To achieve this, the paper proposes a rather simple augmentation mechanism that is inspired by CutOut (DeVries & Taylor 2017) and Gaussian (Grandvalet & Kanu, 1997): adding Gaussian noise to random patches in the image. This simple approach is shown to work surprisingly well on the corruption benchmarks. It seems reasonable that while adding Gaussian noise makes the model robust to high frequency noise, since Gaussian noise is not added everywhere, the model is able to exploit high frequency signal when available in the input. The paper is reasonably well written and the experimental validation is convincing. 
+
+Overall, the approach could become one of the standard mechanisms for data augmentation in the toolset of a practical ML engineer.
+",8,,ICLR2020
+S1KZ3x5ef,1,HkUR_y-RZ,HkUR_y-RZ,Fascinating and well investigated extension of L2S to RNNs,"This paper extends the concept of global rather than local optimization from the learning to search (L2S) literature to RNNs, specifically in the formation and implementation of SEARNN. Their work takes steps to consider and resolve issues that arise from restricting optimization to only local ground truth choices, which traditionally results in label / transition bias from the teacher forced model.
+
+The underlying issue (MLE training of RNNs) is well founded and referenced, their introduction and extension to the L2S techniques that may help resolve the issue are promising, and their experiments, both small and large, show the efficacy of their technique.
+
+I am also glad to see the exploration of scaling SEARNN to the IWSLT'14 de-en machine translation dataset. As noted by the authors, it is a dataset that has been tackled by related papers and importantly a well scaled dataset. For SEARNN and related techniques to see widespread adoption, the scaling analysis this paper provides is a fundamental component.
+
+This reviewer, whilst not having read all of the appendix in detail, also appreciates the additional insights provided by it, such as including losses that were attempted but did not result in appreciable gains.
+
+Overall I believe this is a paper that tackles an important topic area and provides a novel and persuasive potential solution to many of the issues it highlights.
+
+(extremely minor typo: ""One popular possibility from L2S is go the full reduction route down to binary classification"")",8,4.0,ICLR2018
+4hTy4SxESlT,4,n7wIfYPdVet,n7wIfYPdVet,The intuition makes sense. Comprehensive experiments are done.,"This paper studies a variant of multi-task learning, auxiliary learning, where one main task dominates, and other tasks are used to learn a good representation. To achieve this goal, the authors propose a learning-to-learn algorithm. In particular, the auxiliary losses are represented by a vector and then transformed to a new loss term via linear or nonlinear function $h$. They also made two more contributions. First, an approach of new auxiliary task generation is proposed. Second, an implicit differentiation based optimization method is proposed to find the solution. Both theoretical analysis and empirical studies demonstrate the superiority of their proposed model.
+
+Pros:
+1. The intuition makes sense. When the main task dominates, its generalization can help guide the weighting of the weights of auxiliary tasks.
+2. The theoretical analysis further supports the claim of this work.
+3. Comprehensive experiments are made to verify the effectiveness of the proposal.
+
+Cons:
+1. The idea of learning new auxiliary tasks is wired and less intuitive. The learning of the whole system is still the main task loss, without any useful supervision (or self-supervision). Therefore, it is doubtful whether the learned task is meaningful, or may involve some chaos in the system. Overall, I think this part does harm to this work and the authors may consider removing it.
+2. Notations are sort of unclear. In section 3.2, $\mathbf{l}$ is a $K+1$-dim vector, but it seems $g(\mathbf{l};\phi)$ only maps the $K$ auxiliary tasks' losses to a scalar. 
+3. There are some typos and the authors need to polish this paper again.",6,3.0,ICLR2021
+B3A2cuPoz2i,1,5Spjp0zDYt,5Spjp0zDYt,"Interesting topic for paper, but serious shortfalls in approach","This paper looks to investigate when VAEs fail to learn the maximum marginal likelihood (MML) model and some of the implications this can have for downstream tasks.  In particular, it introduces a theorem (Theorem 1) that provides assumptions under which MML will not be found, and then performs experiments to assess behavior in scenarios where the authors believe these assumptions are satisfied or not.
+
+Though the topic of the paper is interesting and the long term aims are a good line of research, I believe the actual approach taken is rather misguided and that the arguments and conclusions are not properly supported.  In short, I do not believe that the key claimed contribution of describing *when* pathologies occur in VAEs is actually accurate or that the paper adds notable additional insight on this compared to previous work.  As such, I do not believe it is suitable for publication at ICLR in its present state.
+
+*Strengths*
+- The problem the work is trying to tackle is important.
+- The supplement is very comprehensive and a lot of experiments have been run.
+- The paper is generally well written (though the hand-waviness of some of the arguments and reliance on the supplement do significantly detract from the clarity in the latter sections).
+- The work is mostly well referenced.  One important missing reference that should be added is  https://arxiv.org/abs/2006.10102 which already makes important related arguments about trade-offs in M2-style semi-supervised VAEs (in particular, they discuss the fact this type of semi-supervision imposes a mismatch between the marginal posterior q(z) and the prior p(z), which closely relates to this paper's discussion of mismatch between p(z|x) and q(z|x) but without needing to assume a ground truth posterior).  Direct discussion about the fact that a VAE need not learn a ""ground truth likelihood"" to perfectly match the data distribution should also be added (see e.g. http://ruishu.io/2017/01/14/one-bit/).
+- Though I am not sure I agree with all the associated conclusions or assumptions being made, it is clear that the authors have put noticeable effort into trying to link the findings of the paper to practical implications.
+
+*Weaknesses*
+- I believe the core result of the paper (Theorem 1) is redundant (see below).
+- The work makes lots of unnecessary and restrictive assumptions about a ""ground truth"" model that do not match up to typical real-world situations where such a model does not generally exist (in particular, the latent variables are generally an arbitrary construction such that the concept of a ground truth is actually meaningless).  Moreover, it is already well established that VAEs are not, in general, capable of uncovering a ground truth model even when this does exist (due to equivalence classes amongst other things); the paper offers little beyond this already well-known fact.  In fact, it is written in a way that often implies that we would expect such a ground truth to be uncovered even though various works have explained why this is not a reasonable expectation (e.g. Locatello et al 2019).  
+- Most of the paper is extremely hand-wavy and imprecise.  For example, various claims are made about the assumptions in Theorem 1 holding or not for different experiments, but the justifications for these are never really properly explained, let alone formally demonstrated.  Furthermore, I actually do not necessarily agree that the assertions made are always correct.  At the very least, it is certainly not the case that all the conclusions of the work have been fully demonstrated.
+- The paper constantly implicitly implies that a failure to satisfy the assumptions of Theorem 1 means that the associated pathologies will be avoided.  However, this logic does not hold as the assumptions in the theorem are sufficient not necessary conditions.  Moreover, I show later in the review that actually the result holds under far weaker assumptions, strongly imply that these inverse assertion that the paper relies on are likely to be false.
+- The paper is written in a way that incorrectly implies various well-known behaviors are surprising insights (e.g. ""the ELBO can prefer learning likelihood functions $f_{\theta}$ that reconstruct $p(x)$ poorly, even when learning the ground truth likelihood is possible!"").
+- Too many of the experimental results are partially relegated to the appendices.  The paper would be stronger for being more focused on certain aspects of the experiments gone through carefully (with the results themselves actually in the paper!) and some completely relegated to the appendices.  At the moment the paper feels almost unfinished because it is not at all self-contained without the supplement, which is not really acceptable in this format.
+- The paper often talks in overly general terms that are not properly justified.  For example, the abstract just talks about generic ""pathologies"" whereas the topic of the paper is about some quite specific issues rather than a general analysis of different VAE pathologies.
+- The figures are scruffy and rather difficult to read (in particular their font size is ridiculously small).
+
+*Redundency of Theorem 1*
+
+I believe that the core result, Theorem 1, is a rather convoluted way of making somewhat simpler points (see below) and does not provide any particular insights; its first assumption is very much reversed engineered rather than a natural starting point.  As such, it is difficult to take intuitions from it or to understand what is required to satisfy it.  Moreover, the experiments seem to demonstrate that the conditions cannot formally be demonstrated in practice as only very hand-wavy explanations are given rather than concrete demonstrations. 
+
+Even more problematically, I believe that the Theorem itself is actually a vacuous result in light of simpler, well-known, ideas.  In short, it is obvious and well-known that we will not achieve maximum marginal likelihood (MML) solutions by maximizing the ELBO if the expected KL cannot be driven to zero at the MML value of theta (except in the bizarre edge case where the expected KL does not vary with theta).  It is also straightforward and well-known that this will give a non-zero KL(p(x)||p_{theta}(x)) except in the special case where there is an alternative likelihood that gives the same p_{\theta}(x) while allowing an expected KL of zero (interestingly, this special case may well occur with infinite data though as this allows the encoder variance to tend to zero).   As such, the result of the Theorem is somewhat obvious even without the first assumption holding: this core assumption is much stronger than it needs to be (it is only a sufficient condition, not a necessary one), which not only makes the theorem predominantly redundant, it also undermines most of the subsequent conclusions the paper derives from this result (e.g. the suggestion that pathologies will not occur if this assumption does not hold).
+
+To demonstrate my misgivings more concretely, and verify that the result is indeed vacuous, I developed the following, which I believe to be a stronger and more intuitive alternative result (that also has the key advantage of not making any assumptions about some ground truth model)
+
+*Alternative Theorem*
+
+Consider a VAE with a fixed prior $p(z)$. Define the sets
+$$\Theta_{MML} = argmin_{\theta} KL(p(x)||p_{\theta}(x)),$$
+and
+$$\Theta_{opt} = argmin_{\theta\in\Theta_{MML}} \min_{\phi} E[KL(q_{\phi}(z|x) || p_{\theta}(z|x))],$$
+such that $\Theta_{opt} \subseteq \Theta_{MML}$ (note both of these will often be the same single point). If the following assumptions hold:
+1. $\nexists \theta \in \Theta_{MML} : E_{p(x)}[KL(q_{\phi}(z|x) || p_{\theta}(z|x))]=0$, i.e. the encoder cannot match the posterior (this assumption is not strictly necessary but is included for intuition because the next assumption cannot hold if this one does not, while it will almost always hold in practice when this assumption does hold),
+2. $\exists \theta \in \Theta_{opt} : \lVert \nabla_{\theta} \min_{\phi} E[KL(q_{\phi}(z|x) || p_{\theta}(z|x))] \rVert >0$ and $\nabla_{\theta} \min_{\phi} E[KL(q_{\phi}(z|x) || p_{\theta}(z|x))]$ is locally absolutely continuous for this $\theta$ (see discussion of this assumption below)
+
+then the global minima of the negative ELBO, {$\theta',\phi'$}, is non-optimal for the marginal maximum likelihood in the sense that $\theta'\notin\Theta_{MML}$ and thus:
+$$KL(p(x)||p_{\theta'}(x))>KL(p(x)||p_{\theta_{MML}}(x))\ge0$$
+where $\theta_{MML}$ be an arbitrary element in $\Theta_{MML}$.
+
+*Proof*:
+The proof follows by considering an arbitrary $\theta$ for which the second assumption is satisfied.  Here $\nabla_{\theta} KL(p(x)||p_{\theta}(x))=0$ but $\nabla_{\theta} \min_{\phi} E[KL(q_{\phi}(z|x) || p_{\theta}(z|x))] \neq 0$ so, by our continuity assumption, it must be possible to improve the ELBO by moving $\theta$ in the direction $-\nabla_{\theta} \min_{\phi} E[KL(q_{\phi}(z|x) || p_{\theta}(z|x))]$.  However, because we are improving $\min_{\phi} E[KL(q_{\phi}(z|x) || p_{\theta}(z|x))]$, it impossible for this change to produce a $\theta$ that remains in the set $\Theta_{MML}$  (as this would imply our original point was not in $\Theta_{opt}$).  As we have a $\theta$ that improves the ELBO but is no longer in $\Theta_{MML},$ we can conclude that the optimum of the ELBO is no longer optimal from the perspective of the maximum marginal likelihood. ☐
+
+Here that the second assumption in my Theorem above is very weak whenever the first assumption holds because it effectively equates to saying that the global optima for the MML are not all also local optima for the attainable expected KL divergence between the encoder and posterior.  For example, one would usually not expect the $\theta \in \Theta_{MML}$ to be connected, in which case the assumption can only be violated if the gradient of the attainable expected KL divergence coincidentally happens to be zero for every MML optimal $\theta$.  In light of this, my suggested theorem is making far weaker and (arguably) more intuitive assumptions than Theorem 1 in the paper, while it is also a slightly stronger final result as it does not require there to be a ""ground truth"" (which is a massive assumption to be able to drop).  
+
+Now, of course, the existence of an alternative theorem does not undermine the contributions of the work in itself, but the problem here is that my result above shows that Theorem 1 in the paper is vacuous because its result effectively always holds in practice if the encoder cannot exactly match the posterior (or more precisely, $\nexists \theta \in \Theta_{MML} : E_{p(x)}[KL(q_{\phi}(z|x) || p_{\theta}(z|x))]=0$).  As such, the complex first assumption in Theorem 1 is generally not necessary and offers little insight.  Moreover, because it is obvious that optimizing the ELBO does produce the MML parameters if the encoder can match the posterior, we see that Theorem 1 provides very little insight other than this already well-known fact.  In my opinion, this severely undermines the contribution of the work.",4,5.0,ICLR2021
+zp3LWFlt1n0,2,5UY7aZ_h37,5UY7aZ_h37,Great paper with jarring flaws,"The paper investigates the oft-overlooked aspect of knowledge distillation (KD) -- why it works. The paper highlights the ability of KD for transferring not just the soft labels, but the inductive bias (assumptions inherent in the method, e.g. LSTM's notion of sequentiality, and CNN's translational invariance/equivariance) from the student so that the student exhibits, to an extent, the teacher's generalization properties as well. The paper explores doing KD between LSTMs and several versions of Transformers (with varying structural constraints) on a subject-verb-agreement dataset, and between CNNs and MLPs on MNIST and corrupted MNIST. Compared to prior work showing that better teacher performance lead to better student performance, this paper also shows that the student's performance on different aspects becomes more similar to the teacher's -- (1) if the teacher is strong on metric A and weak on metric B compared to a student on its own, the student can become stronger on A and weaker on B when distilled using the teacher; (2) if the teacher can generalize well to a separate, previously unseen dataset but the student generalizes poorly on its own, after distillation the student can generalize much better than it can possibly learn to on its own.
+
+Pros:
+- Very interesting hypothesis and sheds light on the inner working of KD. (see above)
+- Interesting and novel set of experiments. Some (not all) experiments shed light on how the hypothesis seems to be true. (see above)
+- Comes up with ways to measure transferred inductive bias, by highlighting different aspects of generalization for a student and comparing with and without distillation.
+
+Cons:
+- The writing is very confusing and cryptic, especially the first page until its last paragraph.
+    1. The abstract is especially not telling the readers much about what is in the paper. I personally would be confused and skip reading this paper because I thought the paper discusses ""can we distill knowledge using knowledge distillation"". Inductive bias come in many forms and is not often discussed, and it helps to use examples to tell the story directly, e.g. by mentioning the specific differences between inherent priors in CNNs/MLPs or LSTMs/Transformers in the abstract *and* the first paragraphs of the introduction.
+    1. Second page, bold ""Second"" and ""Third"" are the same thing.
+- Although after extensive thinking I believe the paper is distinct from previous KD analyses, the paper does not itself distinguish its findings enough from what is known in the literature.
+    1. Granted it is hard to distinguish the inductive bias transfer aspect of KD versus other aspects of KD, it is hard to experimentally prove it because the field does not quite know what are the aspects of KD that makes it work. But the paper does not do a good job explaining which behavior is certainly due to inductive bias transfer, rather than the behavior can possibly be caused by other hypothesis in the field, such as KD transfer ""knowledge"" of inter-class relationship, or the effect from soft labels.
+    1. Note that the ECE results don't tell readers much. People expect soft labels to help not because they make models better calibrated, but because they boost performance, and it's not clear if people think better calibration leads to better main performance. Even if people do, in Fig 3(b) the ECE improves quite a lot for Transformer student with better teachers, so it is wrong to claim ""Given the lack of significant improvement in ECE..."".
+    1. The CNN/MLP experiment only has tasks that CNNs outperform MLPs. It would make it more interesting to see a task where the MLP outperforms CNN, e.g. a made up task whose ground truth is the xor of a few pixel positions, which could be hard for CNNs while easy for MLPs.
+- Relatively small number of datasets. Just two datasets and two sets of networks is not very convincing to claim these findings generalize to other architectural changes.
+- Experiments are sometimes not apple-to-apple comparisons. And some experiments are not convincing or irrelevant.
+    1. The MNIST experiments only have two networks, and arguably CNN is absolutely better than MLP. It would make the point clearer if a worse CNN is used such that the MNIST-vanilla performance is the same as the MLP, and show improved generalization results on MNIST-C.
+    1. Figure 1 does not tell readers much, because latent representation can be both inductive bias and regular representation power, and we already know that KD can improve the student's representation power. Same for the third bullet point in page 2.
+    1. Suspect of cherry-picking results from which loss (LM or classification) to show. Figures 2,4 are experiments using the LM loss, and Figures 3,5 are using the classification loss, without giving a clear explanation why.
+
+Summary:
+Given the interesting hypothesis and set of experiments, I think the community can benefit from this paper's findings in understanding KD so we use it more wisely, or at least generate more discussions of why KD is working. Despite the relatively unclear writing of the introduction and some experiments being unconvincing, the impact of the paper still outweighs these flaws.
+
+=============================
+Update
+While I agree in principle with Reviewer 1 that this paper has jarring flaws in writing and the rebuttal version does not adequately address it, I disagree that the writing warrants such a low score. I have seen worse papers with outrageous claims (e.g. try to claim significance with p=0.1) and I would not give those a 2. I would also disagree with R1 that there is no interesting result in this paper, because there is no prior work I know that even considers how distilled models generalize like their teacher.
+If I were to grade this paper based on different aspects, the originality and significance would be both 9's, quality a 6 due to experiment issues and careless generalization, and clarity a 3-4 due to unclear motivation in the abstract/early intro and poor differentiation from prior work in terms of experiment design and analysis.
+That said, the rebuttal did not change my mind that the writing probably will not be improved enough post-rebuttal, I would thus not be able to consider this a top paper despite the interesting observations.",7,4.0,ICLR2021
+GQEzKZb5dGL,4,PBfaUXYZzU,PBfaUXYZzU,Novel insights but more works need to be done,"Summary:
+
+This paper proposes a new evaluation framework for imbalanced data. Specifically, they introduce an additional weighted term in the formulation of balanced accuracy. By varying the weights, their framework can be adapted to many application domains. Finally, they use two case studies to illustrate the effectiveness of the proposed measure.
+ 
+Pros:
+
+- The proposed framework is simple and effective. 
+- The proposed framework can be used in many application domains.
+
+
+Cons:
+
+I have several comments regarding the experimental results:
+
+- How do the authors compute the “skew” score for each dataset? This term is not clearly explained in the paper. 
+
+- Since the weighted terms e.g., $w_i$ play an important role in the proposed framework, more examples should be given to explain how to choose these parameters in different application domains.
+
+- In section 4.2, the authors state “ … by modifying loss functions of DNNs to capture class importance weights…”. It is not very clear to see how to implement this.
+
+- When using WBA in model training, what are the performances for the baseline measures, e.g., precision, recall? Also, why choose these values for $w_i$, e.g., w1=0.209, w2=0.368, and w3=0.255? 
+
+ 
+Overall, I think the novelty of this work is limited and the experimental results are not very convincing.
+",4,4.0,ICLR2021
+BJlvxzmJaQ,3,Hyx6Bi0qYm,Hyx6Bi0qYm,Review of adversarial domain adaptation for stable brain-machine interfaces,"This contribution describes a novel approach for implanted brain-machine interface in order to address calibration problem and covariate shift. A latent representation is extracted from SEEG signals and is the input of a LTSM trained to predict muscle activity. To mitigate the variation of neural activities across days, the authors compare a CCA approach, a Kullback-Leibler divergence minimization and a novel adversarial approach called ADAN.
+
+The authors evaluate their approach on 16-days recording of neurons from the motor cortex of rhesus monkey, along with EMG recording of corresponding the arm and hand. The results show that the domain adaptation from the first recording is best handled with the proposed adversarial scheme. Compared to CCA-based and KL-based approaches, the ADAN scheme is able to significantly improve the EMG prediction, requiring a relatively small calibration dataset.
+
+The individual variability in day-to-day brain signal is difficult to harness and this work offers an interesting approach to address this problem. The contributions are well described, the limitation of CCA and KL are convincing and are supported by the experimental results. The important work on the figure help to provide a good understanding of the benefit of this approach.
+
+Some parts could be improved. The results of Fig. 2B to investigate the role of latent variables extracted from the trained autoencoder are not clear, the simultaneous training could be better explained. As the authors claimed that their method allows to make an unsupervised alignment neural recording, independently of the task, an experiment on another dataset could enforce this claim.",9,4.0,ICLR2019
+SJxNx3Optr,3,HyxjNyrtPr,HyxjNyrtPr,Official Blind Review #3,"SUMMARY: Unsupervised/Self-supervised generative model for image synthesis using 3D depth and RGB consistency across camera views
+
+CLAIMS:
+- New technique for RGBD synthesis using loss in 3D space
+- Can disentangle camera parameters from content (I disagree slightly with ""disentangle"" since you are conditioning on camera parameters in the first place)
+- Different generator architectures can be used
+
+METHOD:
+Generate RGBD images of 2 different views, have an adversarial loss on the RGB image, have a content loss between RGB1 and warp(RGB2), have a depth loss between D1 and warp(D2)
+Equation 5:
+- Possibly either ""c_{1->2}"" needs to be replaced by ""c_{2->1}"", or ""G_{RGB}(z, c_1) - warp(G_{RGB}(z, c_2), c_{1->2})"" needs to be replaced by ""warp(G_{RGB}(z, c_1), c_{1->2}) - G_{RGB}(z, c_2)"" (or am I missing something?)
+- Not entirely sure why there is a different ""projection"" operation, since both ""warp"" and ""projection"" are calculated from Equation 3. I understand that ""warp"" is the combined Rt matrix that is estimated using the two views and Equation 3, assuming that the ""d""s are correct. Not sure what ""projection"" does though, possibly explain it better?
+
+DECISION: Very clearly written paper, simple idea executed well
+
+The paper is clearly written and well organized. It uses a simple idea, and performs sufficient number of experiments to explore the idea. It is not very novel, but the paper shows its applicability with multiple architectures as a bonus.
+
+The figures showed results almost only from their method. It would be great to pick one generator architecture, and elucidate more on the differences between not using their 3D loss and using it. Good attempt though.
+
+ADDITIONAL FEEDBACK:
+- Might not be ""representation learning"", instead it is learning a generative model.
+- ""3 EXPERIMETNS"" -> ""3 EXPERIMENTS""
+- The appendix should have more details on the equations and the specific formulations of warp  and projection operations",6,,ICLR2020
+QzIWxLsPRn4,4,HQoCa9WODc0,HQoCa9WODc0,Official Review 4,"### Summary
+The paper describes a new method for detecting outliers with deep autoencoders by suppressing the reconstruction of out-of-distribution data. The article first investigates the reasons why standard autoencoders (AEs) reconstruct outlier datapoints fairly well, and are therefore problematic when used to detect anomalies via the reconstruction loss. The main novel contribution is the Energy-based Autoencoder (EBAE), a variant of an autoencoder in which the reconstruction loss is directly used as an energy function, and therefore outliers should have high energy. This is done with a new gradient formulation that enforces normalization of the probability distribution by sampling from the learned model via a variant of Langevin Monte-Carlo sampling. This second term is supposed to punish the reconstruction of outliers. Results are shown on a MNIST holdout task (leave one class out), and for OOD detection in CIFAR-10 and a downsampled version of ImageNet, showing that EBAE learns to reconstruct samples from different datasets, as well as constant and noise inputs, with significantly higher reconstruction error than competing methods.
+
+### Evaluation
+The field of anomaly detection with deep autoencoders is relevant for a large part of the ICLR community. The paper contributes a novel method that improves outlier detection on the tested benchmarks, so the paper has some significance for the field of anomaly detection. The paper is overall relatively well written, although there are some sections with quite a few grammar mistakes (e.g. Sec. 4.2) that should be corrected, and neither the figures nor their captions are very clear. The originality of the paper is quite limited, because it is to a large part based on known observations, and adds an incremental improvement over previous approaches tackling the same problem. The methods and the related work are described in sufficient detail. In the section on outlier reconstruction it is not clear which of the proposed reasons for outlier reconstruction by AEs are novel insights, and which are known facts from the literature. Only for the first argument (smoothness) a source is quoted. The experiments are quite extensive in comparing to different algorithms and with multiple outlier datasets, but the interpretation of these results is very short, almost non-existent.
+
+Overall I think this paper is about at the acceptance threshold. It is a nice but not a groundbreaking new method, and although the results on the tested benchmarks look good, these are relatively easy benchmarks for outlier detection, and so the relevance for real-world problems is not quite clear. In particular it is not clear if the sampling mechanism can deal with difficult outlier detection problems where inliers and outliers are similar (e.g. outliers produced via corruptions). Additionally I am concerned about the computational complexity of the method compared to other approaches. I think the paper would greatly benefit from a more exhaustive interpretation of the results.
+
+**Pros**:
+1. The paper shows a new method for outlier detection that improves existing AE-based techniques.
+2. There are a lot of experiments with different datasets and different competing algorithms for the three main tested datasets (MNIST, CIFAR-10, and ImageNet32).
+3. The experimental results show an advantage on the tested datasets both for class holdout and outliers from other datasets.
+
+**Cons**:
+1. Energy-based interpretations of autoencoders are well-known, so this is a rather incremental addition to the existing literature. 
+2. The method adds computational complexity through the sampling process, and it is not clear how many samples are needed for the method to become effective. The supplementary material provides the number of steps to generate a sample, but I could not find the total number of samples generated (maybe I missed it). The authors say that the complexity is OK for the tested datasets, but I would assume for larger images and more complex datasets this could become a big issue, because there will not be an efficient sampling of outliers in such cases.
+3. There is almost no interpretation of the results. Why e.g. do almost all other methods have great difficulty with the constant inputs, but no problems with noise? Why are methods such as PXCNN++ performing so poorly on most datasets?
+4. The benchmarks are rather simple, because the tested outlier datasets are quite different from the training set. Tests e.g. on corrupted versions of the training data would be interesting to test the power of the method. Robust anomaly detection, i.e. detecting outliers under the assumption that a small fraction of training points are anomalies could also prove challenging, since such datapoints might be reproduced by the sampling process (see e.g. Beggel, L., Pfeiffer, M., & Bischl, B. (2019). Robust anomaly detection in images using adversarial autoencoders. ECML (pp. 206-222); Choi, H., Jang, E., & Alemi, A. A. (2018). Waic, but why? generative ensembles for robust anomaly detection. arXiv preprint arXiv:1810.01392.) It would be good to talk more in the paper about the limitations of the method.
+5. In the tables on outlier detection I was missing the reconstruction error for inliers as a reference.
+6. The figures are not properly described in the captions, and some of them e.g. Fig. 2 cannot be understood without reading the full text.
+
+
+### Additional Comments
+1. In Fig. 1 (left) and Fig. 5 (right) it appears that all reconstructions of EBAE have much higher brightness than the original and reconstructions of AE. Why is that?
+2. In Fig. 1 (right) it is unclear why the reconstruction loss of EBAE is overall so much higher than for AE (the axes are shifted). The figure suggests a pretty high loss also for inliers, compared to AE, that's why I am also asking to provide the inlier values for Tables 2 and 3.
+3. Please improve the caption of Fig. 2, this is not understandable.
+4. In section 3 (towards the end) I don't understand the following sentence: ""The latent representation is more distribution for large D_z..."". Please correct.
+5. In Section 4.2. I am not sure about the reasoning why the sampling should produce outliers at all. It is only mentioned that sampling produces high-likelihood samples, but if only inliers are sampled, how does this help distinguishing from real outliers? It seems unlikely that proper outliers are sampled at all from this process.
+6. Section 6.1, Typo: ""hyperpamraeters""
+
+
+
+
+
+",5,4.0,ICLR2021
+ByrfSMcgz,2,SyfiiMZA-,SyfiiMZA-,"Nice to see this topic pop up again, but paper is lacking comparisons and insights.","I'm glad to see the concept of jointly learning to control and evolve pop up again!
+
+Unfortunately, this paper has a number of weak points that - I believe - make it unfit for publication in its current state.
+Main weak points:
+- No comparisons to other methods (e.g. switch between policy optimization for the controller and CMA-ES for the mechanical parameters). The basic result of the paper is that allowing PPO to optimize more parameters, achieves better results...
+- One can argue that this is not true joint optimization Mechanical and control parameters are still treated differently. This begs the question: How should one define mechanical ""variables"" in order for them to behave similarly to other optimization variables (assuming that mechanical and control parameters influence the performance in a similar way)?
+
+Additional relevant papers (slightly different approach):
+http://www.pnas.org/content/108/4/1234.full#sec-1
+http://ai2-s2-pdfs.s3.amazonaws.com/ad27/0104325010f54d1765fdced3af925ecbfeda.pdf
+
+Minor issues:
+Figure 1: please add labels/captions
+Figure 2: please label the axes
+",4,4.0,ICLR2018
+4HcPPbtXgnp,2,CU0APx9LMaL,CU0APx9LMaL,A sound paper with somewhat discouraging results,"This paper studies neural architecture search for automatic speech recognition. The approach is to first search over small, reusable networks, called cells, and then applies the cells to a template network. The cells are learned with phonetic recognition on TIMIT and validated on letter recognition on LibriSpeech.
+
+The approach strikes a good balance between having a large search space and the computation cost of the search. I will discuss a few weaknesses in detail, but these weaknesses won't be known prior to performing the experiments in the paper.
+
+The presentation of the paper is also done well. I have no trouble following the paper from start to finish.
+
+The weakness of the paper is the absolute PERs on TIMIT and WERs LibriSpeech. The best PER achieved in this paper, 21.1% is quite high in today's standard. In (Graves et al., 2013), the numbers are around 18%. The best WERs achieved in Figure 7 are high in the teens. In (Hsu et al., 2020), the number for training on the 100 hours of LibriSpeech is around 14%.
+
+It is unclear if the paper uses any regularizer at all when training the models. Even adding some amount of dropout would help the final numbers.
+
+The discrepancy between the numbers in the paper and others makes me wonder the search over the cells is a wrong direction to begin with. Maybe it is the things held fixed that play the role of achieving the best numbers. For example, the macro architecture is held fixed, the optimizer is held fixed, the learning rate schedules are more or less fixed. The macro architecture might play a critical role here. Typically, the competitive architectures require many layers of LSTMs instead of one used in the paper. It is quite discouraging that the models, discovered by NAS after spending so much compute, are not competitive to baseline models reported in other papers.
+
+
+Hybrid Speech Recognition with Deep Bidirectional LSTM
+Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed
+ASRU, 2013
+
+Semi-Supervised Speech Recognition via Local Prior Matching
+Wei-Ning Hsu, Ann Lee, Gabriel Synnaeve, and Awni Hannun
+arXiv:2002.10336",6,4.0,ICLR2021
+SkgumSL19S,3,rJl5rRVFvH,rJl5rRVFvH,Official Blind Review #3,"This paper proposes an off-policy batch reinforcement learning algorithm, Way Off-Policy(WOP). The authors address the extrapolation error which is a general problem of batch reinforcement learning. WOP algorithm uses KL-control to penalize divergence between prior and policy. Detailed comments are as follows:
+
+- KL penalize is the first in batch RL, but many approaches have already been proposed in RL. Moreover, I'm not sure if it is novel because I think that it is not much different from KL penalize proposed in RL.
+- And DBCQ seems to simply remove the perturbation of BCQ. Is there any other difference? (If so, a detailed explanation needs to be added in the paper.)
+- Are there any comparisons with more recent batch RL algorithms for discrete actions? (ex. SPIBB(Safe Policy Improvement with Baseline Bootstrapping, Laroche R., et al 2019))
+- Although the main experiment is about dialog, the section of ‘traditional RL experiments’ seems not enough for cartpole only.
+
+This paper is well organized and clearly written, but I cannot agree with the main contribution (KL-control penalize) of this paper (because the KL penalize idea is very similar to proposed approaches in RL). And, it seems to be compared with more recent batch RL algorithms in the experiments.
+
+[Minor errors]
+- c first appears in Equation (6), which seems to have no definition and explanation in the paper.
+- It is shown that the normalization constant is missing in Equation (9).
+- In section 3.4, line 4, pi_\theta(a_t, s_t) -> pi_\theta(a_t | s_t)
+",3,,ICLR2020
+mhgXH9UbPgo,1,ueiBFzt7CiK,ueiBFzt7CiK,"The paper is well presented with detailed motivations and analysis, but has weakness in experiments","In this paper, the authors proposed a framework for differentiable graph algorithm discovery (DAD). The framework is developed by improving two the discovery processes, i.e., designing a larger search space, and an effective explainer model. To enlarge the search space, the proposed DAD augments GNNs with cheap global graph features, which consist of solutions on the spanning trees of the original graph, and approximate solutions of greedy algorithms. All the features are concatenated together as input to GNNs. Experiments indicate that the proposed DAD is better than the approximate solutions. The explainer model is developed to explain the learned GNN, by employing the learning to explain (L2X) framework in (Chen et al., 2018). In general, the paper is well-written, and the method is interesting.
+
+Some questions and comments:
+
+1.  Although the authors have provided an ablation study to verify the contributions of tree solution versus greedy solution in Figure 11 in appendix, it is still not very clear about the impact of the global features on the model. The ablation study is short and lacks detailed discussions. In Figure 11, what is ""raw"" model? The global features are concatenated to the input, which actually increases the dimension of features. It would be necessary to see whether the global information or the extra degrees of freedom improve the performance.  
+
+2. How to determine how many spanning tree solutions or greedy solutions is enough for performance improvement? The authors said that ""We tuned the hyperparameters of GNN models for each graph category, such as the number of layers and the number of spanning trees, using grid search. "" It sounds like the proposed model is hard to generalize to different datasets.
+
+3. The novelty of the explainer model is not very clear. It seems that the authors just applied the learning to explain (L2X) framework in (Chen et al., 2018) to the GNN settings.
+
+4. There many typos, e.g., 
+(1) In page 4, ""treat the l constraints seperately for efficient computations"",    --->  separately 
+(2) In page 5, above Eq.(11), ""is irrelavent to nodes of distrance larger than T""   --> irrelevant, distance
+(3) In page 5, ""is defined to be the origianl GNN""   --> original
+(4) In page 6, ""add a global read-out functionto""   --> function
+
+
+I have read the authors' response, and I would like to keep my current rating.
+
+
+
+",7,3.0,ICLR2021
+-5X8P9agrUX,1,cO1IH43yUF,cO1IH43yUF,"An exhaustive study on the instabilities of BERT fine-tuning, and simple and intuitive methods to circumvent these problems and improve the average performance of fine-tuning on different tasks","Large language models (LM) architectures, such as BERT, XLNet, etc., are not generally trained from scratch, but rather used as pretrained models. Among all, BERT is one of the most widely used ones, and its use on downstream tasks mainly consists on a stage of fine-tuning, where the new layers added are trained, and the rest of parameters of the network are left unfrozen, and hence are adjusted slightly to better fit the new task. However, this step of fine-tuning BERT is known to be quite unstable, and depends on a large set of factors, especially the initialization. Since the final performance on these downstream tasks can vary notably, different approaches has been proposed to circumvent this, but still the most common solution consists simply on choosing the best performing model, from a few random initialisation, using the validation set. 
+
+##### Summary
+In the current paper, the authors aim at tackling the aforementioned problem in a more grounded way. They investigate deeper the possible causes for these instabilities, and propose methods to counteract these pitfalls. Hence, they propose three simple, yet effective, approaches to stabilise the training and ensure better performances: a modified optimiser, the use of randomly initialised top layers, and more training steps. They provide a large collection of results, compare all these solutions to previous works, and discuss differences and similarities. Thanks to the analyses carried out, the current paper results in an exhaustive study on how “safely” fine-tune BERT, and the different factors that are to be taken into account when making use of these models.
+
+##### Strong and weak points
+I would like to start with the weakest point of the paper: it actually does not present anything clearly novel, nor innovative or groundbreaking. All the solutions proposed are inspired by previous approaches, or are just slight modifications of existing methods. But, this does not mean the paper is not valuable, as I do believe it is. The instability while fine-tuning large LMs on downstream tasks is a well known problem, but yet it has not been tackle exhaustively, and I do believe there does not exist clear guidelines and/or modifications that enable easily circumventing a critical weakness of these models. But I consider this paper succeeds at precisely this important task, thanks to the extended and exhaustive study it presents, and how it proposes three simple modifications that seem to solve this pitfall on most scenarios. 
+
+Besides, the paper is quite well written, and presents in a clear manner the problems with the models, some intuition about the cause of those issues, and then, the solutions to overcome them. All the solutions are sufficiently justified, and are intuitive and simple. The latter, instead of being a weak point, for this precise problem it is more an advantage, as will allow an effortless adoption. Its improved performance is ensured thanks to the large set of benchmarks, on various datasets, the authors have compiled on the current manuscript. This is indeed another strong point, as all the solutions proposed are also tested under different conditions, with more or less training steps, and different numbers of top layers randomly initialised. 
+
+##### Decision, and key reasons
+I believe the paper is ready to be accepted. Overall, it is an interesting and useful paper that will help many NLP researchers, and end-users of BERT, fine-tune better models, obtain improved performances, and therefore, start from a better baseline for their endeavours. And all this, with just some simple and intuitive modifications and guidelines. All the proposed methods and suggestions are not drawn from a few bunch of tests, but rather from a large collection of simulations, for different and varied datasets, with disparate starting conditions, and run over a fair amount of random initialisations. Therefore, I believe the authors have taken their time, and simulation time, to ensure that the presented results are robust and consistent, which is something to remark also.
+
+##### Questions and additional evidence
+Although I believe the paper is nicely written, and compiles all the required results and tests, I would appreciate if the authors could comment further on the following points:
+* I do believe there is a reason for not performing bias-correction on BERTAdam, and therefore, introducing it back might be affecting BERT training and fine-tuning in some specific, I guess negative, way. Could the authors comment on this? Or their understanding on why the correction was removed for BERTAdam.
+* In Figure 4, you suggest that with 5 to 10 random trials, the bias correction will achieve good results. However, observing the plots for all the datasets, we realise that indeed that number of random trials may benefit more the non-corrected version, as in most of the datasets the performance is either higher, or at least comparable. And although the variance is larger, we might still ensure at least a similar result . Could you comment on this? Would not be the corrected version a better option when no random initialisations are envisaged?
+* For the re-init, when just training for 3 epochs, it surprises me that indeed we could train the last 6 layers with just this reduced amount of data and training steps. And more surprisingly, according to Figures 14-16, is that the weights for these last 6 layers are the first to stabilise, even though they started from scratch, and they are supposed to be critical for the downstream tasks. Could you comment on this? I guess my understanding is wrong, and I would appreciate therefore some further insights. 
+* Also, on the Re-init scheme, you mention that the number of layers to re-initialize depends on the task. Could you in any case offer here a general rule of thumb? 
+
+##### Extra feedback
+Finally, I would like to conclude listing some small typos and errors I could spot in the manuscript:
+* Page 7, after Results, the reference to the Table is wrong.
+* Page 8, table 2: I believe the result for the RTE - Int. Task is mistype. I guess it should be something around 71.8.
+* Page 14, section E, Effect of Re-init… : the reference to the figure.
+* The caption for all figures 14 to 17 is wrong, as it should read fine-tuning.
+
+These are the ones I could find, but it is not an exhaustive list. In any case, I would like to highlight the quality of the present manuscript, in terms of clearness and writing. ",7,3.0,ICLR2021
+r1INGfVNl,2,HJTXaw9gx,HJTXaw9gx,"somewhat interesting paper, wrong conference","Approximating solutions to PDEs with NN approximators is very hard. In particular the HJB and HJI eqs have in general discontinuous and non-differentiable solutions making them particularly tricky (unless the underlying process is a diffusion in which case the Ito term makes everything smooth, but this paper doesn't do that). What's worse, there is no direct correlation between a small PDE residual and a well performing-policy [tsitsiklis? beard? todorov?, I forget]. There's been lots of work on this which is not properly cited. 
+
+The 2D toy examples are inadequate. What reason is there to think this will scale to do anything useful? 
+
+There are a bunch of typos (""Range-Kutta""?) .
+
+More than anything, this paper is submitted to the wrong venue. There are no learned representations here. You're just using a NN. That's not what ICLR is about. Resubmit to ACC, ADPRL or CDC.
+
+Sorry for terseness. Despite rough review, I absolutely love this direction of research. More than anything, you have to solve harder control problems for people to take notice...",3,5.0,ICLR2017
+BkeqaMZJ5S,2,S1gwC1StwS,S1gwC1StwS,Official Blind Review #1,"The paper aims to study the topology of loss surfaces of neural networks using tools from algebraic topology. From what I understood, the idea is to effectively (1) take a grid over the parameters of a function (say a parameters of a neural net), (2) evaluate the function at those points, (3) compute sub-levelset persistent homology and (4) study the resulting barcode (for 0/1-dim features) (i.e., the mentioned ""canonical form"" invariants). Some experiments are presented on extremely simple toy data.
+
+Overall, the paper is very hard to read, as different concepts and terminology appear all over the place without a precise definition (see comments below). Given the problems in the writing of the paper, my assessment is that this idea boils down to computing persistent homology of the sub-levelset filtration of the loss surface sampled at fixed parameter realizations. I do not think that this will be feasible to do, even for small-scale real-world neural networks, simply due to the difficulty of finding a suitable grid, let alone the vast number of function evaluations involved.
+
+The paper is also unclear in many parts. A selection is listed below:
+
+(1) What do you mean by gradient flow? One can define a gradient flow in a linear space X and for a function F: X->R, e.g., as  a smooth curve R->X, such that x'(t) = -\nabla F(x(t)); is that what is meant? 
+
+(2) What do you mean by ""TDA package""? There are many TDA packages these days (maybe the CRAN TDA package?)
+
+(3) ""It was tested in dimensions up to 16 ..."" What is meant by dimension here? The dimensionality of the parameter space?
+
+(4) The author's talk about the ""minima's barcode"" - I have no idea what is meant by that either; the barcode is the result of sub-levelset persistent homology of a function -> it's not associated to a minima.
+
+(5) Is Theorem 2.3. not just a restatement of a theorem from Barannikov '94? At least the proof in the appendix seems to be .
+
+(6) Right before Theorem 2.3., what does the notation F_sC_* mean? This needs to be introduced somewhere.
+
+From my perspective, the whole story revolves around how to compute persistence barcodes from the sub-levelset filtration of the loss surface, obtained from function values taken on a grid over the parameters. The paper devotes quite some time to the introduction of these concepts, but not in a very clear or understandable manner. The experiments are effectively done on toy data, which is fine, but the paper stops at that point. I do not buy the argument that ""it is possible to apply it [the method] to large-scale modern neural networks"". Without a clear strategy to extend this, or at least some preliminary ""larger""-scale results, the paper does not meet the ICLR threshold. The more theoretical part is too convoluted and, from my perspective, just a restatement of earlier results.
+
+
+
+
+
+
+
+
+
+
+",1,,ICLR2020
+r1Ojef4gf,1,ryRh0bb0Z,ryRh0bb0Z,"The authors propose a GAN formulation for multi-view learning trained to disentangle content from other aspects (""view"") influencing the image, presented from a slightly narrow perspective.","The paper proposes a GAN-based method for image generation that attempts to separate latent variables describing fixed ""content"" of objects from latent variables describing properties of ""view"" (all dynamic properties such as lighting, viewpoint, accessories, etc). The model is further extended for conditional generation and demonstrated on a range of image benchmark data sets.
+
+The core idea is to train the model on pairs of images corresponding to the same content but varying in views, using adversarial training to discriminate such examples from generated pairs. This is a reasonable procedure and it seems to work well, but also conceptually quite straightforward -- this is quite likely how most people working in the field would solve this problem, standard GAN techniques are used for training the generator and discriminator, and the network architecture is directly borrowed from Radford et al. (2015) and not even explained at all in the paper. The conditional variant is less obvious, requiring two kinds of negative images, and again the proposed approach seems technically sound.
+
+Given the simplicity of the algorithmic choices, the potential novelty of the paper lies more in the problem formulation itself, which considers the question of separating two sets of latent variables from each other in setups where one of them (the ""view"") can vary from pair to pair in arbitrary manner and no attributes characterising the view are provided. This is an interesting problem setup, but not novel as such and unfortunately the paper does not do a very good job in putting it into the right context. The work is contrasted only against recent GAN-based image generation literature (where covariates for the views are often included) and the aspects related to multi-view learning are described only at the level of general intuition, instead of relating to the existing literature on the topic. The only relevant work cited from this angle is Mathieu et al. (2016), but even that is dismissed lightly by saying it is worse in generative tasks. How about the differences (theoretical and empirical) between the proposed approach and theirs in disentangling the latent variables? One would expect to see more discussion on this, given the importance of this property as motivation for the method.
+
+The generative story using three sets of latent variables, one shared, to describe a pair of objects corresponds to inter-battery factor analysis (IBFA) and is hence very closely related to canonical correlation analysis as well (Tucker ""An inter-battery method of factor analysis"", Psychometrika, 1958; Klami et al. ""Bayesian canonical correlation analysis"", JMLR, 2013). Linear CCA naturally would not be sufficient for generative modeling and its non-linear variants (e.g. Wang et al. ""Deep variational canonical correlation analysis"", arXiv:1610.03454, 2016; Damianou et al. ""Manifold relevance determination"", ICML, 2012) would not produce visually pleasing generative samples either, but the relationship is so close that these models have even been used for analysing setups identical to yours (e.g. Li et al. ""Cross-pose face recognition by canonical correlation analysis"", arXiv:1507.08076, 2015) but with goals other than generation. Consequently, the reader would expect to learn something about the relationship between the proposed method and the earlier literature building on the same latent variable formulation. A particularly interesting question would be whether the proposed model actually is a direct GAN-based extension of IBFA, and if not then how does it differ. Use of adversarial training to encourage separation of latent variables is clearly a reasonable idea and quite likely does better job than the earlier solutions (typically based on some sort of group-sparsity assumption in shared-private factorisation) with the possible or even likely exception of Mathieu at al. (2016), and aspects like this should be explicitly discussed to extend the contribution from pure image generation to multi-view literature in general.
+
+The empirical experiments are somewhat non-informative, relying heavily on visual comparisons and only satisfying the minimum requirement of demonstrating that the method does its job. The results look aesthetically more pleasing than the baselines, but the reader does not learn much about how the method actually behaves in practice; when does it break down, how sensitive it is to various choices (network structure, learning algorithm, amount of data,  how well the content and view can be disentangled from each other, etc.). In other words, the evaluation is a bit lazy somewhat in the same sense as the writing and treatment of related work; the authors implemented the model and ran it on a collection of public data sets, but did not venture further into scientific reporting of the merits and limitations of the approach.
+
+Finally, Table 1 seems to have some min/max values the wrong way around.
+
+
+Revision of the review in light of the author response:
+The authors have adequately addressed my main remarks, and while doing so have improved both the positioning of the paper amongst relevant literature and the somewhat limited empirical comparisons. In particular, the authors now discuss alternative multi-view generative models not based on GANs and the revised paper includes considerably extended set of numerical comparisons that better illustrate the advantage over earlier techniques. I have increased my preliminary rating to account for these improvements.",7,3.0,ICLR2018
+2VCXC5IR5Nf,1,GKLLd9FOe5l,GKLLd9FOe5l,Official Blind Review #4,"*Summary*
+
+This paper proposes a new algorithm (SUBTLE) to conduct online A/B testing that 1) allows for continuous monitoring, and 2) detects subgroups with enhanced treatment effect (if such subgroups exist). The authors formalize the problem into a clean hypothesis testing problem (9) that tests if the value gap between the optimal policy and the all-control policy is 0, propose the algorithm SUBTLE, and prove that it is able to control type I error at any time. Experiment results compare SUBTLE with the existing baseline SST, and show that SUBTLE is able to control type I error (while SST fails when the treatment has clear negative impact) and achieves competing detection power. 
+
+*Overall Assessment*
+
+Overall I vote for acceptance.  This paper focuses on an important real-world problem that many ICLR readers care about, and is easy to follow in general.
+
+*Pros*
+- The problem considered in this paper looks real and relevant to me. I like the idea of ""testing the existence of a beneficial subgroup"", which should be more useful in practice than ""testing whether there's difference between treated and untreated"" as SST does. The testing hypotheses (9) look clean and reasonable to me.
+- The proposed algorithm (SUBTLE) adapts mSPRT to address the current problem, which is easy to understand and achieves provable type I error guarantees. It also enjoys relatively good performance in both simulated and real data experiments.
+- The paper is clearly written and easy to follow.
+
+*Cons and Questions*
+- SUBTLE performs extremely well in model V (the high-dimensional model), achieving lower type I error and higher power than in model I-IV.  I'm kind of doubtful about this result, since model V is essentially like model I-IV with more noise covariates, and as shown in section 4.2 adding more noise covariates lowers the estimated power. Is there any good explanation for this? Is it because the treatment effect structure in model V makes random forest a more favorable method? If so, how does SUBTLE perform when we have different $\theta$ structures (like linear)?
+- SUBTLE looks relatively more conservative than SST when $c$ is small.  In many real-world applications, treatment effect $\theta(X)$ is often smaller in magnitude than $\mu(X)$, which is like small a $c$ in experiments. I am curious about how will the results look like when $c$ is smaller than $0.6$. ",7,3.0,ICLR2021
+TmhImuIOn8,4,npkSFg-ktnW,npkSFg-ktnW,"Need to clearly state scope, method, contributions and to provide a more rigorous experimental setup.","# Summary
+This paper proposes a new method called Disentangled Exploration Auto-Encoder (DEAE). This new method is based on “(Ge at al.) Zero-shot synthesis with group-supervised learning” to which a modified cyclic loss term is added. This method is trained on datasets with label supervision.
+
+# Pros
+1. Compared to “(Ge at al.) Zero-shot synthesis with group-supervised learning” the results seem to be more visually pleasing.
+
+# Cons
+1. The method is not clearly presented. In particular the loss terms are not mathematically expressed as equations. I don’t know if we’re expected to read the paper this method is based on to have detailed equations. In any case, the added cyclic loss is expressed as an equation either, so it’s really hard to say what the method really does. It’s also unspecified how the latent space is allocated to attributes.
+2. The scope of the method is not clearly presented either: I was under the impression that the proposed method was comparable to other unsupervised auto-encoders while in fact it requires label supervision
+3. It seems the paper claims to be more than it actually is: if I understand correctly, the sole contribution of this paper is a cyclic loss term, compared the (Ge et al) paper which can be seen as a regularizer.
+4. Experiments are mostly focused on visual inspection of images and do not appear impressive to me (except for the toy dataset of colored letters, but then I don’t know what the SotA is like). Very few numerical results (save for dataset bias elimination) are presented. Results are not compared to other SotA methods, except the (Ge at al) paper which the method is based upon. Experimental setup is very incomplete.
+
+# Questions and nits
+1. It would have benefited my comprehension to mention in the abstract that the proposed method requires attribute/label supervision. In its current form, it seems to claim to solve disentanglement for unsupervised auto-encoders and I find it misleading.
+2. The mention of the generative ability without GAN based training in the abstract is also confusing: the proposed method seems orthogonal to GANs and could be combined with them.
+3. Several times I see the term “perfect disentanglement” to describe the improved disentanglement that this method offers compared to (Ge et al). What I don’t understand is what makes it “perfect”, can’t it be further improved?
+4. “<mention unsupervised auto-encoders>. We propose a different solution to empower precise attribute controllable synthesis ability on autoencoders: DEAE”. To be fair, it’s also a solution to a different problem scope. The proposed method uses attributes while the cited methods don’t. So it’s really a solution to a different problem altogether.
+5. Typo “whic h” => “which”
+6. “Fig. 4 (d) shows that we can combine the UDVs to dicover new attribute values.” Aside from the typo on ""discover"", it’s unclear how you discover new attributes values (which I assume are centroid like the example you mentioned before for the blue color). Here instead, my understanding is that UDV just provides a vector along which values of interest may lie. It seems the eventual decision to make a value an attribute is manually decided by a human after inspecting the effects along an UDV axe.
+7. Downstream task performance. It is a toy dataset and it’s hard to really tell the real power of the proposed method. A lot of information is missing, what are the sizes the $D_S$ and $D_L$? The only reported numbers are $D_S$ vs $D_{S+DEAE}$. What is the unreported accuracy gap between $D_{S+DEAE}$ and $D_{S+GSL-AE}$ that leads to the later conclusion that DEAE performs better? What is the accuracy for $D_L$? No information is known about the classifier network architecture(s), parameter sizes, tuning and whether or not they overfit or what other causes could be responsible for the observed results.
+
+=====POST-REBUTTAL COMMENTS======== 
+I thank the authors for the response and the efforts in the updated draft. Some of my queries were clarified, particularly concerning missing experimental details. However, unfortunately, I still think more needs to be clarified in the actual paper write up, notably on the points of non-adversarial as well as the method description.",3,4.0,ICLR2021
+BJxrL1eLoQ,1,rJg8yhAqKm,rJg8yhAqKm,Potentially useful but poorly motivated and evaluated,"The paper proposes a method of regularising goal-conditioned policies with a mutual information term. While this is potentially useful, I found the motivation for the approach and the experimental results insufficient. On top of that the presentation could also use some improvements. I do not recommend acceptance at this time.
+
+The introduction is vague and involves undefined terms such as ""useful habits"". It is not clear what problems the authors have in mind and why exactly they propose their specific method. The presentation of the method itself is not self-contained and often relies on references to other papers to the point where it is difficult to understand just by reading the paper. Some symbols are not defined, for example what is Z and why is it discrete?
+
+The experiments are rather weak, they are missing comparison to strong exploration baselines and goal-oriented baselines.",3,3.0,ICLR2019
+SJlurqK4Tm,3,H1fF0iR9KX,H1fF0iR9KX,Experiments too limited to judge the merits,"This paper proposed to use graph-based deep learning methods to apply deep learning techniques to images coming from omnidirectional cameras. It solves the problem of distorsions introduced by the projection of such images by replacing convolutions by graph-based convolutions, with in particular a combinaison of directed graphs which makes the network able to distinguish between orientations.
+
+The paper is fairly well written and easy to follow, and the need for treating omnidirectional images differently is well motivated. However, since the novelty is not so much in the graph convolution method, or in the use of graph methods for treating spherical signals, but in the combined application of the particular graph method proposed to the domain of omnidirectional images, I would expect a more thorough experimental study of the merits of the method and architectural choices.
+
+1. The projected MNIST dataset looks very localized on the sphere and therefore does not seem to leverage that much of the global connectivity of the graph, although it can integrate deformations. Since the dataset is manually projected, why not cover more of the sphere and allow for a more realistic setting with respect to omnidirectional images?
+More generally, why not use a realistic high resolution classification dataset and project it on the sphere? While it wouldn't allow for all the characteristics of omnidirectional images such as the wrapping around at the borders, it would lead to a more challenging classification problem. Papers such as [Khasanova & Frossard, 2017a] have at least used two toy-like datasets to discuss the merits of their classification method (MNIST-012, ETH-80), and a direct comparison with these baselines is not offered in this work.
+
+2. The method can be applied for a broad variety of tasks but by evaluating it in a classification setting only, it is difficult to have an estimate of its performance in a detection setting, where I would see more uses for the proposed methods in such settings (in particular with respect to rotationally invariant methods, which do not allow for localization).
+
+3. I fail to see the relevance of the experiments in Section 4.2 for a realistic application. Supposing a good model for spherical deformations of a lens is known, what prevents one from computing a reasonable inverse mapping and mapping the images back to a sphere? If the mapping is non-invertible (overlaps), then at least using an approximate inverse mapping would yield a competitive baseline.
+I am surprised at the loss of accuracy in Table 2 with respect to the spherical baseline. Can you identify the source of this loss? Did you retrain the networks for the different deformations, or did you only change the projection of the network trained on a sphere? 
+
+4. While the papers describes what happens at the level of the first filters, I did not find a clear explanation of what happens in upper layers, and find this point open to interpretation. Are graph convolutions used again based on the previous polynomial filter responses, sampling a bigger region on the sphere? Could you clarify this?
+
+5. I would also like to see a study of the choice of the different scales used (in particular, size of the neighborhood).
+
+Overall, I find that the paper introduces some interesting points but is too limited experimentally in its current form to allow for a fair evaluation of the merits of the method. Moreover, it leaves some important questions open as to how exactly it is applied (impact of sampling/neighborhood size, design of convolutions in upper layer...) which would need to be clarified and tested.
+
+Additional small details:
+- please do not use notation $\mathbb{N}_p$ for the neighborhood, it suggests integers
+- p. 4 ""While effective, these filters ... as according to Eq. (2) filter..."" -> article missing for the word ""filter""
+",4,4.0,ICLR2019
+bIOWQHLFaJN,4,8nXkyH2_s6,8nXkyH2_s6,MNIST alone is not enough to draw any decisive conclusions,"This draft proposes to use the relu activation pattern of the neurons in the neural network as the hash code for the input. Essentially the input features are bucketized into small piecewise linear regions. The authors show empirically that the proposed hash code has small collision and high accuracy given certain conditions including
+1. The features are around the sample manifold 
+2. the training time is long enough
+3. the network is wide enough
+4. the training sample size is large enough
+The authors also found empirically the effect of regularization is relatively small on the encoding properties.
+
+I feel this is an interesting thought but I am not sure if this is the first work on it. Some other possible issues:
+1. Figure 4(a) somehow shows the redundancy is the smallest at epoch 0, then it goes high after 1 epoch and decreases slowly as the number epochs grows. Can authors provide some explanation on this? Does it suggest the random initialization of the neural network gives a good hash code in terms of the redundancy metric? (of course the accuracy will be bad)
+2. The authors used K-means as another benchmark to compare. To me k-means is an unsupervised clustering algorithm. How do you get the accuracy from k-means? How do you match the cluster id to the labels?
+3. My biggest complaint is that only the MNIST data set is investigated in the experiment. MNIST is too easy to show any conclusive results. You may need to work on other data sets such as image net, cifar 100, or NLP related data set to draw a convincing conclusion.
+4. All the conclusions are purely empirical. Can authors provide some explanation or intuition on why the redundancy ratio decreases as the training time grows? Is this related to the type of optimizer being used? Why can a larger sample size also help reduce the redundancy ratio?
+
+
+Overall I think this draft has some really good ideas but the empirical result is not quite conclusive due to the lack of extensive experimentation.
+",5,4.0,ICLR2021
+S1xJXhf6tr,1,r1lIKlSYvH,r1lIKlSYvH,Official Blind Review #2,"1. Summary
+The paper theoretically investigates the role of “local optima” of the variational objective in ignoring latent variables (leading to posterior collapse) in variational autoencoders. The paper first discusses various potential causes for posterior collapse before diving deeper into a particular cause: local optima. The paper considers a class of near-affine decoders and characterise the relationship between the variance (gamma) in the likelihood and local optima. The paper then extends this discussion for deeper architecture and vanilla autoencoders and illustrate how this can arise when the reconstruction cost is high. The paper considers several experiments to illustrate this issue.
+
+2. Opinion and rationales
+I thank the authors for a good discussion paper on this important topic. However, at this stage, I’m leaning toward “weak reject”, due to the reasons below. That said, I’m willing to read the authors’ clarification and read the paper again during the rebuttal to correct my misunderstandings if there is any. The points below are all related.
+
+a. I would like to understand the use of “local optima” here. I think the paper specifically investigate local optima of the likelihood noise variance, and there are potentially other local optima. Wouldn’t this be an issue with hyperparameter optimisation in general? For example, for any regression tasks, high observation noise can be used to explain the data and all other modelling components can thus be ignored, so people have to initialise this to small values or constrain it during optimisation.
+
+b. I think there is one paper that the paper should discuss: Two problems with variational expectation maximisation for time-series models by Turner and Sahani. In this paper, the paper considers optimising the variational objective wrt noise likelihood hyperparameters and illustrates the “bias” issue of the bound towards high observation noise.
+ 
+c. I think it would be good to think about the intuition of this as well: “unavoidably high reconstruction errors, this implicitly constrains the corresponding VAE model to have a large optimal gamma value”: isn’t this intuitive to improve the likelihood of the hyperparameter gamma given the data?
+
+d. If all above are sensible and correct, I would like to understand the difference between this class of local minima and that of (ii). Aren’t they the same?
+
+e. The experiments consider training AEs/VAEs with increasingly complex decoders/encoders and suggest there is a strong relationship between the reconstruction errors in AEs and VAEs, and this and posterior collapse. But are these related to the minima in the decoder’s/encoder’s parameter spaces and not the hyperparameter space? So that is the message that the paper is trying to convey here?
+
+3. Minor:
+
+Sec 3
+(ii) assumI -> assuming
+(v) fifth -> four, forth -> fifth
+",3,,ICLR2020
+LxDr8fD0h85,2,xfmSoxdxFCG,xfmSoxdxFCG,The method is interesting but the experimental evidence is not good enough,"Although the paper does not say so, my understanding is that the proposed word embedding method actually first perform Kmeans-like clustering on the context vector shown in Figure 2. In the binary word embedding of each word, we set the dimensions corresponding to the k closest cluster centers to 1 and 0 otherwise. Most parts of the paper are describing how the simple word embedding method is related to the neural system of a fruit fly and showing the method achieves comparable performance compared with GloVe in word similarity tasks, context-sensitive word similarity datasets, and document classification tasks.
+
+Pros:
+1. It is interesting to see the connection between a biological neural system and word embedding.
+2. As far as I know, the method is novel.
+3. Such a simple and biologically plausible method could reach reasonable performance. 
+
+Cons:
+1. The experiments have not demonstrated that the method is a useful word embedding method for NLP researchers. Some of the experiment designs, comparisons, and presentations are misleading.
+2. Some related work and comparisons are missing.
+3. The authors actually do not explain why this method works well in machine learning perspectives (similar to a biological neural system is not a very strong explanation for many people).
+
+Clarity:
+The paper is easy to understand, but it does not explain why the method works from the ML perspective.
+
+Originality:
+As far as I know, the method is novel.
+
+Significance of this work:
+This might have a large impact on the neuroscience field.
+
+The method is compared with GloVe most of the time. First, GloVe actually does not perform well in word similarity tasks compared with word2vec/SGNS [3]. In terms of document classification, the results are mixed and I think the authors should try more datasets to know which method is better. In addition, GloVe is trained on a different dataset, so the result is not very comparable. Second, GloVe is a dense method. The proposed method should be compared with binary word embedding or compressed word embedding methods such as [1,2]. It will be even better if you can show that it is better than sparse representation such as PPMI in [3], even though PPMI is not binary and uses much more dimensions (you might want to use the dot product values instead of binary value in Equation (4) to improve your method and make the comparison fairer). 
+
+As I mention at the beginning, I believe the word embedding is actually a binary projection to Kmeans-like cluster centers in the context vector space. Here is why. Let's first assume W_{mu} and v^A are normalized and ignore the normalization term /p to build the connection. Then, the l2 distance is (1 - dot product). Equation (1) tells us that we are minimizing the l2 distance between v^A and its closest cluster center (i.e., W_{mu}). Finally, Equation (4) clearly describes the top k projection. If I am right, I think adding this perspective will make readers understand the method better. If I am wrong, please explain why and try to explain what the method is actually doing in some machine learning perspectives. 
+
+If I am right, we will find the description section 3.2 might be misleading. The main reason that the average cosine similarity within the cluster is higher might be because the binarization projection makes the word embeddings locally collapse to small cluster centers, which makes the word embeddings cannot capture the fine-grained statistics details. You can visualize the word embedding in a 2D space to see if I am right. If I am right, this is not an advantage in my opinion.
+
+If the authors could show that the proposed method is state of the art compared with binary word embeddings in a fair setting (e.g., train on the same dataset), I will vote acceptance. Otherwise, I am very likely to keep my weak rejection vote.
+
+Minor:
+1. The definition of p before Equation (1) is unclear. Why do you have a mod here?
+2. The optimization method in Equation (2) is unclear. What does dt mean? Is t the time step in iterative optimization? Is this a kind of gradient descent? I guess the differential equation comes from neural science and it needs more explanations to make NLP researchers understand. It would be better if the author can compare with other optimization methods such as gradient descent and EM algorithm for Kmeans clustering.
+
+
+[1] Tissier, Julien, Christophe Gravier, and Amaury Habrard. ""Near-lossless binarization of word embeddings."" Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. 2019.
+[2] Wang, Yuwei, et al. ""Biological Neuron Coding Inspired Binary Word Embeddings."" Cognitive Computation 11.5 (2019): 676-684.
+[3] Levy, Omer, Yoav Goldberg, and Ido Dagan. ""Improving distributional similarity with lessons learned from word embeddings."" Transactions of the Association for Computational Linguistics 3 (2015): 211-225.",7,3.0,ICLR2021
+Z8UEIjrPo4_,4,cP2fJWhYZe0,cP2fJWhYZe0,Interesting use of interpretability tools to highlight data biases/statistical artifacts.,"This work reports the problem of image classification datasets (CIFAR-10 and ImageNet) which contains statistical patterns present in both training and tests that can be leveraged by neural networks to achieve high accuracy, but would not be discerned as salient features by humans. Using Sufficient Input Subsets (SIS), they show that retaining the smallest SIS to keep a confidence of 99% leads to spare sets of about 5% of the original pixels and that these subsets of pixels are not salient features for humans. Most importantly they show that training NNs on these SIS from a previously trained network achieves similar results.
+
+===============================
+
+Pros:
+
+1. The paper targets a very important subject. Benchmarks are a fundamental part of the progress in machine learning but overreliance on a single metric can be problematic.
+2. The results on models trained on tiny SIS but achieving nevertheless high accuracies are surprising and a sign that statistical artifacts (or sampling bias) are present in both training and test sets (the authors name it ‘valid statistical patterns’).
+3. The experiments with models ensembles and input dropout show that the issue of overinterpretation can be alleviated to some degree.
+
+Cons:
+
+1. I am strongly confident that the pathology should not be blamed on the models but rather on the data. I am confident humans trained on the tiny SIS can learn to classify the examples with much greater accuracy than 20%. (More on this in Additional observations below)
+2. Related to the last point, there is no mention in the paper (or I missed it) that all datasets studied in the original paper where SIS is proposed (Carter et al. 2019) do not lead to overinterpretation issues.
+3. SIS and Batched SIS are not clearly presented and defined in the main text. 
+
+===============================
+
+Reasons for score:
+
+I would vote for a weak accept. The message of the paper has important implications, we cannot use interpretation tools if models use incomprehensible features that are statistical artifacts, or rather quoting the authors ‘interpretability method that faithfully describes the model should output these nonsensical rationales’. However, the presentation of the paper should be improved, for instance there lacks explanation of SIS and Batched SIS in the main paper to help the reader follow. Also, the experiments about methods to alleviate overinterpretation should be presented with results of human scores to better convey the suspected decrease of overinterpretation.
+
+===============================
+
+Additional observations
+
+I believe the observations are mainly pathologies of the data rather than the models. If such statistical patterns exist in the data that allows generalization, then a model learning such features is not pathological but rather well adapted to its task. Methods to alleviate such issues by biasing models to rely on larger subsets of the input is to me only a way of alleviating the data’s sampling bias. Cormier et al 2019 do not indeed report such pathologies, most probably because the data they studied did not have this issue. 
+
+Humans are not trained only on CIFAR-10 to know what a frog or a car is. As authors say in section 4.2, there are statistical artifacts in the dataset that the humans do not know. I am fairly confident a human could be trained on the sparse versions and classify well the test example afterwards. The labels should also be changed to meaningless ones (ex: A, B, C, D, …) so that humans are learning from scratch like the CNNs.
+
+
+I first thought the following to quotes to be contradictory:
+
+‘We also find SIS subsets confidently classified by one model do not transfer to other models. For instance, 5% pixel-subsets derived from CIFAR-10 test images using one ResNet18 model (which classifies them with 94.8% accuracy) are only classified with 25.8%, 29.2%, and 27.5% by another ResNet18 replicate, ResNet20, and VGG16 models , respectively, suggesting there exist many different statistical patterns that a flexible model might learn to rely on [...].’
+
+‘We find models trained solely on these pixel-subsets can classify corresponding test image pixel-subsets with minimal accuracy loss compared to models trained on full images. [...] This result suggests that the highly sparse subsets found via backward selection offer a valid predictive signal in the CIFAR-10 benchmark explointed by models to attain high test accuracy.’
+
+After rereading many times I realized the first quote was about using SIS for one replicate on test set and compute test accuracy of another replicate using the same SIS, while the second quote was about training another replicate on the SIS and evaluating it on them. I think this could be made more clear in the text.
+
+===============================
+
+Questions
+
+What is the precise threshold for the SIS? Is it always 99% confidence? But not all examples are predicted with 99% confidence isn’t? Is it 99% of f(original x)?
+
+There must be a drawback with the Batched Gradient SIS algorithm, like lesser accuracy. Do the authors discuss it in the appendix? I have not found any discussion on that matter. 
+
+===============================
+
+Typos
+
+Page 3: [...] for the model to the same -> for the model to make the same
+             [...] a gradient-based to find -> a gradient-based method to find?
+
+===============================
+
+Post-Rebuttal
+
+I thank the authors for their detailed answers.  After reading the other reviews and the author's rebuttal, I maintain my rating of 6 for the paper. My concerns on the description of the SIS methods and results on the proposed mitigation are not addressed.
+
+I am not convinced as R1 and R4 that training on the SIS and testing on the full image is the correct way of testing if SIS is sufficient for the model's predictions. If we would present SIS images with unrelatable labels (A instead of Cat, B instead of Dog, C instead of Boat, etc) to humans and ask them to learn the mappings, I am confident they could achieve good results. As pointed out by other reviewers we can see some patterns in the SIS. Showing a full image afterwards and asking to predict (A, B, C, ...) would be quite difficult however. It's easy to infer the pattern from the full image, but the other way around is more difficult. To me the most important is that a given model architecture can be trained on the SIS of another trained model (with different random initializations) and still be able to learn and generalize. That alone shows in my opinion that the dataset contains undesirable statistical artifacts shared by training and test sets, and as the authors says in the paper '‘interpretability method that faithfully describes the model should output these nonsensical rationales’. 
+
+I believe R4 makes a valuable point when saying '[...] I think it might be more useful to look at the mean SIS size of those that are wrongly classified by humans'. This seems to be a better way of gauging what threshold should be used for the size.",6,3.0,ICLR2021
+rJeLro9L67,3,BJEOOsCqKm,BJEOOsCqKm,Experimentally Limited,"This paper considers detecting anomalies in textures. For this task they use VGG-19 features and two human-inspired features from Portilla & Simoncelli and Schutt & Wichmann.
+With these features, they train one-class anomaly detectors. One such anomaly detector is a one-class SVM, and they introduce a loss for one-class neural networks.
+
+The novelty in this paper comes from the problem setup which I have not seen treated before. The loss function they propose also appears original.
+
+However, comparisons are limited. They compare against OC-SVMs, but these are known to be weaker than several types of anomaly detectors [1]. This paper would also do well to ground itself in more recent research on deep anomaly detection [2]. Likewise, the problem setting is limited. In all, experimentation could use more breadth and depth.
+
+[1] Andrew F. Emmott, Shubhomoy Das, Thomas Dietterich, Alan Fern, Weng-Keen Wong. Systematic Construction of Anomaly Detection Benchmarks from Real Data. ODD, 2013.
+[2] Dan Hendrycks and Kevin Gimpel. A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks. ICLR, 2017.",3,3.0,ICLR2019
+Ag0j1Q7XUU9,4,LxBFTZT3UOU,LxBFTZT3UOU,Unprincipled line-search approach for optimizing deep neural networks. Needs better comparison to past literature. ,"This paper proposes a line-search for optimizing deep neural networks. The method is rather unprincipled and quite close to the approach proposed in (Vaswani et al, 2019). I do not think that the paper proposes new ideas that haven't been already explored in the deterministic optimization literature. All in all, the paper needs to better connect to the algorithmic ideas in the deterministic optimization literature, compare against the new optimization methods proposed recently and have more representative plots. Detailed review below. 
+
+- The claim ""adaptive methods concentrate more on finding good directions than on optimal step sizes, and could benefit from line search approaches"" needs justification. At this point, there has been substantial work that shows that adaptive methods like Adam are quite robust to their step size and do in fact, work well across problems. 
+
+- Please cite the literature relevant to SGD, for example, Robins-Munro and Bottou et al, 2016. In the deterministic optimization literature, using Polyak step size is an alternative to line-search approaches. Please also cite the recently proposed methods based on the Polyak step-size, Berrada, et al ""Training Neural Networks for and by Interpolation"" and Loizou et al ""Stochastic polyak step-size for SGD: An adaptive learning rate for fast convergence"" that have been used to train deep neural networks. 
+
+- In Figure 1, it is not enough to show that the loss landscapes are similar in the batch gradient direction. they should also be the same in the stochastic gradient direction, which depending on the batch, can be very different from the batch gradient direction. A more convincing approach would be to choose random directions and compute a metric of similarity for all such directions. Moreover, please explain how is ""t"" chosen for this plot. We know that the loss landscape is different at the start vs the end of the optimization. Consequently, this is not enough evidence to show that line-search can be done on a stochastic batch. In this figure, the batch-size is another confounding factor. What is the effect of the batch-size? 
+
+- ""lemp has a simple shape and can be approximated well by lower order polynomials, splines or fourier series."" This is indeed the motivation behind line-search techniques for convex problems. Please cite Noecedal-Wright 2006 or the original line-search paper by Armijo. 
+
+- In Section 2.2, backtracking line-search is used to overcome the first challenge, i.e. the algorithm picks the largest step-size that satisfies a sufficient decrease condition. The complexity of this is proportional to the number of backtracking iterations which is small. In the proposed approach, a number of points is sampled, meaning additional function evaluations. Please justify why the backtracking approach is not sufficient in this case, and explain why the proposed method would be better.  
+
+- Similarly, for the second challenge, it is doing exactly the thing you cautioned against earlier ""that the batch need not be representative of the full loss function"".  Doing a backtracking line-search on a batch is exactly the approach adopted by (Vaswani et al, 2019). Please clarify this and also explain why you do not experimentally compare against them. 
+
+- The idea of building a model of the loss function by using additional points (from the past) is well known in the deterministic optimization literature in the form of line-search with quadratic/cubic interpolation. This approach does not use additional function evaluations. Please explicitly make this connection and again, justify your approach against this less expensive method. Moreover, the paper proposes to build a highly accurate model of a stochastic loss which need not be representative of the full loss in any case. 
+
+- ""The test error is determined by a 5-fold cross-validation. The second last polynomial degree is chosen and the polynomial is again
+fitted on all loss values to get a more accurate fit. Consequently the closest minimum to the initial location is determined and additional losses are measured in a reasonable interval around it."" This is clearly very computationally expensive for determining the step-size for one iteration. Please clearly state what is the computational complexity of determining the step-size in one iteration. 
+
+- ""ELF generalizes better if not performing a step to the minimum, but to perform a step that decreased the loss by a decrease factor"" This is almost the same as checking the Armijo sufficient decrease condition with a factor of \delta. Why not just do this and say it explicitly?
+
+- Experimentally, since the proposed approach is closest to the work of (Vaswani et al, 2019), please experimentally compare against their method. 
+
+- In Figure 4, please plot the training/test loss vs the number of iterations. One point of information in the form of the test accuracy is not representative, especially since the metric being optimized is the training loss. And there are multiple confounding factors that influence the test error corresponding to any optimization method. Since this is more of an experimental paper, it would make sense to compare against the newer variants of Adam, such as RADAM and AdaBound that have found to work well.",3,5.0,ICLR2021
+SJxn7ibY3X,1,SkMwpiR9Y7,SkMwpiR9Y7,"Core idea is interesting, but the follow-through is kind of scattered with weak results in too many directions.","This paper proposes a method for functional regularization for training neural nets, such that the sequence of neural nets during training is stable in function space. Specifically, the authors define a L2 norm (i.e., a Hilbert norm), which can be used to measure distances in this space between two functions. The authors argue that this can aid in preventing catastrophic forgetting, which is demonstrated in a synthetic multi-task variant of MNIST.   The authors also show how to regularize the gradient updates to be conservative in function space in standard stochastic gradient style learning, but with rather inconclusive empirical results.  The authors also draw upon a connection to the natural gradient.
+
+
+***Clarity***
+
+The paper is reasonably well written.  I think the logical flow could be improved at places.   I think the major issue with clarity is the title.  The authors use the term ""regularizing"" in a fairly narrow sense, in particular regularizing the training trajectory to be stable in function space.  However, the more dominant usage for regularizing is to regularize the final learned function to some prior, which is not studied or even really discussed in the paper.
+
+Detailed comments:
+
+-- The notation in Section 2 could be cleaned up.  The use of \mu is a bit disconnected from the rest of the notation.  
+
+-- Computing the empirical L2 distance accurately can also be NP hard.  There's no stated guarantee of how large N needs to be to have a good empirical estimate.  Figure 3 is nice, but I think a more thorough discussion on this point could be useful.
+
+-- L2-Space was never formally defined.  
+
+-- Section 2.1 isn't explained clearly.  For instance, in the last paragraph, the first sentence states ""the networks are initialized at very different point"", and halfway into the paragraph a sentence states ""all three initializations begin at approximately the same point in function space."".  The upshot is that Figure 1 doesn't crisply capture the intuition the authors aim to convey.
+
+
+***Originality***
+
+Strictly speaking, the proposed formulation is novel as far as I am aware.  However, the basic idea has been the air for a while.  For instance, there are some related work in RL/IL on functional regularization:
+-- https://arxiv.org/abs/1606.00968
+
+The proposed formulation is, in some sense, the obvious thing to try (which is a good thing).  The detailed connection to the natural gradient is nice.  I do wish that the authors made stronger use of properties of a Hilbert space, as the usage of Hilbert spaces is fairly superficial.  For instance, one can apply operators in a Hilbert space, or utilize an inner product.  It just feels like there was a lost opportunity to really explore the implications.
+
+
+***Significance***
+
+This is the place where the contributions of this paper are most questionable.  While the multi-task MNIST experiments are nice in demonstrating resilience against catastrophic forgetting, the experiments are pretty synthetic.  What about a more ""real"" multi-task learning problem?
+
+More broadly, it feels like this paper is suffering from a bit of an identity crisis.  It uses regularizing in a narrow sense to generate conservative updates.  It argues that this can help in catastrophic forgetting.  It also shows how to employ this to construct the standard bounded-update gradient descent rules, although without much rigorous discussion for the implications.  There are some nice empirical results on a synthetic multi-task learning task, and inconclusive results otherwise.  There's a nice little discussion on the connection to the natural gradient.  It argues that that this form of regularization lives in a Hilbert space, but the usage of a Hilbert space is fairly superficial.  All in all, there are some nice pieces of work here and there, but it's all together neither here or there in terms of an overall contribution.    
+
+
+***Overall Quality***
+
+I think if the authors really pushed one of the angles to a more meaningful contribution, this paper would've been much stronger.  As it stands, the paper just feels too scattered in its focus, without a truly compelling result, either theoretically or empirically.",6,4.0,ICLR2019
+SkgupY1-p7,3,rJgSV3AqKQ,rJgSV3AqKQ,A technical report rather than a research paper,"General:
+In general, this looks like a technical report rather than a research paper to me. Most parts of the paper are about the empirical analysis of adaptive algorithms and hyper-gradient methods. The contribution of the paper itself is not sufficient to be accepted.
+
+Possible Improvements:
+1. The study of such optimization problem should consider incorporating mathematics analysis with necessary proof. e.g. show the convergence rate under specific constraints. Even the paper is based on others' work, the author(s) could have extended their work by giving stronger theory analysis or experiment results.
+2. Since this is an experimental-based paper, besides CIFAR10 and MNIST data sets, the result would be more convincing if the experiments were also done on ImageNet(probably should also try deeper neural networks).
+3. The sensitivity study is interesting but the experiment results are not very meaningful to me. It would be better if the author(s) gave a more detailed analysis.
+4. The paper could be more consistent. i.e. emphasize the contribution of your own work and be more logical. I might miss something, but I feel quite confused about what is the main idea after reading the paper. 
+
+Conclusion:
+I believe the paper has not reached the standard of ICLR. Although we need such paper to provide analysis towards existing methods, the paper itself is not strong enough.",3,4.0,ICLR2019
+g83rl7dHoct,3,HW4aTJHx0X,HW4aTJHx0X,Confused about the major contribution of this work,"This work presents a dataset based on S2ORC with additional annotations on the abstracts in terms of contribution and context. Based on this dataset, this paper also adopts two baseline models from prior work with a training strategy (also defined by prior work) to demonstrate the task of summarizing scientific literature from two aspects. 
+
+Overall, I was confused about the contribution of this work. Based on my understanding it could either be presenting a newly annotated dataset or demonstrate this task using some baseline models. However, neither of them has enough evidence to convince me that this is a solid work. 
+
+About the data, I have a few questions about the annotation procedure:
+
+1) about data curation, if there are more than 20 incoming citations, the proposed method suggests to keep only the top 20 papers based on their citations. However, how do we know these 20 papers are the most relevant to a specific work? 
+
+2) I had trouble to understand the value of ""silver"" standard references. First of all, the classification accuracy is only 86.3%, without further guarantee, we have no idea about the annotation quality. Besides, if the major contribution of this work is about the dataset, then manually annotating 400 abstracts really did not sound like enough contribution.
+
+My confusion about the contribution of this work mainly came from section 5 and 6. Particularly, in section 6, the line ""to better understand the strengths and shortcomings of our models"" reminds me that maybe the contribution of this work is about those models. Another confusion is that, if this work is about the dataset, then I really think it is necessary to analyze the annotation quality, instead of the performance of existing models. ",4,5.0,ICLR2021
+MuS5HYVnPwx,4,SUyxNGzUsH,SUyxNGzUsH,Interesting idea; results need further clarification,"This paper studies the language grounding aspect of video-language problems. It proposes a Neural Module Network (NMN) for explicit reasoning of visually-grounded object/action entities and their relationships. The proposed method is demonstrated to be somewhat effective in the audio-visual dialogue task and has been shown superior to existing works on video QA. Overall, the paper is motivated clearly and is delivered with good clarity. The followings need to be clarified.
+
+i) The proposed model demonstrates impressive results on TGIF-QA but without any insightful justification. Since the questions in TGIF-QA are short and usually do not involve complicated reasoning, intuitively, a heavy reasoning scheme might not necessarily pay off. Please clarify the performance gain and possible reasons. Also, ""soft label programs"" lack the necessary context (and should be in bold instead in Tab. 4).
+
+ii) Including intense model variants in the main result table (Tab. 2) gives this paper a somewhat unfair advantage, especially when the best performing method on each metric comes from different model variants. The validation set (from both AVSD and TGIF-QA) is supposed to serve the purpose of model architecture search and ablation studies. Besides, the underlines in the lower part of Tab. 2 should go to method VGD-GPT2.
+
+========== Post-Rebuttal ==========
+
+Concerns on paper/results clarity still persist. Lowering my rating to 5.",5,2.0,ICLR2021
+HJK2Mt3lG,3,r1AoGNlC-,r1AoGNlC-,"The PQT algorithm is nice, but more analysis would be much appreciated. Also, BF is an odd language to target for program synthesis.","This paper introduces a method for regularizing the REINFORCE algorithm by keeping around a small set of known high-quality samples as part of the sample set when performing stochastic gradient estimation.
+
+I question the value of program synthesis in a language which is not human-readable. Typically, source code as function representation is desirable because it is human-interpretable. Code written in brainfuck is not  readable by humans. In the related work, a paper by Nachum et al is criticized for providing a sequence of machine instructions, rather than code in a language. Since code in brainfuck is essentially a sequence of pointer arithmetic operations, and does not include any concept of compositionality or modularity of code (e.g. functions or variables), it is not clear what advantage this representation presents. Neither am I particularly convinced by the benchmark of a GA for generating BF code. None of these programs are particularly complex: most of the examples found in table 4 are quite short, over half of them 16 characters or fewer. 500 million evaluations is a lot. There are no program synthesis examples demonstrating types of functions which perform complex tasks involving e.g. recursion, such as sorting operations.
+
+There is also an odd attitude in the writing of this paper, reflected in the excerpt from the first paragraph describing that traditional approaches to program synthesis “… typically do not make use of machine learning and therefore require domain specific knowledge about the programming languages and hand-crafted heuristics to speed up the underlying combinatorial search. To create more generic programming tools without much domain specific knowledge …”. Why is this a goal? What is learned by restricting models to be unaware of obviously available domain-specific knowledge? 
+
+All this said, the priority queue training presented here for reinforcement learning with sparse rewards is interesting, and appears to significantly improve the quality of results from a naive policy gradient approach. It would be nice to provide some sort of analysis of it, even an empirical one. For example, how frequently are the entries in the queue updated? Is this consistent over training time? How was the decision of K=10 reached? Is a separate queue per distributed training instance a choice made for implementation reasons, or because it provides helpful additional “regularization”? While the paper does demonstrate that PQT is helpful on this very particular task, it makes very little effort to investigate *why* it is helpful, or whether it will usefully generalize to other domains.
+
+Some analysis, perhaps even on just a small toy problem, of e.g. the effect of the PQT on the variance of the gradient estimates produced by REINFORCE, would go a long way towards convincing a skeptical reader of the value of this approach. It would also help clarify under what situations one should or should not use this. Any insight into how one should best set the lambda hyperparameters would also be very appreciated.",6,4.0,ICLR2018
+HJ_YkBz-G,3,rkhCSO4T-,rkhCSO4T-,"a specific architecture for action recognition, tested on one dataset -> limited contribution","Summary: the paper considers an architecture combining neural networks and Gaussian processes to classify actions in a video stream for *one* dataset. The neural network part employs inception networks and residual networks. Upon pretraining these networks on RGB and optical flow data, the features at the final layer are used as inputs to a GP classifier. To sidestep the intractability, a model using a product of independent GP experts is used, each expert using a small subset of data and the Laplace approximation for inference and learning.
+
+As it stands, I think the contributions of this paper is limited:
+
+* the paper considers a very specific architecture for a specific task (classifying actions in video streams) and a specific dataset (the HMDB-51 dataset). There is no new theoretical development.
+
+* the elements of the neural network architecture are not new/novel and, as cited in the paper, they have been used for action classification in Wang et al (2016), Ma et al (2017) and Sengupta and Qian (2017). I could not tell if there is any novelty on this part of the paper and it seems that the only difference between this paper and Sengupta and Qian (2017) is that Sengupta and Qian used SVM with multi-kernel learning and this paper uses GPs.
+
+* the paper considers a product of independent GP experts on the neural net features. It seems that combining predictions provided by the GPs helps. It is, however, not clear from the paper how the original dataset was divided into subsets.
+
+* it seems that the paper was written in a rush and many extensions and comparisons are only discussed briefly and left as future work, for example: using the Bayesian committee machine or modern sparse GP approximation techniques, end-to-end training and training with fewer training points.",3,4.0,ICLR2018
+SyeX2Lk6KH,1,HJlU-AVtvS,HJlU-AVtvS,Official Blind Review #2,"Updates:
+
+Thanks for the updates. 
+
+I find the new theoretical results interesting and potentially useful,  which shows, in the large $d$ setting, spectrums of CKs/NTKs  for boolean cube, sphere and isotropic Gaussian are closed to each other in some sense. Thus, I raise my score to weakly accepted but lower down my confidence level since I am not that familiar with Boolean cube literature. 
+
+
+------------------------------------------------------
+The study of extremely over-parameterized networks (i.e. infinitely width networks) has become one of the most active research directions in theory deep learning. The key objects in understanding such networks are the conjugate kernel [1, 2] (CK defined in the paper) and the Neural tangent kernels [3] (NTK). The CK characterizes how the network looks like at initialization (connection to Gaussian processes as well) and the NTK is very useful to characterize the gradient descent training dynamics of large width networks in the kernel regime. Understanding properties of such kernels, in particular, their spectra distribution and eigenspace, could be potentially an important step towards a finer-gained understanding of generalization in neural networks.   
+
+The main contribution of this paper is the development of the spectral theory of CK and NTK on boolean cube (similar or weaker results on uniform distribution in spheres and Gaussian distribution in R^n). More precisely, the authors show that, over the space of boolean cube, the CK/NTK could be diagonalized using the Fourier basis and the eigenvalues depend only on the frequency (i.e. the degree of the monomials); Thm 3.1. The authors also develop some computation tools to compute the spectra; Lemma 3.2.   Using the tools developed in this paper, the authors are able to clarify some of the interesting observations found by other researchers. Most noticeably, the authors show that the observation in [4] 'neural network is biased towards simple functions' is NOT universal. Whether this statement is correct or not depends heavily on the choice of activation function (e.g. Relu v.s. Erf) and hyper-parameters (e.g. weight variance, depths).  There are also some other interesting empirical findings: the optimal depth of a neural network depends on the complexity (i.e. degree in the boolean cube setting) of the function to learn, CK (i.e. training only the last layer) tends to be more useful for learning less complex functions, etc. 
+
+Overall, this is a nice paper. I am leaning for a weakly accept. 
+
+
+[1] Amit Daniely, Roy Frostig, and Yoram Singer. Toward Deeper Understanding of Neural Networks:
+The Power of Initialization and a Dual View on Expressivity. arXiv:1602.05897 [cs, stat], February
+2016.
+[2] Jaehoon Lee, Yasaman Bahri, Roman Novak, Sam Schoenholz, Jeffrey Pennington, and Jascha
+Sohl-dickstein. Deep Neural Networks as Gaussian Processes. In International Conference on
+Learning Representations, 2018.
+[3] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural Tangent Kernel: Convergence and
+Generalization in Neural Networks. arXiv:1806.07572 [cs, math, stat], June 2018. 00000
+[4] Guillermo Valle-Pérez, Chico Q. Camargo, and Ard A. Louis. Deep learning generalizes because
+the parameter-function map is biased towards simple functions. arXiv:1805.08522 [cs, stat], May
+2018.
+",6,,ICLR2020
+ryg9Strt2m,1,HJex0o05F7,HJex0o05F7,Active anomaly detection technique employing existing approaches and lacking appropriate literature review,"(Since the reviewer was unclear about the OpenReview process, this review was earlier posted as public comment)
+
+Most claims of novelty can be clearly refuted such as the first sentence of the abstract ""...This work presents a new approach to active anomaly detection..."" and the paper does not give due credit to existing work. Current research such as Das et al. which is the most relevant has been deliberately not introduced upfront with other works (because it shows lack of the present paper's novelty) and instead deferred to later sections. The onus of a thorough literature review and laying down a proper context is on the authors, not the reviewers. Detailed comments are below.
+      
+      1. Related Works: ""...active anomaly detection remains an under-explored approach to this problem...""
+          - Active learning in anomaly detection is well-researched (AI2, etc.). See related works section in Das et al. 2016 and:
+            - K. Veeramachaneni, I. Arnaldo, A. Cuesta-Infante, V. Korrapati, C. Bassias, and K. Li, ""Ai2: Training a big data machine to defend,"" International Conference on Big Data Security, 2016.
+        
+      2. ""To deal with the cold start problem, for the first 10 calls of select_top..."":
+          - No principled approach to deal with cold start and one-sided labels (i.e., the ability to use labels when instances from only one class are labeled.)
+        
+      3. Many arbitrary hyper parameters as compared to simpler techniques:
+          - The number of layers, nodes in hidden layers.
+            - The number of instances (k) per iteration
+            - The number of pretraining iterations
+            - The number of times the network is retrained (100) after each labeling call
+            - Dealing with cold start (10 labeling iterations of 10 labels each, i.e. 100 labels).
+        
+      4. The paper mentions that s(x) might not be differentiable. However, the sigmoid form of s(x) is differentiable.
+      
+      5. Does not acknowledge the well-known result that mixture models are unidentifiable. The math in the paper is mostly redundant. Some references:
+          - Identifiability  Of  Nonparametric  Mixture  Models And  Bayes  Optimal  Clustering (pradeepr/arxiv npmix v.pdf)"" target=""_blank"" rel=""nofollow"">https://www.cs.cmu.edu/ pradeepr/arxiv npmix v.pdf)
+          - Semiparametric estimation of a two-component mixture model by Bordes, L., Kojadinovic, I., and Vandekerkhove, P., Annals of Statistics, 2006 (https://arxiv.org/pdf/math/0607812.pdf)
+          - Inference for mixtures of symmetric distributions by David R. Hunter, Shaoli Wang, Thomas P. Hettmansperger, Annals of Statistics, 2007 (https://arxiv.org/pdf/0708.0499.pdf)
+          - Inference on Mixtures Under Tail Restrictions by K. Jochmans, M. Henry, and B. Salanie, Econometric Theory, 2017 (http://econ.sciences-po.fr/sites/default/files/file/Inference.pdf)
+          
+      6. Does not acknowledge existing work that adds classifier over unsupervised detectors (such as AI2). This is very common.
+        - This is another linear model (logistic) on top of transformed features. The difference is that the transformed features are from a neural network and optimization can be performed in a joint fashion. The novelty is marginal.
+        
+      7. While the paper argues that a prior needs to be assumed, it does not use any in the algorithm. There seems to be a disconnect. It also does not acknowledge that AAD (LODA/Tree) does use a prior. Priors for anomaly proportions in unsupervised algorithms are well-known (most AD algos support that such as OC-SVM, Isolation Forest, LOF, etc.).
+        
+      8. Does not compare against current state-of-the-art Tree-based AAD
+          - Incorporating Expert Feedback into Tree-based Anomaly Detection by Das et al., KDD, 2017.
+        
+      9. The 'Generalized' in the title is incorrect and misleading. This is specific to deep-networks. Stacking supervised classifiers on unsupervised detectors is very common. See comments on related works.
+      
+      10. Does not propose any other query strategies than greedily selecting top.
+      
+      11. Question: Does this support streaming?
+",3,4.0,ICLR2019
+S1g8up0dhX,1,BkxgbhCqtQ,BkxgbhCqtQ,Variational inference with discrete distribution for uncertainty estimation,"This paper proposes runs variational inference with discrete mean-field distributions. The paper claims the proposed method is able to give a better estimation of uncertainty from the model. 
+
+Rating of the paper in different aspects ( out of 10)
+Quality 6, clarify 5, originality 8, significance of this work 5 
+
+Pros: 
+
+1. The paper proposes a generic discrete distribution as the variational distribution to run inference for a wide range of models. 
+
+Cons:
+
+1. When the method begins to use mean-field distributions, it begins to lose fidelity in approximating the posterior distributions. Even the model is able to do a good job in approximating marginal distributions, it is hard to evaluate whether the model is gaining benefit overall. 
+
+2. I don't see a strong reason for using discrete distributions. In one dimensional space, a distribution can be approximated in different ways. Using discrete distributions only increases the difficulty of reparameterization. 
+
+3. In the experiment evaluation, the algorithm seems only marginally outperforms competing methods. 
+
+
+Detailed comments: 
+
+In the motivation of the paper, it cites low-precision neural networks. However, low-precision networks are for a different purpose -- small model size and saving energy. 
+
+equation 6 is not clear to me.
+
+In equation 10, how are these conditional probabilities parameterized? Is it like: z ~ Bernoulli( sigmoid(wz) ) ?
+
+It is nice to have a brief introduction of the evaluation measure SGR. 
+
+In table 3, 1st column, the third value seems to be the largest, but the fourth is bolded. 
+",5,4.0,ICLR2019
+aeMJsGSkbC,3,WGWzwdjm8mS,WGWzwdjm8mS,On the fairness of experimental comparison between CV and GD due to different sample size,"The paper proposes using gradient disparity between two batches for early stopping, and explains the reason by showing its connection with the generalization error. Experiments on limited-size dataset and noisy dataset are presented. 
+
+My main concern is related to the experimental setting: The experiments in Section 5 compares 
+
+(a) k-CV with (1-1/k)N training samples + N/k validation samples; with
+
+(b) Gradient Disparity (GD) using  N training samples. 
+
+
+Why not also compare with 
+
+(x1) Gradient Disparity (GD) using (1-1/k)N training samples; or maybe even
+
+(x2) use k-CV to determine stopping epoch, take an average to estimate the best stopping epoch \hat{n}, then re-train using all samples and stop at epoch \hat{n}. 
+
+It seems unclear to me whether the advantage of GD comes from (i) better characterization of the generalization or (ii) sample size advantage. 
+
+(x1) v.s. (a) and (x2) v.s. (b) could show whether GD better captures the generalization under same number of samples. 
+
+(x1) v.s. (b) and (x2) v.s. (a) could show the benefit of increasing sample size. 
+
+Finally, the experimental setting uses k-fold CV instead of a fixed validation set, and it is not clear what the standard deviation of the experimental results for k-CV means, e.g., the standard deviation describes (i) randomness due to data splitting; or (ii) randomness due to training algorithm? ",5,3.0,ICLR2021
+Ks2tW-5vOZ9,4,HZcDljfUljt,HZcDljfUljt,Effective idea with limited novelty,"This paper studies the effect of quantization during training together with batch normalization in quantized deep neural networks. The compound effect of convolution and batch normalization on the dynamic range of activations has implications on the progress of training. The authors propose a protocol for training a quantized neural network combining filter pruning, fine tuning and bias correction. The experiments show that the model size is reduced significantly, while keeping or improving the accuracy.
+
+Strengths
+- The paper is clearly presented, and the effect of batch normalization in the dynamic range of a quantize layer is interesting.
+- The method, in the paper settings, is effective in reducing the model size and sometimes improving accuracy.
+
+Weaknesses
+- There is not qualitative analysis of the problem of the dynamic range and how the proposed method alleviates it. 
+- The novelty is limited in my opinion. The paper combines typical practices used in other works. For example, pruning filters with small magnitude is a common practice. it is also common to fine tune a network after pruning or compression, which not only reduces model size but often also improves the accuracy due to lower overfitting. And bias correction to account for the reconstruction error is also commonly used in similar cases (e.g. DFQ, Finkelstein2019, Masana2017). 
+- The number of iterations in the experiments is a fixed number. It would be more convincing using a validation set with early stopping.
+- The pruning+fine tuning seems to be done once (twice in the proposed workflow). In that case the comparison may not be fair without several iterations of pruning+fine tuning, to compare models when they cannot be further pruned. 
+
+Overall, I think the novelty of the paper is limited, with concerns about the experiments.
+
+Questions
+Please address weaknesses
+
+Finkelstein et al., Fighting Quantization Bias With Bias, ECV@CVPR 019 
+Masana et al., Domain-adaptive deep network compression, ICCV2017
+
+-- Post rebuttal
+
+I appreciate the response by the authors and the new experiments. I also read the other reviews and responses. I think the paper has improved in the revised version. However, I'm still concerned about the novelty, which still remains relatively incremental, as also pointed by other reviewers. I update my rating to 5.",5,3.0,ICLR2021
+a3n5hP_ZZg2,2,Ux5zdAir9-U,Ux5zdAir9-U,Proposes a rule-based synthetic graph generator; Biases in the generation process lead to doubts about the paper's conclusions,"The authors propose a synthetic graph generator to evaluate graph neural networks. The generation process starts with defining rules, subset of rules are used to define a world, each world is then used to sample a graph. Test queries are generated by picking a pair of vertices u, v and generating a path connecting them via the rules. There are some biases in the generation process. For instance, the rules are exclusively open path or chain rules. And as noted above, the test queries mostly stick to a path (the authors allow some variations by adding nodes to vertices already on the path but the path seems to form the backbone of the test query). The remainder of the paper takes a few well known graph neural networks and evaluates them on data generated using GraphLog. Based on these results, the authors claim that E-GAT outperforms RGCNs.
+
+This reviewer acknowledges the need for a benchmark for knowledge base completion. However, GraphLog seems to have one too many biases baked into it for it to form a definitive benchmark. The first bias is the adherence to open path or chain rules. I don't understand the need for this. I can think of two different property nodes p1, p2 hanging off a vertex u leading to an edge with another vertex v. Can this be captured by a chain rule? In section 3, the paper states ""Path-based Horn clauses of this form ... encompass the types of logical rules learned by state-of-the-art rule induction systems"" and then goes on to cite \partial ILP and NeuralLP. \partial ILP captures more than just chain rules, it tries to capture recursive logic programs, and NeuralLP is based on TensorLog which is much more general than this. If you are going to cite related work then might as well cite it properly. Another bias baked into GraphLog is how it generates its test queries which is mostly a path between two vertices. I found myself wondering whether these biases are what's causing RGCNs to underperform on GraphLog generated data. Such doubts lead me to believe there's some gap in GraphLog that needs bridging. Please don't get me wrong, I'm sure GraphLog will still be used to evaluate KBC approaches but I remain unconvinced that it is general enough to warrant a full conference research paper. 
+
+Writing quality wise, the presentation is clear enough with details delegated to the appendices. Related work about graph neural networks and knowledge base completion seems to be well covered. One aspect of related work that seems missing is previously proposed synthetic graph generators. For instance, did the authors try to look for generative approaches for scale-free random graphs? Are these not of interest to the KBC and graph neural networks communities for some reason?",4,4.0,ICLR2021
+jS9tykHuQj1,4,D51irFX8UOG,D51irFX8UOG,"Dense, but well motivated","
+Summary
+---
+
+Children generalize to new sights that combine known perceptual elements.
+Children generalize to new instances of known abstract concepts like order and number.
+Children know what actions they can take in new scenarios, because those actions have been available in similar contexts.
+Machines should be able to generalize in the same ways.
+
+This paper proposes a new environment, HAMLA, where machines are tested on their
+ability to generalize in all of these ways.
+Agents navigate through a maze to a goal location using only carefully designed
+signals extracted from carefully constructed mazes.
+They must learn perception (MNIST digits, color, location), abstract concepts
+(number, order), affordances (move up/down/left/right as available), and
+efficient exploration strategies at once, like humans seem to be able to.
+
+Evaluation is dynamic, estimating what an agent already knows then generating
+new test instances that test the model on something slightly outside what it knows.
+There is no static test set or test environment.
+
+TD3 is used to train various agents based on different NN architectures
+that incorporate more or less structure.
+Agents often fail to navigate to the goal and do so efficiently in scenarios basically similar to those it was successful at.
+Failure happens more often as the generalization gap becomes larger and when the agents must also learn perception in addition to concepts and exploration.
+
+
+Strengths
+---
+
+The motivation is ambitious, interesting, and relevant. It makes sense. It pulls strongly from cognitive science. Explicitly attacking multiple specific modes of generalization at once is an interesting direction.
+
+This could establish a new baseline task that tests multiple kinds of generalization in a toy manner.
+
+The dynamically generated evaluation strategy is new and interesting. It may make it harder to compare performance across models, but clearer about how well a single model is actually performing. That could inspire a significant shift in evaluation methodology.
+
+The paper is highly sylized and polished.
+
+
+Weaknesses
+---
+
+1) The paper is hard to understand in just 8 pages.
+
+The writing is so dense and so many details are left to the 20 page appendix that it is hard to understand the approach or experiments at more than a very high level by reading just the main 8 pages of the paper.
+
+The main example here is the notation. Much of it is non-standard. It is also used frequently and most definitions are left for the appendix.
+
+
+2) Novelty relative to some related work isn't clear.
+
+How does this compare to point-goal nav? [1] I think the test procedure and available senses make it different, but it's still fundamentally navigating to a goal location. Will scaling training help solve this problem as evaluated by the proposed metric?
+
+[1]: Wijmans, Erik et al. “DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames.” ICLR (2020).
+
+
+3) Impact may be limited by difficulty. Tackling this whole problem may be too difficult right now. All approaches completely fail to solve the full problem including vision, as indicated by the bottom right of table 1, which is filled with 0s. If progress is limited to the symbolic setting then the problem is significantly less interesting.
+
+
+Preliminary Evaluation
+---
+
+At the moment the paper is not very clear. That makes it hard to evaluate the quality of the experiments. The quality and novelty of the motivation is high, being fairly novel and interesting. Its significance is highly uncertain because of the paper's clarity and potential difficulty. The paper might be significant as either 1) a central reference for applying some cognitive science concepts to AI, 2) a benchmark that spurs new agent designs, or 3) inspiration for designing new evaluation metrics.
+
+My main uncertainty in this evaluation is because I haven't understood the paper in its full 30 pages of depth.
+
+",7,2.0,ICLR2021
+Byg8zj4RFS,2,H1gS364FwS,H1gS364FwS,Official Blind Review #1,"This paper deals with the important issue of information extraction from less-resourced languages.
+In its current form, the paper has two main weaknesses:
+- it is poorly written & organized
+- it was a fairly weak empirical evaluation
+
+In order to address the first issue:
+- the authors should significantly improve the quality of the prose, which can be confusing & difficult to undersrand
+- the introduction needs to be significantly crisper: in its current form, it is far too general and does NOT describe the rest of the paper; the authors should explain ...
+   1) what is the problem they are working on (currently present, but far too long) 
+   2) what is the proposed approach & why is it novel (missing)  
+   3) what are the main results & their significance (also missing)
+
+In order to address the second issue:
+- 3.1 needs more details; it is this reviewer's understanding that the current corpus consists of 1065 documents (which is extremely small in size); how many sentences are there in these documents? how many on/off events?
+- 3.2 is particularly difficult to follow
+- 3.3 should not use 3-4 paragraphs on introducing NB, SVM, and Decision Trees; last but not least, why decision trees and not randomized forests or deep learning models? in the current form, it is also unclear whether the on/off event detection is performed at sentence or document level 
+- 4: what is the value of k in k-fold CV?
+
+Last but not least, the authors could use the work below as inspiration on how to improve the overall quality of the paper
+https://pqdtopen.proquest.com/doc/2025917601.html?FMT=ABS",1,,ICLR2020
+S1eu25Nqtr,2,Hyg5TRNtDH,Hyg5TRNtDH,Official Blind Review #1,"The authors present an algorithm for postprocessing neural networks to ensure calibration under domain shift.
+Calibration under domain shift is an interesting challenge that has been receiving increasing attention and tackling this in an unsupervised manner is an interesting approach. However, I have 2 major concerns regarding the approach presented by the authors.
+
+What makes calibration under domain shift useful and appealing is that the model is then robust against any changes in the test distribution that can occur during the life cycle of a model. These often include erroneous/samples (corresponding to truly OOD samples), but also gradual domain shift, where the test distribution continuously moves away from the training distribution (e.g. due to a continuous drift in user behaviour/change in customer base) or unforeseen changes. My first major concern is regarding the requirements for UTS, which render this approach not very useful in many of these practical  applications: UTS first requires knowledge of and access to the test distribution; in addition it assumes that the distribution of the labels remains unaffected under domain shift. These assumptions are violated in the practical applications described above, in particular those where a gradual, continuous domain shift occurs - in this case, access to the test distribution is difficult since it changes continuously. On this note I also would have liked to see some analysis on how performance depends on the number of samples that are available from the test set, since in practice this might be substantially smaller than the full test set used.
+Furthermore, I find the assumption that the distribution of labels remains unchanged problematic (q_s(y) = q_t(y) and even q_s(y|x)=q_t(y|x)): once sufficiently out-of-domain, labels become meaningless and predictions for truly OOD samples should have maximum entropy. Even for small domain shifts in practical applications it is not clear why q_s(y|x)=q_t(y|x) should hold and it would have been useful to see a discussion and some robustness analysis on this.
+Finally,  the algorithm requires re-calibration whenever the test distribution changes, which in practice is  often not clear (and part of the reason why dealing with predictions under domain shift is so challenging). 
+
+In addition to doubts on practical applicability, my second major concern is regarding the depth of the evaluation.
+First, while the authors present some comparisons to probabilistic methods, I am missing a crucial comparison to Evidential Deep Learning (Sensoy et al, NeurIPS 2018), which results in far superior performance than deep ensembles, SVI or dropout. Importantly, the comparisons to probabilistic approaches presented by the authors are very limited. The big advantage of those approaches is that, once trained, no further recalibration is necessary and well calibrated predictions can be made for any level of domain shift, whereas UTS requires a recalibration step for very level of domain shift. That is why I think it is crucial to not only show one arbitrarily picked level of domain shift for each dataset/perturbation, but calibration across all levels of domain shift, as for TS and TS-Target; since no recalibration is required for those probabilistic approaches 
+ this is very straight-forward and would be very informative - especially since e.g Figure 5 shows that UTS has only very minor advantages over TS in many settings. 
+I appreciate that the authors report some performance in terms of ECE in the supplement, but I think it would be very informative to report performance in terms of ECE for all domain-shift experiments: The Brier score conflates accuracy with calibration (see eg the 2 component decomposition), whereas ECE directly quantifies calibration and is hence easier to interpret and arguably the more meaningful measure when quantifying calibration. 
+
+Minor:  I find the manuscript lacks clarity. Aspects such as the definition of calibration as well as implications and interpretation of Proposition 1 should be described in more detail in the manuscript. 
+",1,,ICLR2020
+rkeaNDmTKS,1,SkgGCkrKvH,SkgGCkrKvH,Official Blind Review #1,"This paper studies the convergence of CHOCO-SGD for nonconvex objectives and shows its linear speedup while the original paper of CHOCO-SGD only provides analysis for convex objectives. The momemtum version of CHOCO-SGD is also provided although no theoretical analysis is presented.
+
+Extensive empirical results are presented in this paper and the two use cases highlight some potential usage of the algorithm. However, there some concerns which could be addressed.
+
+First, the authors only provide analysis on CHOCO-SGD but the comparison with baselines are based on their momemtum versions. Moreover, some highly relevant baseline like DeepSqueeze are not cited and compared. Thus, the advantage of vanilla CHOCO-SGD over other alternatives is not convincing. 
+
+Second, the cores of decentralized optimization include minimization of objective and consensus of the solution. However, no evaluation of the consensus is presented and this leads to the following point.
+
+Third, it seems the authors report the average performance over all nodes using their individual model. If this is the case, the reported perfromance and comparison are not convincing. Without consensus, different nodes can have individual minimizer. In this case, the obtained average loss can be even smaller than the optimal loss. Under current measurement, if we run SGD on each worker individually without any communication, we will still get pretty good performance but this does not achieve the goal of decentralized optimization. Further clarification on this is needed.
+
+Overall, I think the technical contribution of this paper is unclear and the evaluation is not convincing. 
+",3,,ICLR2020
+Byx4wYANtr,1,H1lhqpEYPr,H1lhqpEYPr,Official Blind Review #2,"Summary and Decision 
+
+This paper studied an actor-critic learning algorithm for solving a mean field game. More specifically, the authors showed a particular actor-critic algorithm converges for linear-quadratic games with a quantitative bound in Theorem 4.1. Notably, results on learning algorithms for solving mean field games without prior knowledge of the parameters are rare, so results of this type is highly desirable. However, the algorithm studied in this paper is uniquely tailored for the (very special!) linear-quadratic setting, which is very unsatisfying. We will discuss concern this in detail below. 
+
+Overall, I recommend a weak accept for this paper. 
+
+Background 
+
+Mean field games is a theory of large population games, where we assume each agent has infinitesimal contributions to the dynamics in the limit as number of agents go to infinity. Similar to mean field theory of particles, the limiting dynamics can be completely characterized by a single distribution of agents, commonly know as McKean-Vlasov dynamics. The theory drastically simplifies computation for large population games: while it is essentially impossible to find a Nash equilibrium for a 100 agent game, we can compute the Nash equilibrium for the mean field limit and approximate the finite game. 
+
+Mathematically, mean field games remain very difficult to solve even knowing the parameters and dynamics. Therefore it is often important to first study a simple case where we can solve the game analytically. In the context of optimal control and mean field games, we can often recover closed form solutions (up to the solution of a Riccati equation) when the dynamics are linear and the cost is quadratic. We call this class of games linear-quadratic mean field games (LQ-MFG). To interpret the LQ assumption, typical control problems in this setting can be recast into a convex optimization problem in the control (or strategy) using convex analysis techniques. Therefore LQ assumptions provides both theoretical and computational tractability. 
+
+Here we will specifically note the paper of Elliot, Li, and Ni, where we can find a closed form solution of the discrete time LQ-MFG with finite horizon. 
+https://arxiv.org/abs/1302.6416
+
+Furthermore, we will also distinguish between games with a finite horizon and infinite horizon. While there are difficulties associated with both cases, typically an ergodic infinite horizon problem removes the time variable from the equation, making the problem slightly easier. Hence many researchers in MFG prefer to begin by studying the ergodic problem. 
+
+In the context of reinforcement learning, we are more interested in solving MFG without knowledge of underlying parameters, dynamics, or even the cost function. This direction is still relatively new for the MFG community, and many problems remain open. The ultimate goal of this line of research is to develop generic and scalable algorithms that can solve general MFGs without knowledge of the game parameters/dynamics/cost etc. 
+
+
+Discussion of Contributions 
+
+This work is the first analysis of actor-critic algorithms for solving MFG. At the same time, the paper studies discrete time MFGs, which is generally less popular but no less interesting. Therefore a theoretical convergence result in this setting is highly desired. 
+
+Overall, the mathematical set up of this problem is very convoluted. This likely motivated the authors to make more simplifying LQ type assumptions to recover stronger results. Even with these assumptions, to put all the pieces of the puzzle together is no easy task. The authors have to consider the interaction between the agent state and the mean field state, as well as the estimation of optimal controls and how to bound errors from estimation error. This led to a long appendix of proofs - while too lengthy to verify, the results seem sensible. 
+
+From this, I believe the mathematical analysis itself is a worthy contribution. This paper will serve as a good starting point for future analysis of more complex problem settings and other learning algorithms in MFGs. 
+
+
+Discussion of Limitations 
+
+The main concern regarding this paper is on quantifying how much of a contribution the results add to the broader community. While results on LQ-MFGs are always nice to have, I believe the specific actor-critic algorithm depends too much on the LQ structure for this work to be useful. Two examples of these are:
+
+1. on page 5, above equation (2.1), the actor-critic algorithm will be only seeking policies that are linear in the state x. This is taking advantage of the fact that we know LQ-MFGs have linear optimal policies. 
+2. on page 7, below equation (3.7), the algorithm requires to know the form of the gradient of the cost function - and therefore leading to a direct estimation of matrices \Upsilon that form the gradient. This is only possible in LQ-MFGs. 
+
+Therefore, results from this paper will be very difficult to generalize to other actor-critic algorithms for LQ-MFGs. At the same time, it's also difficult to generalize these results to the same actor-critic algorithm for non-LQ-MFGs. 
+
+We also note that if we can assume knowledge of the LQ form of underlying dynamics and the form of the cost function, but no knowledge of the parameters, the problem reduces down to a parameter estimation problem. In this case, we can speculate the results of Elliot, Li, and Ni can be adapted to the ergodic case, and we can recover the approximate optimal controls given estimated parameters. Furthermore, in some sense, this particular actor-critic algorithm is implicitly estimating a sufficient set of parameters (the \Upsilon matrices) to find the optimal control. Essentially, if we rely too much on the LQ structure, the problem is then rendered much less interesting. 
+
+In summary, the ultimate goal of this line of research is to approximately solve non-LQ games, therefore the value of this current paper is very incremental in the larger context of learning MFGs. While serving as a reference for future analysis of related algorithms for MFGs, it will be difficult to borrow concrete ideas and results from this paper to build on. 
+
+*Edit:* I believe the discussions below have address several important concerns, and I will raise my score to accept.",8,,ICLR2020
+HkgFSQ1j2Q,3,HyGDdsCcFQ,HyGDdsCcFQ,Simple and interesting work,"Thanks for the rebuttal. But, I am still not very convinced with the proposed results. For CIFAR-100 (0%), you get about 0.2% gain, for ImageNet (0%), you get about 0.2% loss in top-5 accuracy, and for WebVision, you get about 0.3% gain. I am not sure whether you can call these as statistically significant gains. I believe such gain/loss can be obtained with many other tweaks, such as the learning rate scheduling, as the authors have done. 
+
+I believe extensive testing the proposed method on many real noisy datasets, not the synthetically generated ones, and showing the consistent gains would much strengthen the paper. But, at the current version, the only such result is Table 5, which is, again, not very convincing to me. 
+
+So, I still keep my rating. 
+
+=======
+
+Summary:
+
+The authors propose a simple empirical method for cleaning the dataset for training. By using the implicit regularization property of SGD-based optimization method, the authors come up with a method of setting a threshold for the training loss statistics such that the examples that show losses above the threshold are regarded as noisy examples and are discarded. Their empirical results show that ODD (their method) can outperform other baselines when artificial random label noise is injected. They also show ablation studies on the hyperparameters and show the final result seems to be robust to those parameters. 
+
+Pros:
+- The method is very simple
+- The empirical results, particularly on the synthetic noisy training data, seems to be encouraging.
+- The ablation study argues that the method is robust to the hyperparameters, p, E, and h.
+
+Cons:
+- I think the results remains to be highly empirical. While it is interesting to see the division of the loss statistics in Figure 2, I am not very convinced about the real usage of the proposed method. The result in Table 5 shows that ODD can outperform ERM for real world datasets, but the improvement seems to be marginal. Moreover, the hyperparameter p was set to 30 for that experiment, but how did the authors choose that parameter? Clearly, if you choose wrong p, I think the performance will degrade, and it is not clear how you can choose p in real applications. The ablation studies are only with synthetic noisy label data, so I think the result is somewhat limited. 
+- 
+
+I think the paper shows interesting results, but my concern is that it seems to be quite empirical. The positive results are particularly on the synthetic data case. 
+
+
+",5,4.0,ICLR2019
+H16Hn-6lf,1,ry80wMW0W,ry80wMW0W,Novel model to discover subtasks in probabilistic planning (LMDP).,"This paper proposes a formulation for discovering subtasks in Linearly-solvable MDPs. The idea is to decompose the optimal value function into a fixed set of sub value functions (each corresponding to a subtask) in a way that they best approximate (e.g. in a KL-divergence sense) the original value.
+
+Automatically discovering hierarchies in planning/RL problems is an important problem that may provide important benefits especially in multi-task environments. In that sense, this paper makes a reasonable contribution to that goal for multitask LMDPs. The simulations also show that the discovered hierarchy can be interpreted. Although the contribution is a methodological one, from an empirical standpoint, it may be interesting to provide further evidence of the benefits of the proposed approach. Overall, it would also be useful to provide a short paragraph about similarities to the literature on discovering hierarchies in MDPs.  
+
+A few other comments and questions: 
+
+- This may be a fairly naive question but given your text I'm under the impression that the goal in LMDPs is to find z(s) for all states (and Z in the multitask formulation). Then, your formulation for discovery subtasks seems to assume that Z is given. Does that mean that the LMDPs must first be solved and only then can subtasks be discovered? (The first sentence in the introduction seems to imply that there's hope of faster learning by doing hierarchical decomposition).
+
+- You motivate your approach (Section 3) using a max-variance criterion (as in PCA), yet your formulation actually uses the KL-divergence. Are these equivalent objectives in this case?
+
+
+Other (minor) comments: 
+
+- In Section it would be good to define V(s) as well as 'i' in q_i (it's easy to mistake it for an index). ",6,2.0,ICLR2018
+uqa9HExYji7,2,Oj2hGyJwhwX,Oj2hGyJwhwX,Review,"1. Summary
+
+The paper presents two new methods to improve corruption robustness and domain generalization: SelfNorm, a way to adapt style information during inference, and CrossNorm, a simple data augmentation technique diversifying image style in feature space. Both methods are tested on Cifar10/100-C, ImageNet-C, Semi-Supervised Cifar and GTA V -> Cityscapes.
+
+
+2. Strengths
++ The approach is straight forward and apparently very effective
++ The evaluation is performed reasonably
++ The method is tested not only on corruptions but also on domain adaptation. This is great as both tasks can be seen as two sides of the same metal and it is great to see more methods testing on both.
+
+3. Weaknesses
+- The main problem is the write up. In the introduction ""texture sensitivity"" is introduced as a concept but never picked up later on. In general the motivation is very high level drawing a lot on the concepts of style, texture and content but the method itself is rather down to earth modifying instance normalization parameters. In my opinion the high level arguments and colorful terms (style, shape sensitivity, unity of opposites etc.) distract rather than add to the story. They may serve as inspirations but without extensive experiments it is hard to related concepts like style to the manipulation of network parameters. 
+- The ablation study was hard to follow and tbh I think it could have been shortened presenting only the most important results in the paper and the rest in the appendix.
+- Conversely I felt the paper would profit from more figures and visualizations (e.g. Figure 7). The method is very simple and I think that is a major strength. It should thus also be possible to present it in a more concise form.
+
+
+4. Recommendation 
+
+As it is right now I think the paper has to be rejected because the write up is just too chaotic and vague. I do however like the method a lot and I would vote for accept if it was rewritten substantially to focus more on an understandable presentation of the method than abstract concepts and colorful terms. 
+
+
+4. Questions/Recommendations
+- It is probably beyond the scope of the review period but a demonstration of the technique on some non image data would be amazing making the theoretical argument in Section 4.1 that the method can work on other domains much more powerful.
+- It would be nice to include a discussion of ""Improving robustness against common corruptions by covariate shift adaptation"".
+
+6. Additional feedback 
+- Figures 1 and 2 are a bit hard to understand.
+- IMHO it would be more interesting to include Figure 7 in the paper and move Figure 3 into the appendix.
+- The use of IBN should be mentioned in Table 2",7,4.0,ICLR2021
+8O6NvQuedV,3,Pbj8H_jEHYv,Pbj8H_jEHYv,Another parameterization for Orthogonal Convolutional Layer,"The paper provides another parameterization for orthogonal convolutional layers using the Cayley transform, different from BCOP. To the best of my knowledge, this parameterization is novel. However, I have a few questions regarding the proposed method.
+
+(1) For 1D-convolutional layers, BCOP is a complete characterization. From the paper, the parameterization is not complete since the eigenvalues are all +1. While it is possible to multiply a diagonal matrix with either +1 or -1 entries, it is not clear such multiplication closes the gap. I am curious whether the composition is complete; otherwise, the proposed parameterization is strictly weaker than BCOP. 
+
+(2) For 2D-convolutional layers, I believe both BCOP and the proposed method using the Cayley transform are incomplete. So I am curious whether the proposed parameterization is a proper superset of BCOP. Without the argument, it is vague to state the proposed method is more expressive than BCOP --- the better results could come from optimization instead of parameterization. That being said, the paper is still interesting if the proposed parameterization covers a different subset of all orthogonal 2D-convolutional layers (i.e., neither a superset nor a subset of BCOP). In this case, the authors need to characterize the difference between these two subsets. Which layer can be parameterized by Cayley transform but not BCOP, and vice versa?
+
+If the authors can clarify the questions, I will definitely increase my score. Others are minor comments: 
+
+(3) Comparing computational complexities for different methods, RKO, OSSN, SVCM, BCOP, and Cayley transform is desired.  
+
+(4) The BCOP includes the experiment of Wasserstein distance estimation. Empirically, it is better to show the proposed method is better than BCOP in various scenarios if a theoretical justification is too hard, if not impossible.
+
+The questions above are well addressed in the response, and I would like to increase my score.  ",7,4.0,ICLR2021
+SyeZPBHTYB,1,HJlSmC4FPS,HJlSmC4FPS,Official Blind Review #2,"This paper proposed to remove all bias terms in denoising networks to avoid overfitting when different noise levels exist. With analysis, the paper concludes that the dimensions of subspaces of image features are adaptively changing according to the noise level. An interesting result is that the MSE is proportional to sigma instead of sigma^2 when using bias-free networks, which provides some theoretical evidence of advantage of using BF-CNN.
+
+One main practical concern is that only Gaussian noise is considered in this paper which provides good theoretical analysis. It would be interesting to see if this BF-CNN is extendable to more noise types.
+
+One unclear statement is that for UNet, the Lemma 1 seems to be no longer valid as skip connections exist. Can you provide additional proof that the scale invariance still holds in this case?
+
+",6,,ICLR2020
+N5y1p8nA2_,4,9l0K4OM-oXE,9l0K4OM-oXE,Unconvincing results,"This paper presents an empirical study on the backdoor erasing in CNN via teacher-student alignment of the attention maps.
+
+S1: An unexpected way to erase backdoors without the need for additional information (provided that the experiments are done correctly). 
+
+W1: The experimental settings and results are highly doubtable.
+W2: The findings lack theoretical support.
+
+Detailed comments:
+
+The training progress and the degree of overfitting of the teacher model are important but not clarified in the paper. In figure 4, it shows the four possible combinations of teachers and students. But I can’t understand why ""4) C teacher and B student)"" can work well too. Isn’t it prone to overfit? Maybe you can draw a learning curve for these 4 methods as well.
+
+Please explain the possible source of information gain. For example: do the teacher and the student models use the same set of augmentation data? How does your approach compare with a baseline where only the teacher model is used and trained with both the data seen by the student and teacher models in your approach?
+
+Please visualize the attention maps when the model is overfitting and when it is not. This helps people better understand the behavior of the NAD.
+
+Please describe your data augmentation strategies in detail. What are their settings? Is the data augmentation methods used in each epoch or just at the beginning of the training process?
+
+Overall, I am not convinced by the claims and results. The current results may be simply due to different degrees of overfitting or different data augmentation strategies. The author is encouraged to conduct more experiments to clarify this.
+
+
+",6,4.0,ICLR2021
+r1cviMjgf,2,r154_g-Rb,r154_g-Rb,review,"- This paper proposes a framework where the agent has access to a set of user defined attributes parametrizing features of interest. The agent learns a policy for transitioning between similar sets of attributes and given a test task, it can repurpose its attributes to reactively plan a policy to achieve the task. A grid world and tele-kinetically operated block stacking task is used to demonstrate the idea
+
+- This framework is exactly the same as semi-MDPs (Precup, Sutton) and its several generalizations to function approximators as cited in the paper. The authors claim that the novelty is in using the framework for test generalization. 
+
+- So the main burden lies on experiments. I do not believe that the experiments alone demonstrate anything substantially new about semi-MDPs even within the deep RL setup. There is a lot of new vocabulary (e.g. sets of attributes) that is introduced, but it dosen't really add a new dimension to the setup. But I do believe in the general setup and I think its an important research direction. However the demonstrations are not strong enough yet and need further development. For instance automatically discovering attributes is the next big open question and authors allude to it.
+
+- I want to encourage the authors to scale up their stacking setup in the most realistic way possible to develop this idea further. I am sure this will greatly improve the paper and open new directions of researchers. 
+
+",4,5.0,ICLR2018
+BJgqy28cTQ,3,BJlxm30cKm,BJlxm30cKm,Extra Review: Excellent paper which thoroughly explores a very interesting question,"This is an excellent analysis paper of a very interesting phenomenon in deep neural networks.
+
+Quality, Clarity, Originality:
+As far as I know, the paper explores a very relevant and original question -- studying how the learning process of different examples in the dataset varies. In particular, the authors study whether some examples are harder to learn than others (examples that are forgotten and relearned multiple times through learning.) We can imagine that such examples are ""support vectors"" for neural networks, helping define the decision boundary.
+
+The paper is very clear and the experiments are of very high quality. I particularly appreciated the effort of the authors to use architectures that achieve close to SOTA on all datasets to ensure conclusions are valid in this setting. I also thought the multiple repetitions and analysing rank correlation over different random seeds was a good additional test.
+
+Significance
+This paper has some very interesting and significant takeaways.
+Some of the other experiments I thought were particularly insightful were the effect  on test error of removing examples that aren't forgotten to examples that are forgotten more. In summary, the ""harder"" examples are more crucial to define the right decision boundaries. I also liked the experiment with noisy labels, showing that this results in networks forgetting faster.
+
+My one suggestion would be to try this experiment with noisy *data* instead of noisy labels, as we are especially curious about the effect of the data (as opposed to a different labelling task.)
+
+I encourage the authors to followup with a larger scaled version of their experiments. It's possible that for a harder task like Imagenet, a combination of ""easy"" and ""hard"" examples might be needed to enable learning and define good decision boundaries.
+
+I argue strongly for this paper to be accepted to ICLR, I think it will be of great interest to the community.",9,5.0,ICLR2019
+SyozK3HVe,3,SkB-_mcel,SkB-_mcel,Good work with some limitations,"The work introduces a new regularization for learning domain-invariant representations with neural networks. The regularization aims at matching the higher order central moments of the hidden activations of the NNs of the source and target domain. The authors compared the proposed method vs MMD and two state-of-art NN domain adaptation algorithms on the Amazon review and office datasets, and showed comparable performance. 
+
+The idea proposed is simple and straightforward, and the empirical results suggest that it is quite effective. The biggest limitation I can see with the proposed method is the assumption that the hidden activations are independently distributed. For example, this assumption will clearly be violated for the hidden activations of convolutional layers, where neighboring activations are dependent. I guess this is why the authors start with the output of dense layers for the image dataset. Do the authors have insight on if it is beneficial to start adaptation from lower level? If so, do the authors have insight on how to relax the assumption? In these scenarios, if MMD has an advantage as it does not make this assumption? 
+
+Figure 3 does not seems to clearly support the boost of performance shown in table 2. The only class where the new regularization brings the source and target domain closer seem to be the mouse class pointed by the authors. Is the performance improvement only coming from this single class? ",6,4.0,ICLR2017
+rGrCw41wlM,2,TiXl51SCNw8,TiXl51SCNw8,A technique to serialize multi-bit computations and focus on generating high accuracy binary networks using sparsity,"Quantization of weights in DNNs is a very effective way to reduce the computational and storage costs which can enable deployment of deep learning at the edge. However, determining suitable layer-wise bit-widths while training is a difficult task due to the discrete nature of the optimization problem. This paper proposes to utilize bit-level sparsity as a proxy for bit-width and employ regularization techniques to formulate the problem so that precision can be reduced while training the model.
+
+The method proposed by the authors is sound. It leverages insights that have been employed in a neighboring area (pruning via regularization) and re-purposes those to the problem of quantization. The empirical evaluation is robust as well. 
+
+One issue I have with the proposed method is that the parameter space is expanded by a large amount. Since for every scalar weight, we end up with a collection of binary weights. Doesn't this make training more difficult? It would be nice to discuss this issue. And more importantly how does the extra effort compare to other approaches (such as Dorefanet and others).
+
+Regarding the proposed regularization technique. Lasso (least absolute shrinkage and selection operator) is, as far as I am aware, an optimizer that regularizes the L_1 norm of the parameters. Why is the regularizer in eq. (4) using the L_2 norm? Maybe I am missing something and/or there is an inconsistency is the notation/wording.  
+
+The authors do a good job comparing with related works. However, one of the main early claims is that all works trying to find per-layer precision do so manually. This is not true, there have been some works that have done exactly that. One example is [1] which analytically determines precisions at all layers using a noise gain concept. It would be nice to contrast with such works as well.
+
+Minor issue:
+'comp x' is used in the results (tables) without being defined. It appears to indicate 'compression ratio'. This has to be explicitly stated at least once (maybe in the captions).
+
+references:
+[1] Sakr, Charbel, and Naresh R. Shanbhag. ""Per-tensor fixed-point quantization of the back-propagation algorithm."" ICLR 2019.",6,4.0,ICLR2021
+SygZ4flAYH,1,B1l8L6EtDS,B1l8L6EtDS,Official Blind Review #1,"To alleviate the issues of reward sparsity and mode collapse in most text-generation GANs with a binary discriminator, this paper proposes a self-adversarial learning (SAL) framework with a novel comparative discriminator that takes pairs of text examples from real and generated examples and outputs better, worse, or indistinguishable. Inspired by self-play in reinforcement learning, SAL employs self-play to reward the generator to generate better samples than previous samples with self-improvement signals from the comparative discriminator. It is argued that, because the comparative discriminator always produces self-improvement signals during the training and the self-improvement signal will not be very strong when generated samples are already good enough, the issues of reward sparsity and mode collapses in conventional text GANs are reduced. Experimental results on synthetic data and benchmark datasets demonstrate that SAL outperforms SeqGAN, MaliGAN, and RankGAN both quantitatively and qualitatively.
+
+Pros:
+
+This paper is well-written. The studied problem is well-motivated and the method is clearly presented. The SAL framework with self-play and a comparative discriminator is novel and the results are convincing.
+
+Cons:
+
+1) SAL has a new discriminator, which can be viewed as an architecture change. Although it has a new training strategy and is very different from recent text GANs such as LeakGAN and RelGAN, the latest results of these recent methods on the benchmark datasets should be included in Table 4 and 5 for reference.
+
+2) The comparative discriminator is novel and well-suited for comparing pairs of samples. However, the only informative signals come from pairs of data with one from a real sample and the other from a generated sample. During the self-play process, why the signals from the comparative discriminator comparing two generated samples are always trustworthy? Do the reward signals always help train a better generator? 
+
+3) In section 4.3, since the self-play and the comparative discriminators are shown to be the most significant, it is better to clearly show the algorithms as in Algorithm 1 for training these two baselines either without self-play or without the new discriminator. Does replacing the binary classifier with WGAN classifier help here? All these details should be included in the appendix.
+
+4) Missing training details: It is unclear how the model architectures are chosen, and learning rate, optimizer, training epochs etc. are also missing. All these training details should be included in the appendix.
+
+In summary, this paper proposes a novel discriminator with a new training strategy suited for the discriminator for text generation. Experimental results demonstrate the effectiveness of the proposed framework. I like the idea in the paper and am happy to vote for acceptance.
+",8,,ICLR2020
+xLp5A3Mxfxe,2,tu29GQT0JFy,tu29GQT0JFy,Relevant contribution to MNAR data imputation with potential for improvement in experimental validation,"Overall I very much enjoyed reading this paper! The manuscript is very well written, the related work section is sound, the motivation is clear, the methodology is well formalized and the experimental validation is strong in some aspects. 
+
+The formalisation of the missingness process is very intuitive. Also the literature referenced provides one of the most comprehensive overviews on the topic in the statistics community, on the deep learning side it seems that there is a lot of research covered on the side of deep learning latent variable models based on variational auto encoders (VAEs), but the complementary work on Generative Adversarial Networks (GANs) appears to be less well covered. I’m myself more familiar with the VAE approach, and the authors correctly mention the implicit assumptions of GAN approaches, but as GANs offer an intuitive parametric and flexible way of modelling the missingness mask, I guess it would be helpful to see them in the comparison. 
+
+In the first paragraph on page 5 the authors mention that the approach with the reparametrization only works if the data is continuous. I might be missing something, but rather than using a sampling based approach as in Mohamed et al, referenced by the authors, maybe it would be an option to use something like the Gumbel Softmax? In the experimental section the authors use that approach for the recommender system data, with limited success, it seems, compared to a plain gaussian likelihood. But conceptually it would be good to comment on why that’s not possible? 
+
+My main concerns with this work is the experimental validation. The experimental settings explored are UCI data sets, image data and recommendation systems. It’s great that the authors provide such a comprehensive and heterogenous experimental validation in terms of data sets. What I found a bit limiting is that the not-missing-at-random process was so simple and restricted, and that the missingness ratio was not explored systematically. Many studies on imputation use experimental validation that explores missingness ratios between 0 and 100% of the values and in particular the MAR and MNAR settings are explored by, e.g. sampling a random quantile of a feature to condition the missingness on. This is very simple to implement but would allow for a much more realistic account of missingness structure. Especially for a study like this, which makes a presumably important and strong contribution to the field of missing data imputation I would recommend to demonstrate the effectiveness of the proposed approach by a more realistic experimental setting. 
+
+Another recommendation would be to highlight the advantage of the proposed approach with more synthetic data experiments as in figure 1b, maybe one linear and one non-linear data manifold. That would allow to control for more parameters like the rank of the data and the noise, their covariance structure (strength and independence of features and noise, respectively). But that’s not really necessary I think, the authors did a great job with figure 1b! ",6,3.0,ICLR2021
+ryldwtXVKB,1,Bkf4XgrKvS,Bkf4XgrKvS,Official Blind Review #2,"This paper proposes a method to summarize a given graph based on the algebraic multigrid and optimal transport, which can be further used for the downstream ML tasks such as graph classification.
+Although the problem of graph summarization is a relevant task, there are a number of unclear points in this paper listed below:
+
+- In Section 3.1, the coarsening method has been proposed, which is said to be achieved by finding S such that A_C = S^T A S. 
+    However, A_C is usually not binary for S \in R^{n x m}, hence how to get the coarse graph G_C from A_C is not clear. Please carefully explain this point.
+- In the proposed method, coarse nodes should be selected beforehand. Is there any guideline of how to choose them?
+- In Section 3.2, optimal transport is introduced and the distance between G and G_C is measured via entropic optimal transport in Equation (4) or (7).
+    However, in Equation (4), a and b should come from the input G and G_C , and it is not clearly explained how to obtain them from the input.
+    Moreover, how to use the distance between G and G_C in the proposed coarsening method is also not clear. It seems that it is not used in Algorithm 1.
+- I do not understand why the k-step optimal transport distance is needed. Since it converges to the global optimum as k becomes large, it is usually enough to set k to be large enough.
+- In experiments, how is the proposed method used for graph classification?
+    Since the proposed method is for generating coarse graphs in an unsupervised manner, graph classification cannot be directly performed by itself.
+- In addition to the above issue, to assess the effectiveness of the the proposed method, the following experiment is recommended:
+    Fix some classifier and compare performance of graph classification for the original graphs and for the coarse graphs.
+- In the qualitative study in Section 4.4, while the authors discuss coarse nodes, they are just an input from the user and results are arbitrary. Hence such discussion is not informative.
+
+Minor comments:
+- What is ""X"" in Equation (2)?
+- I recommend to write domain for matrices when they used at the first time.
+",3,,ICLR2020
+klQAnPwonZ,2,ueiBFzt7CiK,ueiBFzt7CiK,"Novel approach to explain DNN-based graph algorithms, but experiments can be improved and the explainer model seems flawed.","Summary:
+This paper proposes a framework to discover graph-algorithms that are learned by neural networks to solve combinatorial optimization problems. To this end, the authors propose (a) augmenting the input of DNN-solver using features extracted by existing combinatorial algorithms and (b) explaining the DNN based on an additional ""explainer"" model trained based on maximizing the lower bound of mutual information between subgraph of the input and the labels predicted by the DNN solver. I think this paper pursues an important and promising direction to extract algorithms from DNN-based solvers. However, I think (a) additional baselines should be incorporated for evaluating the DNN-solver, (b) the proposed explainer does not generate practically useful outputs for discovering new algorithms, and (c) the proposed explainer seems a bit flawed.
+
+Strength:
+- The proposed augmentation scheme gives interesting insights and can be applied to other DNN-based solvers for combinatorial optimization.
+- This paper tackles an interesting problem of discovering new algorithms from DNN-based solvers for combinatorial optimization.
+
+Weakness:
+- It is not clear why the authors only consider baselines with polynomial-time running time. To show that the proposed DNN-based solver is practically useful, the authors should compare with state-of-the-art solvers (both DNN-based and non-DNN-based) under limited running time. To compare the algorithms based on the tradeoff between complexity and approximation ratio, the authors should provide a theoretical analysis of the approximation ratio of the proposed algorithm.
+- It is not clear how to use the proposed algorithm for discovering new graph algorithms. Especially, the algorithms produce results that are not very ""explainable."" For example, how did the authors use the explainer to ""re-discover the node-degree greedy algorithm?"" The proposed framework seems to assume that humans can easily analyze the provided explanations (i.e., graphs with colored nodes). However, it seems hard for me to analyze the pattern in node-degree of nodes just by glancing at several explanations provided by the proposed framework. Such a process becomes especially harder if we aim to discover algorithms based on novel concepts (instead of node-degree). Even more, researchers are usually interested in developing algorithms for large-scale graphs (where exact solutions are intractable), yet the explanation becomes notoriously hard to analyze in the eyes of humans for this case. The authors are encouraged to provide an actual process of extracting analysis on the explanations, e.g., did they (1) look at hundreds of explanations provided by the explainer, (2) intuitively group explanations with the common pattern in node-degree, and (3) infer a greedy pattern in node-degree of the selected nodes?
+- The proposed explainer seems flawed for explaining the DNN-based solver. Namely, the proposed explainer only accesses the DNN-based solver based on its prediction probability. However, I do not think it makes sense to rely only on the prediction to explain the black-box algorithm. To demonstrate, both the brute force search and integer programming solver will give an identical prediction (i.e., optimal solution) for the combinatorial optimization. Applying the proposed explainer to two algorithms would give an identical explanation, but brute force search and integer programming operate in a very different way. 
+
+---
+
+I appreciate the thoughtful rebuttal provided by the authors. My main concerns are on Q2, i.e., the practical usefulness of the algorithm. I do think that the authors provide a convincing argument on ""we can only understand what we can understand,"" hence we should set up a hypothesis and see if it aligns with the explanation. However, I think the usefulness is not well-supported in the current state of the paper. The authors can come up with (a) a stronger example of such a hypothesis and (b) a better measurement of how the hypothesis aligns with the explanation to strengthen the paper. 
+
+Regarding Q3, I still think that it is not correct to provide the same explanation for different algorithms when they produce the same output. Hence, the proposed algorithm should be modified to consider this aspect. 
+",4,3.0,ICLR2021
+9aWUUo2nKJO,4,0EJjoRbFEcX,0EJjoRbFEcX,"Interesting study of learned neural features, with premature conclusion due to experiments on limited data","Motivated by the need to assess a classifier's confidence on its decision,
+the paper proposes to use predicted likelihood of the learned features
+of an unknown sample, and shows that samples with low feature
+likelihood tend to receive a wrong decision.
+Furthermore, the paper argues that using a GMM do model the feature
+distributions is better than other methods like VAE or AR flow.
+
+This study is interesting in the sence that it attempts to analyze the
+distributions of the learned features, which is much needed for better
+understanding of what a deep neural net attempts to do.
+The findings appear to make good sense as the network training process
+is designed to push samples of like-class towards tight clusters when
+mapped to the learned feature space.  Those lying in the outskirts signal
+difficulties in this process,  therefore they are likely to get
+erroneous decisions.   The feature distributions are modeled
+class-blind, so that the prediction can be applied to unseen cases
+without class labels.
+
+The work can be improved by performing this analysis at different stages
+of a deep network,  as one expects that the levels closer to the
+output layer demonstrate more of such clustering effect.  It
+will be useful to confirm with this analysis.
+
+Other than showing numerical results, it will be more convincing if
+example images are shown that are identified by this method to
+have unreliable decisions by the classifier, and what error the classifier
+makes on them.
+
+A more significant weakness is that the experiments are done only with
+an image classification problem, and a single dataset.  This makes the
+conclusion somewhat premature,  as images of physical objects tend to
+form good patterns.  Is the GMM estimation good just because the
+data happen to be well clustered?  Will this conclusion be confirmed
+by classification tasks on other types of data, e.g. text?
+What if the class labels are scrambled?
+What makes some network settings better than others in learning
+such confidence-suggestive features?
+
+Misc.:
+
+p.1, line 4 from bottom, 
+""... capture the distribution of the learned feature space"" ->
+""capture the class-conditional distributions in the learned feature space""
+
+p.4, line 5,
+""... state-of-the-art models assigning higher likelihoods to samples
+... "" ->
+What kind of models does this refer to?  What is the model supposed to
+do?  Why do they assign higher likelihoods to out-of-distribution samples?
+There is a lot to be filled in here; citing an external reference is
+not enough.
+
+p.6, line 7 from bottom,
+""... assigned higher BPDs ..."", please expand the acronym BPD at its
+first mention.
+
+p.8, conclusion,
+""... verified that features extracted from inputs consistently lie
+outside of the training distribution and can be detected by their low
+predicted log-probability.""   This sentence is garbled.  What have
+you verified?  Do you mean to say this for a special type of input?
+What consistently lie outside of what?
+",5,4.0,ICLR2021
+XkRa8Wwy-7e,3,jWkw45-9AbL,jWkw45-9AbL,Solid paper providing a formal distributional view for controlled text generation and a framework of solution,"The paper studies the controlled sequence generation problem based on pretrained language models, i.e., controlling a generic pretrained LM to satisfy certain constraints, e.g., removing certain biases in language models. Specifically, the paper proposes a distributional view and imposes constraints based on collective statistical properties. The problem is formalized as a constraint satisfaction problem, minimizing a divergence objective. The paper proposes to use KL-Adaptive DPG algorithm for approximating the optimal energy-based model distribution. Experiments were conducted over both pointwise constraints and distributional constraints, showing the effectiveness of the model over the compared baselines.
+ 
+Pros:
+- The problem under study is an important problem and can have extensive impact on many downstream language generation applications. 
+- This paper makes solid contributions by proposing a formal view on generation controlling. It provides a framework to handle pointwise, distributional, and hybrid constraints. 
+- The method proposed to sample from the sequential EBM makes sense and is empirically vilified to be effective.
+- The experiments and analyses support the claims and conclusions.
+- Overall, the paper is well organized and easy to understand. 
+
+Cons: 
+- The paper may benefit from some human evaluation for text generation. 
+- It is somehow not easy to tell which model is better from figure 2, GDC or Ziegler. It seems that Ziegler is superior in generating attribute-related sentences while inferior in diversity. The sentence quality might be similar as the converged values of (π, a) are close.
+- The current submission contains a number of typos, grammatical and other style issues, in both the main sections and appendixes, but these are rather easy to fix.
+
+Questions:
+-  For real-life applications, whether the proposed framework has scalability issue; e.g., if a task has a large number of constraints to consider or if the constraints are more complicated than what are tested in Section 3? 
+- Assuming one has already got an adjusted LM with some attributes based on GPT2, which would be better if she/he wants to add a new attribute to generation: starting scratch from GPT2 or continuing with the adjusted model?
+",7,3.0,ICLR2021
+b7GaFdZXG81,1,dFBRrTMjlyL,dFBRrTMjlyL,"Nice theoretical treatment, but results equivalent to linearising deep models","The paper tackles the problem of vanishing/exploding gradient in neural networks by imposing constraints on the weights and activation functions that ensure consistent Frobenius norm across updates to the weight matrices in all layers of the network.  
+
+The paper is very well written, straight to follow, theory is well laid out and it is interesting.  The only (minor) issue I have on the presentation side is the name “bidirectional self-normalizing neural networks”.  This phrasing initially made me think the the network is bidirectional…but the network itself is not, it’s the self-normalizing that is bidirectional.   Wouldn’t it be more precise to say “bidirectionally self-normalized neural networks”?
+
+However, despite a well outlined theory, I think that in the end the experimental evaluation demonstrates a major weakness of the proposed approach.  Credit to the authors for including Table 2 with evaluation on real datasets, but the generalisation performance there seems to show quite clearly that -GPN variants of the networks are equivalent to linear classifiers.  A quick check of the performance of linear classifier on MNIST and CIFAR10 without augmentation reveals 93% and 41% testing accuracy - pretty close to the performance shown in Table 2.  Sure, -GPNising allows for training a 200-layer model, but it seem to come at the cost of reducing the model to barely more than a linear classifier.  The sacrifice of all non-trivial representational power in order to gain stable learning dynamics is just not worth it.
+",4,4.0,ICLR2021
+uSXag5Z679O,3,YTWGvpFOQD-,YTWGvpFOQD-,New baselines for differentially private vision tasks using handcrafted feature extractors,"This article is about a topical issue: performance degradation of of deep learning models trained with differential privacy (DP). Clipping of the gradients and addition of the noise, required to obtain DP guarantees, blur the models such that for moderate privacy guarantees (eps~7.0) CIFAR-10 test accuracy baseline is currently ~66% (Papernot et al., 2020b).
+
+Lower bounds for this degradation have been shown theoretically (e.g. Bassily et al., 2014), and there has recently been also work on circumventing this issue, see e.g.
+
+Kairouz, P., Ribero, M., Rush, K. and Thakurta, A., 2020. Dimension Independence in Unconstrained Private ERM via Adaptive Preconditioning. arXiv preprint arXiv:2008.06570,
+Yingxue Zhou, Zhiwei Steven Wu, and Arindam Banerjee. Bypassing the ambient dimension: Private sgd with gradient subspace identification. arXiv preprint arXiv:2007.03813, 2020
+(these references are not included in the paper). 
+
+Although this paper does not introduce fundamentally anything new for DP learning (DP-SGD + Rényi DP accountant for obtaining eps,delta-guarantees are used), it does clearly beat the state-of-the-art for small epsilon values (eps up to 3.0) for MNIST, Fashion-MNIST and CIFAR-10. This is obtained by using so called Scattering Networks (Oyallon and Mallat, 2015), which have the property of converging very fast without privacy. This phenomenon is transferred to DP learning and thus high accuracies for shorter DP-SGD runs (i.e. smaller epsilons) are obtained. 
+
+As expected, these 'handcrafted data-independent feature extractors' of Scatter Networks cannot beat CNN+DP-SGD when more private data is available, or when features can be extracted from public image data.
+
+All in all, although I think the gist of the paper is simply combining these handcrafted feature extractors (ScatterNets) and DP-SGD, it does improve the baseline for DP CIFAR-10 for small / moderate eps-values (up to 3.0) and does provide new ideas / questions on how to improve DP learning (e.g. by accelerated convergence) also outside of image domain (handcrafted feature extraction for non-vision tasks).
+
+The paper is very well written. A tiny remark:
+You write ""Gaussian noise of variance sigma^2 C^2 is added to the mean gradient.""
+Notice that sigma^2 C^2 - noise is added to the summed gradients, and sigma^2 C^2 / B^2 to the mean.
+
+",7,4.0,ICLR2021
+Skx9wBpHFB,1,Skl3SkSKDr,Skl3SkSKDr,Official Blind Review #2,"The paper is extremely interesting, solid and very well written. The idea is simple but nonetheless developed in a smart and effective fashion. The underlying theory is solid, even if some choices should have been discussed more deeply (e.g. the chosen loss function). Introduction and references are adequate, and the paper is readable by a quite broad audience, despite the detailed technical sections. The main issue related to the manuscript is very narrow target of the experimental part, limited to the isomers of a given compound - it would have been interesting to check its potentialities in generating more different structures and distance matrices, and thus to compare its effectiveness versus alternative generative approaches.",8,,ICLR2020
+rklqd5V53m,3,r1gVqsA9tQ,r1gVqsA9tQ,Review,"The paper proposes a GAN variant, called ChainGAN, which expresses the generator as a ""base generator"" -- which maps the noise vector to a rough model sample -- followed by a sequence of ""editors"" -- which progressively refine the sample. Each component of the generator is trained independently to fool its own separate discriminator, without backpropagating through the entire chain of editors. The proposed ChainGAN model is trained on MNIST, CIFAR10, and CelebA. The paper presents model samples for all three datasets, as well as Inception scores for CIFAR10.
+
+I find the proposed idea simple and elegant but the evaluation lacking, and as such I’m a bit hesitant to outright recommend accepting the paper:
+
+- Evaluation is not very extensive or detailed. Inception scores are shown only for CIFAR10 and using two base generator architectures. The Inception score has known limitations, and I would have expected the authors to also provide FID scores. The main takeaway is also not articulated very clearly. As far as I can tell it appears to be that ChainGAN allows to achieve similar performance with less tunable parameters, but Table 1 shows mixed results, where ChainGAN outperforms the baseline DCGAN architecture using fewer parameters but underperforms the baseline ResNet architecture.
+- The way the experimental section is organized made it difficult for me to find my way around. For example, subsection titles are hard to locate due to the fact that figures and tables were placed immediately underneath them. Overall when the flow of the text is interrupted by a figure, it’s hard to locate where to resume reading.
+- There is a connection to be made with other sequential generation approaches (not to be confused with sequence generation) such as LAPGAN, DRAW, and Unrolled GANs. Discussing the relationship to those approaches would in my opinion add more depth to the paper.",4,4.0,ICLR2019
+rklhGYNC5B,2,HkxCcJHtPr,HkxCcJHtPr,Official Blind Review #4,"The format of the paper does not meet the requirement of ICLR. Due to this, I will give a 3. I suggest the authors to change it as soon as possible.
+
+Besides that, the main idea of the paper is to regularize the training of a neural network to reduce the entropy of its activations. There are extensive experiments in the paper.
+
+The paper introduce two kinds of method to regularize the entropy. The first method is a soft version of the original entropy, and the second is the compressibility loss. After adding the regularization, the performance drop of the compressed network is reduced. The experiment performance is promising.
+
+I think the method is straightforward and reasonable with only a few questions:
+1. Why do you quantize the weight? Seems it's not necessary because the paper only address activation quantization.
+2. What will happen if the weights are quantized to lower bits? For example, 4bit?
+2. How about adding the regularization to weights?
+",6,,ICLR2020
+ryFW0bhlM,2,rJXMpikCZ,rJXMpikCZ,Good basic idea with several weaknesses in the technical exposition and the experiments,"This is a paper about learning vector representations for the nodes of a graph. These embeddings can be used in downstream tasks the most common of which is node classification.
+
+Several existing approaches have been proposed in recent years. The authors provide a fair and almost comprehensive  discussion of state of the art approaches. There are a couple of exception that have already been mentioned in a comment from Thomas Kipf and Michael Bronstein. A more precise discussion of the differences between existing approaches (especially MoNets) should be a crucial addition to the paper. You provide such a  comparison in your answer to Michael's comment. To me, the comparison makes sense but it also shows that the ideas presented here are less novel than they might initially seem. The proposed method introduces two forms of (simple) attention. Nothing groundbreaking here but still interesting enough and well explained. It might also be a good idea to compare your method to something like LLE (locally linear embedding). LLE also learns a weight for each of neighbors of a node and computes the embedding as a weighted average of the neighbor embeddings according to these weights. Your approach is different since it is learned end-to-end (not in two separate steps) and because it is applicable to arbitrary graphs (not just graphs where every node has exactly k neighbors as in LLE). Still, something to relate to. 
+
+Please take a look at the comment by Fabian Jansen. I think he is on to something. It seems that the attention weight (from i to j) in the end is only a normalization operation that doesn't take the embedding of node i into account.  
+
+There are two  issues with the experiments.
+
+First, you don't report results on Pubmed because your method didn't scale. Considering that Pubmed has less than 20,000 nodes this shows a clear weakness of your approach. You write (in an answer to a comment) that it *should* be parallelizable but somehow you didn't make it work. We have to, however, evaluate the approach on what it is able to do at the moment. Having a complexity that is quadratic in the number of nodes is terrible and one of the major reasons learning with graphs has moved from kernels to neural approaches. While it is great that you acknowledge this openly as a weakness, it is currently not possible to claim that your method scales to even moderately sized graphs. 
+
+Second, the experimental set-up on the Cora and Citeseer data sets should be properly randomized. As Thomas pointed out, for graph data the variance can be quite high. For some split the method might perform really well and less well for others. In your answer titled ""Requested clarifications"" to a different comment you provide numbers randomized over 10 runs. Did you randomize the parameter initialization only or also the the train/val/test splits? If you did the latter, this seems reasonable. In Kipf et al.'s GCN paper this is what was done (not over 100 splits as some other commenter claimed. The average over 100 runs  pertained to the ICA method only.) ",5,4.0,ICLR2018
+BJleU5H-qB,2,Bye8hREtvB,Bye8hREtvB,Official Blind Review #1,"Motivated by the observation that powerful deep autoregressive models such as PixelCNNs lack the ability to produce semantically meaningful latent embeddings and generate visually appealing interpolated images by latent representation manipulations, this paper proposes using Fisher scores projected to a reasonably low-dimensional space as latent embeddings for image manipulations. A decoder based on a CNN, a Conditional RealNVP, or a Conditional Pyramid PixelCNN is used to decode high-dimensional images from these projected Fisher score.  Experiments with different autoregressive and decoder architectures are conducted on MNIST and CelebA datasets are conducted. 
+
+Pros:
+
+This paper is well-written overall and the method is clearly presented.
+
+
+Cons:
+
+1) It is well-known that the latent activations of deep autoregressive models don’t contain much semantically meaningful information. It is very obvious that either a CNN decoder, a conditional RealNVP decoder, or a conditional Pyramid PixelCNN decoder conditioned on projected Fisher scores will produce better images because the Fisher scores simply contain much more information about the images than the latent activations. When the $\alpha$ is small, the learned decoder will function similarly to the original pixelCNN, therefore, latent activations produce smaller FID scores than projected Fisher scores for small $\alpha$’s. These results are not surprising. Detailed explanations should be added here.
+
+2) The comparisons to baselines are unfair. As mentioned in 1), it’s obvious that Fisher scores contain more information than latent activations for deep autoregressive models and are better suited for manipulations. Fair comparisons should be performed against other latent variable models such as flow models and VAEs with more interesting tasks, which will make the paper much stronger.
+
+3) In Figure 3, how is the reconstruction error calculated? It’s squared error per pixel per image?
+
+4) On pp. 8, for semantic manipulations, some quantitative evaluations will strengthen this part.
+
+In summary, this paper proposes a novel method based on projected Fisher scores for performing semantically meaningful image manipulations under the framework of deep autoregressive models. However, the experiments are not well-designed and the results are unconvincing. I like the idea proposed in the paper and strongly encourage the authors to seriously address the raised questions regarding experiments and comparisons.
+
+------------------
+After Rebuttal:
+
+I took back what I said. It's not that obvious that the ""latent activations of deep autoregressive models don’t contain much semantically meaningful information"". But the latent activations are indeed a weak baseline considering that PixelCNN is so powerful a generator. If the autoregressive generator is powerful enough, the latent activations can theoretically encode nothing.  I have spent a lot of time reviewing this paper and related papers, the technical explanation about the hidden activation calculation of PixelCNN  used in this paper is unclear and lacking (please use equations not just words). 
+
+Related paper:  The PixelVAE paper ( https://openreview.net/pdf?id=BJKYvt5lg ) explains that PixelCNN doesn't learn a good hidden representation for downstream tasks
+
+Another paper combining VAE and PixelCNN also mentions this point:
+
+ECML 2018: http://www.ecmlpkdd2018.org/wp-content/uploads/2018/09/455.pdf
+
+Please also check the related arguments about PixcelCNN (and the ""Unconditional Decoder"" results) in Variational Lossy Autoencoder (https://arxiv.org/pdf/1611.02731.pdf )
+
+As I mentioned in the response to the authors' rebuttal, training a separate powerful conditional generative model from some useful condition information (Fisher scores) is feasible to capture the global information in the condition, which is obvious to me. This separate powerful decoder has nothing to do with PixelCNN, which is the major reason that I vote reject.
+",3,,ICLR2020
+BJhWQCWVl,1,Byj72udxe,Byj72udxe,A nice approach for rare words / context biasing for LM,"This work is an extension of previous works on pointer models, that mixes its outputs with standard softmax outputs. 
+The idea is appealing in general for context biasing and the specific approach appears quite simple.
+
+The idea is novel to some extent, as previous paper had already tried to combine pointer-based and standard models,
+but not as a mixture model, as in this paper.
+
+The paper is clearly written and the results seem promising.
+The new dataset the authors created (WikiText) also seems of high interest. 
+
+A comment regarding notation:
+The symbol p_ptr is used in two different ways in eq. 3 and eq. 5. : p_ptr(w) vs. p_ptr(y_i|x_i) 
+This is confusing as these are two different domains: for eq 3. the domain is a *set* of words and for eq. 5 the domain is a *list* of context words.
+It would be helpful to use different symbol for the two objects.
+
+",8,4.0,ICLR2017
+j7YczQTgqTp,4,eNdiU_DbM9,eNdiU_DbM9,Review,"##########################################################################
+
+Summary:
+
+ Prediction sets are used to quantify the uncertainty of classification. The naive approach which include the labels until a pre-specified coverage probability is satisfied often leads to large prediction sets. Adaptive Prediction Sets (APS) can output prediction sets with desired coverage but set sizes are still not satisfyingly small and the results are unstable, especially when many probability estimations fall into the tail of the distribution. 
+
+In order to make the prediction stable and sets as narrow as possible under pre-specified coverage probability, this paper extends APS to Regularized Adaptive Prediction Sets (RAPS) by penalizing those class with small probabilities beyond k many classes already included, which leads to a small prediction size. The regularization is an interesting idea in terms of minimizing prediction sets, which is different from previous works where most of them directly minimize a quantity related to the cardinality of prediction sets or intervals. Empirically, compared with other set-valued classifiers extracting information from the same base model CNNs, the proposed method outperforms significantly in terms of set sizes when fixing pre-specified coverage. Moreover, this work shows adaptiveness: it tries to allow large prediction size for difficult instances and small prediction size for easy instances.
+
+##########################################################################
+
+Reasons for score: 
+
+Overall, I vote for accepting. I think the method is well motivated and the solution is simple and portable (can be applied to many base methods).  However, there could be more discussions on several aspects of the problem.
+
+ 
+##########################################################################
+
+Pros: 
+
+1. Studies an important problem.
+
+2. The proposed method is easy to implement and can be applied to general scores or be used to improve base conformal prediction methods.
+
+3. Very impressive empirical performance.
+
+ ##########################################################################
+
+Cons: 
+
+1. Theoretically the ""optimal"" set-valued classifier is based on P(Y=k | X=x). In this sense, the naive approach can be viewed as a plug-in approach when the score is an estimate of P(Y=k | X=x). When regularization is applied, something must be lost. This is as much like in lasso for high-dimensional regression, a penalty function makes the coefficient estimate biased (to trade for sparisity). It is unclear what is lost here with regularization. Is the solution no longer ""Fisher consistent"" in a sense?
+
+2. More to the point: it seems that the proposed method is cut out for problems in which there are MANY classes. I wonder whether it will perform just as well for traditional problems in which there are only a few classes (like in the medical field.)
+
+3. Choose a good value for k_reg and lambda seems to be critical. How sensitive is the result to k_reg? Is there any general theory or guidelines about tuning parameter lambda? In the experiments, the validation (calibration) data sets have huge sample size, which may be common in image data domains, but can be unrealistic for broader applications domains. I wonder if the good performance is largely relying on the large validation (calibration) sample size. 
+ 
+##########################################################################
+
+Questions during rebuttal period: 
+
+The goal of narrowing prediction set size is achieved with the help of regularization, which does not directly try to minimize the cardinality of the prediction set. Can we theoretically prove it is asymptotically optimal? Any comparison to these direct approaches? There is a literature called high-quality prediction interval which directly minimizes the prediction size.
+
++ Tim Pearce, Mohamed Zaki, Alexandra Brintrup, and Andy Neely. High-quality prediction intervals for deep learning: A distribution-free, ensembled approach. arXiv preprint arXiv:1802.07167, 2018.
+
+Any comments on the relation of the uncertainty set approach with the classification with rejection/abstain methods?
+
++ Zhang, C., Wang, W. and Qiao, X. (2018), “On Reject and Refine Options in Multicategory Classification,” Journal of the American Statistical Association, 113 (522), pp. 730–745.
++ Ramaswamy HG, Tewari A, Agarwal S. Consistent algorithms for multiclass classification with an abstain option. Electronic Journal of Statistics. 2018;12(1):530-54.
+ 
+#########################################################################
+
+Small comments: 
+
+Most of the figures and tables are far away from the descriptions which makes it hard to read.
+",7,4.0,ICLR2021
+HkxCRSeCFB,3,B1lyZpEYvH,B1lyZpEYvH,Official Blind Review #2,"This work addresses the problem of aspect-based sentiment analysis *with* rationale extraction that can serve as an explanation (interpretation) of the decision.
+Compared to previous work this paper models the problem as multi-aspect classification, therefore having one model for all aspects instead of one for each aspect. This is obviously useful, but in addition to having a smaller total model they show that they also have (slightly) better prediction AND AT THE SAME TIME ""better"" explanations (which is always hard to evaluate).
+
+This is a useful paper, and I can clearly see it used. However, it might not get many ICLR attendants excited.
+
+I found the paper very hard to read, having to re-read many parts to understand exactly what is the goal of each section. The row names of the table 1-3 do not correspond to the names in the text which makes the understanding even harder. Also, I think that some plot that shows accuracy vs ""interpretability"", with model size as a third dimension (eg surface of the bubble) would convey the take-home of the paper much clearer than the many tables with small differences between them.
+
+A minor comment: the SVM has a huge parameter space, which is weird. Are you using *all* bigrams? A fairer comparison would be to use more n-grams, but keep only the most frequent ones.",6,,ICLR2020
+HyVYatu4e,2,SygvTcYee,SygvTcYee,Nice idea but not fully fleshed out,"This paper proposes an extension of the MAC method in which subproblems are trained on a distributed cluster arranged in a circular configuration. The basic idea of MAC is to decouple the optimization between parameters and the outputs of sub-pieces of the model (auxiliary coordinates); optimization alternates between updating the coordinates given the parameters and optimizing the parameters given the outputs. In the circular configuration. Because each update is independent, they can be massively parallelized.
+
+This paper would greatly benefit from more concrete examples of the sub-problems and how they decompose. For instance, can this be applied effectively for deep convolutional networks, recurrent models, etc? From a practical perspective, there's not much impact for this paper beyond showing that this particular decoupling scheme works better than others. 
+
+There also seem to be a few ideas worth comparing, at least:
+- Circular vs. parameter server configurations
+- Decoupled sub-problems vs. parallel SGD
+
+Parallel SGD also has the benefit that it's extremely easy to implement on top of NN toolboxes, so this has to work a lot better to be practically useful. 
+
+Also, it's a bit hard to understand what exactly is being passed around from round to round, and what the trade-offs would be in a deep feed-forward network. Assuming you have one sub-problem for every hidden unit, then it seems like:
+
+1. In the W step, different bits of the NN walk their way around the cluster, taking SGD steps w.r.t. the coordinates stored on each machine. This means passing around the parameter vector for each hidden unit.
+2. Then there's a synchronization step to gather the parameters from each submodel, requiring a traversal of the circular structure.
+3. Then each machine updates it's coordinates based on the complete model for a slice of the data. This would mean, for a feed-forward network, producing the intermediate activations of each layer for each data point.
+
+So for something comparable to parallel SGD, you could do the following: put a mini-batch of size B on each machine with ParMAC, compared to running such mini-batches in parallel. Completing steps 1-2-3 above would then be roughly equivalent to one synchronized PS type implementation step (distribute model to workers, get P gradients back, update model.)
+
+ It would be really helpful to see how this compares in practice. It's hard for me to understand intuitively why the proposed method is theoretically any better than parallel SGD (except for the issue of non-smooth function optimization); the decoupling also can fundamentally change the problem since you're not doing back-propagation directly anymore, so that seems like it would conflate things as well and it's not necessarily going to just work for other types of architectures.",6,4.0,ICLR2017
+mlTIpDngMw0,3,mCLVeEpplNE,mCLVeEpplNE,"Compelling interpretability methodological work but minor flaws in motivation, lack of discussion about practical limitations","The authors did a fantastic job of answering questions, revising their manuscript in accord with reviewer feedback (Sec 3.4 title), and even adding new experimental results based on reviewer suggestions (mid-training hierarchy) and reflecting best practices in interpretability research. I was really impressed by their nimbleness and responsiveness. I will raise my score to a 7: I think this is a very solid paper and excellent research effort around a nascent idea. In particular, I think its impact is limited by
+
+- its close coupling to naturally hierarchical problems, e.g., multi-class classification with a taxonomy
+- its close coupling to image data and tasks
+- its heuristic nature: fully train neural net, infer hierarchy via clustering, retrain neural net, then map a priori labels onto inferred hierarchy
+
+The ""10"" version of this paper (maybe future work?) would propose a way to infer the hierarchy on the fly and show how to apply it onto other kinds of data and problems with different structures.
+
+-----
+
+This submission proposes a modification of neural networks that replaces the ""final linear layer with a decision tree."" The term ""decision tree"" is applied somewhat loosely to a hierarchical neural architecture akin to a hierarchical softmax. In the current work (as I understand it), this hierarchy is induced from a pre-trained multi-output, e.g., multiclass, neural network via a hierarchical clustering and subsequent averaging of the output weights. At inference time, path probabilties can be computed based on the chain rule. Predictions can be made based on either a greedy traversal of the tree (choosing the most likely child at each step, a la hierarchical softmax) or by choosing the most probable leaf, which requires computing all path probabilities. Empirical results across three standard image datasets are suggestive, if not conclusive, and the paper concludes with some interesting, albeit cursory, examples of potential ""interpretability"" applications.
+
+The submission summarizes its contributions at the end of Section 1 as follows:
+1. It proposes a tree-structured loss to augment supervised neural network training (predominantly for multiclass classification problems).
+2. It describes a heuristic to induce a hierarchy in the output weights of a pre-trained multi-output neural network, enabling decision tree-like inference and provide evidence it is more effective than other approaches for inducing hierarchies.
+3. It presents simple case studies of how the induced hierarchy can be used for traditional ""interpretability"" tasks, like debugging and generating explanations.
+
+I appreciate the idea at the center of this paper -- adding simple hierarchical structure to a multi-output neural network, with the aim of increased interpretability -- but I feel the work as it is presented is nascent and the manuscript itself is flawed. I lean toward rejection at the moment, but I could be persuaded to change my mind by some combination of solid revisions, convincing author response, or vociferous advocacy from other reviewers.
+
+I will briefly extol the paper's strengths before providing a longer discussion of what I consider to be its key weakness. First, I really like the last sentence in the paper:
+
+""This challenges the conventional supposition of a dichotomy between accuracy and interpretability, paving the way for jointly accurate and interpretable models in real-world deployments.""
+
+Weaknesses in the evaluation of its interpretability claims aside, I agree with this statement. I think the case studies presented do provide evidence of improved interpretability alongside small accuracy improvements. I think this paper does succeed in demonstrating that accuracy and interpretability are not necessarily competing objectives, at least for certain tasks (multiclass classification of images).
+
+A laundry list of other strengths:
+
+- The motivation is strong (modulo weakness discussed below): there is a growing need to provide human-understandable insights into decisions made by complex machine learning models.
+- The proposed approach is simple and elegant, easy to implement, and empirically effective. I'm quite impressed that the proposed tree loss appears to improve accuracy (!) on multiple tasks.
+- I also think this paper lays groundwork for a direction of research that the community could continue to build on.
+
+I think that the manuscript's largest flaw, ironically, regards interpretability, its primary motivation. The work's central claim is that the tree-structured decision layer delivers improved interpretability with comparable or slightly improved accuracy. In its discussion of this claim, the manuscript provides no precise definition of ""interpretable,"" making it difficult to verify the claim qualitatively or quantitatively. Section 5 presents a vignette of case studies, but the discussion of each is quite limited. In particular, none of the use cases is fully motivated or placed in the context of previous research on interpretability definitions [1][2]. The cursory presentation of results for each do the results a disservice by making it difficult for the reader to recognize and assess their significance.
+
+To quote the introduction from Lipton's _The Mythos of Model Interpretability_ [1],
+
+""Despite the absence of a definition, papers frequently make claims about the interpretability of various models. From this, we might conclude that either: (i) the definition of interpretability is universally agreed upon, but no one has managed to set it in writing, or (ii) the term interpretability is ill-defined, and thus claims regarding interpretability of various models may exhibit a quasi-scientific character.""
+
+I believe the paper would be strengthened by focusing on one use case, e.g., debugging or human trust, using the ~1 page dedicated to Section 5 to motivate it more fully and to present the results in detail. If the primary use case is generalization or debugging, then I suggest designing a quantitative analysis so defend against claims of cherry picking the best results (a common problem in presenting ""example"" interpretability results).
+
+Section 5.4 includes a quantitative evaluation, but I question whether mere human preference is evidence of ""human trust."" More recent research on trust appear to use more elaborate studies in which trust is measured by subjects' rate of success in performing a particular task aided by the machine learning model [3].
+
+I want to caveat the above: I really appreciate this line of work and think it has value. There is an ongoing discussion in our community about rewarding good ideas, rather than punishing imperfect or incomplete execution. I also acknowledge that I am far from an expert in the latest interpretability research. Nonetheless, my understanding is that interpretability researchers have grown more skeptical of interpretability claims about new methods absent a rigorous framework (definitional and/or experimental) for evaluating those claims.
+
+When I read this paper, I find it hard to escape the conclusion that its interpretability claims rest on the presupposition that trees are naturally more interpretable (and further that readers will accept this dogma). I disagree with this assertion (see below), but even if it were generally true, I still think the paper would be strengthened by adding a more rigorous discussion and analysis of its claims. Propose a definition or criterion (see [1][2] for ideas), ideally one that could be assessed qualitatively and evaluated empirically, then apply it.
+
+Regarding the claim about trees in Section 5: ""The interpretability of a decision tree is well-established when input features are easily understood (e.g. tabular data in medicine or finance).""
+
+I would dispute that this is ""well-established"" for anything but the simplest decision tree models, with a single tree consisting of a small number of splits using a handful of features, which are rare in realistic settings. The most commonly used tree-structured models (gradient boosted decision trees and random forests) are not readily interpretable, even for tabular data and especially for high dimensional inputs. This has made research like SHAP [4] of great interest to practitioners.
+
+What is more, even for tabular data, the neural decision trees described in this paper are (to my understanding) basically a cascade of linear classifiers, with split each having access to all features at once. This does do not lend itself to the same kind of ""interpretation"" one gets for classic decision trees that use one feature per split. With even modestly deep hierarchies, the resulting ""explanations"" would rapidly become quite complex.
+
+I see one other weakness in the proposed method itself: as I understand things, it requires access to a pretrained neural network. At the very least, one needs pre-existing output weights to cluster in order to induce a hierarchy -- and the induced hierarchy is a necessary component in the presented results. This isn't a fatal flaw -- learning a hierarchy on the fly could be left for future work. Nonetheless, it limits the work's usefulness and potential impact.
+
+What is more, I don't think the manuscript is sufficiently clear about this requirement: on my first pass through the paper, I came away with the impression that there was a way to learn the hierarchy while training the neural net -- the inclusion of a section entitled ""Training with Tree Supervision Loss"" seems to imply this. I suggest revising the text to make it crystal clear that it is not possible to use the tree loss to train a neural net from scratch -- at least, not without a predefined hierachy (perhaps from a previous training run or prior knowledge).
+
+I will now summarize the improvements I suggest for strengthening the manuscript:
+1. Focus on one definition of interpretability and then analyze central claims through that lens. Introduce it early in the paper (introduction) and then dedicate Section 5 to it, rather than trying to cover lots of use cases superficially.
+2. Make the limitations of the proposed approach VERY clear. In particular, you need a predefined hierarchy to train with the tree loss, and you need pretrained neural net (or pre-existing weights, at least) to induce a hierarchy based on clustering.
+3. If the intention is for this approach to be used exclusively for finetuning or adapting an existing neural network, then this should be made clear in the text. Consider renaming Section 3.3.
+4. Justify (or reword) statements like ""the interpretability of a decision tree is well-established"" or ""neural features are visually interpretable"" (a single reference does not suffice...the Olah distill survey draws no such definitive conclusions).
+
+I have a few questions:
+- One thing that is not clear: when training with tree loss, are weights shared across nodes? In particular, the weight vector for an inner node is the average of its descendent leaf node weight vectors. When training with tree loss, do we then treat that inner node weight vector as a set of independent parameters with separate updates? Or do we continue to treat it as a sum of leaf parameters, so that leaf and inner node updates affect the same parameters, as in an RNN or recursive network.
+- Is there possibly a heuristic that could approximate ""learning the hierarchy?"" For example, train with a basic loss for enough iterations until the output weights start to converge, then pause training to induce the hierarchy and then resume training with the tree loss. Part of the heuristic could be guidance about how to detect when the output weights have sufficiently converged.
+- What are the key differences between this approach and a hierarchical softmax? My understanding is that they're basically equivalent at inference time (except maybe traditional hierarchical softmax uses hard decisions?). What about during training? Is it maybe the use of negative sampling for hierarchical softmax?
+- How would this approach perform for extremely high dimensional output spaces -- one of the primary motivations for hierarchical softmax? I imagine that for some output cardinality, ""soft"" inference becomes computationally infeasible.
+
+[1] Lipton. The Mythos of Model Interpretability. https://arxiv.org/abs/1606.03490.
+[2] Doshi-Velez and Kim. Towards A Rigorous Science of Interpretable Machine Learning. https://arxiv.org/abs/1702.08608.
+[3] Poursabzi-Sangdeh, et al. Manipulating and Measuring Model Interpretability. https://arxiv.org/abs/1802.07810.
+[4] Lundberg, et al. From local explanations to global understanding with explainable AI for trees. https://www.nature.com/articles/s42256-019-0138-9.",7,4.0,ICLR2021
+yQNWw3LkjoI,1,30EvkP2aQLD,30EvkP2aQLD,Review,"It's an interesting paper showing provably efficient batch RL is impossible when only two assumptions holding: a) Realizability, and b) Feature Coverage. By providing counter-examples, which is quite persuasive as the State/Action Space is only polynomially dependent in H, the sample complexity could be exponentially dependent in H w.h.p. 
+
+However, I still have doubts about the Assumption Feature Coverage. Different than Assumption Concentrability [Chen & Jiang(2019), Xie & Jiang(2020a)], Feature Coverage only assumes that feature covariance matrix has minimum eigenvalue 1/d. My doubts then follow: 
+1) Is Asp Concentrability a more stringent asp than Asp Feature Coverage (from a mathematical perspective)? If so, why we would concern Asp Feature Coverage while Chen & Jiang(2019) already proves Asp Concentrability is necessary for efficient batch RL?
+2) Can Asp Feature Coverage really represent exploratory of the batch dataset? In RL's perspective, exploratory is important empirically and theoretically. However, in the hard instance, the important states $s_h^*$ are even not in the support of the dataset (maybe also breaks the Asp Concentrability?), though I understand it's necessary to obtain the exponential lower bound.
+3) Theorem 5.1 confuses me a little. I understand the main goal is to illustrate Error Amplification. But why we would concern upper bound instead of the lower bound? Or Sec5 aims to say that Error Amplification is unavoidable under Realizability and Feature Coverage?
+
+In conclusion, I argue that this paper is well-written and answers the question perfectly with certain assumptions. I have also checked the proof and there are no significant errors. Though I have some doubts about whether certain assumptions are reasonable enough in RL, I still stay positive about this paper. My doubts may be incorrect, but I'm looking forward to the discussion.
+",7,4.0,ICLR2021
+hBcSXDthu8R,2,SRzz6RtOdKR,SRzz6RtOdKR,This paper proposes a reweighed loss function for robust regression model training against label noise. The weight value of each training instance is determined by prior knowledge about the noise generation process. Results confirm empirically the merit of the algorithmic design,"In this paper, a reweighting technique is proposed to suppress the impact of heteroscedastic label noise in regression model training. The objective function of the regression model training process is composed of a weighted combination of instance-wise training loss. The instance-wise weight is determined by the estimated noise variance based on prior information of the label generation process. The weighting formulation is inspired by the best possible estimator of noisy measurements reaching the Cramer-Rao bound. 
+
+The paper is clearly written. It explains well the problem definition and the methodological formulation. However, we think the innovation in this work is limited. The downside of this paper is as follows: 
+1. It is a strong and usually impractical assumption to know a priori knowledge of label noise in the regression model training process. Especially, in the proposed method, the estimate of the noise variance needs to be accurate enough, so as to help suppress the noise impact accordingly. It is better to incorporate jointly the learning of noise distribution and the regression/classification 
+model, so as to optimize the tolerance against the data-dependent noise. 
+2. It only considers the noisy learning process for regression. However, in classification scenarios, noise (with respect to labels) is usually presented in the form of label flipping. The proposed reweighing technique is not directly applicable in that case. Please refer to the following work for further reading: 
+Learning with Noisy Labels, Nagarajan Natarajan, Inderjit S. Dhillon, Pradeep Ravikumar, and Ambuj Tewari, NIPS 2013. 
+",4,5.0,ICLR2021
+BJdG429gz,3,BkisuzWRW,BkisuzWRW,Why is this not simply doing RL in the real world and then using the learnt policy (with no exploration)?,"One of the main problems with imitation learning in general is the expense of expert demonstration. The authors here propose a method for sidestepping this issue by using the random exploration of an agent to learn generalizable skills which can then be applied without any specific pretraining on any new task. 
+
+The proposed method has at its core a method for learning a parametric skill function (PSF) that takes as input a description of the initial state, goal state, parameters of the skill and outputs a sequence of actions (could be of varying length) which take the agent from initial state to goal state.
+
+The skill function uses a RNN as function approximator and minimizes the sum of two losses i.e. the state mismatch loss over the trajectory (using an explicitly learnt forward model) and the action mismatch loss (using a model-free action prediction module) . This is hard to do in practice due to jointly learning both the forward model as well as the state mismatches. So first they are separately learnt and then fine-tuned together. 
+
+In order to decide when to stop, an independent goal detector is trained which was found to be better than adding a 'goal-reached' action to the PSF.
+
+Experiments on two domains are presented. 1. Visual navigation where images of start and goal states are given as input. 2. Robotic knot-tying with a loose rope where visual input of the initial and final rope states are given as input.
+
+Comments:
+
+- In the visual navigation task no numbers are presented on the comparison to slam-based techniques used as baselines although it is mentioned that it will be revisited.
+
+- In the rope knot-tying task no slam-based or other classical baselines are mentioned.
+
+- My main concern is that I am really trying to place this paper with respect to doing reinforcement learning first (either in simulation or in the real world itself, on-policy or off-policy) and then just using the learnt policy on test tasks. Or in other words why should we call this zero-shot imitation instead of simply reinforcement learnt policy being learnt and then used. The nice part of doing RL is that it provides ways of actively controlling the exploration. See this pretty relevant paper which attempts the same task and also claims to have the target state generalization ability. 
+
+Target-driven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning by Zhu et al.
+
+I am genuinely curious and would love the authors' comments on this. It should help make it clearer in the paper as well.
+ 
+Update:
+
+After evaluating the response from the authors and ensuing discussion as well as the other reviews and their corresponding discussion, I am revising my rating for this paper up. This will be an interesting paper to have at the conference and will spur more ideas and follow-on work.",7,5.0,ICLR2018
+BkxIVj-92m,2,S1gBgnR9Y7,S1gBgnR9Y7,An empirical study with little analysis,"Edit: changed ""Clarity""
+
+[Relevance] Is this paper relevant to the ICLR audience? yes
+
+[Significance] Are the results significant? no
+
+[Novelty] Are the problems or approaches novel? no
+
+[Soundness] Is the paper technically sound? okay
+
+[Evaluation] Are claims well-supported by theoretical analysis or experimental results? marginal
+
+[Clarity] Is the paper well-organized and clearly written? no
+
+Confidence: 3/5
+
+Seen submission posted elsewhere: No
+
+Detailed comments:
+
+In this work, the authors compare several state-of-the-art approaches for high-resolution microscopy analysis to predicting coarse labels for the outcomes of pharmacological assays. They also propose a new convolutional architecture for the same problem. An empirical comparison on a large dataset suggests that end-to-end systems outperform those which first perform a cell segmentation step; the predictive performance (AUC) of almost all the end-to-end systems is statistically indistinguishable.
+
+=== Major comments
+
+The paper is primarily written as though its main contribution is as an empirical evaluation of different microscopy analysis approaches. Recently, there have been a large number of proposed approaches, and I believe a neutral evaluation of these approaches on datasets other than those used by the respective authors would be a meaningful contribution. However, the current paper has two major shortcomings that prevent it from fulfilling such a place.
+
+First, the authors propose a novel approach and include it in the evaluation. This undercuts claims of neutrality. (Minor comments about the proposed approach are given below.) 
+
+Second, the discussion of the results of the empirical evaluation is restricted almost solely to repeating in text the what the tables already show. Further, the discussion focuses only on the “top line” numbers, with the exception of a deep look at the Gametocytocidal compounds screen. It would be helpful to instead (or additionally) identify meaningful trends, supported by the data acquired during the experiments. For example: (1) Do the end-to-end systems perform well on the same assays? (2) Would a simple ensemble approach improve things? if they perform well on different assays, then that suggests it might. (3) What are the characteristics of the assays on which the CNN-based approaches perform well or poorly (i.e., how representative is Figure 5)? (4) What happens when the FNN-based approach outperforms the CNN-based ones? in particular, what happens in A13? (5) How sensitive are the approaches to the number of labeled examples of each assay type? (6) Are there particular compounds which seem particularly informative for different assays?
+
+A second major concern is whether the binarized version of this problem (i.e., assay result prediction) is of interest to practitioners. In many contexts, quantitative information is also important (“how much of a response do we see?”). While one could imagine the rough qualitative predictions (“do we see a response?”) shown here as an initial filtering step, it is hard to believe that the approach proposed here would replace other more informative analysis approaches.  
+
+=== Minor comments
+
+Are individual images from the same sample image always in only the training, validation, or testing set? that is, are there cases where some of the individual images from a particular sample image are in the training set, while others from that sample image are in the testing set?
+
+I did not find the dataset construction description very clear. Does each row in the final, 10 574 x 209 matrix correspond to a single image? Does each image correspond to a single row? For example, it seems as though multiple rows may correspond to the same image (up to four? the three pChEMBL thresholds as well as the activity comment). What is the order in which the filtering and augmenting happens? It would be very helpful to provide a coherent, pipeline description of this (say, in an appendix).
+
+Do all the images in the dataset come from the same microscope (and cell line) at the same resolution, zoom, etc.? If so, it is unclear how well this approach may work for images which are more heterogeneous. There are not very many datasets of the size described (I believe, at least) available. This may significantly limit the practical impact of this work.
+
+How many epochs are required for convergence of the different architectures? For example, MIL-net has significantly fewer parameters than the others; does it converge on the validation set faster?
+
+=== Typos, etc.
+
+The references are not consistently formatted.
+
+“not loosing” -> “not losing”
+“doesn’t” -> “does not”
+",3,3.0,ICLR2019
+FeamT628qnp,4,3LujMJM9EMp,3LujMJM9EMp,Good way to estimate mutual information,Seems like the most direct way to estimate mutual information using a classifier. I like this work because it is much more straight-forward than the prior work such as MINE. It shows sufficient performance on the experiments shown.,7,2.0,ICLR2021
+HkX303_ez,2,BJQPG5lR-,BJQPG5lR-,"Simple idea, not explored thoroughly enough","UPDATED COMMENT
+I've improved my score to 6 to reflect the authors' revisions to the paper and their response to my and R2's comments. I still think the work is somewhat incremental, but they have done a good job of exploring the idea (which is nice).
+
+ORIGINAL REVIEW BELOW
+
+The paper introduces an architecture that linearly interpolates between ResNets and vanilla deep nets (without skip connections). The skip connections are penalized by Lagrange multipliers that are gradually phased out during training. The resulting architecture outperforms vanilla deep nets and sometimes approaches the performance of ResNets.
+
+It’s a nice, simple idea. However, I don’t think it’s sufficient for acceptance. Unfortunately, this seems to be a simple idea that doesn't work as well as the simpler idea (ResNets) that inspired it. Moreover, the experiments are weak in two senses: (i) there are lots of obvious open questions that should have been explored and closed, see below, and (ii) the results just aren’t that good. 
+
+Comments:
+
+1. Why force the Lag. multipliers to 1 at the end of training? It seems easy enough to treat the alphas as just more parameters to optimize with gradient descent. I would expect the resulting architecture to perform at least as well as variable action nets. If not, I’d be curious as to why.
+
+2.Similarly, it’s not obvious that initializing the multipliers at 0.5 is the best choice. The “looks linear” initialization proposed in “The shattered gradients problem” (Balduzzi et al) implies that alpha=0 may work better. Did the authors try any values besides 0.5? 
+
+3. The final paragraph of the paper discusses extending the approach to architectures with skip-connections. Firstly, it’s not clear to me what this would add, since the method is already interpolating in some sense between vanilla and resnets. Secondly, why not just do it? 
+
+",6,4.0,ICLR2018
+H1e4b3x9hQ,3,BkfhZnC9t7,BkfhZnC9t7,Review,"This paper presents an approach to address the task on zero-shot learning for speech recognition, which consist of learning an acoustic model without any resources for a given language. The universal phonetic model is proposed, which learns phone attributes (instead of phone label), which allows to do prediction on any phone set, i.e. on any language. The model is evaluated on 20 languages and is shown to improve over a baseline trained only on English.
+
+The proposed UPM approach is novel and significant: being able to learn a more abstract representation for phones which is language-independent is a very promising lead to handle the problem of ASR on languages with low or no resources available. 
+
+However, the results are the weak point of the paper. While the results demonstrate the viability of the approach, the gain between the baseline performance and the UPM model is quite small, and it's still far from being usable in practice. 
+
+To improve the paper, the authors should discuss the future work, i.e. what are the next steps to improve the model.
+
+Overall, the paper is significant and can pave the way for a new category of approaches to tackle zero-shot learning for speech recognition. Even if the results are not great, as a first step they are completely acceptable, so I recommend to accept the paper.
+
+Revision:
+The approach of using robust features is interesting and promising, as well as the idea of training on multiple languages. Overall, the authors response addressed most of the issues, therefore I am not changing my rating.",7,4.0,ICLR2019
+ByggukyLhQ,1,HJx4KjRqYQ,HJx4KjRqYQ,"simple idea that might be useful, but unnecessarily complicated exposition","This paper proposes training latent variable models (as in VAE decoders) by running HMC to approximate the posterior of the latents, and then estimating model parameters by maximizing the complete data log-likelihood. This is not a new idea by itself and is used e.g. as a baseline in Kingma and Welling's original VAE paper. The novelty in this paper is that it proposes tuning the parameters of the HMC inference algo by maximizing the likelihood achieved by the final sample in the MCMC chain. This seems to work well in practice and might be a useful method, but it is not clear under what conditions it should work.
+
+The paper is written in an unnecessarily complicated and formal way. On first read it seems like the proposed method has much more formal justification than it really has. The discussion up to section 3.5 makes it seem as if there is some new kind of tractable variational bound (the ERLBO) that is optimized, but in practice the actual objective in equation 16 is simply the likelihood at the final sample of the MCMC chain, that is Monte Carlo EM as e.g. used by Kingma & Welling, 2013 as a baseline.  The propositions and theorems seem to apply to an idealized setting, but not to the actual algorithm that is used. They could have been put in an appendix, or even a reference to the exisiting literature on HMC would have sufficed.
+
+The experiments do not clearly demonstrate that the method is much better than previous methods from the literature, although it is much more expensive. (The reported settings require 150 likelihood evaluations per example per minibatch update, versus 1 likelihood evaluation for a VAE). Also see my previous comments about evaluation in this paper's thread.
+
+- Please explain why tuning the HMC algo by maximizing eq 16 should work. I don't think it is a method that generally would work, e.g. if the initial sample z0 ~ q(z|x) is drawn from a data dependent encoder as in HVI (Salimans et al) then I would expect the step size of the HMC to simply go to zero as the encoder gets good. However in your case this does not happen as the initial sample is unconditional from x. Are there general guidelines or guarantees we can conclude from this?
+
+- The authors write ""Because MPFs are equivalent to ergodic Markov chains, the density obtained at the output of an MPF, that is, qL, will converge to the stationary distribution π as L increases.""
+
+This is true for the idealized flow in continuous time, but HMC with finite step size does generally NOT converge to the correct distribution. This is why practical use of HMC includes a Metropolis-Hastings correction step. You omit this step in your algorithm, with the justification that we don't care about asymptotic convergence in this case. Fair enough, but then you should also omit all statements in the paper that claim that your method converges to the correct posterior in the limit. E.g. the writing makes it seem like Proposition 2 and Theorem 1 apply to your algorithm, but it in fact they do not apply for finite step size. Maybe the statements are still correct if we take the limit with L->inf and the stepsize delta->0 at a certain rate? This is not obvious to me.
+
+In practice, you learn the stepsize delta. Do we have any guarantees this will make delta go to zero at the right rate as we increase the number of steps L? I.e. is this statement from your abstract true? -> ""we propose a novel method which is scalable, like mean-field VI, and, due to its theoretical foundation in ergodic theory, is also asymptotically accurate"". (convergence of uncorrected HMC only holds in the idealized case with step size -> 0)",4,5.0,ICLR2019
+H1xFygmyz,1,BJB7fkWR-,BJB7fkWR-,"This paper contains interesting ideas, but it is not ready for publication.","In this paper, the authors propose a new approach for learning underlying structure of visually distinct games.
+
+The proposed approach combines convolutional layers for processing input images, Asynchronous Advantage Actor Critic for deep reinforcement learning task and adversarial approach to force the embedding representation to be independent of the visual representation of games. 
+
+The network architecture is suitably described and seems reasonable to learn simultaneously similar games, which are visually distinct. However, the authors do not explain how this architecture can be used to do the domain adaptation. 
+Indeed, if some games have been learnt by the proposed algorithm, the authors do not precise what modules have to be retrained to learn a new game. This is a critical issue, because the experiments show that there is no gain in terms of performance to learn a shared embedding manifold (see DA-DRL versus baseline in figure 5).
+If there is a gain to learn a shared embedding manifold, which is plausible, this gain should be evaluated between a baseline, that learns separately the games, and an algorithm, that learns incrementally the games. 
+Moreover, in the experimental setting, the games are not similar but simply the same.
+
+My opinion is that this paper is not ready for publication. The interesting issues are referred to future works.
+",3,3.0,ICLR2018
+H1gbR6xTtB,1,SkeJPertPS,SkeJPertPS,Official Blind Review #3,"Summary:
+This paper introduces a method for domain adaptation, where each domain has noisy examples. Their method is based on a decision tree in which the data at each node are split into equal sizes while maximizing the
+information gain.  They also proposed a way to reduce domain alignment. Their method is tested on several noisy domain adaptation settings and performs better than other baseline methods. 
+
+Pros:
+Their idea to utilize a decision tree for domain adaptation sounds novel. 
+Experiments indicate the effectiveness of their method.
+
+Cons:
+This paper is not well-written and has many unclear parts. 
+1, The presentation of the problem set is unclear throughout this paper. In the abstract, they mentioned that they tackle the situation where both source and target domains contain noisy examples. However, they did not define the exact problem setting in any section. I could not understand what kind of problem setting motivated their method, which makes it hard to understand their method. 
+2, How they actually optimized the model is also unclear. From Eq 1~4, it is hard to grasp how they trained the model. 
+3, In open-set domain adaptation, simply minimizing domain-distance can harm the performance. How does the method avoid this issue? It was also unclear. 
+4, Experimental setting seems to be wrong and unclear. In Openset1, they say that ""The labels from 1 to 10 of both source and target domains are marked as the known class, and all data with label 11∼20 in the source domain and label 21∼31 in the
+target domain are used as one unknown class"". However, Saito et al. (2018) used 21-31 classes in the target domain as one unknown class. In addition, ""According to Saito et al. (2018) the target data of the unknown class is not used in training, "", they used the 21-31 classes for training in an unsupervised way. How is this method used to detect unknown class? Is there any threshold value set for it?
+5, The experimental setting is unclear. In 4.4, "", we use only 10% of training samples"", does it mean 10 % training source examples or target examples? This setting is also unclear. 
+
+From the cons written above, this paper has too many unclear parts in the experiments and method section. I cannot say the result is reproducible given the content of the paper and the result is a reliable one. They need to present more carefully designed experiments. 
+",3,,ICLR2020
+RKei0M_bnpk,2,MY5iHZ0IZXl,MY5iHZ0IZXl,"Potentially interesting, but not clear what the question is, why this method, or what answers we get","This paper analyzes functioning of BERT by identifying gradient-based influence paths to track flow of influence between model inputs and outputs. Such analysis using influence paths has been established in prior work, but the present paper expands on this work with definition of ""multi-partite"" patterns, and by introducing a method for identifying strongly influential paths in the model. The authors evaluate on existing datasets for studying subject-verb agreement and reflexive anaphora, and they report measures of concentration of the influence flows, and performance of the model after compression based on the identified influence paths.
+
+Overall, I think there are certainly interesting analyses that could come out of this work, but the current paper does not provide a clear enough contribution to be ready for publication. 
+
+The reported results don't give us much to go on in terms of interpreting what we have learned from these analyses. The authors choose two measures: ""concentration"" of the influence paths, and drop in accuracy when compressing the model based on those paths. I'm not sure what we should be taking away from the concentration numbers, and the authors don't provide any substantive discussion of why this matters. There is also no meaningful baseline against which to compare these numbers. The second measure, accuracy of the compressed model, is a bit easier to interpret, in that it allows us to verify the extent to which critical information flow is indeed occurring within the identified paths, but while this verification serves to increase confidence in the method, it's not clear what it tells us about the functioning of the BERT model. (The authors also do not give a clear statement of what task these accuracies are in fact for, though I assume by default that they are referring to accuracy in assigning higher probability to the grammatical option over the ungrammatical option in the selected Marvin & Linzen datasets.) There is also no discussion of how we should interpret results from the attention-based versus the embedding-based influence paths.
+
+Zooming out further, there isn't a clearly identifiable question being asked in this paper, and in particular there is no clear connection between the analysis method being used and the particular linguistic phenomena embodied by the chosen evaluation data. What question is being asked about the model's handling of these linguistic phenomena, and why is this method appropriate for asking it? What output of the analysis will be used as an answer to the key questions? These connections should be defined clearly and from the start. As it is, I'm unclear on what these results tell us about the chosen linguistic phenomena (or, conversely, why this set of phenomena was chosen to showcase utility of this method). The measures reported have some variation between sentence types -- but what does it mean that there is higher positive embedding influence concentration for sentences with subject relative clauses, or higher negative embedding influence concentration for within sentence complement sentences? What is the meaning of the accuracies dropping more for subject relative clauses and number agreement among the embedding influence paths, and all accuracies seemingly dropping for the attention influence paths?  
+
+In Section 4.2 there are a couple of observations about the influence paths that make some concrete connections to potential questions about syntactic dependencies in the chosen datasets. However, because no formal connection between these things has been established prior to this point (no clear question, no clear linking hypothesis between the analysis and the phenomena) this set of observations comes a bit out of the blue, and leaves us still without clear takeaways. In sum, I think that there is potential for interesting insights about these phenomena to come out of this analysis method, but the paper would benefit from much more clearly defined questions and linking assumptions.",6,4.0,ICLR2021
+YM69dKGw_Xv,2,nXSDybDWV3,nXSDybDWV3,"A novel library for Stein VI, but the paper lacks details of the examples and evaluation","### Summary
+This paper introduces EinStein VI: a lightweight composable library for Stein Variational Inference (Stein VI). The library is built on top of NumPyro and can take advantage of many of NumPyro's capabilities. It supports recent techniques associated with Stein VI, as well as novel features. The paper provides examples of using the EinStein VI library on different probabilistic models. 
+
+### Strengths
+I'm not aware of other PPLs that support Stein Variational Inference. EinStein VI can provide an easier way to compare different Stein VI algorithms, and make research in the area easily reproducible.  
+
+### Concerns 
+The paper states that it provides examples that demonstrate that EinStein VI's interface is easy-to-use and performs well on pathological examples and realistic models. While it is true that there are several examples described, in my opinion there are not enough details to support the claims that EinStein VI is easy to use and performs well. A concrete comparison between EinStein VI and other methods is missing. It would have been helpful to have, for example, some concrete numbers (e.g. time taken to do inference, posterior predictive checks, posterior mean convergence plots, etc) that showcase why it is useful to use Stein VI for those examples, as opposed to other, already existing methods. 
+
+Another concern is that it is difficult to judge from the paper what the difference to (standard) NumPyro is. There is only a high-level explanation of the examples in the paper, so it's hard to imagine what the actual code looks like. Most importantly, I would have liked to see a comparison between EinStein VI code and what the code would have looked like without EinStein VI. 
+
+### Reasons for score
+Unfortunately, there is not enough to go on in this paper, which is why I recommend reject. There is no strong evidence to support either the usability of the system (through elaborate examples and contrasting EinStein VI to other systems) or its performance (through experiments). This paper will be much stronger, and will have a better chance of reaching more people, if it includes either 1) more elaborate code examples that demonstrate that using EinStein is indeed better and easier than vanilla NumPyro, or 2) experiments comparing different Stein VI techniques to other inference algorithms, as evidence that a dedicated Stein VI library is indeed empowering our inference toolkit.
+
+However, I do appreciate that writing a paper about tools / libraries is difficult, as the contribution of tools is typically a longer-term improvement in the workflow of developing new methods and techniques. I am open to increasing my score during rebuttal, depending on the answers of the questions listed below.
+
+### Questions for the authors
+Why has Stein VI not been implemented in PPL systems previously? Is it a matter of timing, or is there something particularly challenging about integrating Stein VI into a PPL? 
+
+The paper mentions ""compositionality"" several times. I was a little confused about what you mean by that: can you explain, perhaps with an example? 
+
+The paper mentions novel features (second to last paragraph page 8): can you elaborate?
+
+The paper shows an example of using NeuTra in combination with Stein VI. Can you elaborate on the kind of problems that NeuTra won't be able to handle on its own? What about more lightweight approaches that can be applied in the context of probabilistic programming, such as ""Automatic Reparameterisation of Probabilistic Programs"" (Gorinova, Maria I., Dave Moore, and Matthew D. Hoffman. ICML 2020)? When will we see benefits of *both* applying a reparameterization that improves the posterior geometry, *and* using a more sophisticated inference algorithm like Stein VI?
+
+### Suggestions for improvement and typos that have not affected the score I gave to the paper
+Perhaps the most important change that would improve the paper is adding more concrete examples that would showcase the importance of using EinStein VI as opposed to simply NumPyro / other libraries. It would be nice to see a model where Stein VI gives us better inference results than a range of other algorithms / techniques and compare the code to what the user would have to write otherwise to achieve the same results. The examples of composing Stein VI with reparameterization / marginalization in NumPyro can be improved by comparing the results to Stein VI without reparameterization / marginalization and to other inference algorithms with reparameterization / marginalization. 
+
+Typos:
+* last line of the abstract should be 500 000 as opposed to 500.000.
+* URL in footnote 3 does not lead to the correct page
+",4,4.0,ICLR2021
+SJlojG-Z6X,3,HJGtFoC5Fm,HJGtFoC5Fm,Theorem 2.1 -2.2 are interesting,"Overall I found that the paper does not clearly compare the results to existing work. There are some new results, but some of the results stated as theorems are immediate consequence of existing work and a more detailed discussion and comparison is warranted. I will first give detailed comments on the establishing the relationship to existing work and then summarize my evaluation. 
+
+————
+Detailed comments on contributions and relationships to existing work.
+
+A. Theorem 2.1 establishes the limit of the regularized solutions as the maximum margin separator. 
+This result is a generalization the analogous results for linear models Theorem 3 in Rosset et al. (2004) “Boosting as a regularized path to maximum margin separator” and Thm 2.1 in Rosset Zhu Hastie “margin maximizing loss functions”  (the later paper missing from references, and that paper generalizes the earlier result for multi-class cross entropy loss). 
+Main difference from earlier work:
+1. extends the results for linear models to any homogeneous function 
+2. (minor) the previous results by Rosset et al. were stated only for lp norms, but this is a minor generalization since the earlier work didn’t at any point use the lp-ness of the norm and immediately extends for any norms. 
+
+Secondly, Theorem 2.2 also gives a bound on deviation of margin when the regularization is not driven all the way to 0. I do think this theorem would be differently stated by making the explicitly showing dependence of suboptimal margin \gamma’ on lambda and the sub optimality constant of loss. This way, one can derive 2.1 as a special case and also reason about what level of sub-optimality of loss can be tolerated. 
+
+B. Theorem 3.1 derives generalization bounds of learned parameters in terms of l2 margin. 
+—this and many similar results connecting generalization to margins have already been studied in the literature (Neyshabur et al. 2015b for example covers a larger family of norms than just l2 norm). Specially an analogous bound for l1 margin can also be found in these work which can be used in the discussions that follow. 
+
+C. Theorem 3.2: This result to my knowledge is new, but also pretty immediate from definition of margin. The proof essentially follows by showing that having more hidden units can only increase the margin since the margin is maximized over a larger set of parameters. 
+
+D. Comparison to kernel machines: Theorem 3.3 seems to be the paraphrasing of corollary 1 in Neyshabur et al (2014). But the authors claim that the Theorem 3.3 also holds when “the regularizer is small”. I do not understand what the authors are referring to here or how the result is different form existing work. Please clarify
+
+-----------
+In summary, The 2.1-2.2 on extension of the connection between regularized solution and maximum margin solution to general homogeneous models and to non-asymptotic regimes 
+-- this is in my opinion key contribution of the paper and an important result. But there is not much new technique in terms of proof here
+",5,4.0,ICLR2019
+HkeJCJAY2m,1,SygONjRqKm,SygONjRqKm,This paper presents a new way to construct the context vector in seq2seq networks.,"This paper presents a new way to construct the context vector in seq2seq networks. Specifically, the context vector in the proposed model is treated as a latent variable with a normal prior. The posterior is ""defined"" as a mixture of the latent presentations of the input sequence and the mixing weight is the attention. To do inference, maximising VAE-like ""ELBO"" is used. The proposed model is reported to achieve better performance in the seq2seq tasks including document summarisation and video caption.
+
+My comments are as follows:
+
+1. Although the authors claim a full statistical treatment of the context vector, the proposed model is not a proper statistical setting to me. I do not mean the setting is wrong, but it is really weird to me. The paper treats the context vector as a latent variable, which is used to generate the target sentence (take document summarisation task for example). The proper setting is to use the target sentence to do inference of the posterior of the context vector. However, the paper directly defines the format the ""posterior"", which is a mixture of the hidden representation of the source sentences, ignoring the target sentence. Therefore, I do not think this is the true posterior of the context vector. Moreover, seriously speaking, the ""ELBO"" of the proposed model is not a real ELBO. It is a training object function looked like an ELBO. Therefore, the difference of the proposed context vector with the standard one is instead of using a fixed hidden representation, the proposed one uses a stochastic one drawn from a Gaussian, the covariance of which is from a neural network.
+
+2. To me, it is less intuitive to analyse why the performance gain comes from, given the proposed context vector works so better than the previous ones. I am expecting the authors to give some intuitive explanations and empirical results on this.
+
+3. Minor comments on clarity: The abstract says both the mean and the covariance are modelled with neural networks, but in the paper, it seems that only the covariance is from a neural network. 
+",5,4.0,ICLR2019
+r1g9TUss3m,3,SkVRTj0cYQ,SkVRTj0cYQ,Differentially private variant of the federated learning framework,"The paper revisits the federated learning framework from McMahan in the context of differential privacy.  The general concern with the vanilla federated learning framework is that it is susceptible to differencing attacks. To that end, the paper proposes to make the each of the interaction in the server-side component of the gradient descent to be differentially private w.r.t. the client contributions. This is simply done by adding noise (appropriately scaled) to the gradient updates.
+
+My main concern is that the paper just described differentially private SGD, in the language of federated learning. I could not find any novelty in the approach. Furthermore, just using the vanilla moment's accountant to track privacy depletion in the federated setting is not totally correct. The moment's accountant framework in Abadi et al. uses the ""secrecy of the sample"" property to boost the privacy guarantee in a particular iteration. However, in the federated setting, the boost via secrecy of the sample does not hold immediately. One requirement of the secrecy of the sample theorem is that the sampled client has to be hidden. However, in the federated setting, even if one does not know what information a client sends to the servery, one can always observe if the client is sending *any* information. For a detailed discussion on this issue see https://arxiv.org/abs/1808.06651 .",4,4.0,ICLR2019
+SJlzds52FS,1,B1gdkxHFDH,B1gdkxHFDH,Official Blind Review #3,"This paper proposes a new definition of algorithmic fairness that is based on the idea of individual fairness. They then present an algorithm that will provably find an ML model that satisfies the fairness constraint (if such a model exists in the search space). One needed ingredient for the fairness constraint is a distance function (or ""metric"") in the input space that captures the fact that some features should be irrelevant to the classification task. That is, under this distance function, input that differ only in sensitive attributes like race or gender should be close-by. The idea of the fairness constraint is that by perturbing the inputs (while keeping them close with respect to the distance function), the loss of the model cannot be significantly increased. Thus, this fairness constraint is very much related to robustness.
+
+---
+
+Overall, I like the basic idea of the paper but I found the presentation lacking.
+
+I do think their idea for a fairness constraint is very interesting, but it gets too bogged down in the details of the mathematical theory. They mention Dwork et al. at the beginning but don't really compare it to their idea in detail, even though I think there would be a lot of interesting things to say about this. For example, the definition by Dwork et al. seems to imply that some labels in the training set might be incorrect, whereas the definition in this paper does not seem to imply that (which I think is a good thing).
+
+The main problem in section 2 is that the choice of distance function is barely discussed although that's what's most important to make the result fair. For all the mathematical rigor in section 2, the paragraph that is arguing that the defined constraint encourages fairness is somewhat weak. Here a comparison to other fairness definitions and an in-depth discussion of the distance function would help.
+
+(In general I felt that this part was more trying to impress the reader than trying to explain, but I will try to not hold it against this paper.)
+
+As it is, I feel the paper cannot be completely understood without reading the appendix.
+
+There is also this sentence at the bottom of page 5: ""A small gap implies the investigator cannot significantly increase the loss by moving samples from $P_*$ to comparable samples."" This should have been at the beginning of section 2 in order to motivate the derivation.
+
+In the experiments, I'm not sure how useful the result of the word embedding experiment really is. Either someone is interested in the sentiment associated with names, in which case your method renders the predicted sentiments useless or someone is not interested in the sentiment associated with names and your method doesn't even have any effect.
+
+Final point: while I like the idea of the balanced TPR, I think the name is a bit misleading because, for example, in the binary case it is the average of the TPR and the TNR. Did you invent this terminology? If so, might I suggest another name like balanced accuracy?
+
+I would change the score (upwards) if the following things are addressed:
+
+- make it easier to understand the main point of the paper
+- make more of a comparison to Dwork et al. or other fairness definitions
+- fix the following minor mistakes
+
+Minor comments:
+
+- page 2, beginning of section 2: you use the word ""regulator"" here once but everywhere else you use ""investigator""
+- equation 2.1: as far as I can tell $M$ is not defined anywhere; you might mean $\Delta (\mathcal{Z})$
+- page 3, sentence before Eq 2.3: what does the $\#$ symbol mean?
+- page 3, sentence before Eq 2.3: what is $T$? is it $T_\lambda$?
+- Algorithm 2: what is the difference between $\lambda^*_t$ and $\hat{\lambda}_t$?
+- page 7: you used a backslash between ""90%"" and ""10%"" and ""train"" and ""test"". That would traditionally be a normal slash.
+- in appendix B: the explanation for what $P_{ran(A)}$ means should be closer to the first usage
+- in the references, you list one paper twice (the one by Zhang et al.)
+
+EDIT: changed the score after looking at the revised version",6,,ICLR2020
+wtmokxTtFy8,2,TgSVWXw22FQ,TgSVWXw22FQ,Previous concerns not fully addressed,"This paper proposes a zero-shot voice style transfer (VST) algorithms that explicitly controls the disentanglement between content information and style information. Experiments show that the proposed algorithm can achieve significant improvement over the existing state-of-the-art VST algorithms. There are two major strengths of this paper. First, it motivates the algorithm design from an information-theoretic perspective. Second, the performance improvement is significant.
+
+However, since it is a resubmission from a previous machine learning conference, the previous concerns regarding limited novelty are not fully addressed. More specifically, if we view the proposed algorithm entirely from a technical perspective, there are two major innovations over AutoVC: 1) The style embedding is trained together with the content embedding, instead of being pre-trained; 2) The introduction of I3. The latter does not seem fully justified.
+
+First, it is shown in Table 4 that without I3, the drop in performance is not obvious. The authors ascribe this to that I1 and I2 already suffice to train the good model. Does it mean that the introduction of I3 is not as important an innovation as co-training?
+
+Second and more importantly, in all the experiments, the proposed system retains AutoVC’s physical bottleneck design, which was the key to disentangling style in AutoVC. In order to justify that I3 is a better disentangling mechanism than AutoVC, it is necessary to perform an ablation study where the bottleneck is widened and see if I3 still guarantees disentanglement, without which it is hard to justify the value of I3.
+
+Besides the concern regarding novelty, there are a few other concerns. There lacks a back-to-back comparison in Figure 2. What do the embeddings look like for AdaINVC and AutoVC? Also, Figure 2 only shows that the content embedding does not include style information. It would also be helpful to show the style embedding does not include content information by showing the content embedding and style embedding cluster with respect to different phones.
+
+To sum, without strong supporting evidence for the novel design in IDE-VC, it is hard to judge the contribution of this paper. I would look forward to more thorough evaluations in the rebuttal.",6,5.0,ICLR2021
+4Q5lyjLuerz,1,O6LPudowNQm,O6LPudowNQm,Synthetic dataset generator for inequality statements over ordered fields,"The paper describes a synthetic dataset generator for inequality
+statements over ordered fields.  The reason is to provide a test of
+generalization ability of models in interactive theorem proving tasks
+with Lean examples given as motivation. A lightweight syntax tree
+based prover is used by the machine learning agents to attempt solving
+the generated statements. GNN experiments with various masures show
+some generalization ability.
+
+I appreciate the developed inequality theorem generator and the
+provide API. They indeed could be useful for further investigations.
+
+I am however still not convinced that the considered measures
+correspond well to the generalization ability that the paper wants to
+show. For me the task is very simple in comparison with the claims.
+I am happy about the one given generalization example, but other than
+this what I can read from the paper is agents can solve test problems
+similar to those seen during training. As such, I do not think that the
+research done in the paper supports the conclusions the authors draw.
+The same holds for the out-of-distribution generalization where again
+the experiments show a small generalization capability but I am not
+convinced that this approach can in general lead to generalization.
+
+I appreciate the addition of the Monte Carlo tree search to the base
+model and the consideration of some models (even if still short).
+",6,2.0,ICLR2021
+ryegpfeaYS,2,rJehVyrKwH,rJehVyrKwH,Official Blind Review #2,"This paper suggests a quantization approach for neural networks, based on the Product Quantization (PQ) algorithm which has been successful in quantization for similarity search. The basic idea is to quantize the weights of a neuron/single layer with a variant of PQ, which is modified to optimize the quantization error of inner products of sample inputs with the weights, rather than the weights themselves. This is cast as a weighted variant of k-means. The inner product is more directly related to the network output (though still does not account for non-linear neuron activations) and thus is expected to yield better downstream performance, and only requires introducing unlabeled input samples into the quantization process. This approach is built into a pipeline that gradually quantizes the entire network.
+
+Overall, I support the paper and recommend acceptance. PQ is known to be successful for quantization in other contexts, and the specialization suggested here for neural networks is natural and well-motivated. The method can be expected to perform well empirically, which the experiments verify, and to have potential impact.
+
+Questions:
+1. Can you comment on the quantization time of the suggested method? Repeatedly solving the EM steps can add up to quite an overhead. Does it pose a difficulty? How does it compare to other methods?
+2. Can you elaborate on the issue of non-linearity? It is mentioned only briefly in the conclusion. What is the difficulty in incorporating it? Is it in solving equation (4)? And perhaps, how do you expect it to effect the results?",8,,ICLR2020
+r1gpVv_gM,1,SJxE3jlA-,SJxE3jlA-,"Interesting similarity function, but insufficient evidence of generality","There are a number of attempts to add episodic memory to RL agents. A common approach is to use some sort of recurrent model with a model-free agent. This work follows this approach using what could be considered a memory network with a identity embedding function and tests on 'Concentration', a game which requires matching pairs of cards. They find their model outperforms a DNC and LSTM baselines.
+
+The primary novelty is the use of an explicitly masked similarity function (with learned mask) and the concentration task, which requires more memory than, for example, common tasks adapted from the psychology literature such as the Morris watermaze or T-maze (although in the supervised setting tasks such as Omniglot are quite similar).
+
+This work is well-communicated and cites relevant prior work. The author's should also be commended for agreeing to release their code on publication.
+
+The primary weakness of this work its lack of novelty and lack of evidence of generalization of the approach, which limits its significance. The model introduced is a slight variant of memory networks. Additionally, the single task the model is tested on appears custom-designed to favor the model (see next paragraph). While the analysis of the weakness of cosine similarity is interesting, memory networks which compute separate embeddings for the 'label' (content-based label for retrieval) and memory content don't appear to suffer from the same issue as the DNC. They can store only retrieval-relevant content in the label and thus avoid issues with normalization.
+
+The observation vector is stored directly in memory without passing through an embedding function, which in general seems quite limiting. However, in the constructed task the labels are low-dimensional, random vectors and there is no noise in the labels (i.e. two cards with the same label are labelled identically, rather the similarly). The author's mention avoiding naturalistic labels such as omniglot characters (closer to the real version of concentration) due to the possibility the agent might memorise the finite set of labels, however by choosing a large dataset and using a non-overlapping set of examples for the test set this probably could be avoided and would provide a more naturalistic test set.
+
+The comparison with the DNC also seems designed to favor their model. DNC has write-gates, which might be relevant in a task with many irrelevant observations, but in this task are clearly going to impair learning. A memory network seems the more appropriate comparison. Its not clear why the DNC model used two different DNCs for computing the policy and value.
+
+To demonstrate their model is of more general interest it would be necessary to try on a wider range of more naturalistic tasks and a comparison with model-free agents augmented with memory networks. Simply showing that a customized model can outperform on a single custom, synthetic task is insufficient to demonstrate that these changes are of wider interest.
+
+Minor issues:
+- colorblind seems an odd description for agents which cannot perceive the card face. Why not just 'blind'? colorblind would seem to imply partial perception of the card face.
+
+- the observations of the environment are defined explicitly, but not the action space.",4,5.0,ICLR2018
+HJxqzYDt2Q,1,r1gEqiC9FX,r1gEqiC9FX,Review,"The authors propose a new weight re-parameterization technique called Equi-normalization (ENorm) inspired by the Sinkhorn-Knopp algorithm. The authors show that the proposed method preserve functionally equivalent property in respect of the output of the functions (Linear, Conv, and Max-Pool) and show also that ENorm converges to the global optimum through the optimization. The experimental results show that ENorm performs better than baseline methods on CIFAR-10 and ImageNet datasets.
+
+pros)
+(+) The authors provide a theoretical ground.
+(+) The theoretical analysis of the convergence of the proposed algorithm is well provided.
+(+) The computational overhead reduced by the proposed method compared with BN and GN looks good.
+
+cons)
+(-) There is no comparison with other weight reparameterization methods such as Weight Normalization, Normalization propagation, Instance Normalization, or Layer Normalization. 
+(-) The evidence why functionally equivalence is connected to the performance or generalization ability is not clarified.  
+(-) The experimental results cannot consistently show the effectiveness of the proposed method in test accuracy. In Table 4, the proposed method outperforms BN, but In Table 2 and 3, BN is mostly better than the proposed method.
+(-)  The batch size shown in Table 2 and 3 may be intended to show the batch-independent property of the proposed method, but BN is also doing well in those tables. Therefore, Table 2 and 3 are not adequate to show the batch-independent property.
+(-) The proposed method should evaluate with deeper networks (e.g., ResNet-50, ResNet101, or DenseNet-169) to support the superiority over BN and GN.  
+(-) Adjusting c does not seem to be promising. In Table 2 and 3, ENorm-1 is better than ENorm-1.2, and also in Table 4, only the result of ENorm-1 is provided. The authors should do a parameter study with c to make all the experiments more convincing.
+
+Comments) 
+- The experimental settings are not consistent. The authors should provide the reason why they set those settings or should include some studies about the parameters (for example about the paramter c). 
+- Section 3.7 is not clear to me.  How's the performance going on when adjusting c < 1?
+- It is better for the authors to provide the Sinkhorn-Knopp algorithm (SK algorithm), which gave them an inspiration for this work, for better readability. 
+- Why eq.(4) is necessary? For iterative optimization? If so, the authors should incorporate a detailed explanation about this in the corresponding section.
+- The authors should provide a detailed description of the parameter c. It is not clear why c is necessary, and please make sure the overall derivation does not need to be modified due to the emergence of c.
+- It seems that the authors could compact the paper by highlighting key ideas. 
+- Typo: Annex A (on p.5).
+
+The paper is written well and provides a sound theoretical analysis to show the main idea, but unfortunately, the experimental results do not seem to support the effectiveness of the proposed method.",5,3.0,ICLR2019
+C6aX6eJwQkp,1,kE3vd639uRW,kE3vd639uRW,"Simple yet effective method, but need more analysis / evidence","This paper proposes a simple downscaling / upscaling method LiftPool, inspired by Lifting Scheme from signal processing. Compared with traditional max / avg pooling used in CNNs, LiftPool decomposes input signal into a downscaled approximation and difference component, which can be used for classification (by summing both components) or segmentation (by skip-layer connections). Experiments on both tasks show proposed method yields better performance, and additional analysis on stability indicates LiftPool is more robust than other pooling methods.
+
+Pros:
+- A simple but effective method by transferring knowledge from signal processing to vision/ML. 
+- Pooling is a fundamental block in CNN architecture and this work would benefit a wide range of research and applications
+- Consistently better experiment results compared with other commonly used pooling methods, by a significant margin
+- Paper is relatively clear written and easy to follow
+
+Cons:
+- My main concern with this paper is lacking measurements over # of params, FLOPs and latency. Unlike traditional pooling methods which are usually parameter-free and (relatively) fast to compute, LiftPool does need (from authors) ""convolution operators followed by non-linear relu operators"" to simulate the filters in  $\mathcal{P}$ and $\mathcal{U}$. The implementation details of these conv operators are missing (except the kernel size is 5, from ablation study). It is unclear whether the performance boost of proposed method is from the effectiveness of LiftPool, or from added capacity of network with more parameters and computations.
+
+
+Minor comments:
+- Abstract: ""upsampling a down-scaled feature map loses much information"": this is not necessarily true.  Downscaling could lose information while upscaling alone usually don't bring more information.  Moreover, CNNs could still ""memorize"" spatial information in its depth channel (consider a space-to-depth that reduces spatial resolution but still preserves all information).
+
+- Both $\mathcal{P}$ and $\mathcal{U}$ should be real-valued and using conv + relu might limit filter responses to non-negative values? If multiple conv layers are used and the last one do not have any activation, please specify.
+
+- Eq 4/5 proposes additional loss to help training LiftPool but its effect is not backed up by experiments.
+
+- Figure 3: it might be beneficial to show the original high res feature map together with baseline pooling results (e.g. max / avg pooled) to demonstrate information preserved by LiftPool.
+
+- Experiments conducted are more towards lightweight backbones (ResNet18/50 and MobilenetV2). Would LiftPool also work well with larger architectures? This is specially important given the extra params and computes LiftPool brings.
+
+- On segmentation experiments: It is well-known that bringing low level features in decoders of modern segmentation networks could boost model performance. How would LiftPool work with architectures that have decoder with skip-connections? The results on DeeplabV3plus looks good, but it is more from transferability (pretrained on image classification using liftpool). 
+
+- On deeplabv3plus setup: it would be great to clarify more details on this model, e.g. decoder and output stride used in training and validation.
+
+Final review:
+
+The authors addressed most of my concerns. Just a nit that it could be helpful to add the total FLOPs into the table (together with # of params) just for completeness. ",7,4.0,ICLR2021
+S1ZMS80Qx,1,r1R5Z19le,r1R5Z19le,,"This work presents an embedding approach for semi-supervised learning with neural nets, in the presence of little labeled data. The intuition is to learn a metric embedding that forms “clusters” with the following desiderata: two labeled examples from the class should have a smaller distance in this embedding compared to any example from another class & a given unlabeled example embedding will be closer to all of the embeddings of *some* label (i.e. that a given unlabeled example will be “matched” to one cluster). The paper formulates these intuitions as two differentiable losses and does gradient descent on their sum.
+
+It’s unclear to me how different is this work from the sum of Hoffer & Ailon (2015) (which is eq. 3) and Grandvalet & Bengio (2004) (seems to be related to eq. 4). Would be nice if the authors not only cited the previous work but summarized the actual differences.
+In Section 5.1, the authors say that Szegedy et al. (2015) use random noise in the targets -- is that actually true? I think only soft targets are used (which are not noisy).
+
+Does the choice of \lambda_{1,2} make a difference?
+
+How is k for k-NN actually chosen? Is there a validation set?
+
+Figure 1 would benefit from showing where the labeled examples were at the beginning of training (relative to each other / rest of the data).
+
+The submission seems overall OK, but somewhat light on actual data-driven or theoretical insights. I would’ve liked experiments showing the influence of data set sizes at the very least, and ablation experiments that showed the influence of each of the corresponding losses.
+
+
+",6,4.0,ICLR2017
+S1xvCfDD6X,2,HJz6tiCqYm,HJz6tiCqYm,An important benchmark for measuring the robustness of computer vision models,"This paper introduces new benchmarks for measuring the robustness of computer vision models to various image corruptions. In contrast with the popular notion of “adversarial robustness”, instead of measuring robustness to small, worst-case perturbations this benchmark measures robustness in the average case, where the corruptions are larger and more likely to be encountered at deployment time. The first benchmark “Imagenet-C” consists of 15 commonly occurring image corruptions, ranging from additive noise, simulated weather corruptions, to digital corruptions arising from compression artifacts. Each corruption type has several levels of severity and overall corruption score is measured by improved robustness over a baseline model (in this case AlexNet). The second benchmark “Imagenet-P” measures the consistency of model predictions in a sequence of slightly perturbed image frames. These image sequences are produced by gradually varying an image corruption (e.g. gradually blurring an image). The stability of model predictions is measured by changes in the order of the top-5 predictions of the model. More stable models should not change their prediction to minute distortions in the image. Extensive experiments are run to benchmark recent architecture developments on this new benchmark. It’s found that more recent architectures are more robust on this benchmark, although this gained robustness is largely due to the architectures being more accurate overall. Some techniques for increasing model robustness are explored, including a recent adversarial defense “Adversarial Logit Pairing”, this method was shown to greatly increase robustness on the proposed benchmark. The authors recommend future work benchmark performance on this suite of common corruptions without training on this corruptions directly, and cite prior work which has found that training on one corruption type typically does not generalize to other corruption types. Thus the benchmark is a method for measuring model performance to “unknown” corruptions which should be expected during test time.
+
+In my opinion this is an important contribution which could change how we measure the robustness of our models. Adversarial robustness is a closely related and popular metric but it is extremely difficult to measure and reported values of adversarial robustness are continuously being falsified [1,2,3]. In contrast, this benchmark provides a standardized and computationally tractable benchmark for measuring the robustness of neural networks to image corruptions. The proposed image corruptions are also more realistic, and better model the types of corruptions computer vision models are likely to encounter during deployment. I hope that future papers will consider this benchmark when measuring and improving neural network robustness. It remains to be seen how difficult the proposed benchmark will be, but the authors perform experiments on a number of baselines and show that it is non-trivial and interesting. At a minimum, solving this benchmark is a necessary step towards robust vision classifiers. 
+
+Although I agree with the author’s recommendation that future works not train on all of the Imagenet-C corruptions, I think it might be more realistic to allow training on a subset of the corruptions. The reason why I mention this is it’s unclear whether or not adversarial training should be considered as performing data augmentation on some of these corruptions, it certainly is doing some form of data augmentation. Concurrent work [4] has run experiments on a resnet-50 for Imagenet and found that Gaussian data augmentation with large enough sigma (e.g. sigma = .4 when image pixels are on a [0,1] scale) does improve robustness to pepper noise and Gaussian blurring, with improvements comparable to that of adversarial training. Have the authors tried Gaussian data augmentation to see if it improves robustness to the other corruptions? I think this is an important baseline to compare with adversarial training or ALP.
+
+Few specific comments/typos:
+
+Page 2 “l infinity perturbations on small images”
+
+The (Stone, 1982) reference is interesting, but it’s not clear to me that their main result has implications for adversarial robustness. Can the authors clarify how to map the L_p norm in function space of ||T_n - T(theta) || to the traditional notion of adversarial robustness?
+
+1. https://arxiv.org/pdf/1705.07263.pdf
+2. https://arxiv.org/pdf/1802.00420.pdf
+3. https://arxiv.org/pdf/1607.04311.pdf
+4. https://openreview.net/forum?id=S1xoy3CcYX&noteId=BklKxJBF57",9,5.0,ICLR2019
+SJghIrYk9r,1,r1ecqn4YwB,r1ecqn4YwB,Official Blind Review #1,"This goal of this paper is to present a strong empirical result showing that a ""pure"" machine learning based method can outperform all known methods on some of the most challenging time series forecasting benchmarks (TOURISM, M3 and especially M4). Since I am not from the field of forecasting, I can not be sure of this, but from my understanding these benchmark datasets are indeed challenging and the cited references back up the claims of the paper related to these datasets being important in the field. 
+
+On the most challenging dataset (M4), the best known performing method combines RNNs with a traditional smoothing algorithm. The model proposed in this paper outperforms it without being combined with any classical approach, though it does utilize ensembling.
+The experimental setup is sound in my opinion, and the result appears to be of high potential significance.
+However, despite trying to go through Section 3 multiple times, the exact model architecture is not clear to me. Due to this reason, my current decision is a weak rejection since the model is a central contribution of the paper. I will be happy to increase my score if the authors can make the model description crystal clear.
+
+Even though my expertise is deep neural network architectures, I find it hard to follow the descriptions in Sec. 3. I faced the most difficulty understanding section 3.1, which obviously made the rest of subsections even harder to follow. Here are my main points of confusion:
+
+- One big issue is that the paper uses an illustration (Fig. 1) to explain the architecture instead of equations, but then uses symbols in the main text that do not appear on Fig. 1 at all such as g_theta. Is the ""FC Stack (4 layers)"" g_theta? 
+
+- Where are (uppercase) phi functions? I could infer that these are the ""FC"" blocks but they should be labeled.
+
+- The Figure has the symbols g^b_theta and g^f_theta that do not appear in the text description. What exactly do they do? And is the theta that parameterizes each of them the same theta that parameterizes g_theta? How is this possible if g_theta is the ""FC Stack (4 layers)""?
+
+- The description in second and third paragraph of Section 3.1 is very confusing and unclear. It should be replaced or augmented with equations using clearly defined symbols that match Figure 1.
+
+- More confusion stems from the use of the term ""parameters"" in (I believe) a different context than is used in neural networks, where ""parameters"" refers to connection weights. But here parameters are outputs of some functions, so either they are not connection weights or this is a fast-weight style architecture where outputs are weights [1], in which this should be made clear.
+
+- Design of the doubly residual architecture in Section 3.2 makes sense to me at a high level, but I feel it is still very hard to clearly understand and implement it. Again, use of equations to clearly define the computation would be very helpful.
+
+
+[1] Schmidhuber, Jürgen. ""Learning to control fast-weight memories: An alternative to dynamic recurrent networks."" Neural Computation 4.1 (1992): 131-139.
+
+
+--- Update after rebuttal ---
+
+I am happy to see the paper greatly improved by the authors in their updates. My concerns related to the presentation of the model have been addressed, and I find the architecture much easier to understand. I also appreciate the detailed supplementary material that is likely to help readers interested in the area. Related to the areas I work in, I noticed the following missing references:
+
+Densenets: Lang, K.J. and Witbrock, M.J., 1988, June. Learning to tell two spirals apart. In Proceedings of the 1988 connectionist models summer school (No. 1989, pp. 52-59).
+
+Metalearning: Schmidhuber, J., 1987. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-... hook (Doctoral dissertation, Technische Universität München).
+
+While improving papers is generally the objective of the rebuttal phase, I suggest that the authors to not take this as an opportunity to submit unpolished papers in the first phase. That said, I have increased my rating to reflect my satisfaction with the current version of the paper.
+",8,,ICLR2020
+S1EEVuwlG,1,SyELrEeAb,SyELrEeAb,Interesting idea but needs more experiments and justification since it is a vast field and not all aspects of the problem is accounted for.,"In this paper, the authors propose to use the so-called implicit model to tackle Genome-Wide Association problem. The model can be viewed as a variant of Structural Equation Model. Overal the paper is interesting and relatively well-written but some important details are missing and way more experiments need to be done to show the effectiveness of the approach.
+
+*  How do the authors call a variant to be associated with the phenotype (y)? More specifically, what is the distribution of the null hypothesis? Section D.3 in the appendix does not explain the hypothesis testing part well. This method models $x$ (genetic), $y$ (phenotype), and $z$ (confounder) but does not have a latent variable for the association. For example, there is no latent indicator variable (e.g., Spike-Slab models [1]) for each variant.  Did they do hypothesis testing separately after they fit the model? If so, this has double dipping problem because the data is used once to fit the model and again to perform statistical inference. 
+
+* In GWAS, a method resulting in high power with control of FP is favored. In traditional univariate GWAS, the false positive rate is controlled by genome-wide significant level (7e-8), Bonferroni correction or other FP control approaches.  Why Table 1 does not report FP? I need Table 1 to report the following: What is the power of this method if FPR is controlled(False Positive Rate < 0.05)? Also, the ROC curve for FPR<0.05 should be reported for all methods. 
+
+* I believe that authors did a good job in term of a survey of the available models for GWA from marginal regression to mixed effect model, etc. The authors account for typical confounders such as cryptic relatedness which I liked. However, I recommend the authors to be cautious calling the association detected by their method "" a Causal Association."" There are tons of research done to understand the causal effect of the genetic variants and this paper (and this venue) is not addressing those.  There are several ways for an associated variant to be non-causal and this paper does not even scratch the surface of that. For example, in many studies, discovering the causal SNPs means finding a genetic variant among the SNPs in LD of each other (so-called fine mapping). The LD-pruning procedure proposed in this paper does not help for that purpose. 
+
+* This approach jointly models the genetic variants and the phenotype (y). Let us assume that one can directly maximize the ML (ELBO maximizes a lower bound of ML). The objective function is disproportionally influenced by the genetic variants (x) than y because M is very large (  $\prod_{m=1}^M p(w) p(x|z,w,\phi)   >>   p(z) p(y|x,z,\theta) $  ). Effectively, the model focuses on the genetic variants, not by the disease. This is why multi-variate GWAS focuses on the conditional p(y|x,z) and not p(y,x,z). Nothing was shown in the paper that this focusing on p(y,x,z) is advantageous to p(y|x,z). 
+
+* In this paper, the authors use deep neural networks to model the general functional causal models. Since estimation of the causal effects is generally unidentifiable (Sprites 1993), I think using a general functional causal model with confounder modeling would have a larger chance to weaken the causal effects because the confounder part can also explain part of the causal influences. Is there a theoretical guarantee for the proposed method? Practically, how did the authors control the model complexity to avoid trivial solutions?
+
+Minor
+-------
+* The idea of representing (conditional) densities by neural networks was proposed in the generative adversarial networks (GAN). In this paper, the authors represent the functional causal models by neural networks, which is very related to the representation used in GANs. The only difference is that GAN does not specify a causal interpretation. I suggest the authors add a short discussion of the relations to GAN.
+
+* Previous methods on causal discovery rely on restricted functional causal models for identifiability results. They also use Gaussian process or multi-layer perceptron to model the functions implicitly, which can be consider as neural networks with one hidden layer. The sentence “These models typically focus on the task of causal discovery, and they assume fixed nonlinearities or smoothness which we relax using neural networks.” in the related work section is not appropriate. 
+
+[1] Scalable Variational Inference for Bayesian Variable Selection in Regression, and Its Accuracy in Genetic Association Studies",5,5.0,ICLR2018
+W5g80NDOPK2,3,FGqiDsBUKL0,FGqiDsBUKL0,[Official Review] ,"#### Summary ####
+This paper studies an interesting inverse-graphics problem. It proposed a novel method to learn 3D shape reconstruction using pre-trained 2D image generative adversarial networks. Given an image containing one single object of interest, it first predicts the graphics code (e.g., viewpoint, lighting, depth, and albedo) by minimizing the reconstruction error using a differentiable renderer. The next step is to render many pseudo samples by randomization in the viewpoint and lighting space, while keeping the predicted depth and albedo fixed. A pre-trained 2D image GAN is further used to project the pseudo samples to the learned data manifold through GAN-Inversion. Finally, these projected samples are added to the set for the next round optimization. Experimental evaluations have been conducted on several categories including face, car, building, and horse.
+
+
+#### Comments ####
+Overall, this is a very interesting paper with good presentations, promising experimental results, and solid quantitative comparisons with the previous work. Reviewer would like to point out the potential weakness of the paper as follows.
+
+W1: Though impressed by the results (especially the proposed method works for horse and building), reviewer suspects the paper only works in a very simplified setting: (1) the GAN was previously trained on a large amount of 2D images of a single category with many variations in identity, viewpoint, and lighting; (2) the initialization (or step 1 in Section 3.1) step seems very critical to the overall performance; and (3) viewpoint and lightning randomization seems have to be hand-tuned. Reviewer would like to see the discussions on the underlying assumptions more explicitly. In addition, reviewer would like to know how does the method generalize to “dirty” data: people with sunglasses, people with noticeable earrings, people partially occluded by wavy long hair, and people with a side view (looks like the input has to be a frontal face image). Same question applies to those non-convex shapes: a convertible car or a car with the window open. Reviewer suspects the method in the current form cannot handle them well.
+
+W2: Some important experimental settings are neither presented nor clarified. For example, it is not clear what is the difference between “Ours (3D)” and “Ours (GAN)”, which should be clarified. For image editing (see Figure 6 and Figure 8), reviewer sees a noticeable change in background color sometimes (e.g., second row in Figure 6). It would be good to give a very detailed explanation of the image editing process (e.g., what’s the input and output format in each stage). As ellipsoid was used to initialize the face shape, reviewer would like to know what was the initialization for other categories such as building and car (see Figure 10). 
+
+W3: It would be good to report the time spent on the computation and optimization and how it is compared to the baselines in Table 1 and Table 2. This is very important metric to report as a fair comparison to the previous work.
+
+== Post-rebuttal Comments ==
+
+I am raising my score from 7 to 8, as author responses addressed my comments well (especially answer to W1 and the Figure 13) than expected.",8,5.0,ICLR2021
+Q_EAkaGLa2G,1,QnzSSoqmAvB,QnzSSoqmAvB, ICLR 2021 Conference Paper2090 AnonReviewer3,"This paper presents an algorithm NDMZ that extends MuZero to non-deterministic, two-player, zero-sum games of perfect information. The new algorithm borrows the idea from non-deterministic MCTS and the theory of extensive-form games. The empirical studies show a competitive performance of MuZero agains AlphaGoZero, despite MuZero lacks a perfect simulator the game.
+
+Comments:
+
+This paper is generally well-written and clear. The empirical results demonstrate that NDMZ can achieve a similar performance as AlphaGoZero, which is very interesting provided that MuZero does not get access to a perfect game simulator.
+
+1. In Fig 2. with 12-5-6 Nannon, it shows that it takes AlphaGoZero a few rounds to learn and after 100 rounds of training, it does not perform well against the optimal policy (with a winning rate about 30%), which implies that AlphaGoZero does not converge to an optimal policy. The reviewer is not familiar with AlphaGoZero's performance in these games and is wondering whether this is expected.",7,1.0,ICLR2021
+4HOnX-eOiUn,3,1Q-CqRjUzf,1Q-CqRjUzf,Reasonable breadth of empirical analysis but experimental design leads to potentially misleading results.,"This work studies the ‘churn’ (disagreement between predictions of two replicates) caused by different sources of variation in the training procedure and proposes solutions to reduce it. One solution is to use minimum entropy regularization to increase prediction confidences and the second solution is to force model agreement via co-distillation.
+
+===============================
+
+Pros:
+1. The paper is well written and clear.
+2. Studies dissect many components, e.g. churn caused different sources of variation, ablation study of the proposed co-distillation+entropy. 
+3. Results are compared to reasonable baselines. 
+
+Cons:
+1. I find the problem statement unconvincing. How much retraining from scratch affects generalization? Beside variability on the test set accuracy, is there any evidence that the fluctuations observed are representative of variations of the true risk (on all data distribution)? Or is it only noise due to the small size of the test set?
+2. Modification of two independent variables (dependent and independent variables in planning of experimental procedures) in the experiments of Table 1 likely make the results misleading. (More on this below) 
+
+===============================
+
+Reasons for score:
+
+I would vote for a weak reject. The breadth of the empirical analysis is sufficient but subtle details (as further explained below) make the results misleading for Table 1.
+
+===============================
+
+Additional observations
+
+Modification of two independent variables (dependent and independent variables in planning of experimental procedures) in the experiments of Table 1 make the results potentially misleading. The churn depends to some degree on the level of accuracy, and data augmentation significantly affects accuracy as well. Indeed, removing data augmentation leads to a drastic drop of 4% accuracy. In the same way, but less drastic, the random data order from one epoch to another affects accuracy. When modifying data augmentation or data order, it is not possible to determine whether the change of churn is strictly due to data augmentation/data order or to accuracy drop. One way of avoiding this confounding effect would be to conduct hyperparameter optimization in a way to enforce a given level of accuracy (ex: 88%). We could then compare with augmentation at 88% accuracy vs no augmentation at 88% accuracy. If for instance a sub-optimal learning rate with data augmentation yielding 88% accuracy leads to the same level of churn than a good learning rate with no data augmentation yielding 88% accuracy, then we could conclude the effect on churn is mainly due to accuracy itself. I would nevertheless assume data augmentation to reduce churn indeed as it can be seen as increasing the dataset size, which reduces the level of noise. Therefore, to measure the effect of accuracy alone on churn I would also run two experiments where I optimize the learning rate to find accuracies of 84% and 88% (both without data augmentation) to see the relation between accuracy and churn on equal dataset size.
+
+On the same topic, the authors should avoid removing altogether data augmentation when they want to remove its effect of variation. They should rather seed it. I understand it is more effort as I recently went through the process, but it is possible. This way they would study the effect of varying data augmentation vs fixing it across replicates without losing its regularization effect. The same applies for data order. It is possible to seed data order without losing the randomization from one epoch to another. 
+
+In section 2.3, the authors say that ‘Even when all other aspects of training are held constant (rightmost column), model weights diverge within 100 steps (across runs) and the final churn is significant.’ I am not familiar with TensorFlow, but I know ResNet implementations based on the cudnn backend require the deterministic operators for perfect reproducibility. ResNet is trainable in a deterministic way using PyTorch for instance. It turns out I have also studied these sources of noise and the residual variance resulting from this noise (numerical noise due to operation order on GPU) is significantly smaller than the one caused by data sampling, weighs init, data ordering or data augmentation. The results last column on table 1, where churn is only caused by numerical noise, suggests that the large increase in churn when removing data augmentation is mainly due to the decrease in accuracy. When only numerical noise is present there is smaller variance in results so I would expect churn to be smaller. The fact that it increases in Table 1 suggests to me that we are indeed observing a confounding effect where the main cause is the drop of accuracy rather that the removal of data augmentation.
+
+
+Bold results in table 2 are misleading. Many of them are not significantly different yet only one result per column is in bold. All top results that are not significantly different should be in bold. 
+
+The co-distillitation + entropy procedure proposed in this paper introduces 2 new hyperparameters. The optimization of these hyper-parameters can lead to misleading results if hyperparemeters of baselines are not optimized accordingly with similar budgets. The experimental section should report these optimization procedures so that we can assert the reliability of the results. Also, were the alpha and beta optimized to provide better accuracy of lower churn in Table 2?
+
+===============================
+
+Typos, minor comments, questions
+
+Intro in section 2 should restate that the definition builds open the work of Cormier et al 2016.
+
+In the co-distillation process, how do you choose which model to retain to compute the churn? As I understand it the churn is computed on 2 models that are trained independently with other co-distillation ‘siblings’. Do you pick randomly within the co-distillation pairs?
+
+Page 1, first paragraph: a a novel -> a novel
+Page 5, Intuitively,encouraging -> Intuitively, encouraging
+
+===============================
+
+Post-Rebuttal
+
+I thank the authors for the detailed answer. In light of the response of the authors and the other reviews, I still recommend rejecting the paper, with a rating of 5.
+
+Some comments based on the rebuttal:
+
+I find the data in table 6 to support my point on confounding variables. The churn with data augmentation fixed on GPU is systematically higher than with random data augmentation on TPU when model initialization is random. Note that the main differences here beside the fact that data augmentation is fixed or random, is that the accuracy is lower by 1.5%-2.5%. We see again an increase in churn related to a decrease in accuracy. Just as when the data augmentation was removed altogether. The authors say that accuracy change itself isn't predictive of churn because we could make a training perfectly reproducible with lower accuracy thus leading to 0 churn, but the same argument would hold for removing a fixed data augmentation from a perfectly reproducible training. When not fixing the whole training process, two different interventions leading to equivalent accuracy decrease could lead to equivalent churn. This for instance would be a direct consequence of a binomial modelisation of the model performance as a function of test set size and model average accuracy (and by the way which models fairly well test accuracy variation for the datasets-architectures in this paper). The lower the accuracy, the higher the variance.
+
+> Table 2. We boldfaced the results in table 2 with the best mean performance, which we believe is a standard practice.
+
+It is unfortunately common practice indeed, but it is a bad practice. Results that are so close that random fluctuations could explain the difference should not be considered as different.",5,4.0,ICLR2021
+BkKlz-GNe,3,H1Heentlx,H1Heentlx,,"This paper considers the case where multiple views of data are learned through a probabilistic deep neural network formulation. This makes the model non-linear (unlike e.g. CCA) but makes inference difficult. Therefore, the VAE framework is invoked for inference.
+
+In [Ref 1] the authors show that maximum likelihood estimation based on their linear latent model leads to the canonical correlation directions. But in the non-linear case with DNNs it's not clear (at least with the present analysis) what the solution is wrt to the canonical directions. There's no such analysis in the paper, hence I find it a stretch to refer to this model as a CCA type of model. In contrast, e.g. DCCA / DCCAE are taking the canonical correlation between features into account inside the objective and provide interpretations.
+
+[Ref 1] F. R. Bach and M. I. Jordan. A probabilistic interpretation of canonical correlation analysis. Technical Report 688, 2005.
+
+There is also a significant body of very related work on non-linear multi-view models which is not discussed in this paper. For example, there's been probabilistic non-linear multi-view models [Ref 2, 3], also extended to the Bayesian case with common/private spaces [Ref 4] and the variational / deep learning case [Ref 5].
+
+[Ref 2] Ek et al. Gaussian process latent variable models for human pose estimation. MLMI, 2007.
+[Ref 3] Shon et al. Learning shared latent structure for image synthesis and robotic imitation. NIPS, 2006.
+[Ref 4] Damianou et al. Manifold relevance determination. ICML, 2012.
+[Ref 5] Damianou and Lawrence. Deep Gaussian processes. AISTATS, 2013.
+
+I can see the utility of this model as bringing together two elements: multi-view modeling and VAEs. This seems like an obvious idea but to the best of my knowledge it hasn't been done before and is actually a potentially very useful model.
+
+However, the question is, what is the proper way of extending VAE to multiple views? The paper didn't convince me that VAE can work well with multiple views using the shown straightforward construction. Specifically, VCCA doesn't seem to promote the state of the art in terms of results (it actually is overall below the SOA), while the VCCA-private seems a quite ill-posed model: the dimensionalities d have to be manually tuned with exhaustive search; further, the actual model does not provide a consinstent way of encouraging the private and common variables to avoid learning redundant information. Relying only on dropout for this seems a quite ad-hoc solution (in fact, from Fig. 4 (ver2) it seems that the dropout rate is quite crucial). Perhaps good performance might be achieved with a lot of tuning (which might be why the FLICKR results got better in ver2 without changing the model), but it seems quite difficult to optimize for the above reasons. From a purely experimental point of view, VCCA-private doesn't seem to promote the SOA either. Of course one wouldn't expect any new published paper to beat all previous baselines, but it seems that extension of VAE to multiple views is a very interesting idea which deserves some more investigation of how to do it efficiently.
+
+Another issue is the approximate posterior being parameterized only from one of the views. This makes the model less useful as a generic multi-view model, since it will misbehave in tasks other than classification. But if classification is the main objective, then one should compare to a proper classification model, e.g. a feedforward neural network.
+
+The plots of Fig. 8 are very nice. Overall, the paper convinced me that there is merit in attaching multiple views to VAE. However, it didn't convince me a) that the proposed way to achieve this is practical b) that there is a connection to CCA (other than being a method for multiple views). The bottom line is that, although the paper is interesting, it needs a little more work.
+",5,4.0,ICLR2017
+B1guFCaotH,2,SygcSlHFvS,SygcSlHFvS,Official Blind Review #2,"This paper proposes to provide a detailed study on the explainability of link prediction (LP) models by utilizing a recent interpretation of word embeddings. More specifically, the authors categorize the relations in KG into three categories (R, S, C) using the correlation between the semantic relation between two words and the geometric relationship between their embeddings. The authors utilize this categorization to provide a better understanding of LP models’ performance through several experiments.
+
+This paper reads well and the results appear sound. I personally believe that works on better understanding KGC models are a very essential direction which is mostly ignored in this field of study. Moreover, the provided experiments support the authors’ intuition and arguments.
+
+As for the drawbacks, I find the technical novelty of the paper is somewhat limited, as the proposed method consists of a mostly straightforward combination of existing methods. Further, I believe this work needs more experimental results and decisive conclusions identifying future directions to achieve better performance on link prediction. My concerns are as follows:
+
+•    I am wondering about the reason for omitting Max/Avg path for two of the relations in WN18RR? Further, the average of 15.2 for the shortest path between entities with “also_see” relation appears to be a mistake?
+•    Was there any specific reason in choosing WN18RR and NELL-995 KGs for the experiments?
+•    It would be interesting to see the length of paths between entities for train and test data separately. 
+•    I suggest providing a statistical significance evaluation for each experiment to better validate the conclusions.
+•    I find the provided study in section 4.2 very similar to the triple classification task in KGs. Can you elaborate on the differences and potential advantages of your setting?
+•    I am wondering how you identified the “Other True” triples for WN18RR KG in section 4.2 experiments?
+
+On overall, although I find the proposed study very interesting and enlightening, I believe that the paper needs more experimental results and decisive conclusions.
+",6,,ICLR2020
+ikdGRvylMJy,4,GbCkSfstOIA,GbCkSfstOIA,A paper with strong claims but lack of support and formalism in the technique ,"[Summary] In this work, authors dealt with the problem of improving classification performance in the setting of semi-supervised learning. Authors exploit inherent knowledge on the samples distribution via clustering to improve the latent space. The work heavily relies on well-studied concepts including the  Davies-Bouldin index and maximum margin clustering.
+
+
+[Cons]
+-- The general idea to improve the semi-supervised classification task by exploiting knowledge from the data through clustering constraints.
+
+
+[Pros]
+-- The lack of a formalist to describe the technique turns down the paper. This also makes unclear the level of novelty.
+-- The results are ultimately less informative than one would like and the experimental setting is limited. This leads to the results being no more than case studies which demonstrates only limited advantages in both qualitative and quantitative terms. 
+
+
+
+Detailed comments for authors:
+
+
+--  [Novelty] The novelty in the proposed technique is unclear as it leverages on already well-studied principles including Davies-Bouldin index and maximum margin clustering. Authors need to strongly  detail the main advantages and key insights of the technique. As it is described so far, it is hard to appreciate the level of contribution.
+
+-- [Lack of formalism] A major drawback that turns down the paper is the lack of formalism to describe the proposed technique (which also makes unclear the level of novelty). That is, the core part of the technique is  (6) and previous expressions (1)-(5) are drawn from already well-studied concepts (Davies-Bouldin index and maximum margin clustering). One would expect that the authors provide further details (6) such as explicitly define the optimisation process.  Not only that, but several expressions are not properly set -- for example (5) authors needs to set the explicit definition rather that the text ‘k pairs single link distance’ as there are several works using the principles of maximum margin clustering, the reader needs to see the specifics of each component in the technique (even (5) draw from a very well-studied principle yet it is not defined properly). Overall, it is not enough to set (6) but authors should reduce (1)-(5) which are well-established expressions and detail (6) along with important parts such as the optimisation. 
+
+-- A major drawback of the components are not considered. For example, it is well-known that the Davies-Bouldin index can report good results, however, it does not imply the best information retrieval. Why then chose Davies-Bouldin index? There are several works such as [*] that can be used to generalise the latent space well whilst avoiding the disadvantage of the authors proposed components and affect positively the outcome from SSL [**].  
+
+[*]Caron, M., Bojanowski, P., Joulin, A., & Douze, M. (2018). Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 132-149).
+[**] Sellars, Philip, Angelica Aviles-Rivero, and Carola Bibiane Schönlieb. ""Two cycle learning: Clustering based regularisation for deep semi-supervised classification."" arXiv preprint arXiv:2001.05317 (2020).
+
+
+-- Experimental setting. The results are ultimately less informative than one would like and the experimental setting is limited. This leads to the results being no more than case studies which demonstrates only limited advantages in both qualitative and quantitative terms. 
+-- Authors open the paper with a big statement “the comparison results confirm our model’s outstanding performance over semi-supervised learning” However, authors use very small and not well-representative datasets such as CIFAR-100, ImageNet (or even a smaller version mini-ImageNet). Likewise the comparison setting is limited, there are no comparisons with recent other techniques such as FixMatch, UDA. Deep Label propagation etc. Otherwise the strong claims of the authors are unsupported. 
+-- The findings are less ultimate than one would like. The results from the tables are far from being well-explained. 
+",4,5.0,ICLR2021
+ryxIsgyc27,2,S1xiOjC9F7,S1xiOjC9F7, Interesting  application but inadequate experiments,"The authors introduce a Graph Matching Network for retrieval and matching of graph structured objects. The proposed methods demonstrates improvements compared to baseline methods. However, I have have three main concerns: 
+1) Unconvining experiments.
+	a) Experiments in Sec4.1. The experiments seem not convincing. Firstly, no details of dataset split is given. Secondly, I am suspicious the proposed model is overfitted, although proposed GSL models seem to bring some improvements on the WL kernel method. As shown in Tab.3, performance of GSL models dramatically decreases when adapting to graphs with more nodes or edges. Besides, performance of the proposed GSLs also drops when adapting to different combines of k_p and k_n as pointed in Sec.B.1. However, the baseline WL kernel method demonstrates favourable generalization ability.
+
+	b）Experiments in Sec4.2. Only holding out 10% data into the testing set is not a good experiment setting and easily results in overfitting. The authors are suggested to hold more data out for testing. Besides, I wonder the generalization ability of the proposed model. The authors are suggested to test on the small unrar dataset mentioned in Sec.B.2 with the proposed model trained on the ffmpeg dataset in Sec4.2.
+
+2) Generalization ability. The proposed model seems sensitive to the size and edge density of the graphs. The authors is suggested to add experiments mentioned in (1).
+
+3) Inference time and model size. Although the proposed model seems to achieve increasing improvements with the increasing propagation layers. I wonder the cost of inference time and model size compared to baselines methods. ",6,4.0,ICLR2019
+Bkg6qVsa2X,2,SJekyhCctQ,SJekyhCctQ,Innovative perturbation-based learning strategy leads to very impressive performance in adversarial detection,"This paper proposes a method for the detection of adversarial examples via what the authors term ""neural fingerprinting"" (NeuralFP). Essentially, a reference collection of perturbations are applied to the training data so as to learn the effects on the classification decision. The premise here is that on average, normal examples from a common class would have similar changes in the classification decision when reference perturbations are applied, whereas adversarial examples (particularly those off the local submanifold) may have a markedly different set of changes from what was expected for the targeted class. These reference perturbations as well as the anticipated output perturbations together form the ""fingerprints"".
+
+To measure the difference between observed outputs and fingerprints, the average (squared?) Euclidean distance is used. Given a fixed set of input fingerprints (presumably chosen so as to provide coverage of the range of possible perturbation directions), the authors use the distance formula as a regression loss (""fingerprint loss"") to train the choice of output fingerprints. Although the authors do not explicitly state it this way, this secondary training objective encourages a K-Means style clustering of output perturbations where the output fingerprints serve as cluster representatives. 
+
+This learning formulation is to my mind is both very innovative and extremely effective, as demonstrated by the authors' experimental results. Their experiments show superlative performance (near perfect detection!) against essentially the full range of state-of-the-art attacks. They give careful attention to the mode of attack, and show excellent performance even for adaptive white-box attacks, in which existing attack methods are given the opportunity to minimize the fingerprint loss.
+
+The presentation of the paper is excellent - clear, well-motivated, and detailed, with careful attention given to experimental concerns such as the choice of perturbation directions (the recommendation is to choose them at random), and the number of fingerprints to pick.
+
+Overall, the reported results are so good, and the approach so convincing, that one wonders what the weaknesses of the approach might be (if any). Questions that do come to mind are:
+* Can an adversarial strategy can be developed that could execute a successful attack while minimizing the fingerprint loss. 
+* Another issue is whether the NeuralFP would work on more challenging data sets where the classes are highly fragmented - at what rate would the benefits of NeuralFP fade as the classification performance degrades? 
+* What happens to performance if the perturbation directions are chosen so as to better conform with the local sub-manifolds... would fewer perturbations be required? (It would seem that reducing the number of perturbations needed could have a significant effect on training time.)
+
+Overall, this is a very strong and important result, fully deserving of acceptance.
+
+P.S. Two sets of typos that need attention:
+* In Equation 3, the Euclidean norm is taken. In Equation 5, the squared Euclidean norm is taken. Presumably, one of these is a typo. Which?
+* In the definition of delta-min and delta-max in the first paragraph of Section 2.2, y-hat should be w-hat.
+
+",9,4.0,ICLR2019
+H1wc2j2lM,3,B1mvVm-C-,B1mvVm-C-,Interesting paper,"Thank you for the submission. It was an interesting read. Here are a few comments: 
+
+I think when talking about modelling the dynamics of the world, it is natural to discuss world models and model based RL, which also tries to explicitly take advantage of the separation between the dynamics of the world and the reward scheme. Granted, most world model also try to predict the reward. I’m not sure there is something specific I’m proposing here, I do understand the value of the formulation given in the work, I just find it strange that model based RL is not mention at all in the paper. 
+
+I think reading the paper, it should be much clearer how the embedding is computed for Atari, and how this choice was made. Going through the paper I’m not sure I know how this latent space is constructed. This however should be quite important. The goal function tries to predict states in this latent space. So the simpler the structure of this latent space, the easier it should be to train a goal function, and hence quickly adapt to the current reward scheme.  
+
+In complex environments learning the PATH network is far from easy. I.e. random walks will not expose the model to most states of the environment (and dynamics). Curiosity-driven RL can be quite inefficient at exploring the space. If the focus is transfer, one could argue that another way of training the PATH net could be by training jointly the PATH net and goal net, with the intend of then transferring to another reward scheme.
+
+A3C is known to be quite high variance. I think there are a lot of little details that don’t seem that explicit to me. How many seeds are run for each curve (are the results an average over multiple seeds). What hyper-parameters are used. What is the variance between the seeds. I feel that while the proposed solution is very intuitive, and probably works as described, the paper does not do a great job at properly comparing with baseline and make sure the results are solid. In particular looking at Riverraid-new is the advantage you have there significant? How does the game do on the original task? 
+
+The plots could also use a bit of help. Lines should be thicker. Even when zooming, distinguishing between colors is not easy. Because there are more than two lines in some plots, it can also hurt people that can’t distinguish colors easily. 
+
+",6,3.0,ICLR2018
+rJg_hF_TnQ,3,rJgvf3RcFQ,rJgvf3RcFQ,"Review for the paper: ""On Inductive Biases in Deep Reinforcement Learning""","This paper focuses on deep reinforcement learning methods and discusses the presence of inductive biases in the existingRL algorithm. Specifically, they discuss biases that take the form of domain knowledge or hyper-parameter tuning. The authors state that such biases rise the tradeoff between generality and performance wherein strong biases can lead to efficient performance but deteriorate generalization across domains. Further, it motivates that most inductive biases has a cost associated to it and hence it is important to study and analyze the effect of such biases. 
+
+To support their insights, the authors investigate the performance of well known actor-critic model in the Atari environment after replacing domain specific heuristics with the adaptive components. The author considers two ways of injecting biases: i) sculpting agents objective and ii) sculpting agent's environment. They show empirical evidence that replacing carefully designed heuristics to induce biases with more adaptive counterparts preserves performance and generalizes without additional fine tuning.
+
+The paper focuses on an important concept and problem of inductive biases in deep reinforcement learning techniques. 
+Analysis of such biases and methods to use them judiciously is an interesting future direction. The paper covers a lot of related work in terms of various algorithms and corresponding biases.
+However, this paper only discusses such concepts at high level and provides short empirical evidences in a single environment to support their arguments. Further, both the heuristics used in practice and the adaptive counterparts that the paper uses to replace those heuristics are all available in existing approaches and there is no novel contribution in that direction too.
+Finally, the adaptive methods based on parallel environment and RNNs have several limitation, as per author's own admission.
+
+Overall, the paper does not have any novel technical contributions or theoretical analysis on the effect of such inductive biases which makes it very weak. Further, there is nothing surprising about the author's claims and many of the outcomes from the analysis are expected. The authors are recommended to consider this task more rigorously and provide stronger and concrete analysis on the effects of inductive biases on variety of algorithms and variety of environments.
+
+
+
+",3,4.0,ICLR2019
+B1g97ORFtB,2,rJel41BtDH,rJel41BtDH,Official Blind Review #3,"This paper proposes to combine pseudo-labelling with MixUp to tackle the semi-supervised classification problem. My problem is that ""MixMatch: A Holistic Approach to Semi-Supervised Learning"" by Berthelot et al. is very similar with just a few differences on the pseudo-labelling part. Could you stress more the difference between your paper and their paper ? Because I might be wrong about it.
+
+Pros:
+* Good results on C10
+* A clear related work section that divides the existing works in pseudo labelling vs consistency
+* Interesting results about the effects of using different architectures. I also like the ablation study.
+
+Weaknesses:
+* Usually, SVHN is also among the tested datasets
+* The pseudo labelling part is a bit unclear.For example, do you just refresh the pseudo-labels at the end of each epoch ?
+* minor: a typo with ""and important role""
+
+If there was not an existing paper already using MixUp, I would have leaned towards acceptance. You can still motivate the differences with the MixMatch paper.",3,,ICLR2020
+r1lsg0nptH,2,rkgMkCEtPB,rkgMkCEtPB,Official Blind Review #1,"This paper is exploring the importance of the inner loop in MAML. It shows that using the inner loop only for the classifier head (ANIL) results are comparable to MAML. It also shows that using no inner loop at all (NIL) is okay for test time but not for training time.
+
+It is indeed interesting to understand the effect of the inner loop. But, as the authors noted (“Our work is complementary to methods extending MAML, and our simplification and insights could be applied to such extensions also”), for it to be useful I’d like to see whether these insights can be extended to SOTA models. MAML is less than 50% accuracy on 1-shot mini-imagenet while current SOTS models achieve 60-65%.
+
+The NIL experiment that shows low performance when no inner loop is used in training time doesn’t make sense. This is basically the same as the nearest-neighbour family of methods, e.g. ProtoNet (Snell et al., 2017), which have been shown to perform similarly to (or even better than) MAML.
+
+
+After rebuttal:
+I do think it's important to also have that kind of analysis works. My main concern is with how ANIL and NIL are introduced as new algorithms and not just an ablation of the MAML method. Presented as new algorithms I tend to compare them against the leaderboard where they are very far from the top. I am keeping my previous rating.  ",3,,ICLR2020
+By3oLYeNe,2,Skq89Scxx,Skq89Scxx,,"This an interesting investigation into learning rate schedules, bringing in the idea of restarts, often overlooked in deep learning. The paper does a thorough study on non-trivial datasets, and while the outcomes are not fully conclusive, the results are very good and the approach is novel enough to warrant publication. 
+
+I thank the authors for revising the paper based on my concerns.
+
+Typos:
+- “flesh” -> “flush”",7,4.0,ICLR2017
+CvfwPhx8Jky,2,-6vS_4Kfz0,-6vS_4Kfz0,Official Blind Review #2,"The paper proposes Evolutionary Graph Reinforcement Learning to solve the memory placement problem. Main ideas are using GNN as the network architecture for reinforcement learning agents that look for more informed priors for evolutionary algorithms. Overall novelty of the paper comes from the neat combination of RL, EA, and GNN, and applying it to memory placement (ML for Systems).
+
+The paper indeed tackles an important problem that can affect the overall performance and efficiency of the hardware. I believe the reorganization of various off-the-shelf ML techniques to solve real problems in the systems domain marks a large contribution, hence the positive overall rating.
+
+One of the main drawbacks of the paper is that the paper only tests on a single type/configuration of hardware. While this is fine to some extent, this makes it hard to get confirmation about the generality of the overall method considering the large variance of the speedup.
+
+Another related question comes from how this work relates to the optimizations of the dataflows [1,2]. As it is difficult to evaluate the overall memory communication without considering the order of operations, etc. the work in turn neglects the big question and focuses on only the partial view of the problem. It would provide a nice reference point if some of these points are discussed in the paper.
+
+Last question comes from the baselines. While the previous works on tensor optimizations [3,4] are very closely related and many of the ideas provide a good comparison point, these have not been discussed nor cited. For example, I guess AutoTVM's way of approximating the search space using TreeGRU or XGBoost can help. Also, Chameleon's way of sampling the examples using adaptive sampling may provide an interesting reference point in terms of reduction of number of samples.
+
+Overall, I have enjoyed reading the paper and I find the ideas in the paper interesting. I am currently weakly pro for the paper, and look forward to the authors' response :)
+
+Questions
+1. Could you provide and overview of the NNP-I's architecture in the appendix? Also, possibly ablation studies over different configurations of the hardware.
+2. What are the relative communication speed of SRAM, LLC, and DRAM? 
+3. It is well discussed in the computer architecture community that the memory communication is very much determined by the dataflows of the architecture. How are the results affected by these dataflows?
+4. How does the work compare to the methods described in [3,4]?
+
+[1] ""Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks"", ISCA 2016
+
+[2] ""Interstellar: Using Halide's Scheduling Language to Analyze DNN Accelerators"", ASPLOS 2020
+
+[3] ""Learning to optimize tensor programs"", NeurIPS 2018
+
+[4] ""Chameleon: Adaptive code optimization for expedited deep neural network compilation"", ICLR 2020",7,4.0,ICLR2021
+sBDKCvUC9oz,2,1wtC_X12XXC,1wtC_X12XXC,"Interesting idea, few major concerns ","## Second Review
+
+Thanks for taking all my comments seriously.  After carefully reading other reviews and quick modifications introduced by authors, I believe this work is richer and has shown some potential towards building scalable and robust alternative to BP.  Thanks for including angles as suggested, this further supports  hypothesis proposed in this work. I really appreciate results using CNNs and cross entropy loss, therefore I increase my score to 7. It seems that other reviewers do not appreciate that training a network using hebbian like updates and without  BP requires some nontrivial engineering tricks and theoretical considerations which are now well described in this paper (updated manuscript). It is difficult to match BP gradients, and many popular alternatives including FA, DFA, DTP, EP struggle to match BP performance when tested on complex benchmarks (Cifar, imagenet, etc) with complex architectures (CNNs, RNNs, etc). Beside this given approach is robust even when backward weights are trainable. However I agree storing backward and forward synapses challenges the bio-plausibility of approach, which I think can be handled if local representation are handled differently. Nonetheless, changing few strong words (optimal BP, optimal gradients) and derivation, supports major hypothesis proposed in this work. A better justification on non-local updates as raised by other reviewers is required to further strengthen this work. But i liked the results with activity relaxation and how close gradients are with respect to BP. Combining current approach with other bio-inspired approaches might solve some key aspects of current learning algorithm. 
+Current approach is still heavily dependent on backprop, and partially gets closer to bioplausible approaches (mainly hebbian like update rule). Testing this on deeper architectures (Resnet, student-teacher etc) might further strengthen your work. 
+ 
+## Minor comments
+
+Figure 4 c) change angls--> angles
+In appendix, change caption of fig 5a) from MNOST to MNIST, I believe it is a typo
+
+Add results with FA or DFA with fixed and learnable weights, will further support robustness and closeness of BP claim in this work.
+
+## Summary
+In this work, authors designed a bio-inspired approach known as activation relaxation (AR), utilizing local information for training deep neural network. Unlike prior work on bio-inspired learning, AR only utilizes single neuron to compute its gradient helpful for neural circuitry. AR is seen to be derived from postulating a dynamical relaxation phase in which the neural activities are tracing out a dynamical system. This modification is hypothesis to be close to backprop gradients getting ride of weight transport problem. Authors show accuracy as a metric to validate their hypothesis by conducting experiments on small scale dataset such as mnist and fmnist.
+
+## Review
+Most of the paper is clear, but experiment section needs more work in approving the hypothesis w.r.t AR.
+
+1] AR converges close to backprop gradients.
+Accuracy is not a valid measure to show robustness and calculate approximately with backprop gradients.  As shown by FA (lillicrap) , DFA (nokland), LRA (ororbia), the updates carried by there model roughly lies within 45 degree compared with backprop. It is better to show model update angles w.r.t to BP, DFA and other bio-inspired approaches.
+
+2] is AR robust? What happens to the model at various initialization and how does it behave when tested with various model hyper-parameter changes?
+
+## Experimental section is missing major chunk of information. 
+2.1] Did you perform grid search on model hyper-parameters, if so, what are those? Did you experiment with various activation functions and if so, what are those? Are results consistent whenever MSE is replaced with CE? If so report your findings.
+2.2] What happens to the network when BP gradients are weak resulting into poor information, can AR come out of that low saddle point and converge to better local minima? Few experiments related to better convergence are shown by ororbia, they experimented with various init to show their model can converge to better local minima whenever backprop had issue in converging.
+2.3] Did you perform multiple trails on your experiments? If so, you should report standard error in your paper.
+
+3] AR w.r.t fixed error weights vs AR w.r.t learnable error weights.
+How does local learning approach, help in improving the model plasticity when you have these two scenarios? Paper mentions few lines on those, but detailed experiments should be conducted to validate the statement stable and robust performance. Also provide information on how error weights are initialized, and ranges experimented w.r.t backward or error weights.
+
+4] Comparison with other bio-inspired approaches.
+Current manuscript does not show any comparison with other bio-inspired approaches (DFA, FA, DTP, DTP-sigma, LRA, Weight mirroring). If goal is to show model is robust, close to backprop updates and gradients, then it is better to show comparison with these approaches or at least show angle to understand closeness w.r.t BP updates.
+
+5]  Performance on large scale dataset and CNNs.
+It is been argued that bio-inspired approaches (DFA, FA, DTP) struggle to match backprop performance when evaluated on large scale dataset with deep CNNs[Bartunov 18] . However recently weight mirroring [Akrout 19] and LRA[Ororbia and Mali 19]  have shown that they can come closer to backprop in terms of performance on large scale datasets. But in this current manuscript, there are no results w.r.t CNNs and updates w.r.t. filters when AR is deployed on such challenging visual recognition tasks.
+
+[Bartunov 18] Bartunov, S., Santoro, A., Richards, B., Marris, L., Hinton, G.E. and Lillicrap, T., 2018. Assessing the scalability of biologically-motivated deep learning algorithms and architectures. In Advances in Neural Information Processing Systems (pp. 9368-9378).
+
+[Ororbia 18] Ororbia, A.G., Mali, A., Kifer, D. and Giles, C.L., 2018. Deep credit assignment by aligning local representations. arXiv preprint arXiv:1803.01834.
+
+[Akrout 19] Akrout, M., Wilson, C., Humphreys, P., Lillicrap, T. and Tweed, D.B., 2019. Deep learning without weight transport. In Advances in neural information processing systems (pp. 976-984).
+
+[Ororbia and Mali 20] Ororbia, A., Mali, A., Kifer, D. and Giles, C.L., 2020. Reducing the Computational Burden of Deep Learning with Recursive Local Representation Alignment. arXiv preprint arXiv:2002.03911.
+
+
+",7,4.0,ICLR2021
+8ZoH8MCEUWt,3,qbH974jKUVy,qbH974jKUVy,"Interesting and relevant analysis, but the conclusions aren't clear enough yet","
+Summary
+---
+A large body of work creates disentangled representations to improve
+combinatorial generalization. This paper distinguishes between 4 types of
+generalization and shows that existing unsupervised disentanglement approaches
+generalize worse to some and better to others.
+
+(introduction)
+There are 3 types of combinatorial generalization. Each requires a learner to
+generalize to a set of instances where 0, 1, or 2 dimensions have been completely held out.
+Previous work has not distinguished between these kinds of generalization when
+testing how disentangled representations generalize. This work does that to
+understand the relationship between disentanglement and combinatorial
+generalization in a more fine grained manner.
+
+(approach)
+Throughout this paper, beta-VAE and a recent variant are trained with varrying levels of
+disentanglement (controlled by beta) to reconstruct d-sprites images.
+These images contain simple shapes and are generated using 5 ground truth
+latent factors. The ground truth latent factors allow disentanglement to be
+measured (using Eastwood and Williams 2018), essentially by checking whether
+the ground truth latent factors are linearly separable in the learned latent space.
+
+(experiment - plain reconstruction)
+* reconstruction error differs for different types of combinatorial generalization (holding out fewer dimensions is easier)
+* reconstruction error is not highly correlated with disentanglement
+
+(experiment - compositional reconstruction)
+Instead of reconstructing the input, a version of the input with one attribute changed is generated.
+* generation error differs for different types of combinatorial generalization (holding out fewer dimensions is easier)
+
+(conclusion)
+Usually disentanglement is encouraged to achieve combinatorial generalization, but this paper presents a simple experiment where it doesn't do that.
+
+
+
+Strengths
+---
+
+The central claim of the paper may help clarify the disentanglement literature.
+
+It seems very useful to taxonomize generalization in this way.
+
+The writing and motivation is generally very clear. The figures are easy to understand and help demonstrate the narrative.
+
+This paper aims to characterize an existing line of work in detail rather than proposing a new approach/dataset/etc. I like work of this nature and would like to see more like it.
+
+
+
+Weaknesses
+---
+
+
+1. The relationship between disentanglement and generalization is clearly or quantitatively demonstrated:
+
+The most interesting claim in this paper is that disentanglement is not necessarily correlated with combinatorial generalization, but this claim is not clearly supported by the data.
+
+* The main support comes from table 1. Here higher D-score does not necessarily mean lower test NLL. This observation should be made quantitative, probably just by measuring correllation between D-score and test NLL.
+
+* Table 2 seems to contradict this claim. In that case higher D-score does mean lower test NLL.
+
+
+2. The taxonomy of generalization is a bit too specific to be useful and a bit incoherent:
+
+The difference between ""Interpolation"" and ""Recombination to element"" generalization
+is not clear to me. Each of the purple and red cubes in figure 1a represents
+a combinations of rotation, shape, and translation factors.
+It may be that it makes a difference when some dimensions are categorial
+and others are continuous, as in the Interpolation example, but this doesn't
+seem to really solve the factor because continuous latent variables
+are still latent variables. I see some vague intuition behind this distinction,
+but the paper does correctly identify the precise distinction.
+
+Furthermore, this taxonomy of generalization seems limited to me.
+It seems like ""Recombination to element"", ""Recombination to range"", and ""Extrapolation""
+just hold out a different number of dimensions (e.g., ""none"", ""rotation"", and ""shape and rotation"", respectively).
+This begs the question of what happens when there are 4 generative dimensions?
+Is generalization when 3 of those are held out also called ""Extrapolation""?
+
+I think more work needs to be done to create a taxonomy which precisely and clearly generalizes
+to N latent factors and creates a more coherent distinction between combinatorial and
+non-combinatorial generalization.
+However, I think it's possible to create a better taxonomy and that it
+will probably be very useful to do so.
+
+
+3. The paper should test the idea more thoroughly, on more datasets and on more disentanglement approaches. For example, it could include other datasets or tasks with different ground truth factors of variation (e.g., 3D chairs [1]). It could also include more disentanglement approaches like [2].
+
+
+[1]: M. Aubry, D. Maturana, A. Efros, B. Russell, and J. Sivic. Seeing 3d chairs: exemplar part-based 2d-3d
+alignment using a large dataset of cad models. In CVPR, 2014.
+[2]: Esmaeili, B. et al. “Structured Disentangled Representations.” AISTATS (2019).
+
+
+
+
+
+Comments / Suggestions
+---
+
+Describe the disentanglement metric in more detail. From the beginning disentanglement is treated differently from combinatorial generalization. It's not immediately clear what disentanglement is that makes it different and why that's interesting to study. For example, initially one might think that beta-VAE is inherently disentangled.
+
+Can this taxonomy of generalization be generalized to continuous domains? For example, can it be generalized to any (typically continuous) hidden layer a neural net learns?
+
+
+
+Preliminary Evaluation
+---
+
+Clarity - The presentation is quite clear.
+Quality - The claims are not quite well enough supported. The experiments that were run don't support a clear conclusion and more experiments should have been run to support a more general conclusion.
+Novelty - I don't think anyone has catalogued the performance of disentanglement methods in terms of a generalization taxonomy.
+Significance - This paper might help clarify the disentanglement literature and more broadly help people think about combinatorial generalization.
+
+I like this paper because of its clarity, novelty, and significance. However, I think the quality concerns are significant enough that it shouldn't be accepted at this stage.
+
+Final Evaluation (Post Rebuttal)
+---
+The author response and accompanying paper revision clearly and effectively addressed each of the 3 main weaknesses I pointed out, so I raised my rating.",7,4.0,ICLR2021
+H1gkDZaeM,3,SkFqf0lAZ,SkFqf0lAZ,Nice contribution to memory augmented recurrent neural network ,"The main contribution of this paper are:
+(a) a proposed extension to continuous stack model to allow multiple pop operation,
+(b) on a language model task, they demonstrate that their model gives better perplexity than comparable LSTM and attention model, and 
+(c) on a syntactic task (non-local subject-verb agreement), again, they demonstrate better performance than comparable LSTM and attention model.
+
+Additionally, the paper provides a nice introduction to the topic and casts the current models into three categories -- the sequential memory access, the random memory access and the stack memory access models. 
+
+Their analysis in section (3.4) using the Venn diagram and illustrative figures in (3), (4) and (5) provide useful insight into the performance of the model.",8,5.0,ICLR2018
+BygJ8bHctB,1,HyxLRTVKPH,HyxLRTVKPH,Official Blind Review #3,"This work presents a simple technique for tuning the learning rate for Neural Network training when under a ""budget"" -- the budget here is specified as a fixed number of epochs that is expected to be a small fraction of the total number of epochs required to achieve maximum accuracy. The main contribution of this paper is in showing that a simpler linear decay schedule that goes to zero at the end of the proposed budget achieves good performance. The paper proposes a framework called budget-aware schedule which represents any learning rate schedule where the ratio of learning rate at time `'t' base learning rate is only a function of the ratio of 't' to total budget 'T'. In this family of schedules, the paper shows that a simple linear decay works best for all budgets. In the appendix, the authors compare their proposed schedule with adaptive techniques and show that under a given budget, it outperforms latets adaptive techniques like adabound, amsgrad, etc.
+
+Pros:
+1. This paper presents a simple technique for a problem that is impactful namely performing training under a small budget presumably as an approximation during neural architecture search or hyperparameter tuning. The technique is empirically shown to be effective for many computer vision benchmarks.
+2. The paper presents extensive experimental results comparing linear decay with other budget-aware schedules. The accuracy comparisons are performed under different budgets as well as for neural architecture ranking while selecting architecture with budgeted training.
+3. Overall, I think this paper can be generally useful for many practitioners.
+
+Cons:
+1. The paper makes claims around the phenomena of gradient magnitude vanishing as well as its effectiveness. E.g. in section 5, authors state ""We call this “vanishing gradient” phenomenon budgeted convergence. This correlation suggests that decaying schedules to near-zero rates (and using BAC) may be more effective than early stopping."". This is not clear from the paper as the paper merely shows gradient magnitude decreasing with learning rate. This claim appear like an overreach to me.
+2. The key motivating use cases for budget-aware training is providing approximations for problems like neural architecture search and hyper parameter tuning. However, for these use cases, the paper does not perform extensive comparisons for commonly used algorithms like Adam. Why?
+
+nits:
+1. In section 2, various -> varies
+2. Right above equation 1, budge -> budget
+",6,,ICLR2020
+4qMNkRZoZWj,1,lvRTC669EY_,lvRTC669EY_,The paper is interesting but it could be improved,"This paper proposes to use reward randomization to explore the policy space in multi-agent games. The idea is that in most of multi-agent games multiple Nash Equilibriums exist with different payoffs. The goal is to find the NE that provides the highest payoff. 
+Policy Gradient and its variants, which have obtained a lot of practical successes, in general fail to find the NE with the highest payoff. 
+A first approach could be to re-start PG with different initializations for finding different NEs and then selects the best one. 
+In contrast, the authors propose to randomize the reward structure for exploring different policies. Then the policies are optimized on different reward structures with PG. The policy that leads to the highest payoff is selected and then optimized with PG on the original structure of rewards.
+The authors provide some theoretical results to show that reward randomization has a highest probability to find the best NE than random initializations of PG. 
+The authors also propose to use reward randomization for learning an agent against different type of opponents. 
+The experiments are done on three games and show the interest of their approach in comparison with several baselines.
+
+The paper is well written, proposes interesting ideas supported by analytical and experimental results. However the reviewer has some remarks, concerns and questions.
+
+Concerning Theorem 1, O(\epsilon) for a probability is not a strong result: it can be higher than 1. After looking the proof, the reviewer thinks that it seems possible to provide the right expression of the probability of finding the high payoff NE. 
+
+Concerning Theorem 2, the proof is quite informal and the reviewer is not sure that it is correct. In particular, it is not clear if the same condition than in Theorem 1 is necessary: a-b = \epsilon (d-c). In the statement it seems not because a,b,c,d are uniformly sampled and there is no \epsilon in the statement, but the remark stating that RR necessitates at most O(\log1/\epsilon) times to achieves 1-\epsilon suggests that it does. 
+Moreover the reviewer thinks that the proposed analysis (statements of Theorem 1 and 2) will be more convincing if the number of starts, needed by the two approaches for finding w.h.p the high payoff NE, is compared (as you did in the remark).
+
+In Algorithm 2, the authors write that the policy \pi’_2 is drawn from \Pi_2, but in the experiments section 5.3, the authors explain that \Pi_2 is carefully built, meaning that the policies in \Pi are chosen to be effective. This step is not in Algorithm 2, which is still correct, but this suggests that if \Pi_2 is not well chosen Algorithm 2 does not work.
+This leads to my main concern. The rewards seem to be uniformly sampled with the constraint that their sum is no more than C_{max}. However, with this kind of uniform sampling the set of games used for exploring policies contains a lot of games that does not respect the constraints induced by the original game M. For instance in stag-hunt we have a \geq b \geq d > c. Using uniform sampling most of the induced games do not respect this reward structure. So it can lead to inefficient policy. For instance if a < b and a < d an efficient policy is to not track the stag and to hunt the hare. The reviewer understands that the diversity of rewards allows the diversity of obtained policies, but the reviewer is wondering if sampling the rewards with respect to the reward constraints of the game is not enough to obtain the diversity of policies. At a minimum, it could be interesting for the reader to have this reasonable baseline. By the way, may be this baseline allows Algorithm 2 working without carefully choosing the set of policies \Pi.
+
+Overall, the paper is interesting, but the reviewer thinks that it could be better. The reviewer can change his mind if his concerns are answered.
+
+
+___________________________
+
+
+After the rebuttal I raised my score.
+",6,4.0,ICLR2021
+r1l4JK8ijQ,1,H1z-PsR5KX,H1z-PsR5KX,"Well-written paper applying a method for finding individual influential neurons to MT, but insight is ultimately limited","The authors propose a number of methods to identify individual important neurons in a machine translation system. The crucial assumption, drawn from the computer vision literature, is that important neurons are going to be correlated across related models (e.g. models that are trained on different subsets of the data). This hypothesis is validated to some extent: erasing the neurons that scored highly on these measures reduced BLEU score substantially. However, it turns out that most of the activation of the important neurons can be explained using sentence position. Supervised classification experiments on the important neurons revealed neurons that tracked properties such as the span of parentheses or word classes (e.g., auxiliary verbs, plural nouns, etc).
+
+Strengths:
+* The paper is very well written and provides solid intuitions for the methods proposed.
+* The methods seem promising, and the degree of localist representation is striking.
+* The methods may be able to address the question of *how* localist the representations are (though no numerical measure of localism is proposed).
+* There is a correlation between the neuron importance metrics proposed in the paper and the effect on BLEU score of erasing those neurons from the network (of course, it’s not clear what particular linguistic properties are affected by this erasure - the decrease BLEU may reflect inability to track specific word tokens more than any higher-level linguistic property).
+
+Weaknesses:
+* It wasn't clear to me why the neurons that track particular properties (e.g., being inside a parentheses) couldn't be identified using a supervised classifier to begin with, without first identifying ""important"" neurons using the unsupervised methods proposed in the paper. The unsupervised methods do show their strength in the more exploratory visualization-based analyses -- as the authors point out (bottom of p. 6), the neuron that activates on numbers but only at the beginning of the sentence does not correspond to a plausible a-priori hypothesis. Still, most of the insight in the paper seems to be derived from the supervised experiments.
+* The particular linguistic properties that are being investigated in the classification experiments are fairly limited. Are there neurons that track syntactic dependencies, for example?
+* I wasn't sure how the GMMs (Gaussian mixture models) for predicting linguistic properties from neuron activations were set up.
+* It's nice to see that individual neurons function as knobs that can change the gender or tense of the output (with varying accuracy). At the same time, I was unable to follow the authors' argument that this technique could be used to reduce gender bias in MT.
+* I wasn't sure what insight was gained from the SVCCA analyses -- this method seems to be a bit of a distraction given the general focus on localist vs. distributed representation. In general, I didn’t come away with an understanding of the pros and cons of each of the methods.",6,4.0,ICLR2019
+SkeKSvde9r,2,HkxBJT4YvB,HkxBJT4YvB,Official Blind Review #2,"The paper proposes a new way of estimating treatment effects from observational data, that decouples (disentangles) the observed covariates X into three sets: covariates that contributed to the selection of the treatment T, covariates that cause the outcome Y and covariates that do both. The authors show that by leveraging this additional structure they can improve upon existing methods in both ITE and ATE 
+
+The main contributions of the paper are:
+* Highlighting the importance of differentiating between treatment and outcome inducing factors and proposing an algorithm to detect the two
+* Creating a joint optimisation model that contains the factual loss, the cross entropy (treatment) loss and the imbalance loss
+
+Overall, I like the paper quite a lot, I find it well-written and clearly motivated with a very nice experimental section that it is designed around understanding the behaviour of the proposed model.
+
+In terms of suggestions, I think it will be very interesting to link the approaches using invariant causal representations with existing work in the Counterfactual Risk Minimization [1] literature and to mutualise the experimental setup. 
+
+[1] Swaminathan, Adith, and Thorsten Joachims. ""Counterfactual risk minimization: Learning from logged bandit feedback."" International Conference on Machine Learning. 2015.",8,,ICLR2020
+HJeCA06d3X,1,r1g1LoAcFm,r1g1LoAcFm,Using label structure to address class imbalance,"This is a clear and well written paper that attempts to improve our ability to predict in the setting of massive multi-label data which, as the authors highlight, is an increasingly import problem in biology and healthcare. 
+
+Strengths:
+The idea of using the hierarchical structure of the labels is innovative and well-motivated. The experimental design and description of the methods is excellent. 
+
+Weaknesses:
+Overall the results are not consistently strong and there is a key baseline missing. The approach only seems help in the ""rare label, small data"" regime, which limits the applicability of the method but is still worthy of consideration. 
+
+My biggest reservation is that the authors did not include a baseline where the classes are reweighted according to their frequency. Multilabel binary cross-entropy is very easy to modify to incorporate class weights (e.g. upweight the minority class for each label) and without this baseline I am unable to discern how well the method works relative to this simple baseline.
+
+One more dataset would also strengthen the results, and since I am suggesting more work I will also try to be helpful and be specific. Predicting mesh terms from abstracts would qualify as a massive multilabel task and there is plenty of public data available here: https://www.nlm.nih.gov/databases/download/pubmed_medline.html 
+
+Finally, there is one relevant paper that the authors may wish to consider in their review section: https://www.biorxiv.org/content/early/2018/07/10/365965",4,4.0,ICLR2019
+lE_MDmxKeB,3,h8q8iZi-ks,h8q8iZi-ks,An interesting idea and limited contribution,"##########################################################################
+
+Summary:
+
+This paper propose a method for leveraging additional annotation by using an auxiliary network that modulates activations of the main network. The method proposed achieves significant improvements over a strong baseline on two datasets.
+
+##########################################################################
+
+Reasons for score: 
+
+ves: 
++ This paper tackle the problem of out-of-distribution generalization which is crucial for machine learning applications.
++ The idea of leveraging additional information to enhance domain shift generalization is interesting and make sense to me.
++ The proposed conditional networks is practical.
++ Overall, the paper is well written and is easy to follow.
+
+cons:
+- My main concern about the paper is the novelty. The paper seems the incremental work based on Conditional Batch Normalization, which limits the contribution. Overall, the novelty and contribution of this work are marginal.
+
+##########################################################################
+
+
+ ",4,3.0,ICLR2021
+r1ezQrs3tB,2,ByxtHCVKwB,ByxtHCVKwB,Official Blind Review #1,"The paper presents a machine-learning based heuristic for solving traveling
+salesman problems. In particular, MCTS is used to explore a large neighbourhood.
+The authors present their approach and evaluate it empirically.
+
+The presented approach is interesting; a few details could be described in more
+detail and motivated better (for example how the particular functional form for
+estimating the potential Z of an edge was chosen). but in general the paper is
+well-written.
+
+The main part where the paper falls short is the experimental evaluation. The
+authors state that the reference algorithms were executed on different
+platforms, even though at least some of them are publicly available and the
+authors could have run them themselves. In their own experimental setup, the
+authors overload the machines by solving instances on each hyper-threaded
+logical core instead of the physical cores for no apparent reason. Running on
+logical cores like this leads to significantly longer runtimes. This
+experimental setup is changed for instance set 2 for no apparent reason.
+
+The authors claim to improve on the optimal solution that concorde finds,
+confirmed by a non-peer-reviewed paper, without providing a justification -- if
+this is due to rounding errors, are the found tours the same and just the length
+computation is flawed? Or are the tours different?
+
+Tables 1 and 2 present results in completely different formats. This makes it
+unnecessarily hard to compare results. In particular, Table 2 presents no run
+times.
+
+Finally, the instances used to evaluate the approach seem relatively easy.
+TSPlib contains many more instances that are more challenging to solve, with
+hundreds to thousands of cities. Even on the relatively small instances, the
+presented approach is often an order of magnitude slower than the exact solver
+concorde -- why would I want to use the presented approach in a practical
+setting?
+
+In summary, I feel that the paper cannot be accepted in its current form.
+",1,,ICLR2020
+afIQKoSffg,2,SzjyTIc5qMP,SzjyTIc5qMP,Interesting new direction,"This paper proposes a novel learning framework called information lattice learning. It is formulated as an optimization problem that finds decomposed hierarchical representations that are efficient in explaining data using a two-phased approach. ILL generalizes Shannon's information lattice and authors demonstrate ILL can be applied to learning music theory from scores and chemical laws from molecular data. This paper is proposing a new research direction and I believe it is worth to be presented. One concern I have is the complexity and scalability of the proposed algorithm.
+
+Authors emphasize ""small data"", but I don't see why the proposed approach cannot be applied to ""large data"". In page 15, authors mention the worst case complexity of O(2^N). Does it mean the proposed approach works only for ""simple"" examples such as discovering music theory and chemical laws considered in this paper? Can authors elaborate more on the complexity and the scalability issues of their algorithm? Did authors only consider ""small data"" regime due to the scalability problem?
+
+The definition of signal seems very general and it can even include pmf's. How can we enforce restrictions on signals such as probability simplex?
+
+Can authors comment on how to make a deep learning version of the proposed framework? Say, hierarchical info GAN, hierarchical VAE, etc.?
+
+It would be interesting to compare their work with existing unsupervised deep learning algorithms that attempt to find disentangled representations.
+",7,3.0,ICLR2021
+Bye9YWnphm,3,Skl6k209Ym,Skl6k209Ym,Sound empirical study,"The authors propose a deep learning method based on image alignment to perform one-shot classification and open-set recognition. The proposed model is an extension of Matching Networks [Vinyals et al., 2016] where a different image embedding is adopted and a pixel-wise alignment step between test and reference image is added to the architecture. 
+
+The work relies on two strong assumptions: (i) to consider each point mapping as independent, and (ii) to consider the correct alignment much more likely than the incorrect ones. The manuscript doesn’t report arguments in favour of these assumptions. The motivation is partially covered by your statement “marginalizing over all possible matching is intractable”, nevertheless an explanation of why it is reasonable to introduce these assumptions is not clearly stated.
+
+The self-regularization allows the model to have a performance improvement, and it is considered one of the contribution of this work. Nevertheless the manuscript doesn’t provide a detailed explanation on how the self regularization is designed. For example it is not clear whether the 10% and 20% pixel sampling is applied also during self regularization.
+
+The model is computationally very expensive and force the use of only 10% of the target image pixels and 20% of the reference images’ pixels. The complexity is intrinsic of the pixel-wise alignment formulation, but in any case this approximation is a relevant approximation that is never justified. The use of hyper column descriptors is an effective workaround to achieve good performance even though this approximation. The discussion is neglecting to argue this aspect.
+
+One motivation for proposing an alignment-based matching is a better explanation of results. The tacit assumption of the authors is that a classifier driven by a point-wise alignment may improve the interpretation. The random uniformly distributed subsampling of pixels makes the model less interpretable.It may occur for example as shown in figure 3 where the model finds some points that for human interpretation are not relevant and at the same time these points are matched with points that have some semantic meaning.
+",6,2.0,ICLR2019
+HkeTmSckqS,3,BJxSI1SKDH,BJxSI1SKDH,Official Blind Review #3,"This paper proposes to incorporate a latent model (in the form of a variational auto-encoder) in the decoding process of neural machine translation. The motivation is to capture morphology. Experiments on three language-pairs (English to Arabic, Czech, and Turkish) show promising improvements in translation accuracy (BLEU).
+
+I am quite excited by this paper. So far there are not that many successful demonstrations of VAE in neural machine translation. The method is sound and interesting. The results show convincing improvements; some practioners may argue that reported BLEU gain (e.g. 0.8) is not impressive, but I think for a new model like this it is worthy. 
+
+Some suggestions or questions:
+
+- One alternative evaluation metric that might be interesting is to lemmatize both translation outputs and reference, then compute BLEU. This will help distinguish whether the proposed method is improving by getting the morphological inflections correct, or whether it is improving across the board on various word types. 
+
+- Table 3 is interesting. If there is space, I would suggest more analysis along those directions, i.e. investigating what morphology is learned, what is in the latent spaces. 
+
+- Do you think the results will vary depending on decoder layer depth? I wonder if different kinds of latent spaces will be learned with different depth. 
+
+- Also a related question is how about varying the source input BPE merge operation? Again, it seems like these design choices might affect results, especially when dealing with morphology. 
+",8,,ICLR2020
+BkebP2PthQ,2,rJgTciR9tm,rJgTciR9tm,"Preliminary, with 1 promising experiment, but unclear and vague","The paper proposes a method to learn the conditional distribution of a random variable in order to minimize and maximize certain mutual information terms.  Interestingly, the proposed method can be applied to sentiment prediction and outperforms a 2018 method based on SVM.
+
+Overall, the ideas seem intriguing, and the results seem promising, but I really cannot understand what the paper is saying, and I think the paper would be much stronger if it was written more clearly (to make individual sentences more clear, but also to make the broader picture more clear). Not only is the writing hard to understand (some sentences lack a verb!), but it is vague, and the notion of a ""complex system"" is never defined.  It seems that the technique can be applied to any (potentially non-stationary) Markov process?
+
+Additionally, due to the lack of clarity in the writing and lack of mathematical rigor, Theorem 1 does not seem to be true as stated. I think this is an issue of stating the assumptions, and not due to a mistake in the derivation.  Right now, the actual conclusion of theorem 1 is not even clear to me.
+
+Quality: poor/unclear
+Clarity: very poor
+Originality: unclear, perhaps high? Not clear how related it is to the methods of Tishby et al.
+Significance: unclear, as clarity was poor, and there was minimal discussion of alternative methods.
+
+Specific points:
+
+- Eq (2), the first term is included because it is for the ""information compression task"", but I do not understand that. Where is the actual compression?  This is not traditional compression (turning a large vector into a smaller vector), but more like turning one PDF into a PDF with lower entropy?
+
+- This paper seems to fall into the subfield of system identification (at which I am not an expert), so I'd expect to see some related literature in the field. The only compared method was the IF method of Tishby et al. from 18 years ago (and the current work seems to be a generalization of that).
+
+- Equation (4): what exactly is the object being minimized? Is it a PDF/probability measure? Is it an *instance* of a random variable?  If it is a PDF, is it the PDF of B_k | X_{k-1} ?
+
+- The statement of Theorem 1 is either too vague or wrong. To say ""The solution... is given by"" makes it sound like you are giving equations that define a unique solution. Perhaps you mean, ""Any solution ... must necessarily satisfy..."" ? And that is not clearly true without more work. You are basically saying that any minimizer must be a stationary point of the objective (since you are not assuming convexity). It seems everything is differentiable?  How do you know solutions even exist -- what if it is unbounded? In that case, these are not necessary conditions.
+
+- Lemma 1: ""The iterative procedure... is convergent.""  The iterative procedure was never defined, so I don't even know what to make of this.
+
+- Section 3.2: ""As proved by prior work, the optimum solution obtained by a stochastic transformation that is jointly Gaussian with bottleneck's input.""  I do not know what you are trying to say here. There's no predicate.
+
+- Section 4 wasn't that interesting to me yet, since it was abstract and it seemed possible that you make a model to fit your framework well. But section 5 is much better, since you apply it to a real problem. However, what you are actually solving in section 5 is unclear. The entire setup is poorly described, so I am very confused.
+
+",4,2.0,ICLR2019
+HJcItTz4e,3,H1zJ-v5xl,H1zJ-v5xl,Review,"This paper introduces the Quasi-Recurrent Neural Network (QRNN) that dramatically limits the computational burden of the temporal transitions in
+sequence data. Briefly (and slightly inaccurately) model starts with the LSTM structure but removes all but the diagonal elements to the transition
+matrices. It also generalizes the connections from lower layers to upper layers to general convolutions in time (the standard LSTM can be though of as a convolution with a receptive field of 1 time-step). 
+
+As discussed by the authors, the model is related to a number of other recent modifications of RNNs, in particular ByteNet and strongly-typed RNNs (T-RNN). In light of these existing models, the novelty of the QRNN is somewhat diminished, however in my opinion their is still sufficient novelty to justify publication.
+
+The authors present a reasonably solid set of empirical results that support the claims of the paper. It does indeed seem that this particular modification of the LSTM warrants attention from others. 
+
+While I feel that the contribution is somewhat incremental, I recommend acceptance. 
+",6,4.0,ICLR2017
+hjlzbveTWb,2,k9EHBqXDEOX,k9EHBqXDEOX,Theoretical analysis for A3C with 1-step TD,"This paper revisits the A3C algorithm with TD(0) for the critic update to provide better theoretical analysis of A3C. A3C-TD(0) achieves linear speedup and it also matches our intuition. To show the empirical results, the authors provide convergence results of A3C-TD(0) with Markovian sampling in synthetic environments and speedup of A3C-TD(0) in CartPole and Seaquest.
+In this paper, the theoretical and experimental results show that using multiple workers in parallel improves learning speed without loss of sample efficiency. I think it is a valuable research direction. 
+However, A3C-TD(0) is limited compared to A3C because A3C-TD(0) does not use multi-step TD or TD(\lambda). Moreover, the authors use only two gym environments, which seems insufficient.
+",6,3.0,ICLR2021
+Bke0ohlmoQ,1,BJfIVjAcKm,BJfIVjAcKm,"Good paper, but the value of ""verification"" over ""certification"" is not clear","The paper presents several ways to regularize plain ReLU networks to optimize 3 things
+
+- the adversarial robustness, defined as the fraction of examples for which adversarial perturbation exists
+- the provable adversarial robustness, defined as the fraction of examples for which some method can show that there exists no adversarial example within a certain time budget
+- the verification speed, i.e. the amount of time it takes some method to verify whether there is an adversarial example or not
+ 
+Overall, the ideas are sound and the analysis is solid. My main concern is the comparison between the authors method and the 'certification' methods, both conceptually and regarding performance.
+
+The authors note that their method falls under 'verification', whereas many competing methods fall under 'certification'. They point to two advantages of verification over certification: (1) the ability to provide true negatives, i.e. prove that an adversarial example exists when it does, and (2) certification requires that 'models must be trained and optimized for a specific certification method'. However, neither argument convinces me regarding the utility of the authors method. 
+
+Regarding (2): The authors method also requires training the network in a specific way (with RS loss), and it is only compatible with verifiers that care about ReLU stability. 
+
+Regarding (1): It is not clear that this would be helpful at all. Is it really that much better if method A has 80% proven robustness and 20% proven non-robustness versus method B that has 80% proven robustness and 20% unknown? One could make the case that method B is actually even better.
+
+So overall, I think one has to compare the authors method and the certification methods head-to-head. And in table 3, where this is done, Dvijotham comes out on top 2 out of 2 times and Wong comes out on top 2 out of 4 times. That does not seem convincing. Also, what about the performance numbers form other papers discussed in section 2?
+
+-------
+
+Other issues:
+
+At first glance, the fact that the paper only deals with (small) plain ReLU networks seems to be a huge downside. While I'm not familiar with the verification / certification literature, from reading the paper, I suspect that all the other verification / certification methods also only deal with that or highly similar architectures. However, I will defer to the other reviewers if this is not the case.
+
+To expand upon my comment above, I think the paper should discuss true adversarial accuracy on top of provable adversarial robustness. Looking at table 1, for instance, for rows 2, 3 and 4, it seems that the verifier used much less than 120 seconds on average. Does that mean the verifier finished for all test examples? And wouldn't that mean that the verifier determined for each test example exactly whether an adversarial example existed or not? In that case, I would write ""true adversarial accuracy"" instead of ""provable adversarial accuracy"" as column header. If the verifiers did not finish, I would include in the paper for how many examples the result was ""adverarial example exists"" and for how many the result was ""timeout"". I would also include that information in table 3, and I would also include proving / certification times there. 
+
+Based on the paper, I'm not quite sure whether the idea of training with L1 regularization and/or small weight pruning and/or ReLU pruning for the purpose of improving robustness / verifiability was an original idea of this paper. In either case, this should be made clear. Also, the paper seems to use networks with adversarial training, small weight pruning, L1 and ReLU pruning as its baseline in most cases (all figures except table 1). If some of these techniques are original contributions, this might not be an appropriate baseline to use, even if it is a strong baselines.
+
+Why are most experiments presented outside of the ""experiments"" section? This seems to be bad presentation.
+
+I would include all test set accuracy values instead of writing ""its almost as high"". Also, in table 3, it appears as if using RS loss DOES in fact reduce test error significantly, at least for CIFAR. Why is that?
+
+While, again, I'm not familiar with the background work on verification / certification, it appears to me from reading this paper that all known verification algorithms perform terribly and are restricted to a narrow range of network architectures. If that is the case, one has to wonder whether that line of research should be encouraged to continue.
+
+--------
+
+Minor issues:
+
+- ""our focus will be on the most common architecture for state-of-the-art models: k-layer fully-connected feed-forward DNN classifiers"" Citation needed. Otherwise, I would suggest removing this statement.
+- ""such models can be viewed as a function f(.,W)"" - you also need to include the bias in the formula I think
+- ""convolutional layers can be represented as fully-connected layers"". I think what you mean is ""convolutional layers can be represented as matrix multiplication""
+- could you make the difference between co-design and co-training more clear?
+- The paper could include in the appendix a section outlining the verification method of Tjeng",5,3.0,ICLR2019
+5L9tBzINyVP,4,48goXfYCVFX,48goXfYCVFX,Promising application paper about ingredient recommendation.,"The paper studies a promising task of interpretable food ingredients recommendation - there has been a growing interest in modeling recipes. The idea of leveraging KG to improve the interpretability/faithfulness of recipe-related ML tasks seems like a contribution to the community. In particular, the author proposes a method to learn pair specific relational representations for one-to-one (i.e. ingredient to ingredient) and many-to-one (ingredient-set to ingredient) food pairing tasks.
+
+Pros:
+
+The task itself is an interesting application; meanwhile, the task is non-trivial as the ingredient pairing is complicated and affected by various factors. 
+
+It uses recipes instead of interaction history to recommend complementary ingredients. 
+
+It proposes a method based on the memory network. In particular, it first embeds the preselected ingredient set vector and a candidate ingredient vector, then sums them and feeds into the memory network. The output of the memory network is called the relational vector, which is added with the input embedding and then put into a scoring function. The training follows the standard ranking problem which aims to optimize the triplet loss.
+
+KB triples of (ingredient, attribute, attribute value) are further represented as external KB embeddings to augment the memory network. 
+
+Conduct a qualitative analysis of the attention weights on the KB embedding and show that IRRM is able to capture some level of relation between ingredients.
+
+
+Cons:
+
+The experimental section seems to miss important baselines models. Most of the baseline models are non-neural network-based methods. Also, they are mostly based on interaction history. It would be convincing to adding several neural model baselines that use recipe text as training inputs. Otherwise, it's hard to see which component brings improvement. 
+
+The author proposes two new evaluation tasks to show the model's performance - both of them fall into a category of ingredients completion. The tasks seem a bit simple and less beneficial in a real scenario. There are more practical and challenging alternatives that could be used for better evaluation. For example, predict the complementary ingredients given the recipe name or recipe steps. 
+
+The improvement of IRRM on Hit@10 on the Recipe Completion Task seems marginal (i.e. it even underperforms NMF). Is there any reason? 
+
+One other concern is that - most contribution comes from the novel task. It may have a limited scope of the audience at ICLR as it's more like an industrial track application paper, though from the application perspective the paper may be impactful. Other top venues in the field of data mining and recommender systems seem like a better fit.
+",5,4.0,ICLR2021
+rk_Zn-G4x,2,S1Jhfftgx,S1Jhfftgx,Not very convincing,"This paper proposes a way of enforcing constraints (or penalizing violations of those constraints) on outputs in structured prediction problems, while keeping inference unconstrained. The idea is to tweak the neural network parameters to make those output constraints hold. The underlying model is that of structured prediction energy networks (SPENs), recently proposed by Belanger et al. 
+
+Overall, I didn't find the approach very convincing and the paper has a few problems regarding the empirical evaluation. There's also some imprecisions throughout. The proposed approach (secs 6 and 7) looks more like a ""little hack"" to try to make it vaguely similar to Lagrangian relaxation methods than something that is theoretically well motivated.
+
+Before eq. 6: ""an exponential number of dual variables"" -- why exponential? it's not one dual variable per output.
+
+From the clarification questions:
+- The accuracy reported in Table 1 needs to be explained. 
+- for the parsing experiments it would be good to report the usual F1 metric of parseval, and to compare with state of the art systems.
+- should use the standard training/dev/test splits of the Penn Treebank.
+The reported conversion rate in Table 1 does not tell us how many violations are left by the unconstrained decoder to start with. It would be good to know what happens in highly structured problems where these violations are frequent, since these are the problems where the proposed approach could be more beneficial.
+
+
+Minor comments/typos:
+- sec.1: ""there are"" -> there is?
+- sec 1: ""We find that out method is able to completely satisfy constraints on 81% of the outputs."" -> at this point, without specifying the problem, the model, and the constraints, this means very little. How many constrains does the unconstrained method satisfies?
+- sec 2 (last paragraph): ""For RNNs, each output depends on hidden states that are functions of previous output values"" -- this is not very accurate, as it doesn't hold for general RNNs, but only for those (e.g. RNN decoders in language modeling) where the outputs are fed back to the input in the next time frame. 
+- sec 3: ""A major advantage of neural networks is that once trained, inference is extremely efficient."" -- advantage over what? also, this is not necessarily true, depends on the network and on its size.
+- sec 3: ""our goal is take advantage"" -> to take advantage
+- last paragraph of sec 6: ""the larger model affords us"" -> offers?
+",3,4.0,ICLR2017
+SyxKniRnYS,2,SyedHyBFwS,SyedHyBFwS,Official Blind Review #3,"In this paper the authors present a new way to use autoregressive modeling to generate images pixel by pixel where each pixel is generated by modeling the difference between the current pixel value and  the preexistent ones. In order to achieve that, the authors propose a copy and adjustment mechanism that select an existing pixel, and then adjust its sub-pixel (channel values) to generate the new pixel. The proposed model is demonstrated with a suite of experiments in classic image generation benchmark. The authors also demonstrate the use of their technique in Image to Image translation.
+Overall, although the paper explain clearly the intuition and the motivation of the proposed technique, I think that the paper in its present state have low novelty, weak related work analysis review and insufficient experiments to support a publication at ICLR. 
+
+
+
+**Novelty, contribution and related work**
+The authors should highlight better their main contribution novelty of the proposed method compared to their baseline.
+
+
+**Result and conducted experiments**
+the correctness of the proposed approach is not proved by the conducted experiment  in fact:
+The experiments do not provide the details of the used architecture compared to your baseline. 
+In Table 1 you report the results using your technique on several computer vision tasks (generation, colorization and super-resolution) but you're not comparing with the SoA of each of these tasks.
+The  results reported in Tables 1 and 2 are not convincing  when compared to existing approaches (using only CIFAR10 in Table2). 
+There are so many missing details specially to validate Image-To-image translation 
+Figure 3 is confusing and  not clear  
+
+**Minor comments**
+In  references section : (Kingma & Dhariwal, 2018) is not in a proper format (nips 2018)
+Bad quality of illustrations and images 
+Be coherent with the position of captions (figure 3)",3,,ICLR2020
+Zy418x0vOS,1,01olnfLIbD,01olnfLIbD,Important empirical work demonstrating real threat of poisoning attack on large-scale CNNs.,"This paper presents a scalable data poisoning attack algorithm focusing on targeted attacks. The technique is based on gradient matching, where the intuition is to design the poisoning patterns such that their effect on the gradient of the training loss mimics the gradient as if the targeted test image is included in the training data.
+
+The paper presents both theoretical intuitions behind the algorithm, as well as empirical reduction and simplification to make the algorithm scalable to ImageNet and applicable to even a black-box attack against the Google Cloud AutoML toolkit.
+
+The algorithm proposed in this paper is practical and general, making it a realistic poisoning threat to modern deep learning systems. The presentation is clear and the theoretical justification is intuitive and easy to understand.
+
+Overall, I think this paper is a good contribution to the study of the large-scale poisoning attack.
+
+Minor typo:
+ In proof of Prop 1, you need the angle between the two gradients to be almost always smaller than 90 degrees, not 180 degrees.",7,4.0,ICLR2021
+HJzp02rMM,3,SJvrXqvaZ,SJvrXqvaZ,Ok but not good enough,"Clarity 
+The paper is clear in general. 
+
+Originality
+The novelty of the method is limited. The proposed method is a simple extension of L. Pinto et al. by replacing TRPO with A3C. No evidence is provided to show the proposed method is competitive with the original TRPO version. 
+
+Significance
+- The empirical results on the hardware are valuable. 
+- The simulated results are very limited. The neural networks used in the simulation have only one hidden layer. The method is tested on the Pendulum domain. 
+
+Pros:
+- Real hardware results are provided. 
+
+Cons:
+- Limited simulation results. 
+- Lacking technical novelty. 
+",4,4.0,ICLR2018
+r1eZ-Hw8nX,1,BJgQB20qFQ,BJgQB20qFQ,An application of tree and DAG LSTMs with important details missing from the draft,"The paper proposes to plan by taking an initial plan and improving it. The authors claim that 1) this will achieve results faster than planning from scratch and 2) will lead to better results than using quick, local heuristics. However, when starting with an initial solution there is always the danger of the final solution being overly biased by the initial solution. The authors do not address this adequately. They show how to apply tree and DAG-based LSTMs to job scheduling and shortening expressions. Since they are simply using previously proposed LSTM variants, I do not see much contribution here. The experiments show some gains on randomly generated datasets. More importantly, details are missing such as the definitions of SP and RS from section 4.4.",5,3.0,ICLR2019
+DkLzaRbXwz2,2,IjIzIOkK2D6,IjIzIOkK2D6,"We carefully review the motivation, approach, and empirical results.","
+This work proposes an efficient graph neural architecture search to address the problem of automatically designing GNN architecture for any graph-based task. Comparing with the existing NAS approaches for GNNs, the authors improves the search efficiency from the following three components: (1) a slim search space only consisting of the node aggregator, layer aggregator and skip connection; (2) a one-shot search algorithm, which is proposed in the previous NAS work; and (3) a transfer learning strategy, which searches architectures for large graphs via sampling proxy graphs. However, the current performance improvement over the human-designed models is marginal, which diminishes their research contribution.
+
+The paper organization is clear, but some expressions should be improved. The details are listed as below.
+Typos: In the Abstract, state-of-the-art should be abbreviated as SOTA, not SOAT.
+Typos: $L_\theta(Z)$ after Equation (4) is not defined. Should it be $L_W (Z)$ as used in Equation (4)?
+Clarity: The explanation before Equation (5) is a bit confused, which should be re-organized. There is grammar error (the absence of sentence subject) in the first sentence: “however, in this work, to make use of the differentiable nature of Lθ(Z), and design a differentiable search method to optimize Eq. (4).”
+Clarity: The notations related to variable $Z_{i, j}$, i.e., $Z_{i, j}^T$ and $Z_{i, j}^k$,  are not defined well. What is the difference between the super-scripts: T and k?
+
+The pros of this work are summarized in terms of three components used in EGAN, which improves the search efficiency. The experiment results show that their framework greatly reduce time, comparing with the GraphNAS, Bayesian search and random search.
+
+Major questions:
+(1) In Introduction: we doubt that designing proper GNN architectures will take tedious efforts. As far as I know, the architecture parameters of the human-designed models do not require extensive tuning efforts on the testing benchmark datasets. Furthermore, most of the architecture parameters could be shared and used among the testing datasets to achieve the competitive performances.
+(2) It is unclear for the second challenge: the one-shot methods cannot be directly applied to the aforementioned dummy search space. There are some one-shot models with the parameter sharing strategy used for searching the hidden embedding size. 
+(3) In Section 3.1, why is the dummy search space very large? The search space seems only to include the aggregators and hidden dimensions. It might be much smaller than the search space of CNNs.
+(4) Their search space assigns skip connections between the intermediate layers and the final layer, which is contradictory to the common case where the skip connections could be applied among the intermediate layers. As shown in [1], the skip connections may exist between any two layers. Could you provide reasons on the design of skip connection limitation? 
+(5) In the node and graph classification of the experimental section, the performance improvement over the human-designed is marginal. This would not justify the motivation of applying NAS to search graph neural networks. The authors should provide more discussions on the contribution of this work in terms of research and industrial applications. 
+(6) The marginal performance improvement might result from the search space. Currently, the authors’ search space is based on the traditional message passing approaches. They should consider more the recent developments in GNNs to further improve the performance. 
+(7) The selection of baselines is unfair. The search space contains the skip connection components based on the JK-Network. However, authors excluded the important baseline in [2], which could achieve the comparable performance on dataset Citeseer and Reddit. For the graph classification task, authors also excluded a lot of pooling methods, such as the Graph-u-Net [3], which achieves the better performance than the proposed approach.
+
+[1] Rong, Yu, et al. ""Dropedge: Towards deep graph convolutional networks on node classification."" International Conference on Learning Representations. 2019.
+[2] Xu, Keyulu, et al. ""Representation learning on graphs with jumping knowledge networks."" arXiv preprint arXiv:1806.03536 (2018).
+[3] Gao, Hongyang, and Shuiwang Ji. ""Graph u-nets."" arXiv preprint arXiv:1905.05178 (2019)
+",3,5.0,ICLR2021
+rke32mHAYr,3,rylvAA4YDB,rylvAA4YDB,Official Blind Review #2,"This paper proposes a new neural network architecture for dealing with graphs dealing with the lack of order of the nodes. The first step called the graph isomorphic layer compute features invariant to the order of nodes by extracting sub-graphs and cosidering all possible permutation of these subgraphs. There is no training involved here as no parameter is learned. Indeed the only learning part is in the so-called classification component which is a (standard) fully connected layer. In my opinion, any classification algorithm could be used on the features extracted from the graphs.
+Experiments are then given for the graph classification. I do not understand results of Table 1 as the accuracies reported for MUTAG and PTC in Xu et al with GIN are much higher than the numbers here.",1,,ICLR2020
+H1eUT3mRtS,3,B1eWbxStPH,B1eWbxStPH,Official Blind Review #1,"Strength:
+-- The paper is well written and easy to follow
+-- The authors proposed a new approach called directional message passing to model the angles between atoms, which is missing in existing graph neural networks for molecule representation learning
+--  The proposed approach are effective on some targets. 
+
+Weakness:
+-- From the point of view of graph neural networks, the novelty of the proposed techniques is marginal
+-- The performance of the proposed method are only better than existing methods on some of the targets. 
+
+This paper studied learning the graph representation of molecules by considering the angles between atoms. The authors proposed a specific type of graph neural network called directional message passing. Experimental results on the QM9 data set prove the effectiveness of the proposed approach over existing sota algorithms such as Schnet for some of the targets. 
+
+Overall, the paper studies a very important problem, which aims to learn the representation of molecules. Modeling the angles of atoms is indeed less explored in existing literature. From the view of graph neural networks, the proposed technique is not that new since edge embedding has already been studied in existing literature. But for the technique could be particular useful for molecule representation learning, especially with the BESSEL FUNCTIONS. One question is that the schnet also leverages the coordinates of the atoms, which may also implicitly model the angles between the edges. In terms of the experiments, the proposed approach does not seem that strong, only achieving the best performance on 5 targets out of 12. 
+
+Overall, I feel this paper is on the borderline. Now I will give weak accept and will revise my score according to other reviews and comments.  ",6,,ICLR2020
+jM2Q0KpXWqt,2,EoVmlONgI9e,EoVmlONgI9e,"Important problem, but unsure if this bias makes sense","This paper tackles the problem of exploration in multi-agent RL,
+formulated as a Dec-POMDP. The authors propose to shape the reward
+using the output of a classifier that tries to determine which agent
+saw a particular observation. Moreover, the authors propose some
+regularization schemes to ""break ties"" early on in the training. Then,
+it is shown that the proposed reward shaping term can be integrated
+into two popular MARL algorithms that use centralized training: MAAC
+and QMIX. The integration into MAAC is relatively straightforward
+because MAAC uses independent critics, whereas the integration into
+QMIX is more involved due to the Q-function mixing step.
+
+I believe this is a very important problem being tackled by the
+authors. Even in single-agent RL, exploration is a major issue; this
+only gets even more apparent in the multi-agent setting. I believe
+that the paper was well-written, and the Introduction made it very
+clear what exactly the paper's contributions were, which I greatly
+appreciate.
+
+Unfortunately, I'm not sure that I believe that the proposed bias is
+truly useful. I understand the intuition: we want agents to
+""specialize"" their observations, such that it is easy to predict which
+agent is receiving any particular observation. However, won't this be
+counterproductive in most practically interesting domains? For
+example, consider a team of robots working together in a factory or a
+household; they will constantly be changing their environment as they
+take actions toward operating the factory equipment or cleaning the
+kitchen, which means the observations they receive will always be
+changing. But the proposed reward shaping mechanism in this paper
+would ""fight"" against this progress, because it would encourage the
+agents to engage in trivial behavior just to be able to see the same
+observations over and over. In my mind, it would be important to
+consider the dynamics of how the classifier p(i | o_i) is changing
+over time.
+
+Another option could be to simply encourage the agents to learn
+different policies, maybe measured via KL-divergence of P_{agent 1}(a
+| s) and P_{agent 2}(a | s). I believe that this bias would be
+sufficient to solve the example presented in Figure 1? Does this bias
+seem reasonable, and if so, how does it compare against EOI?
+Basically, more broadly, I would have liked to see experimental
+comparisons that convince me that EOI is the *correct* bias to use
+versus other natural biases, whereas the current experiments only seem
+to compare against non-shaping methods like ROMA and HL.
+
+Some other questions:
+
+1. How significant is the fact that you are ignoring second-order
+effects in solving the bi-level optimization in Appendix A.2? Have the
+authors conducted preliminary experiments to prove that it doesn't
+make much difference? I understand that this practice is standard in
+methods like MAML, but it would be nice to verify that it doesn't
+matter in this setting either.
+
+2. Looking at the shaped reward computation, r + alpha * p(i | o_i),
+it seems like if the classifier were naive and simply outputted a
+uniform distribution, you would still be giving positive intrinsic
+reward in that case. Might it make more sense instead to consider the
+divergence between p(i | o_i) and the uniform distribution?",5,4.0,ICLR2021
+6edoeUPGvl,3,5i4vRgoZauw,5i4vRgoZauw,Interesting yet unconvincing ideas about modeling the primate visual system with DNNs,"The study starts from the fact that DNNs have been around and popular for a while for modeling the visual system, but that they are not realistic because they are trained via supervised learning approaches with a very large number of parameters and that this is not a feasible model of the development in the visual system.
+
+In general, although the manuscript presents some interesting ideas, it makes many assumptions without providing clear bases for these assumptions (e.g. compressing the weights of a pretrained network to sample new weights is posed as a realistic approximation of the infant visual brain) and lacks a theoretical foundation for the claims and experiments that are presented. The authors acknowledge that this study is intended as a proof of principle, but given the arbitrary nature of the choices made, I do not see the added significant value of the results.
+
+While DNNs are indeed commonly used as models of the primate visual system, in my view, the current study is addressing a somewhat inconsequential problem. This is because to the best of my knowledge, no neuroscientist is claiming that a deep neural network is a complete and accurate model of the (development of the) primate visual system. Furthermore, it is well-known and acknowledged that deep neural networks are not biologically plausible models of (how learning occurs in) the brain. They are currently one of the best computational tools to use to study the sensory (and especially the visual) nervous systems, and that is all that they are. It is not clearly explained why it is necessary to claim that the learning in these models and the development of the brain has to be similar for them to be good models of vision. Of course, we should thrive for better and more accurate models of the brain, but in my view the current study does not serve to this goal.
+
+In section 4 authors describe an initialization protocol for the network weights which involve compressing a trained model’s weights into clusters and then sampling from these clusters. What is not clear is why the authors assume that this can be a valid model of the infant visual system. At this point their approach sounds like arbitrarily selecting a set of criteria to make the networks perform worse than fully trained networks, and then training them. I could be missing something, but I do not see the relevance or necessity of an approach such as the presented one. A main concern is that no theoretical basis has been established in the paper besides some superficial ideas. For instance, why would an infant brain be made up of a DNN with connections whose weights are initialized with the method authors came up with?
+
+Much of the methodological details are only included in the appendix. I found it rather odd to not find any information about, for example, the proposed weight initialization method in the paper.
+
+It is not clear to me what is presented in Figure 1 and why. Why are the authors showing how models from another paper trains?
+
+Another concern is that nowhere in the results seems to be a test for significance. The improvements of the results could be a coincidence, since the results are heavily dependent on one experiment.",3,4.0,ICLR2021
+IrsjyoIypTt,2,HP-tcf48fT,HP-tcf48fT,"This paper proposes GLSEARCH, a Graph Neural Network based model for MCS detection, which aims to search for the maximum common subgraphs between two input graphs. ","The motivation of this paper is clear and interesting, as it’s important to explore the maximum common subgraph in biochemical domain. In this paper, the authors conduct a lot of experiments to demonstrate the effectiveness of the proposed method. Despite of this, the presentation of this paper requires improvement because many important details are missing, which makes it hard to follow. The time-complexity analysis might also be crucial to demonstrate the superiority of the proposed method over other baselines in terms of searching time. 
+
+Strengths:
+1.	The motivation of this paper is clear and interesting.
+2.	The authors conduct many experiments to demonstrate the effectiveness of the proposed method.
+Weakness:
+1.	The last paragraph in Section 2.2 (the notion of bidomain) is hard to follow. It’s not clear what is k, and how bidomain partitions the nodes to get V’_{k,1} and V’_{k,2}, which from two different graphs G_1 and G_2. Many details regarding the notations are missing when a new equation is introduced, e.g. r_t in Factoring Out Action subsection. These missing details make it hard to follow. 
+2.	In Figure 1, what’s the difference between 01 and 10?
+3.	In equation 2, what operation does INTERACT stand for? 
+4.	The authors mention the maximum common subgraph detection problem is NP-hard, so it’s important to provide the time complexity of the proposed algorithms. However, in this paper, the authors do not provide any time complexity analysis or report the running time of the proposed method. In addition, in this paper, the authors mention that MCSP and MCSP+RL adopt the heuristics node pair selection policy but the proposed method is “learn-to-search” algorithm. It might be interesting to see whether the proposed method greatly reduces the search time compared with state-of-the-art algorithms as the number of nodes increases.",5,4.0,ICLR2021
+sytKzhZ_Wd_,4,s0Chrsstpv2,s0Chrsstpv2,Good comparison of data generators for adversarially robust explanations,"Summary
+-------------
+Following the work of Slack et al (2020), which presents adversarial attacks on explanations, this work proposes a solution, that is to use improved perturbation data generators that produce instances more similar to samples in the training set. 
+This work also shows that the IME method is more resilient to adversarial attacks in comparison to LIME & SHAP, while both LIME and SHAP would benefit from the proposed data generators. 
+
++ves: 
+-------
+- Overall, the solution to use improved data generators that closely match the training data distribution is a good one including the comparison between the different data generators. The result on the robustness of IME method is good. 
+- Authors have submitted modified version of the code i.e. gLIME & gSHAP which use the proposed improved data generators. Both the code and the result on IME is expected to benefit the AI explainability community and practitioners that more or less rely on either LIME or SHAP today. 
+
+Possible improvements
+--------------------------------
+- It would be good to comment on how the data generators work on images. 
+- Using training data distribution may perhaps improve the overall quality of explanations as well i.e. beyond making them robust to adversarial attacks, it might be good to discuss any such benefits in the paper by considering explainability metrics such as monotonicity, faithfulness, etc. 
+- One thing which is unclear is author's recommendation on which data generators ( among the 3 evaluated ) to eventually use - what are their pros/cons. Does this depend on the type / distribution of data or explainability method or both. 
+
+Conclusion: 
+---------------
+Overall, this is a nice piece of work which leverages existing data generators to show that adversarial robustness of LIME & SHAP based explanations can be improved. ",7,4.0,ICLR2021
+HhZAWB_UvEr,1,H5B3lmpO1g,H5B3lmpO1g,Sound approach with new insights but some design choices don't seem to fit to closed-loop control,"__Summary__
+The paper targets the problem of closed-loop 6D robotic grasping with a parallel gripper based on RGB-D in-hand-camera images. The policy takes an aggregated point cloud (computed from image history) as input (using a PointNet++) and outputs the pose transformation of the gripper. There are several contributions in the specifics of the proposed method. The policy is pretrained using behavioral cloning and DAGGER on known object models, where the expert is composed by a grasp pose sampler and the OMG grasp trajectory planner. Subsequently, the pretrained policy is improved using TD3. Actor and critic networks are each regularized via a loss for solving the auxiliary task of (independently from each other) predicting the final grasping pose. As the goal poses are only available for the expert demonstrations goals are added for the policy roll-out in hindsight, similar to hindsight experience replay, which does not require access to an object model. The approach is evaluated in simulation for grasping YCB and ShapeNet objects with a Franka Emika Panda robot. The evaluation covers ablations of several algorithmical and architectural choices.
+
+
+__Strong points__
+- The paper is well-written
+- Thorough and interesting ablations
+- Sound approach
+
+__Weak points__
+- State-representation might get invalidated when an object is moved, nullifying the biggest advantage of closed-loop grasping
+- no comparisons with competing methods (e.g. open-loop 6D grasping + OMG)
+- no evaluations in a cluttered environment
+- no real robot experiment
+- no code
+
+
+__Recommendation__
+I recommend accepting the paper because the system looks sound and promising and is well explained. Some parts, like regularization based on goal pose prediction, are potentially useful also for very different approaches. 
+
+__Supporting Arguments__
+Overall the approach seems sound and the success rates of around 90% seem good for closed-loop 6D grasping of unknown objects. The ablations that disentangle the contributions of the different design choices are valuable and I was delighted to also find a comparison to the more common and arguably more intuitive approach of using the goal predictions as input to the networks. As the approach and the results are presently sufficiently clear, the paper provides a valuable contribution and should thus be accepted.
+
+Still, there are also several weak points. 
+- The experiments don't seem to show the benefit of closed-loop grasping by only considering an uncluttered environment and not performing comparisons with open-loop approaches (e.g., [1] or [2]).
+- Some design choices do not seem to fit well to closed-loop grasping, which is mainly important for dynamic scenes, e.g. for re-grasping slipping objects. However, in such cases the accumulated point clouds can quickly become invalid, and, furthermore, lower-level control actions might be more appropriate than pose-deltas.
+- The approach is not evaluated on a real robot. For all I can tell, the simulation does not account for errors in control, calibration, forward kinematics that are encountered in practice.
+- The supplementary did not contain source code. I hope that the authors consider publishing the code after acceptance. 
+
+__Questions__
+1) How could the proposed method deal with changes in the object position (accumulated point cloud becomes invalid)
+2) Did you consider sharing some features between the actor and critic, e.g. parts of the PointNet
+3) I assume that the gripper is always closed at the end of a finite-horizon episode, is that correct?
+4) I guess the foreground mask is obtained based on the z value of the transformed point cloud. The paper should be more specific here.
+5) Please elaborate: ""We also empirically observe that the auxiliary task with the bootstrap errors caused by value overestimation.""
+
+__Additional Feedback__
+If I understand correctly (based on Appendix A) the method uses TD3 instead of ""vanilla"" DDPG. If this is the case, the main part should also be more concrete.
+It's not clear to me what Figure 5 is supposed to show. What does contact-rich mean in this context? Is it about contacts before closing the gripper?
+
+Typos:
+extra space in Figure 1: ""known objects .""
+extra ""the"" in Section 5.1: ""indicates the the aggregated point cloud""
+
+__References__
+[1] ten Pas, Andreas, et al. ""Grasp pose detection in point clouds."" The International Journal of Robotics Research 36.13-14 (2017): 1455-1473.
+[2] Mousavian, Arsalan, Clemens Eppner, and Dieter Fox. ""6-dof graspnet: Variational grasp generation for object manipulation."" Proceedings of the IEEE International Conference on Computer Vision. 2019.",7,4.0,ICLR2021
+rhWkihm_zDE,4,yZBuYjD8Gd,yZBuYjD8Gd,"The exploration is useful, and would like to see a corresponding framework to be used in the community. ","This paper mainly studied how the negative samples can affect the model performance in supervised learning CIO works. Through the experiments, this work has a few interesting findings, including the majority of negative samples are not important for the model learning, only a small subset of hard samples determine the model importance. These hard examples are also closely related with positive samples (more semantically similar).  We can see from experiments that it's very important  to fairly treat negative samples in supervised learning tasks. However, there is no frameworks proposed to help improve the learning representation or speed up the training task.  In general, the readers are more interested in the solutions after realizing the importance of negative samples treatment during the experiments.  It would be necessary to include the corresponding solutions by automatically setup these negatives samples in CID related task.",5,3.0,ICLR2021
+Bke1qCDAtB,2,B1xu6yStPH,B1xu6yStPH,Official Blind Review #3,"A Simple method to detect adversarial examples, but needs more work.
+
+#Summary:
+The paper proposed a method that utilizes the model’s explainability to detect adversarial images whose explanations that are not consistent with the predicted class.  The explainability is generated by SHAP, which uses Shapley values to identify relative contributions of each input to a class decision. It designs two detection methods: EXAID familiar, which is aimed to detect the known attacks and EXAID unknown, which is against unknown attacks. Both of the two methods are evaluated on perturbed test data which are generated by FGSM, PGD and CW attack with perturbations of different magnitudes. Qualitative results also show that the proposed method can effectively detect adversaries, especially when the perturbation is relatively small.
+
+#Strength
+The method is easy to implement and using the idea of interpretation for detecting adversarial examples seems interesting.
+
+Good results are demonstrated compared with other comparators.
+
+#Weakness
+The idea of this paper is based on the interpretation method of DNN. However, it has been shown that these interpretation methods are not reliable and easy to be manipulated [1][2]. Therefore, although the method is simple to design, it also brings other security concerns.
+Unfortunately, the paper does not address these issues. In addition, the comparators listed in the experiments are not state-of-art or common baselines. It is either not clear why authors modified the existing method and develop their own “unsupervised” version. 
+In the experiments, many details are omitted. For example, how is the “noise level” defined? Are they based on L1, L2 or L-inf perturbation? For PGD attack, how many iterations does the generation run and what is the step size? How many effective adversarial examples are generated for training and testing? And all the experiments are conducted in a relatively small dataset, it is also suggested to do experiments on large datasets, e.g. Imagenet.
+In the evaluation part, it looks strange to me why the EXAID familiar performs worse than EXAID unknown in evaluating FGSM attack on SVHN since the EXAID familiar is trained using FGSM attack.
+
+#Presentation
+I think the authors used a wrong template to generate the article. The font looks strange and the headnote indicates it is prepared for ICLR2020. The paper contains many typos and even the title contains a misspelling. Poor coverage of citations. There are more works for detecting adversarial examples that are published, e.g. [3][4][5]. On the other hand, the paper does not have the literature review for work related to the model interpretation.
+
+Overall, I think the paper is not good enough for publication at ICLR.
+[1] Dombrowski, Ann-Kathrin, et al. ""Explanations can be manipulated and geometry is to blame."" arXiv preprint arXiv:1906.07983 (2019).
+[2] Ghorbani, Amirata, Abubakar Abid, and James Zou. ""Interpretation of neural networks is fragile."" Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. 2019.
+[3] Meng, Dongyu, and Hao Chen. ""Magnet: a two-pronged defense against adversarial examples."" In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 135-147. ACM, 2017.
+[4] Liao, Fangzhou, Ming Liang, Yinpeng Dong, Tianyu Pang, Xiaolin Hu, and Jun Zhu. ""Defense against adversarial attacks using high-level representation guided denoiser."" In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1778-1787. 2018.
+[5] Ma, Shiqing, Yingqi Liu, Guanhong Tao, Wen-Chuan Lee, and Xiangyu Zhang. ""NIC: Detecting Adversarial Samples with Neural Network Invariant Checking."" In NDSS. 2019.
+",3,,ICLR2020
+SygLtq7ZpQ,2,rk4Wf30qKQ,rk4Wf30qKQ,Simple yet effective attacks to infer model architectures; more clarification would help,"This paper performs cache side-channel attacks to extract attributes of a victim model, and infer its architecture accordingly. In their threat model, the attacker could launch a co-located process on the same host machine, and use the same DL framework as the victim model. Their evaluation shows that: (1) their attacks can extract the model attributes pretty well, including the number of different types of layers; (2) using these attributes, they train a decision tree classifier among 13 CNN architectures, and show that they can achieve a nearly perfect classification accuracy. They also evaluate some defense strategies against their attacks.
+
+Model extraction attack under a black-box setting is an important topic, and I am convinced that their threat model is a good step towards real-world attacks. As for the novelty, although Yan et al. also evaluate cache side-channel attacks, that paper was released pretty shortly before ICLR deadline, thus I would consider this work as an independent contribution at its submission.
+
+I have several questions and comments about this paper:
+
+- One difference of the evaluation setup between this paper and Yan et al. is that in Yan et al., they are trying to infer more detailed hyper-parameters of the architecture (e.g., the number of neurons, the dimensions of each layer, the connections), but within a family of architectures (i.e., VGG or ResNet). On the other hand, in this paper, the authors extract higher-level attributes such as the number of different layers and activation functions, and predict the model family (from 5 options) or the concrete model architecture (from 13 options). While I think inferring the model family type is also an interesting problem, this setup is still a little contrived. Would the classifier predict the family of a model correctly if it is not included in the training set, say, could it predict ResNet32 as R (ResNet)?
+
+- In Table 3, it looks like the errors in the captured computation sequences show some patterns. Are these error types consistent across different runs? Could you provide some explanation of these errors?
+
+- In Table 5, my understanding is that we need to compare the avg errors to the numbers in Table 2. In this case, the errors seem to be even larger than the sum of the attribute values. Is this observation correct? If so, could you discuss what attributes are most wrongly captured, and show some examples?
+
+- It would be beneficial to provide a more detailed comparison between this work and Yan et al., e.g., whether the technique proposed in this work could be also extended to infer more fine-grained attributes of a model, and go beyond a classification among a pre-defined set of architectures.
+
+- The paper needs some editing to fix some typos. For example, in Table 5, the captions of Time (Baseline) and Time (+TinyNet) should be changed, and it looks confusing at the first glance.
+",6,4.0,ICLR2019
+VFp_2BEPAt,1,mo3Uqtnvz_,mo3Uqtnvz_,Major concern is about the experimental results,"This paper proposes a one-shot based multi-scale features ﻿generation and utilization framework for object detection. This method searches the ﻿network stride for features generation and detection heads location for features utilization.
+
+ Pros:
+Stride and detection heads location are important factors, and automated learning of them is desirable.
+This method improves the performance based on several baselines. 
+Searching the detection heads location is a new idea, to my knowledge.
+
+Cons:
+However, there are some concerns about this paper.
+1.     This paper seems like an improved version of the SpineNet. Searching for ﻿strides instead of permutation and utilizing a one-shot method instead of reinforcement learning is not very novel.
+2.     If we treat this paper as the improved version of SpineNet, the improvement of the performance is strange. SpineNet improves the performance from 37 to 42.7 for ResNet50. Why jointly searching the backbone features generation and FPN features utilization only improves 1.2 mAP in this paper?
+3.     The comparison with other state-of-the-art methods is unfair. To my knowledge, the results for DetNas (42.0) and CR-NAS (40.2) are based on 1x schedule while this method is trained from scratch for 6x schedule.
+4. It would be good to evaluate on another dataset.
+",5,4.0,ICLR2021
+r1ET9Ncgf,3,Hk3ddfWRW,Hk3ddfWRW,review,"This paper focuses on imitation learning with intentions sampled 
+from a multi-modal distribution. The papers encode the mode as a hidden 
+variable in a stochastic neural network and suggest stepping around posterior 
+inference over this hidden variable (which is generally required to 
+do efficient maximum likelihood) with a biased importance 
+sampling estimator. Lastly, they incorporate attention for large visual inputs. 
+
+The unimodal claim for distribution without randomness is weak. The distribution 
+could be replaced with a normalizing flow. The use of a latent variable 
+in this setting makes intuitive sense, but I don't think multimodality motivates it.
+
+Moreover, it really felt like the biased importance sampling approach should be 
+compared to a formal inference scheme. I can see how it adds value over sampling 
+from the prior, but it's unclear if it has value over a modern approximate inference 
+scheme like a black box variational inference algorithm or stochastic gradient MCMC.
+
+How important is using the pretrained weights from the deterministic RNN?
+
+Finally, I'd also be curious about how much added value you get from having 
+access to extra rollouts.
+",6,4.0,ICLR2018
+ryg2HGdc27,3,BJ4BVhRcYX,BJ4BVhRcYX,"I think the method proposed in this paper might be reasonable. But I do not suggest acceptance, unless the author can improve the writing and include more experimental results.","In this paper, the authors propose a method for pruning the convolutional filters. This method first separates the filters into clusters based on similarities defined with both Activation Maximization (AM) and back-propagation gradients. Then pruning is conducted based on the clustering results, and the contribution index that is calculated based on backward-propagation gradients. The proposed method is compared with a baseline method in the experiments. 
+
+I consider the proposed method as novel, since I do not know any filter pruning methods that adopt a similar strategy. Based on my understanding of the proposed method, it might be useful in convolutional filter pruning.
+
+It seems that ""interpretable"" might not be the most proper word to summarize the method. It looks like that the key concept of this paper, including smilarity defined in Equation (3), and the contribution index defined in Equation (7) are not directly relevant to interpretability. Therefore, I would consider change the title of the paper, for example, to ""Convolutional Filter Pruning Based on Functionality "". 
+
+In terms of writing, I have difficulty understanding some details about the method. 
+
+In filter clustering, how can one run k-means based on pair-wise similarity matrix $S_D$? Do you run kernel k-means, or you  apply PCA to $S_D$ before k-means? What is the criterion of choosing the number of clusters in the process of grid search? 
+
+Are filter level pruning, are cluster level pruning and layer level pruning three pruning strategies in the algorithm? It seems to me that you just apply one pruning strategy based on the clusters and contribution index, as shown in Figure 3. 
+
+In the subsubsection ""Cluster Level Pruning"", by ""cluster volume size"", denoted with$length(C^l_c)$, do you mean the size of cluster, i.e., the number of elements in each cluster? This is the first time I see the term ""volume size"". I assume the adaptive pruning rate, denoted by $R_{clt}^{(c,l)}$, is a fraction. But it looks to me that $length(C^l_c)$ is an integer. So how can it be true that $R_{clt}^{(c,l)} = length(C^l_c)$?
+
+In the subsubsection ""Layer Level Pruning"", how is the value of $r$ determined?
+
+The authors have conducted several experiments. These experiments help me understand the advantages of the proposed method. However, in the experiments, the proposed method is compared to only one baseline method. In recent years, a large number of convolutional filter pruning methods have been proposed, as mentioned in the related work section. I am not convinced that the proposed method is one of the best methods among all these existing methods. I would suggest the authors provide more experimental comparison, or explain why comparing with these existing methods is irrelevant. 
+
+Since the proposed method is heuristic, I would also like the authors to illustrate that each component of the method is important, via experiment. How would the performance of the proposed method be affected, if we define the similarity $S_D$ in Equation (3) using only $V$ or $\gamma$, rather than both $V$ and $\gamma$? How would the performance of the proposed method be affected, if we prune randomly, rather than prune based on the contribution index?
+
+In summary, I think the method proposed in this paper might be reasonable. But I do not suggest acceptance, unless the author can improve the writing and include more experimental results.
+
+",4,4.0,ICLR2019
+jnnQE1UDxes,4,MDsQkFP1Aw,MDsQkFP1Aw,Some thoughts,"**Pros**
+
+Audio-visual sound source separation is an impotant task. The paper pushes the boundary from specific domains (e.g. speakers, musics, etc) to generalized open-domain, which is crucial and far from trivial.
+
+The authors introduced a new, large-scale, open-domain dataset for on-screen audio-visual separation. The videos span 2500 hours, 55 of which are verified by human labelers. The dataset will definitely be very useful for the community as it is way more diverse than before.
+
+---
+**Cons**
+
+*Related work*
+
+To my understanding, Owens and Efros (2018) did not assume fixed number of speakers. While they validate their method under such setting, there is actually no limitation in their model that prevents them to have multiple on-screen sources. Therefore, I'm not sure about the first contribution, except the ""open-domain"" part. 
+
+*Model*
+
+In terms model architecture, (maybe I have missed something but) I didn't see much novelty in the current state. To me, the proposed model is simply a composition of multiple exisiting modules from previous work. Please note that I'm not saying building upon the sucess of previous efforts are wrong by any means. I just had the feeling that the authors are piecing various building blocks together w/o providing much intuition. Maybe there is some novelty lying within the current design. For instance, the authors may have developed a novel routing/module drawing inspiration from a certain observation; the combination of xxx and yyy is based on deliberate choice. It is, however, not clear to me at this point, at least the writing does not reflect it.
+
+Furthermore, if the network is the key player in this paper, the authors shall provide more evidence. While the authors do show conduct some ablation studies on the losses and data, there aren't any analysis regarding the importance of each module (e.g. how critical is the attention design?). It is thus difficult for readers to understand which part of the network is crucial for the success, and what is the major novelty within the architecture. The current form provides very litte intuition and take away.
+
+*Objectives*
+
+Eq. 4 and Eq. 6 looks very similar to me. Aren't they equivalent if the **A** in Eq. 6 is the same as **A** that minimizes Eq. 2, since the assignment in Eq. 4 comes from Eq. 2? On the other side, if the two **A** are different, what's the intuition of exploiting different A for different loss?
+
+
+*Writing/Presentation*
+
+The flow of the model architecture section can be improved. The authors did not provide any high-level context. Instead, they simply dig into the "" details"" of each module. I'm not aware of the connections among while reading the text. Instead, I need to constantly check the figure and infer these.
+
+I also don't know what is the input/output of the model and what representations they are using. Shoudn't these be explained at the very beginning? These are not explicitly defined and I need to infer them myself. For instance, is the output of masking network a M x T soft mask with values lying within [0, 1]? do they exploit waveform (fig. 2) or spectrogram (fig. 1) for audio? I figured/inferred a lot these out after I moved to the experiment section. But from two cents, these are related to the model.
+
+*Experiments*
+
+Currently the authors only evaluate the model on their own dataset. How does the model work on existing datasets? For instance, AudioSet, MUSIC, FAIR-Play, etc. It seems that there is nothing preventing them from applying their model to those datasets. Without these results, it would be hard to justify if the proposed ""open-domain model"" can generalize to ""a specific domain."" I think at least  direct inference (generalization) or train from scratch is required. 
+
+Furthremore, the authors did not compare with any baseline. It seems to me that quite a few prior art [Owens and Efros (2018), Hang et al (2018), etc] can serve as baselines with minor modification. Take Owens and Efros (2018) for example. While they may not be able to decompose each sound source within the on-screen mixture, one can still leverage it to evaluate the on/off-screen separation. The authors thus shall be able to report SI-SNR too. Otherwise its very difficult for people to do an apple-to-apple comparison of this work and prior work.
+
+The authors should report more performance at more percentiles. The most illustrative way is to show the cumulative plot - how many % of data have error less than x.
+
+Is there an intuition of why only pre-training part of the model? Why not pre-train the separation network too?
+
+---
+**Minor comments**
+
+How do the authors define the diversity (Sec. 5.1) of the videos? Do they make use of the tags provided by YFCC100M? Also, whats the statistics of those filtered data? Would be great to provide more details so that we know its indeed covering a wide range of semantic categories.
+
+Some relevant literatures are missing. For instance, [1] also associates the visual information with speeching signal. The subjectives (eg person, dog, birds) in the paper can be viewed as on-screen audio, while prepositions can be seen as off-screen voice. There are definitely a lot more on this direction, but this paper pop out my head right away.
+
+[1] Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input. ECCV 2018.
+
+
+",6,4.0,ICLR2021
+SJ5ROhbNx,3,r1te3Fqel,r1te3Fqel,Review,"SYNOPSIS: The paper proposes a new neural network-based model for reading comprehension (reading a passage of text and answering questions based on the passage). It is similar in spirit to several other recent models, with the main exception that it is able to predict answers of different lengths, as opposed to single words/tokens/entities. The authors compare their model on the Stanford Question Answering Dataset (SQuAD), and show improvements over the baselines, while apparently lagging quite far behind the current state of the art reported on the SQuAD leaderboard.
+
+THOUGHTS: The main novelty of the method is to be able to identify phrases of different lengths as possible answers to the question. However, both approaches considered -- using a POS pattern trie tree to filter out word sequences with POS tags matching those of answers in the training set, and brute-force enumeration of all phrases up to length N -- seem somewhat orthogonal to the idea of ""learning end-to-end "" an answer chunk extraction model. Furthermore, as other reviews have pointed out, it seems that the linguistic features actually contribute a lot to the final accuracy (Table 3). One could argue that these are easy to obtain using standard taggers, but it takes away even more from the idea of an ""end-to-end trained"" system.
+
+The paper is generally well written, but there are several crucial sections in parts describing the model where it was really hard for me to follow the descriptions. In particular, the attention mechanism seems fairly standard to me in a seq2seq sense (i.e. there is nothing architecturally novel about it, as is for instance the case with the Gated Attentive Reader). I may be missing something, but even after the clarification round I still don't understand how it is novel compared to standard attention used in for instance seq2seq models.
+
+Finally, although the method is shown to outperform the baseline method reported in the original paper introducing the SQuAD dataset, it currently seems to be 12th (out of 15 systems) on the leaderboard (https://rajpurkar.github.io/SQuAD-explorer/). Of course, it may be that further training and hyperparameter optimizations may improve these results.
+
+Therefore, given the lack of model novelty (based on my understanding), and the lack of strong results (based on the leaderboard), I don't feel the paper is ready in its current form to be accepted to the conference.
+
+Note: The GRU citation should be (Cho et al., 2014), not (Bengio et al., 2015).",4,3.0,ICLR2017
+B1xkXk7B5H,3,SygWvAVFPr,SygWvAVFPr,Official Blind Review #2,"This works applies neural module network to reading comprehension that requires symbolic reasoning. There are two main contributions: (1) the authors designed a set of differentiable neural modules for different operations (for example, arithmetics, sorting, and counting) that is required to perform reasoning over a paragraph of text. These modules can be compositionally combined to perform complex reasoning. And the parameters of each module (which can be viewed as executor of each operation) are learned jointly with the parser that generates thee program composed of those modules. (2) To overcome the challenge of weak supervision, the authors proposed to use auxiliary loss (information extraction loss, parser supervision, intermediate output supervision). The model is evaluated on a subset of DROP, and outperforms the state-of-the-art models. Ablation studies supported the importance of the auxiliary losses.
+
+Strength:
+
+(1) The problem of applying symbolic reasoning over text is important and very challenging. This work has explored a promising direction that applies NMN, which achieved good results in VQA, to QA tasks that requires reasoning, specifically, a subset of the DROP dataset.
+
+(2) The result, although preliminary, seems promising. The design of the modules seems intuitive and the introduction of auxiliary tasks to alleviate the problem of weak supervision is well motivated and works reasonably well. 
+
+I am leaning towards rejection because:
+
+(1) The main concern is that the paper, in its current form, seems incomplete. It is understandable that the type of datasets that requires reasoning is not very common nowadays, so only DROP is used for evaluation. However, the current evaluation is only on a subset of DROP, which seems unsatisfying.  
+
+The paper argues that ""Our model possesses diverse but limited reasoning capability; hence, we try to automatically extract questions in the scope of our model based on their first n-gram"". However, results on the full dataset seems necessary for evaluating the potential of NMN approach over text. Even if the result is negative, it is still good to know the cause of the failure. For example, does the difficulty come from unstable training or does it come from insufficient coverage of the modules. 
+
+(2) There are several modules introduced in the paper, but there isn't much analysis of them during the experiments. For example, what are some good and bad samples that uses each type of operations. 
+
+(3) Since the modules are learned jointly with the parser, it is good to check whether the learned modules are indeed performing the intended operation instead of just adding more capacity to the model. For example, it might help to show a few examples that demonstrates the ""compare-num-lt"" is actually performing the comparisons. This can support the interpretability claim of the proposed model.  
+
+
+Minor issues:
+
+The complexities of some modules seem large. For example, ""compare-num-lt"" needs to enumerate all the pairs of numbers, which is quadratic. And the complexity of ""find-max-num"" depends on the choice of n, which could be large (although it is chosen to be 3 in this work). 
+
+It is stated that ""Our model performs significantly better than the baseline with less training data, showing the efficacy of explicitly modeling compositionality."" However, the comparison with MTMSN using less training data seems a bit unfair since the proposed model is given more supervision (question parse supervision and intermediate module output supervision). Maybe a better argument is that by explicitly modeling compositionality, it is easier to add such extra supervisions than black box models like MTMSN. 
+
+For ""count"", why is the attention scaled using values [1, 2, 5, 10] first?
+
+In summary, I do like the main idea and the paper has merits, but it requires more evaluation and analysis to be accepted. I am willing to increase my score if more contents are added and I look forward to seeing it in a more complete form.
+
+===================================
+
+Update after author response:
+
+Thanks for the clarification and adding the content, I have updated my score accordingly. However, I still believe the impact of this paper will be much larger if the evaluation can be more complete, e.g., evaluating over the full DROP dataset or even include some other datasets. In the current form, it looks borderline. 
+
+Selecting a subset (~22.7% of DROP dataset) based on the design of the proposed model (""heuristically chosen based on their first n-gram such that they are covered by our designed modules""), and compare to other models, which can actually handle a broader set of questions, only on the selected subset seems incomplete and raises concerns about how generally applicable the proposed model is. For example, since the proposed model is handling some types of questions better, it would be good to show that it can be combined with other models to get a better overall result. ",6,,ICLR2020
+eOZZwHJVEPX,4,Oe2XI-Aft-k,Oe2XI-Aft-k,The good results may be just brought by inadequate attack evaluation.,"The paper proposes a two-stage defense method to improve the adversarial robustness over different perturbation types. Specifically, it first builds a hierarchical binary classifier to differentiable the perturbation types and then uses the result to guide to its corresponding defense models.  It first proves the different types of perturbations could be separable and the adversary could be weakened to fool the binary classifier. It shows their methods achieve a clear improvement in the experiments.
+
+Pros:
+1. The paper is good-written and easy to follow.
+2. The proposed idea is interesting.
+3. The experiment is detailed and comprehensive.
+
+Cons:
+1. There is a major problem in their method. In the last sentence of section 5.2, it says uses the soft relaxation only in generating the adversarial example, but not for inference. It clearly caused the gradient making problem in the adversarial attack later on to test the robustness.  The gradient is blocked before reaching into the binary classifier so that the adversarial attack fails, which I think it is not truly improving the model's robustness.
+2. In my opinion, this method is just a dynamic voting based model ensemble. Just take the binary classifier as a voting procedure.  Therefore, the traditional adversarial attacks won't work in general. I would suggest using the soft relaxation in the inference as well for the adversarial attack. 
+3. Also, the assumption that different norm adversarial examples could be clearly separable might be wrong. You could find the adversarial examples that satisfy both l1, l2 and l_inf constraint by just choosing the \epsilon for every norm differently.
+",4,4.0,ICLR2021
+9evPd1ch91R,1,PdauS7wZBfC,PdauS7wZBfC,The paper is not clear enough,"#### Summary of the paper
+In their paper, the authors demonstrate that Predictive Coding (PC) is a local approximation of back-propagation and could then be interpreted with Hebbian learning rule (a neuro-plausible learning rule). This result has been first demonstrated by [1] with MLP network (on the MNIST dataset) and the presented paper extend this finding to CNNs (on CIFAR10, CIFAR100 and SVHN), RNN and LSTM. 
+
+#### Pros
+* The authors provided experimental evidences on a wide variety of networks' type and databases.
+* The link between neuro-plausible learning rule and back propagation is interesting.
+* The paper is well situated in the literature.
+* The authors are providing the code for clean reproducibility
+
+#### Cons
+* The mathematical definition and notation of the paper are not rigorous enough. It makes the paper unclear and hard to follow.
+* Some crucial points would have deserved in-depth discussion and are just ignored (see below)
+* The paper is not well enough motivated: what’s the point of such a local approximation beside the neuro-plausibility (faster ? Consume less resources ? …)
+* The demonstration seems to include only one kind of loss function, which does not match the claim of the paper
+
+#### Recommendation
+Given the limited impact and the lack of clarity of the paper, I would tend to reject the article.
+
+#### Detailed comments:
+* The gaussian parametrization used by the authors constrains the comparison between PC and backprop to networks with L2 loss function. One cannot claim to approximate arbitrary computational graph if one demonstrates the approximation on a specific loss function (which is known to poorly perform on classification problem). So could your framework be generalized to more effective loss function like cross entropy ? If yes, what would be the underlying probabilistic hypothesis ? This should be included in the paper, as it will strongly strengthen your claim.
+
+* The PC framework proposed by the authors propose a solution to the ’non locality’ of the back propagation (to be a bio-plausible mechanism). However the authors also raised the weight transport problem. On my understanding the proposed framework is still suffering from weight transport as the backward connection weights are the transpose of the feedforward one (due to the derivation of the forward operator). The paper would deserve an in-depth discussion concerning this point.
+
+* What is the computational advantage of local approximations ? Is it saving computational resources (computational time, memory…) ? A comparison of the algorithmic complexity between PC and back-propagation would be valuable to support your claim. In the discussion, the authors mention that their framework, being substantially more expensive than back-prop network, could be deeply parallelized across layer. The authors should provide experimental or theoretical evidences that such parallelization is enough to mitigate the higher number of inference steps (i.e. 100-200) needed by their PC framework.
+
+* The concept of ‘fixed-prediction assumption’ introduced by authors in the paper considers that each (v_i) are fixed to their feedforward value. Then what is the point of the Eq. 2, as you already know the value of the activity vector? On my mind this is here a crucial point, as this is dealing with the core principle of PC : an inner-loop (i.e. the expectation step) that find the most likely v_i, and an outer loop (i.e. the maximization step) that update the parameters. I have the intuition that this problem arises because the authors are tackling a discriminative problem (i.e. finding a mapping between inputs and labels) using a generative model (PC is a generative generative model as described by [2, 3]). Can you please clarify this point ?
+
+* What is the testing procedure of your network ? Is it a simple feedforward pass (which I suspect) or is it an Expectation-Maximization scheme. Your algorithm 1 shows only the training procedure (as you need the label information to perform the computation). If this is a simple feedforward pass, what would be the advantages of the inference process (more robustness ? Better prediction ?)
+
+## Typos and suggestions to improve the paper :
+* The authors state (3rd paragraph page 3) that they are considering a generative model. If it is the case, the formula p(v_i) = product(p(v_i | parent(v_i)) is inaccurate as the authors forgot the prior p(v_N).
+* Eq 1 : The derivation between the first and the second line of Eq.1 has to be demonstrated or referenced (at least in annex)…
+* The authors consider the posterior is a marginal probability (see eq1, and subsequent paragraph). In general the posterior is a conditional probability (this specific point makes your equation 1 hard to grasp because readers are not making the link with the classical ELBO, i.e. the negative free energy). In general, the probabilistic notations are not rigorous enough, and it makes the rest of the mathematical derivation complicated to follow. 
+* The authors should reorganize the Annex to make sure it follows the reading order (Appendix D is cited first)
+* Caption of Figure 1 : backwawrds —> backwards
+* Page 3, 2 lines below eq 1 : as as —> as 
+* Page 4, 4 lines below eq 3 : forwards —> forward
+* Figure 3, which is on my mind the most import one, is shown but not cited in the core text.
+
+[1] Whittington, J. C., & Bogacz, R. (2017). An approximation of the error backpropagation algorithm in a predictive coding network with local hebbian synaptic plasticity. Neural computation, 29(5), 1229-1262.
+
+[2] Rao, Rajesh PN, and Dana H. Ballard. ""Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects."" Nature neuroscience 2.1 (1999): 79-87.
+
+[3]Friston, Karl. ""A theory of cortical responses."" Philosophical transactions of the Royal Society B: Biological sciences360.1456 (2005): 815-836.",4,4.0,ICLR2021
+Bkg4y5FCYS,2,ByeGzlrKwH,ByeGzlrKwH,Official Blind Review #1,"The paper presents novel theoretical results on generalization bounds via compression. Similar ideas in the last few years appeared, but only bounds on a compressed network were obtained. In contrast, the current submission gives a bound on the original (uncompressed) network in terms of the complexity of the compressed network class.
+
+Overall, the paper seems to be well-written. I appreciate that the outlines of the proofs are included in the main text, which helps the reader follow the ideas. The result is novel and quite interesting. The new bounds seem to be still quite far from giving tight generalization theory, but I believe the paper provides some nice theoretical results for other researchers to improve upon. I think the paper could be improved immensely by some empirical analysis of the rank of compressed standard vision networks and rank of activation covariance matrices.  There are also some citation issues (see detailed comments below).
+
+Citation issues:
+In the introduction, paragraph 2, the authors cite Neyshabur et al. 2019 for the observation that networks generalize well despite being overparameterized. It seems like an odd choice. Why is Barttlet’s ‘99 paper [“Size of the weights…”]  not cited? Or at least Neyshabur et al. 2015? 
+Then the authors mention that classical learning theory cannot explain the phenomena mentioned above, and classical theory “.. suggests ” that overparameterized models cause overfitting…”. The authors need to be more precise and add citations (I am assuming that the authors are talking about VC bounds for worst-case ERM generalization).
+In the third paragraph, where the authors talk about norm-based bounds being lose, it seems that Nagarajan and Kolter 2019 should be cited (not only at the end), as well as Dzigaite and Roy 2017 (they look into the looseness of path-norm and margin-based bounds).
+
+Could the authors comment more on how the bound in Theorem 2 is superior to VC dimension bound and whether conditions under which the bound is tight are realistic for standard compressed vision networks. Having weight matrices to be close to rank 1 seems unrealistic.I would like to see some sort of empirical evidence if the authors believe that this is the case. And for larger ranks, the bound seems to be close to VC bound.
+
+In general, I found the notation a bit hard to follow and had to constantly be looking through the paper to find the definitions of various quantities. Having three different r’s, multiple mu’s with dots, bars, stars, etc., was definitely confusing and required extra attention to detail.
+
+Other minor comments:
+
+In section 2, marginal distributions over x and y are introduced. Are those used in the main text?
+Is that a definition of \mu with the dot on top in assumption 5, or is this mu with the dot defined earlier? Using notation := would make it clearer whether the quantity is being defined.
+In Section 3, “The main difference from the…” paragraph, there is \Psi(\dot r) used. Where is that defined?",8,,ICLR2020
+BklpyIz0YB,3,Skgb5h4KPH,Skgb5h4KPH,Official Blind Review #2,"This paper mainly focuses on experimental results on real data to verify the so-called Frequency Principle: DNNs often fit target functions from low to high frequencies during the training process. Some theoretical analyses are also provided to backup the empirical observation. 
+
+This paper is very well-written. The methods are explained very clearly, and the logic is easy to follow. However I think this paper also has some weak points, as listed below:
+
+(1) The frequency principle lacks a rigorous definition. In Section 2, the authors provide very inspiring explanations and examples, however no rigorous definitions are given. Is there a way to directly quantitatively define the response frequency?
+
+(2) In Section 3.1, it is not explained why the frequencies are calculated based on the samples. Probably I have missed something, but based on the description, can’t the frequencies be directly calculated on a grid along the 1-dimensional subspace defined by the mean and first principle component of the data? In other words, even after training the predictor function with real data samples for several steps, the frequency of a predictor function should still be its own property and  should be independent of the distribution of data inputs. In fact, are the vectors $n^{-1/2} (cos( 2\pi k x_{p_1,1}), cos( 2\pi k x_{p_1,2}), \ldots, cos( 2\pi k x_{p_1,n}) )$ for different $k$ even orthonormal vectors? I think unless $x_{p_1,i}$ follows certain specific distributions, these vectors are not even close to orthonormal. Therefore using them to calculate the frequencies is very weird.
+
+(3) In fact, are Sections 3,4 and 6 studying the same kind of frequency? It is not very clear due to the vague definitions.
+
+Because of these concerns, I think this paper is on the borderline. For now I tend to recommend a weak reject.
+",3,,ICLR2020
+m4IbVV1-2jV,2,F8whUO8HNbP,F8whUO8HNbP,"Official Reivew of ""Contrastive Syn-to-Real Generalization""","Training on synthetic data and generalizing to real test data is an important task that can be particularly beneficial for label or data-scarce scenarios. The paper aims to achieve the best zero-shot generalization on the unseen target domain real images without having access to them during synthetic training. The experiments thoughtfully demonstrate both classification and segmentation, as well as ablation studies. Visualizations and interpretations are presented in addition. The visual interpretation of feature diversity in Sec. 2 is another plus. Overall this paper is good both conceptually and experimentally. It is also well written in general. 
+
+My main concern is that the current baseline comparisons in experiment are not fully consistent nor satisfactory. As observed from Table 5, by absolute numbers (not relative margin), the proposed method is roughly aligned with ASG and IBN-Net, but lags much behind (Yue et al., 2019). That was partially explained by the authors, by saying that Yue et al. (2019) required ImageNet images during synthetic training and also implicitly leveraged ImageNet labels as auxiliary domains. However, if looking more closely, even the baseline ResNet-50 mIoU sees a big gap between Yue et al. (2019), and the proposed method as well as the two others. It is unclear and unconvincing to me why the same baseline can perform so differently among those methods, and I think this might potentially undermine the experiment reliability/reproducibility and deserves more clarification from the authors. 
+
+One more nitpick is that this paper shows no figure in experiments. Given there is some extra space, the authors may want to visualize some classification and segmentation results, displaying both success and failure cases.
+
+Typos:
+
+a novel framework that leverage  -> should be “leverages”
+
+the diversity of learned feature embedding play -> should be “plays”
+",6,4.0,ICLR2021
+HydYVVQVx,1,rJxDkvqee,rJxDkvqee,well-done domain adaptation,"this proposes a multi-view learning approach for learning representations for acoustic sequences. they investigate the use of bidirectional LSTM with contrastive losses. experiments show improvement over the previous work.
+
+although I have no expertise in speech processing, I am in favor of accepting this paper because of following contributions:
+- investigating the use of fairly known architecture on a new domain.
+- providing novel objectives specific to the domain
+- setting up new benchmarks designed for evaluating multi-view models
+
+I hope authors open-source their implementation so that people can replicate results, compare their work, and improve on this work.",6,3.0,ICLR2017
+QY9bYuNLk8O,3,NX1He-aFO_F,NX1He-aFO_F,Interesting approach but not sufficient empirical and theoretical evidence to confirm the effectiveness of the approach,"The paper explores an alternative loss function for fitting critic in Reinforcement Learning. Instead of using the standard mean squared loss between critic predictions and value estimates, the authors propose to use a loss function that also incorporates a variance term. The authors dub the approach AVEC. The authors combine their approach with popular RL algorithms such as SAC and PPO and evaluated on the standard benchmarks for continuous control.
+
+Although the paper demonstrates interesting empirical results, I think that the current experimental evaluation has a number of flaws that prevent me from recommending this paper for acceptance. The paper provides basic motivation but it is lacking thorough theoretical investigation of the phenomena. Also the proposed loss is biased in the stochastic mini batch optimization due to the expectation under the squared term that is not addressed in the paper either. Finally, I have major concerns regarding the experimental evaluation. The set of OpenAI mujoco tasks is different from commonly used tasks in literature. In particular, Hopper and Walker2d, which are used in the vast majority of the literature, are ignored in table 1 and figure 2. This fact raises major concerns regarding generality of the approach.
+
+In conclusion, the paper presents interesting results on some tasks for continuous control. However, the paper requires more thorough experimental evaluation to confirm the statements. Also a deeper theoretical analysis will greatly benefit this work. I strongly encourage the authors to continuous working this approach and revise the paper to improve the theoretical and empirical analysis. This paper presents a very interesting idea but in the current form it is not ready for acceptance.",5,3.0,ICLR2021
+SJ45Qm8Zz,1,SywXXwJAb,SywXXwJAb,Interesting theoretical connections,"The paper makes a striking connection between two apparently unrelated problems: the problem of designing neural networks to handle a certain type of correlation and the problem of designing a structure to represent wave-function with quantum entanglement. In the wave-function context, the Schmidt decomposition of the wave function is an inner product of tensors. Thus, the mathematical glue connecting the neural networks and quantum entanglement is shown to be tensor networks, which can represent higher order tensors through inner product of lower-order tensors. 
+
+The main technical contribution in the paper is to map convolutional networks with product pooling function (called ConvACs) to a tensor network. Given this mapping, the authors exploit results in tensor networks (in particular the quantum max-flow min-cut theorem) to calculate the rank of the matricized tensor between a pair of vertex sets using the (appropriately defined) min-cut. 
+
+The connection has potential to yield fruitful new results, however, the potential is not manifested (yet) in the paper. The main application in deep convolutional networks proposed by the paper is to model how much correlation between certain partition of input variables can be captured by a given convolutional network design. However, it is unclear how to use Theorem 1 to design neural networks that capture a certain correlation. 
+
+A simple example is given in the experiment where the wider layers can be either early in the the neural network or at the later stages; demonstrating that one does better than the other in a certain regime. It seems that there is an obvious intuition that explains this phenomenon: wider base networks with large filters are better suited to the global task and narrow base networks that have more parameters later down have more local early filters suited to the local task. The experiments do not quite reveal the power of the proposed approach, and it is unclear how, if at all, the proposed approach can be applied to more complicated networks. 
+
+In summary, this paper is of high theoretical interest and has potential for future applications.",7,3.0,ICLR2018
+lNnmoxxb_cF,1,gIHd-5X324,gIHd-5X324,Review,"Summary:
+
+This paper analyzes the distillation from the bias-variance perspective. Beyond this, the regularization samples affect the performance. Based on the observation, a novel weighted mechanism is proposed to distill knowledge from teacher networks.
+
+Strengths:
+
++) This paper is clear and easy to follow, the organization is good. The bias-variance analysis, the regularization samples, the weighted soft labels all make sense. I felt comfortable when I was reading this paper.
+
++) The analysis is clear and reasonable. The deduction seems correct. The figures are clear.
+
++) The experiments are enough to examine the effectiveness of the proposed weighted distillation (see below).
+
++) The code is submitted to contribute to the community. I appreciate the submission.
+
+
+Weaknesses & Concerns:
+
+-) Sec. 3, first paragraph, $T(x, \tau)$ -> $\hat{y}^t = T(x, \tau)$, $S(x, \tau)$ -> $\hat{y}^s = S(x, \tau)$ to make it more clear. 
+
+-) ' For loss function, we set α = 2.25 for distillation on CIFAR and α = 2.5 for ImageNet via grid search.' How many $\alpha$s have been tested? What are the results? The main concern is that the grid search is costly in practice.  Therefore, I appreciate the analysis in this paper that helps us understanding KD better. However, the grid searched hyper-parameters makes Sec. 4 costly in practice.
+
+Based on the quality of the paper, I select 6 as the initial score.",6,4.0,ICLR2021
+rJlYWZMYhm,2,Bkg6RiCqY7,Bkg6RiCqY7,"I find the justification for decoupled weight decay a little unconvincing, but the empirical results are solid","This review has been somewhat challenging to complete. As the authors write, this work has already been impactful and motivated a great deal of further research. The empirical evaluation is convincing and the results have been reproduced and further studied by others. A moderate amount of space in the paper (Section 3, Section 4.5) is used to refer to work motivated by the paper itself. While I do not take issue with this I believe it should be considered for the final decision (in the sense that disentangling the contributions of the authors and related work becomes tricky). With this said, I continue with my review.
+
+Paper summary: The authors observe that L2 regularization is not effective when using the Adam optimizer. By replacing L2 regularization with decoupled weight decay the authors are able to close the generalization gap between SGD and Adam and make Adam more robust to hyperparameter settings. The empirical evaluation is comprehensive and convincing.
+
+Detailed comments:
+
+1) The authors emphasize the fact that L2 regularization and weight decay are not the same for different optimizers and claim that this goes against the belief of some practitioners. In my experience, most practitioners would not be surprised by this observation itself. The second observation made by the authors, that L2 regularization is not effective in Adam, is the more interesting (and perhaps surprising) observation.
+
+2) I am not convinced of the importance of Proposition 3. In practice, adaptive methods will have a preconditioner which depends locally on the parameters. I understand the motivation from the previous paragraph but felt that the formal result added little.
+
+3) Section 3 introduced the Bayesian filtering perspective of stochastic optimization. The authors share the observation of Aitchison, 2018 that decoupled weight decay can be recovered in this framework. My interpretation is that this observation is important _because_ of the empirical observations in this paper and does not necessarily provide theoretical support for the approach. However, the last paragraph of Section 3 seems to utilize the Bernstein-von Mises theorem to promote the idea that with large datasets the prior distribution is unimportant (and is ignored). I am not sure that I follow this argument. For example, this claim seems to be completely independent of the optimization algorithm used and moreover Propositions 1,2, and 3 are independent of the data distribution. I suspect that this confusion is due to a misunderstanding on my part and would appreciate clarification from the authors.
+
+4) The empirical evaluation in this paper is very strong and these practical techniques have already been adopted by the community in addition to spurring novel research. The empirical observation broadly explores two directions: decoupled weight decay leads to separable hyperparameter search spaces (meaning optimization is less sensitive to hyperparameters), and decoupled weight decay gives improved generalization (and training performance). Both claims are explored throughly with strong evidence given for the improvement due to AdamW.
+
+Overall, I find this paper to be presented well and with convincing empirical results. I feel that the theoretical justification for decoupling weight decay are a little weak, and believe that other work is moving towards better explanations then the ones presented in this paper [1,2,3]. Despite this, I believe that this paper should be accepted.
+
+
+Minor comments:
+
+- I find the notation in the paper confusing in general. x is used to denote weights, and w to denote hyperparameters (e.g. w' for L2 regularization scale and w for weight decay scale). I don't see why it wouldn't be preferable to use the more standard W for weights, x for inputs, and lambda for hparams.
+- Figure 4: it is difficult to distinguish between Adam and SGDWR (especially left).
+
+
+
+Clarity: The paper is well written and clear. I find the notation confusing in places, but is consistent throughout.
+
+Originality: This paper presents original findings but occasionally relies on work motivated by itself to convince the reader of its importance. I do not think that this subtracts from the value of the work.
+
+Significance: The work is clearly significant. Even without knowing that practitioners have adopted the techniques presented in this work, the paper clearly distinguishes itself with strong empirical results.",7,4.0,ICLR2019
+mHBfhuse353,1,MBdafA3G9k,MBdafA3G9k,"Review of ""Visual Imitation with Reinforcement Learning using Recurrent Siamese Networks""","**Summary**: This paper studies the problem of visual imitation learning: given a video of an expert demonstration, take actions to reproduce that same behavior. The proposed method learns a distance metric on videos and uses that distance metric as a reward function for RL. Experiments show that this method does recover reasonable behaviors across a range of simulated robotic tasks.  Compared with prior methods, the main contribution of this work is that the distance metric is parametrized and trained as a siamese network.
+
+**Novelty and Significance**: While the exact method seems novel, it is very similar to a number of prior methods, most notably GAIL. At a high-level the main difference is that this paper uses a siamese network to parametrize the discriminator, and employs a few types of data augmentation. If the experiments had shown that this architectural choice made a significant improvement in performance, and was useful inside a range of imitation learning frameworks (e.g., GAIL, AIRL, and Value Duce), then I think it'd represent a significant contribution. As is, the paper has not convincingly shown that this architectural choice is critical for significantly improved performance.
+
+
+**Experiments:**
+* I found it challenging to assess the experimental results without any quantitative comparisons with baselines (besides TCN). I'd highly recommend comparing against recent imitation learning methods. For example, some reasonable baselines would be
+  * GAIL or InfoGAIL
+  * Zero-shot visual imitation
+  * Imitating latent policies from observation
+  * Generative adversarial imitation from observation (Though this method is discussed, it's not compared against in the figures.)
+  * Learning robust rewards with adversarial inverse reinforcement learning
+  * Discriminator-actor-critic: Addressing sample inefficiency and reward bias in adversarial imitation learning
+  * Imitation learning via off-policy distribution matching
+* The videos of the humanoid imitation behavior on the accompanying website actually look quite poor. The videos for other domains seemed to have been rendered incorrectly.
+* I found it somewhat irksome that the method is entitled ""visual imitation"" but the learned policy doesn't take visual inputs (""Note that we experimented with using visual features as the state input for the policy as well; however, this resulted in poor policy quality."")
+
+
+
+**Clarity:**
+* The introduction makes imitation learning sound like a new problem. I'd recommend clarifying the relationship with prior work earlier in the introduction.
+* I found Fig 1 hard to follow, largely because it's unclear what the method is supposed to produce. One idea is to explicitly say something like: ""We aim to learn a distance function (left) and then use that distance function as a reward function for RL (right). The distance function is learned by ...""
+* I found Eq 4 hard to parse. It might be helpful to label each term (using \underbrace{}) with its semantic meaning.
+* Reward Calculation: One thing that's unclear here is which trajectories are used for computing the reward function. I'm guessing that the agent's trajectory is compared to an expert trajectory, but I don't think this is ever stated explicitly. If so, it's unclear how to compute the reward function when multiple expert trajectories are given.
+
+
+Overall, I give this paper a score of 4/10, primarily because of the lack of discussion and comparison with prior work. I would consider increasing my score if the paper compared against recent visual imitation learning methods (see above). The proposed method also seems quite similar to GAIL; if possible, it'd be great to formally explore the connections between these two methods. Finally, I'd recommend focusing the related work section to only discuss the most related works (see list above), but to discuss the exact differences between these prior methods and the proposed method in more depth.
+
+**Questions for discussion**:
+* Precisely, what are the differences between this method and GAIL? (I want to make sure I didn't miss an important difference)
+* For the experiments in ""3D Robot Video Imitation"", is the expert demonstration provided as an RGB video, a mocap trace, or something else?
+* Where is the comparison with GAIfO shown? (""We have compared our method to GAIfO"")
+* ""that takes into account demonstration ordering"" -- In a Markovian environment with a Markovian expert, shouldn't comparisons based on state-action pairs be sufficient? Is the assumption that the expert's motion is not Markovian?
+* ""Mutual information loss (”Siamese Network triplet loss”)"" -- Eq 3 doesn't look like a mutual information objective. Can you explain precisely how maximizing Eq 3 leads to maximizing mutual information?
+
+
+**Minor comments:**
+* Abstract: No need to define ""GAIfO"" as an acronym in the abstract. Instead, just use ""GAIL without access to actions.""
+* ""In nature ... their movements"" -- Please add citations.
+* ""formulating ... is challenging"" -- How is this different from the large body of prior work on imitation/apprenticeship learning?
+* ""The fundamental problem ..."" -- This is only true for *trajectory*-based imitation learning, not *transition*-based imitation learning (e.g., BC, GAIL, AIRL).
+* ""re-use"" -> ""reuse""
+* ""Additionally, While"" -> ""Additionally, while""
+* ""Including, ..."" This sentence is missing a subject.
+* ""Markov Decision Processes"" -> ""Markov decision processes""
+* When defining a trajectory, use \langle and \rangle.
+* ""it often suffers from distribution mismatch"" -- Cite DAGGER, or something similar.
+* ""behaviour Ng"" -- Use \citep{} for references that are not used as nouns.
+* Eq 2: Can you define the distribution over $h$? (I assume it will be p(h | x)).
+* Eq 3: Perhaps cite prior methods that use this contrastive margin loss (e.g., FaceNet)
+* Define VIRL = ""Visual Imitation with RL"" in S3. It took me awhile to figure out what VIRL was when it was first mentioned in S4.
+* ""is an active research area"" -- I agree, but using citations from '04 and '09 doesn't make this area sound particularly ""recent.""
+* ""the goal is to"" -- The goal of what? Of all ""good distance functions""?
+* ""state-based metrics ... image based inputs"" -- Aren't image-based metrics a special case of state-based metrics? For example, the [Ho & Ermon] citation for state-based metrics can be applied to images.
+* ""Additional works..., none of these ..."" -> ""While additional works ..., none of these ...""
+
+----------------------------
+**Update after author response**: Thanks to the authors for answering my questions and for incorporating feedback into the paper. Through discussion, I think we got to the crux of the method: distance functions seem to work better than classifiers for imitation learning. I think this is a really neat observation, and potentially quite important; the experiments definitely support this hypothesis. That said, I don't think the paper goes far enough in exploring this hypothesis, either experimentally or theoretically. There are a number of confounding variables, such as data, loss function, and architecture, which each will need to be accounted for. I therefore stand by my previous vote (4) to reject the paper. With more thorough experiments and ablations, I think this will make a fantastic submission to a future conference.",4,4.0,ICLR2021
+NBIGQVsBHLI,2,l-PrrQrK0QR,l-PrrQrK0QR,Paper present a novel approach for approximate compression of datasets using Kernel Ridge Regression and experimental results show efficacy of the approach in terms of reducing sample complexity.,"Paper present a novel approach for approximate compression of datasets using Kernel Ridge Regression, referred to as KIP.  Paper is well written and technically sounds and experimental results show efficacy of the approach in terms of reducing sample complexity. It also provides an added benefit of corrupting input datasets without much loss test accuracy for privacy preserving use case learning. 
+
+The definition of $\epsilon$-approximation is introduced in terms bounds on difference between the expected empirical loss for original and compressed dataset which in potentially also bounds generalization error as well. The KIP algorithm iterates over an initial support-set to finally converge over a support dataset that gives similar test performance. The idea of choosing a support set and growing sounds familiar to Nystorm method for kernel approximation provides intuition on  why the approach might work.  Also selection of multiple base kernels also means selection of multiple feature spaces which naturally leads to overall boost in performance. Results on MNIST and CFIR shows the efficacy of these results.
+
+A potential gap in this work is the trade-off between the compressed dataset size N and  test error for a given fixed $\epsilon$. Is there a way to choose say N, for a given $\epsilon$ for a given test error performance.  It may be so that we end choosing all points in the original dataset for a given approximation and test error. A characterization of this using generalization bounds would be helpful. ",7,4.0,ICLR2021
+SJbtlZ5gf,2,HyXNCZbCZ,HyXNCZbCZ,"The authors propose a hierarchical GAN variant of ALI, but offer little novelty or insights.","******
+Please note the adjusted review score after revisions and clarifications of the authors. 
+The paper was improved significantly but still lacks novelty. For context, multi-layer VAEs also were not published unmodified as follow-up papers since the objective is identical. Also, I would suggest the authors study the modified prior with marginal statistics and other means to understand not just 'that' their model performs better with the extra degree of freedom but also 'how' exactly it does it. The only evaluation is sampling from z1 and z2 for reconstruction which shows that some structure is learned in z2 and the attribute classification task. However, more statistical understanding of the distributions of the extra layers/capacity of the model would be interesting.
+******
+
+The authors propose a hierarchical GAN setup, called HALI, where they can learn multiple sets of latent variables.
+They utilize this in a deep generative model for image generation and manage to generate good-looking images, faithful reconstructions and good inpainting results.
+
+At the heart of the technique lies the stacking of GANS and the authors claim to be proposing a novel model here.
+First, Emily Denton et. al proposed a stacked version of GANs in ""Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks"", which goes uncited here and should be discussed as it was the first work stacking GANs, even if it did so with layer-wise pretraining.
+Furthermore, the differences to another very similar work to that of the authors (StackGan by Huan et al) are unclear and not well motivated.
+And third, the authors fail to cite 'Adversarial Message Passing' by Karaletsos 2016, which has first introduced joint training of generative models with structure by hierarchical GANs and generalizes the theory to a particular form of inference for structured models with GANs in the loop. 
+This cannot be called concurrent work as it has been around for a year and has been seen and discussed at length in the community, but the authors fail to acknowledge that their basic idea of a joint generative model and inference procedure is subsumed there. In addition, the authors also do not offer any novel technical insights compared to that paper and actually fall short in positioning their paper in the broader context of approximate inference for generative models.
+
+Given these failings, this paper has very little novelty and does not perform accurate attribution of credit to the community.
+Also, the authors propose particular one-off models and do not generalize this technique to an inference principle that could be reusable.
+
+As to its merits, the authors manage to get a particularly simple instance of a 'deep gan' working for image generation and show the empirical benefits in terms of image generation tasks. 
+In addition, they test their method on a semi-supervised task and show good performance, but with a lack of details.
+
+In conclusion, this paper needs to flesh out its contributions on the empirical side and position its exact contributions accordingly and improve the attribution.",5,5.0,ICLR2018
+Ske7FfJTFB,1,B1lgUkBFwr,B1lgUkBFwr,Official Blind Review #3,"The submission describes an approach for unsupervised domain adaptation in a setting where some parts of the target data are missing.
+
+Both UDA approaches as well as data completion approaches have a sizable research history, as laid out in the related work section (Section 5). The novelty here comes from the properties that a) domain adaptation and data imputation are handled in a joint manner, b) the missing data in the target domain is non-stochastic, and c) imputation is performed in a latent space. This maps to a fairly specific, but realistic enough set of real-world problems; the authors give an image recognition as well as an advertising prediction related problem as experimental examples.
+
+The submission is overall well written and easy to understand. I'd rate the novelty as medium (smart combination of existing methods), but the exemplary experimental evaluation elevates it to more than a systems paper.
+
+The method is described clearly in Section 3, and the joint training makes sense. I notice that not all hyperparameters ({lambda_adv, lambda_mse}, {lambda_1, lambda_2, lambda_3}) are truly needed. lambda_adv and lambda_1 could be canonically set to 1 for such a loss minimization problem, so why are the extraneous parameters included?
+
+In addition to Section 3, the experimental evaluation on two very different data sets in Section 4 is highly detailed and describes the insights clearly, both qualitatively and quantitatively. I'm happy that mean standard deviations are reported on an acceptable experiment sample set size.
+Regarding the different approaches: I'm wondering whether the higher performance of the ADV approach over OT (or the parameter hunger of OT over ADV) is only due to the tuning of the network architectures, or whether this is due to the approximations described in B.1.
+The ablation study in Section 4.4 is interesting w.r.t. the trade-off it shows between stable, consistent, ""average"" results from an MSE loss term, vs. high-variance (and on average better) results when a choice of mode is forced using an adversarial loss term.
+
+Minor comments:
+- In Table 2, I am not sure what the first row ('Naive') refers to. As far as I can tell, it is not referenced in the text.
+- I would move Section 5 (related work) to right after the introduction, as is common in conference papers and makes for smoother reading.
+- Section 5.2: type ""impainting"" -> ""inpainting""
+- Appendix, section 'Pre-processing': It seems to me that there is a clear assumption made that the target set is balanced, since training happens with a balanced source set. Is this realistic in practical scenarios? There is work on DA with unequal class distributions between domains.
+
+In summary, I can clearly recommend this submission for publication.",8,,ICLR2020
+8cb-qlgG7YQ,1,g4E6SAAvACo,g4E6SAAvACo,Interesting albeit under-explored proof-of-concept for training-free NAS.,"Summary:
+The authors propose a training-free way of estimating the performance of a deep net architecture after training using correlations between linearizations of the network at initialization for different augmentations of the same image. This estimate is used as signal to construct NAS algorithms that do not require training deep nets and is evaluated on two datasets. Although I have significant concerns about the practicality of the method, I believe it establishes a sufficiently distinct direction for NAS research that could merit acceptance.
+
+Strengths:
+1. To my knowledge the paper is the first to implement training-free NAS. 
+2. The method is simple and easy to implement.
+3. The method achieves decent performance on CIFAR in dramatically less time than previous approaches.
+4. What seems like a complete codebase is provided (although see Question 1 below).
+5. The paper is clear and easy-to-follow.
+
+Weaknesses:
+1. The justification for the actual score used is weak (see Questions 2-3 below).
+2. The method seems limited to vision data and networks with ReLU activations.
+3. Performance on the ImageNet subset of NAS-Bench-201, the only evaluation using non-CIFAR data, is poor.
+4. Limited exploratory and benchmark evaluations (see Questions 4-7 below).
+
+Questions:
+1. I was not able to run the provided code (search.py) using the instructions provided; could the authors provide a dependency list?
+2. As the score used is non-obvious and has no mathematical basis, it seems likely some trial-and-error was used to find it; is this the case, and if so what sorts of rules were tried that did not work.
+3. Why not just use the correlations between gradients rather than using the indicator function to obtain some linearization?
+4. What is the effect of beta on performance? 
+5. Do other types of data augmentation work?
+6. For NAS-Bench-201, why was N>100 (e.g. N=1000) not tried? There is clearly room to improve and going that high still leaves the NASWOT algorithm by far the fastest. 
+7. How does NASWOT perform on larger search spaces such as DARTS (Liu et al., 2019)?
+
+Notes:
+1. In two locations in the paper (3rd para of intro, 3rd para of background) the authors suggest that Li & Talwalkar (2019) show that WS inhibits architecture search and/or struggles against random search; however, that paper also shows that combining WS with random search is outperforms the latter.
+2. “Moreover, popular search spaces have been shown to be over-engineered, exhibiting little variety in their trained networks (Yang et al., 2020).” - is there evidence that NAS-Bench-101 and NAS-Bench-201 do not also suffer from this? Both were released before the publication of Yang et al. (2020).
+3. “Given a neural network with rectified linear units, we can, at each unit in each layer, identify a binary indicator as to whether the unit is inactive (the value is negative and hence is multiplied by zero) or active (in which case its value is multiplied by one).” - is the proposed method dependent on ReLU activations being used?
+4. Figure 4: what is the small circle that appears either above or below many of the box-and-whisker points?
+5. Table 2: it is standard to report the optimal in the entire search space, not just for N=10/100.
+
+# Post-response update
+Thank you to the authors for very helpful clarifications. This paper provides a reasonable start for a new potential direction in NAS research and so may be worth presenting at the conference, but the justification and applicability of the method is somewhat limited. I therefore stand by my original assessment.",6,4.0,ICLR2021
+SygATEK7pQ,3,HyxGB2AcY7,HyxGB2AcY7,"Novel idea for exploration in RL, good empirical results, can benefit from more clarity and evidence","Summary:
+
+The paper proposes the novel idea of using contingency awareness (i.e. the agent’s understanding of the environment dynamics, its perception that some aspects of the environment are under its control and ability to locate itself within the state space) to aid exploration in sparse-reward reinforcement learning tasks. They obtain great results on hard exploration Atari games and a new SOTA on Montezuma’s Revenge (compared to methods which are also not using any external data). They use an inverse dynamics model with attention, (trained with self-supervision) to predict the agent’s actions between consecutive states. This allows them to approximate the agent’s position in 2D environments, which is then used as part of the state representation to encourage efficient exploration. One of the main strengths of this method is the fact that it achieves good performance on challenging tasks without the expert demonstrations or environment simulators. I also liked the discussion part of the paper and the fact that it emphasizes some of the limitations and avenues for future work. 
+
+Pros:
+Good empirical results on challenging Atari tasks (including SOTA on Montezuma’s Revenge without extra supervision or information)
+Tackles a long-standing problem in RL: efficient exploration in sparse reward environments
+Novel idea, which opens up new research directions
+Comparison experiments with competitive baselines
+
+Cons:
+The choice of extra loss functions is not very well motivated 
+Some parts of the paper are not very clear
+
+Main Comments:
+Motivation of Extra Loss Terms: It is not very clear how each of the losses (eq 5) will help mitigate all the issues mentioned in the paragraph above. I suggest providing more detailed explanations to motivate these choices. In particular, why are you not including an entropy regularization loss for the policy to mitigate the third problem identified? This has been previously shown to aid exploration. I also did not see how the second issue mentioned is mitigated by any of the proposed extra loss terms.
+Request for Ablation Studies: It would be useful to gain a better understanding of how important is each of the losses used in equation 5, so I suggest doing some ablation studies.
+Cell Loss Confusion: Last paragraph of section 3.1: is there a typo in the formulation of the per cell cross-entropy losses? Is alpha supposed to be the action a? Otherwise, this part is confusing, so please explain the reasoning and what supervision signal you used. 
+State Representation: Section 3.2 can be improved by adding more details. For example, it is not explained at all what the function psi(s) contains and how it makes use of the estimated agent location. I would suggest moving some of the details in section 4.2 (such as the context representation and what psi contains) earlier in the text (perhaps in section 3.2). 
+
+
+Minor Comments:
+Plots: It would be helpful to give more details about the plots. I suggest labeling the axes. Is the x-axis number of frames, steps or episodes? How many runs are used to compute the mean? What do the light and dark colors represent? What smoothing process did you use to obtain these curves if any? Figure 2, why is there such a large drop in performance on Montezuma’s Revenge after 80M? Something similar seems to happen in PrivateEye, but much earlier in training and the agent never recovers. 
+Tables: I would suggest reporting results in the tables for more than 3 seeds given that these algorithms tend to have rather high variance. Or at least, provide the values for the variance. 
+Appendix A, Algorithm 1: I believe this can be written more clearly. In particular, it would be good to specify the loss functions that you are optimizing. There seems to be some mismatch between the notation of the losses in the algorithm and the paper. It would also help to define alpha, c, psi etc. 
+Footnote on page 4: you may consider using a different variable instead of c_t to avoid confusion with c (used to refer to the context representation). 
+Appendix D, Algorithm 2: is there a reason for which you aren’t assigning the embeddings to the closest cluster instead of any cluster that is within some range? 
+
+
+References:
+The related work section on exploration and intrinsic motivation could be improved by adding more references such as:
+Gregor et al. 2016, Variational Intrinsic Control
+Achiam et al. 2018, Variational Option Discovery Algorithms
+Fu et al. 2017, EX2: Exploration with Exemplar Models for Deep Reinforcement Learning
+Sukhbaatar et al. 2018, Intrinsic Motivation and Automatic Curricula via Asymmetric Self-Play
+Eysenbach et al. 2018, Diversity is all you need: learning skills without a reward function
+
+
+Final Decision:
+
+This paper presents a novel way for efficiently exploring environments with sparse rewards. 
+However, the authors use additional loss terms (to obtain these results) that are not very well motivated. I believe the paper can be improved by including some ablation experiments and making some parts of the paper more clear, so I would like to see these additions in next iterations of the paper. 
+
+Given the novelty, empirical results, and comparisons with competitive baselines, I am inclined to recommend it for acceptance. 
+",6,3.0,ICLR2019
+HkxD5X1s3m,2,BygRNn0qYX,BygRNn0qYX,"Interesting idea, missing related work, missing results discussion and overall poor presentation","The paper explores the very interesting and relevant problem of universal node representation.  It points out that although powerful models for representation learning on graphs exists, most existing works require to pre-define a pairwise node similarity or to specify model parameters. Hence, the authors propose a novel model that doesn’t require to pre-define neighbors nor to specify the dependence form between each node and its neighbors.
+
+
+Pros:
+- This work studies the important question of universal node embedding model that require minimal user-defined specifications.
+- It proposes an original and novel solution to achieve universal node embedding based on partially permutation invariant function.
+- Provides theoretical guarantee. 
+
+Cons:
+- Some recent works on structural node embedding are directly related to this work but missing in the related work section: struc2vec [1] and GraphWave [2].
+- In the experiment section, it would be necessary to provide the values of the tuned hyper-parameters for each model for reproducibility. 
+- The results are not really analysed nor discussed beyond noticing that P^2IR performs better than other models in most cases. For instance, the authors don't discuss the complexity of the different models, or don’t give intuition as to whether the improvements are significant.  
+- It would be relevant to include some node embeddings models (such as [1,2]) in the baseline methods as they have been shown to outperform node2vec/deepwalk in some classification tasks.
+
+minor comments on the text:
+- on page 2, WFS instead of BFS
+- on page 5, please spell out 'NN function'
+- on page 6, in the last equation characterizing the mapping of node v, it is not clear why the subscript k  in phi_k is there. (similarly for eq. (3) and (4) and subsequent mention of phi). 
+- on page 7, Table 1 is useless.
+
+1. Ribeiro, L. F., Saverese, P. H., and Figueiredo, D. R. (2017). Struc2vec: Learning node representations from structural identity
+2. Donnat, C., Zitnik, M., Hallac, D., and Leskovec, J. (2018). Learning structural node embeddings via diffusion wavelets.
+",5,3.0,ICLR2019
+8gbL-gVatQ8,2,TETmEkko7e5,TETmEkko7e5,"Interesting setting for explanations of sequential decision making, but too many assumptions at play.","
+The authors propose a method of explainable AI for inscrutable blackbox models. The explanations build on a set of user-defined primitives, independently trained on the blackbox representation (e.g., visual frames of an Atari game), and use an increasingly popular method of providing contrastive explanations. Two forms of foil-based responses are provided: (1) indication of action failure from the planning perspective (preconditions unsatisfied); and (2) an explanation of relative sub-optimality that highlights key aspects of action costs that the user may be unaware of.
+
+High-level concepts, particularly those tied to a symbolic description of the world dynamics, is an extremely compelling basis for explanation. It helps build a well-grounded intuition with human users/observers of autonomous systems and arguably is the best way to convey explanations that describe behaviour of an inherently sequential nature.
+
+In addition to the form of explanation primitives, the algorithms are intuitive, and the probabilistic inference seems to be sound.
+
+My concerns with the paper fall into two main categories: the lack of substantial contributions (particularly as related to representation learning) and the strong assumptions placed on the setting.
+
+One of the most significant missed opportunities in this work is to focus on introducing new concepts. Especially given that human studies were conducted, and the setting was identified when algorithms fail, and new or revised concepts are required. Assuming highly accurate binary classifiers for each concept is relatively extreme, and it's only one such overly strong assumption.
+
+Other very strong assumptions include:
+
+(a) The state is memoryless/Markovian: every concept can be determined by looking exclusively at the current frame. This isn't the case in many settings where some memory of the previous actions is required.
+
+(b) The distribution of a fluent across the state space is independent to the distribution of other fluents: this is rarely the case in planning-like domains, and the types of explanations introduced in this work build on planning-like domains a great deal.
+
+(c) There is only one failed precondition: this might be an alright assumption to make given (b), but similarly, I find it unlikely that many domains would have this property.
+
+As pointed out by the authors, assumptions (b) and (c) cause Algorithm 2 to exhaust the entire sampling budget before failing, and I don't believe they are safe assumptions to make.
+
+On the topic of evaluation, there are two further issues. One is the scope of the evaluation (only two domains and a seemingly small number of subjects to test with), and this reduces the significance of the paper's contribution. Another issue is the choice of comparison for H3. The Saliency map is built using different information than that surfaced using the proposed approaches. This makes it challenging to adopt the experiment's conclusion as written since it is also testing the quality of the saliency map in addition to the comparative nature of the explanations.
+
+I am leaning towards rejecting the paper due to the number of assumptions placed on the approach. Combined with the limited evaluation setting and scope of work, the contributions to the field of learning representations seem limited.
+
+Ultimately, my hesitation in recommending acceptance comes from the contribution being on the low-side for the ICLR community. The authors identify key elements that would change this impression -- e.g., refining the concepts when the algorithms fail to find an explanation (italics on pg 5) -- but these do not play a central role in the proposed work.
+
+The H1/H2 results are, in some sense, evident that the proposed explanations are preferred (19/20 for H1). While an important element (it would be very surprising if this weren't the case), they don't serve as a sufficient contribution in their own right. The H3 comparison seems to be somewhat contrived since they come from different sources -- a more accurate comparison would be to engineer the saliency overlay based on the domain-knowledge known. I.e., reflecting the precondition-based information directly. Without that, you conflate both the choice of focus and ability to highlight that choice.
+
+
+Questions for the authors:
+
+1. How do you remove the (seemingly strong) assumption that the distribution of fluents across the state space is independent among the fluents? Alternatively, why can we expect this to be a reasonable assumption to make?
+
+2. How would you remove the dependence on the fully observable / Markovian assumption on the blackbox output that is used for concept classification? I.e., when the full state cannot be discerned by looking at the screen alone.
+
+
+Other minor points of improvement for the paper:
+
+o Mind the notation used for your Goal set (near the end of page 2, you are using a different syntax than the one introduced.
+
+o Defn 3 seems to have a random bracket at the end.
+",5,4.0,ICLR2021
+rylJdHwn2Q,3,HkezXnA9YX,HkezXnA9YX,"Interesting, but please add more experiments like this","The paper explores how well different visual reasoning models can learn systematic generalization on a simple binary task. They create a simple synthetic dataset, involving asking if particular types of objects are in a spatial relation to others. To test generalization, they lower the ratio of observed  combinations of objects in the training data. The authors show the result that tree structured neural module networks generalize very well, but other strong visual reasoning approaches do not. They also explore whether appropriate structures can be learned. I think this is a very interesting area to explore, and the paper is clearly written and presented.
+
+As the authors admit, the main result is not especially surprising. I think everyone agrees that we can design models that show particular kinds of generalization by carefully building inductive bias into the architecture, and that it's easy to make these work on the right toy data. However, on less restricted data, more general architectures seem to show better generalization (even if it is not systematic). What I really want this paper to explore is when and why this happens. Even on synthetic data, when do or don't we see generalization (systematic or otherwise) from NMNs/MAC/FiLM? MAC in particular seems to have an inductive bias that might make some forms of systematic generalization possible. It might be the case that their version of NMN can only really do well on this specific task, which would be less interesting.
+
+All the models show very high training accuracy, even if they do not show systematic generalization. That suggests that from the point of view of training, there are many equally good solutions, which suggests a number of interesting questions. If you did large numbers of training runs, would the models occasionally find the right solution? Could you somehow test for if a given trained model will show systematic generalization? Is there any way to help the models find the ""right"" (or better) solutions - e.g. adding regularization, or changing the model size? 
+
+Overall, I do think the paper has makes a contribution in experimentally showing a setting where tree-structured NMNs can show better systematic generalization than other visual reasoning approaches. However, I feel like the main result is a bit too predictable, and for acceptance I'd like to see a much more detailed exploration of the questions around systematic generalization.
+
+",4,4.0,ICLR2019
+RvnyFY9bjY,3,PXDdWQDBsCG,PXDdWQDBsCG,"important topic, some concerns about the proposed methods","This paper studies how to incorporate shape (particularly depth map) into CNN for more robust models. The study focuses on image classification. Specifically, this paper proposes two depth-map-based defense: 1) Edge-guided Adversarial Training (EAT), which use depth map as an additional input 2) GAN-based Shape Defense (GSD), which learns a generator from depth map to reconstructed images, which is then used as net input. Experiments on 10 datasets shows the effectiveness of the proposed two defenses against white-box attacks including FGSM and PGD40. To further demonstrate the effectiveness, the authors also conduct some other experiments: 1) the proposed EAT goes well with two fast AT algorithms; 2) the proposed algorithm can also be used to defend backdoor attack; 3) edge makes CNN more robust to common image corruptions.
+
+I think the topic that tries to explore and understand the connection between shape and CNN is very important and somewhat under-studied. This paper provides some interesting empirical results and insights to the community. 
+
+1. The assumption of this paper is that edge map does not change much under adversarial attacks. I think this relies on two things: A) we use tradition non-deep-net based edge detector like canny edge; B) the adversarial perturbation is pixel-based perturbation. I am curious to see how the proposed algorithm work with B), but for now, I will focus on A) below.
+
+For A), for harder and more realistic datasets, canny may not work well and we may resort to deep-net-based edge detector (like HED). I think the author also mentions this for cifar10 and tinyimagenet, which is only 32x32 and 64x64. This problem likely becomes more severe when we deal with larger real images. But, if we use deep-net-based edge detector, we now break the assumption that edge map does not change much under adversarial attacks. Since it is deep net and so it is fragile. So I am not sure how the proposed method work with deep-net-based edge detectors, on a somewhat more realistic image datasets.
+
+2. For GSD, as mentioned by the authors, it is similar to the two GAN/VAE-based baselines. I am curious to see how it compares with them? 
+
+3. Also for GSD, considering scaling up to more realistic images with more visual patterns, it would be super hard to learn a mapping from pure depth to rgb, since it is ill-posed and under-determined. The two baseline methods do not have this under-determination because their input is rgb images. Also, the learned mapper would be very correlated with and overfitting to the training data, which hinders generalization.
+
+4. How does CW attack perform against the proposed defense?
+
+5. A minor thing about reference. For the fast AT algorithms, to my best knowledge, I think there is a third one ""Bilateral Adversarial Training: Towards Fast Training of More Robust Models Against Adversarial Attacks"" published in ICCV 2019.
+
+
+",5,4.0,ICLR2021
+xFK60pknP5S,1,nXSDybDWV3,nXSDybDWV3,Needs comparison to existing work ,"Unfortunately the authors link directly to the code, and the code is not anonymous. This might be a desk-reject as this is not a double blind review.
+
+This work is a description of a library for developing variational inference algorithms using the ELBO-within-Stein framework developed in Nalisnick et al. (2017). The library is evaluated on on Neal's funnel and two moons, and on a polyphonic music dataset.
+
+Comments
+
+- Nalisnick et al was published in 2017. I assume this was a typo on the authors' part.
+
+- Table A in the Appendix, describing different kernels, should include a column with computational and memory requirements for each kernel if they differ. This can affect the scalability.
+
+- The work describes LDA but does not evaluate it. It would be helpful to include held-out log likelihood numbers on a standard topic modeling dataset such as 20 newsgroups. This would help people compare to prior work. 
+
+- Similarly, the library is evaluated by fitting to a standard polyphonic music dataset. Please report these numbers in a table, alongside a reasonable approach using standard variational inference and Stein VI (using the library) side-by-side. For example, the numbers here are much better, and use standard variational inference with the KL divergence: https://papers.nips.cc/paper/6039-sequential-neural-models-with-stochastic-layers.pdf (Stein Variational Inference can be difficult to understand, as can be Pyro, which is built on jax/pytorch, and the library developed here is built on top of all of these moving parts. Before embarking on using the library, a machine learning researcher should be very convinced that all this additional effort is worth it. Benchmarking this new library against existing work is important and will go a long way toward justifying its existence.)
+
+- The references are very poorly formatted. Please clean up.
+",3,4.0,ICLR2021
+SJVDtoHef,1,SJdCUMZAW,SJdCUMZAW,Neither very innovative nor very strong evaluations,"I already reviewed this paper for R:SS 2017. There were no significant updates in this version, see my largely identical detailed comment in ""Official Comment""
+
+Quality
+======
+The proposed approaches make sense but it is unclear how task specific they are.
+
+Clarity
+=====
+The paper reads well. The authors cram 4 ideas into one paper which comes at the cost of clarity of each of them.
+
+Originality
+=========
+The ideas on their own are rather incremental.
+
+Significance
+==========
+It is unclear how widely applicable the ideas (and there combination) are an whether they would transfer to a real robot experiment. As pointed out above the ideas are not really groundbreaking on their own.
+
+Pros and Cons (from the RSS AC which sums up my thoughts nicely)
+============
++ The paper presents and evaluates a collection of approaches to speed learning of policies for manipulation tasks.
++ Improving the data efficiency of learning algorithms and enabling learning across multiple robots is important for practical use in robot manipulation.
++ The multi-stage structure of manipulation is nicely exploited in reward shaping and distribution of starting states for training.
+
+- The techniques of asynchronous update and multiple replay steps may have limited novelty, building closely on previous work and applying it to this new problem.
+- The contribution on reward shaping would benefit from a more detailed description and investigation.
+- There is concern that results may be specific to the chosen task.   
+- Experiments using real robots are needed for practical evaluation.
+",4,4.0,ICLR2018
+7sfcPvzqSoo,2,kE3vd639uRW,kE3vd639uRW,"Review of ""LiftPool: Bidirectional ConvNet Pooling""","This paper presents a new pooling layer that is based on the Lifting Scheme from signal processing. It motivates this approach with the desire for reversible pooling functions for certain tasks. The benefits of this reversibility are demonstrated on a semantic segmentation task. As a drop-in replacement for the pooling lawyer in various neural-network backbones, it also outperforms many other pooling layers on classification tasks (ImageNet).
+
+I thought this paper had great collection of analyses: flexibility (choice of pooling band), effectiveness across kernal sizes, generalizability across various backbones, and robustness to corruptions and perturbations.
+
+*I recommend to accept*. While this may be just another pooling layer, it seems a quite well motivated pooling layer coming from a particular need for reversibility.
+
+Some highlights:
+- very clear writing
+- very well situated in historical and contemporary literature
+- exactly the experiments that I would want to see
+- the observation that the sub-band that represents vertical details constributes more to classification accuracy than other sub-bands; curious to know whether this holds outside of VGG13 on CIFAR-100.
+
+Some points for improvement:
+- the presentation of the lifting scheme and the use of the LL/HL/HH/HH notation is perhaps a little non-intuitive for anyone without previous exposure; you could make it clearer that not all bands would be necessarily used during pooling, but that the information would be retained for reversing the operation.
+- on p.7 you say ""sift-invariance""; I think you mean ""shift-invariance""
+- Instead of saying how you believe your findings will stimulate people to think about problems (""These findings may stimulate researchers to rethink"", ""We believe such findings will stimulate one to think""), it would be better to perhaps make a claim that needs to be evaluated in the future, or to point out exactly what is left unknown or surprising by your findings.
+- Figure 7 is not very clear. Think about people with colour blindness. Also, consider two separate graphs side-by-side or one above the other. It isn't clear what the shaded red area is meant to indicate and how it relates to what is ""redistributed."" I see what you're trying to show, and I actually just think it is a quite difficult thing to visualize, so maybe Figure 7 is the best you can get to, but I would brainstorm some more on this.
+- p 12. ""Visualiztion""
+- p 12. the closing quotation marks around ""high frequency"" and ""low frequency"" go the wrong direction
+- Figure 9 caption: the hyphenation in LiftUpPool should be customized; it breaks the word at a weird spot
+- Figure 8 consider using various line types instead of colors in order to better accomodate people with color blindness
+- Throughout: inconsistent hyphenation ""downsizing"" vs ""down-sizing"", ""up-sampling"" vs ""upsampling""
+- Bibliography: I think you need to force capitalization in some of the titles. See e.g. ""pytorch"" and ""Mobilenetv2""
+
+----------------
+
+Question: When you ""combine all the sub-bands by summing them up"", do you literally just add up the values from the corresponding indices across each sub-band so that you're still reducing the dimensionality? I am a little surprised that this doesn't *reduce* performance. Can you say more about this? Why does this work?
+
+Question: Why did you only compare against three baseline pooling methods for the corruptions and perturbations instead of the full gamut as in Table 4?
+
+Question: do you have error bars for your experiments? Or did you run them only once each? For which results did you run your own experiments vs reporting numbers from previous literature (particularily in Table 4)?
+
+----------------
+My main uncertainty (why I am not giving this a 5 for confidence) is that I cannot be sure this hasn't been proposed in the past, but it is hard to prove a negative. I can say is that this does appear novel to me.
+
+I am also uncertain about the evaluation on the semantic segmentation task. I am familiar with this problem and the evaluations seem reasonable, but I cannot be sure whether the choices of comparator methods are the strongest alternatives.",8,4.0,ICLR2021
+S1lrWRQvtH,1,BJl8ZlHFwr,BJl8ZlHFwr,Official Blind Review #1,"The paper presents a novel approach for (generalized) Zero-shot learning (GZSL). As showing in the numerical experiments on some real data, the method demonstrates the significant improvement on the accuracy of prediction comparing to some state-of-the-art methods. 
+The main key of the method is using Variational Inference, variational autoencoders. The authors have taken into account the modality of the data through reparametrize the distributions, especially the inside class invariant modality and class separability. Moreover, the authors also propose to take into account a kind of biasness domain into the learning procedure, which details in adding a regularization of the domain discriminator into the objective function.
+
+The paper is nicely written, espcially with a clear formal introduction to the problem of GZSL.
+
+However, I have some questions:
+1) Does the test set has some labels? How do you know your method works well? I can not find where you have defined a kind of loss so that we can compare the predicted labels \hat{y}_j ?  (In Section 2.)
+2) How do you learn the ""replaced prior"" in equation (4) ?
+3) It is not enough detail on how do you optimize the objective (8) ? a detail explain algorithm would make the paper significant, indeed.
+4) In Table1, would MCMVAE in the last row be MCMVAE-D ?
+
+Final, I expect the authors will make their codes available for the readers.",6,,ICLR2020
+SyxXUGsc3Q,3,H1eSS3CcKX,H1eSS3CcKX,An improvement to relaxed sort operators; some even-harder experiments,"This work builds on a sum(top k) identity to derive a pathwise differentiable sampler of 'unimodal row stochastic' matrices. The Plackett-Luce family has a tractable density (an improvement over previous works) and is (as developed here) efficient to sample. 
+
+[OpenReview did not save my draft, so I now attempt to recover it from memory.]
+
+Questions:
+- How much of the improvement is attributable to the lower dimension of the parameterization? (e.g. all Sinkhorn varients have N^2 params; this has N params) Is there any reduction in gradient variance due to using fewer gumbel samples?
+- More details needed on the kNN loss (uniform vs inv distance wt? which one?); and the experiment overall: what k got used in the end?
+- The temperature setting is basically a bias-variance tradeoff (see Fig 5). How non-discrete are the permutation-like matrices ultimately used in the experiments? While the gradients are unbiased for the relaxed sort operator, they are still biased if our final model is a true sort. Would be nice to quantify this difference, or at least mention it.
+
+Quality:
+Good quality; approach is well-founded and more efficient than extant solutions. Fairly detailed summaries of experiments in appendices (except kNN). Neat way to reduce the parameter count from N^2 to N.
+
+I have not thoroughly evaluated the proofs in appendix.
+
+Clarity:
+The approach is presented well, existing techniques are compared in both prose and as baselines. Appendix provides code for maximal clarity. 
+
+Originality:
+First approach I've seen that reduces parameter count for permutation matrices like this. And with tractable density. Very neat and original approach.
+
+Significance:
+More scalable than existing approaches (e.g: only need N gumbel samples instead of N^2), yields better results.
+
+I look forward to seeing this integrated into future work, as envisioned (e.g. beam search)",8,4.0,ICLR2019
+w15-fGETE_,1,vcKVhY7AZqK,vcKVhY7AZqK,Review,"The paper claims that existing measures of complexity such as entropy are not suitable for measuring task complexity since they focus on the complexity of X rather than the predictive relationship from X to Y. The paper argues that mutual information is not useful for comparing different learning tasks, as two tasks (MNIST vs Fashion-MNIST) can have similar MI but intuitively different complexity (second-to-last paragraph in related work). The paper proposes a measure for task complexity based on the number of queries required to predict a label of an input. The form of the queries is not specified, and the provided examples include half-space queries, single feature (decision-tree-like) queries, or high level semantic queries.  The proposed method instead considers a query generator E, then encoding X as the answers to the sequence of queries generated by E, and predicting Y from the answers. The complexity of a task is related to the number of equivalence classes induced by the input X and the query generator E.
+
+I think the general premise is interesting, although I am not familiar with the related work (eg Achille, Tran) so I can't comment on how novel or different the idea of this paper is. Although the paper claims to have the first subjective notion of task complexity, to me this seems like more of a drawback since measures of complexity should be as standardized as possible, so as to allow comparison between different works. It may have been useful to cement down the three versions of complexity described on page 2, so future works can use them directly (for example provide details of measuring visual semantic complexity; and if it relies on extracting latent features of a neural net, i also wonder how close/different this is to [Achille]). 
+
+Also, the paper claims the drawback of Kolmogorov complexity is that it is not easily computable, but the methods described to compute the paper's proposed complexity are also highly involved, and requires multiple layers of approximations. It would have been helpful to have a more in-depth discussion of how (if at all) the proposed method is easier computationally compared to the other methods.
+
+In proposition 1, epsilon seems not to have been motivated yet? The task complexity in Eq2 did not seem to have any notion of error, so why is there a probability of misclassification in prop 1?
+
+The notation in Equation 1 looks off to me, since the RHS conditions on the set of all x's with the same encoding. Don't we want something to the effect of: p(y|x) = p(y|x')  for all x, x' such that A_q(x) = A_q(x') \forall q ?
+
+The prefix-free constraint seems like it can be avoided if we just include q_STOP in the code? This seems more natural to me, and you can avoid the extra notation and sentences of explanation on the bottom of page 3.
+
+Condition 2 in Prop3: we are assuming that y is categorical, and p_{Y|X=x} is a distribution over the labels in Y? Or does p_{Y|X=x} have all its probability mass on a single label? I assume it is the latter case, but writing it in terms of two inputs disagreeing \forall labels y is confusing when they are only assigned to one label each (in fact, shouldn't it be \exists y where x and x' disagree, instead of \forall y?).
+
+I also didn't quite understand how the conditional inference network takes in {q, A_q(x)}_{1:k}. Isn't this not a fixed length sequence, in fact we don't even know the length of it beforehand since we decide when to stop at runtime based on the stopping condition? So in Fig2, how are the \Psi's able to take in variable-length sequences?
+
+I'm curious about the exact cost of each iteration of the information pursuit algorithm. Given p(A_q(x), y | B) in Eq.9, do you compute p(A_q(x) | B) and p(y | B) exactly to get the mutual information, or do some sample-based approach? How many values can A_q(x) take on? And we need to do this for every possible query q \in Q in order to get argmax_q, so if A_q(x) can take on m values, are you doing O( |Q| m ) number of queries of the distribution p?
+
+On the point of mutual information not being directly useful to predict difficulty in mapping X to Y, it seem that this paper ""A Theory of Usable Information under Computational Constraints"" Xu et al. [ICLR 2020] is very relevant. For example under a limited computational model, perhaps MNIST will have higher mutual information than Fashion-MNIST.",5,3.0,ICLR2021
+BkxGc0P1qS,2,HkeAepVKDH,HkeAepVKDH,Official Blind Review #2,"This paper propose to study the quantization of GANs parameters. They show that standard methods to quantize the weights of neural networks fails when doing extreme quantization (1 or 2-bit quantization). They show that when using low-bit representation, some of the bits are used to model extremal values of the weights which are irrelevant and lead to numerical instability. To fix this issue they propose a new method based on Expectation-Maximization to quantify the weights of the neural networks. They then show experimentally that this enables them to quantize the weights of neural networks to low bit representation without a complete drop of performance and remaining stable.
+
+I'm overall in favour of accepting this work. The paper is well motivated, the authors clearly show the benefits of the proposed approach compared to other approach when using extreme quantization.
+
+Main argument:
++ Great overview of previous methods and why they fail when applying extreme quantization
++ Great study of the influence of the sensitivity to the number of bits used for quantization
+- It would have been nice if the author had provided standard deviation for the results by running each method several times. In particular figure 2.c seem to show that they might be a lot of variance in the results when using low bit quantization.
+- I feel some details are missing or at least lack some precision. For example are the networks pre-trained with full precision in all experiments ? if so can you precise it in section 3.1 also ? 
+- The proposed approach seem very similar in spirit to vector quantization, can the author contrast their method to vector quantization ?
+- In equation (7) doesn't the constant C also depend on alpha and beta ?
+- In section 5.1 do you also use the two phase training described in section 4.2 ?
+- Figure 4.c seems to indicate that quantize the generator only is no more a problem ? Can you explain why this figure is very different from figure 2.c 
+- In table 3 how is the number of bits chosen, did you try several different values and report the best performance ?
+
+Minor:
+- Some of the notations are a bit confusing. You call X the tensor of x, I think it would be more clear to say that X is the domain of x.
+- I'm surprised by the results in section 3.1, wouldn't the issue described in this section when training standard neural networks ? wasn't this known before ?
+- There is some typos in the text",6,,ICLR2020
+H1lRMzRpYr,1,BJgd81SYwr,BJgd81SYwr,Official Blind Review #3,"The paper proposes learning to add input-dependent noise to improve the generalization of MAML-style meta-learning algorithm. The proposed method is evaluated on OmniGlot and miniImageNet. The paper reports improvements upon MAML, MAML with meta-learned parameter-wise learning rates, as well as a few regularization methods that are based on input/hidden state perturbations (Mixup, Variational Information Bottleneck). An ablation study also compares the proposed meta-dropout algorithm with a number of modifications, such as a fixed noise, input-independent noise, etc. It is furthermore shown that meta-dropout somewhat improves the model’s robustness against an adversarial attack. 
+
+The paper is somewhat incremental considering that Li et al, (2017) and Balaji et al, (2018) have already proposed meta-learning parameter-wise learning rates and parameter-wise regularization coefficient respectively. One difference from the methods above is that in the proposed method noise is controlled by the input. The ablation however shows that in 5-shot classification case simply adding non-trainable noise works quite well. 
+
+It seems like the choice of the particular method for adding the noise was performed using the test set. If it’s true, this is methodologically wrong: model selection should be performed on a development set (or meta-development) set. Futhermore, Table 2 contains some results named “Add.”, which I guess stands for additive noise. I did not find an explanation of what is the specific method for adding noise used in this case. Such additive noise is also missing from ablation experiments. 
+
+Overall, it seems that paper falls short of clearly proving that back-propagating through MAML to the noise parameters is helpful. The “Deterministic Meta-Dropout” performs better than baseline MAML, and arguably, meaning that some part of the improvement upon MAML can be due to the architectural differences and not due to noise. “Independent Gaussian” and “Weight Gaussian” baselines perform worse than non-trainable noise (“Fixed Gaussian”). Learning the variance for the noise is shown to be detrimental. There is just too much confusion in the results, the improvements are not very robust. 
+
+The paper writing is okay, but there are serious issues. I am not sure I understand the argument in Section 3.2 that meta-dropout performs variational inference. It seems like Equation 7 is wrong because  y_i is missing from the second argument of the KL divergence term. The transition to Equation 8 is therefore also wrong, and as far as I can understand, the whole argument breaks down. Line 7 in Algorithm 1 in Appendix A (which by the way should really be in the main text) does not make sense.
+
+Other issues: 
+- the second sentence of the abstract is not implied by the first, the usage of “thus” does not seem appropriate
+- the intro should probably mention L1 and L2 regularization as well
+- in Section 3.1 there is a forward reference to Equation 5, makes understanding the text quite hard
+- “meta-droput”, “robustenss”: typos in many places
+- Figure 4 visualization is not clear. 
+- the architectural change required to add noise is not explained in the paper (i.e. what is \phi and how it’s used) 
+- no comparison to meta-learned L1 regularization 
+- a baseline is missing in which \phi is treated as a part of \theta and trained with vanilla MAML
+",3,,ICLR2020
+ryeBokZ4qB,1,ryloogSKDS,ryloogSKDS,Official Blind Review #2513,"The paper proposes a Brigham loss (based on the Brigham distribution) to model the uncertainty of orientations (an important factor for pose estimation and other tasks). This distribution has the necessary characteristics required to represent orientation uncertainty using quaternions (one way to represent object orientation in 3D) such as antipodal symmetry. The authors propose various additions such as using precomputed lookup tables to represent a simplified version of the normalization constant (to make it computationally tractable), and the use of Expected Absolute Angular Deviation (EAAD) to make the uncertainty of the Bingham distribution more interpretable. 
+
++Uncertainty quantification of neural networks is an important problem that I believe should gain more attention so I am happy to see papers such as this one. 
++Various experiments on multiple datasets show the efficacy of the method as well as out performing or showing comparable results to state-of-the-art
+
+-In the caption for Table 1 the author’s write: “the high likelihood and lower difference between EAAD and MAAD indicate that the Bingham loss better captures the underlying noise.” How much difference between EAAD and MAAD is considered significant and why?
+
+-In section 4.5 they write “While Von Mises performs better on the MAAD, we observe that there is a larger difference between the MAAD and EAAD values for the Von Mises distribution than the Bingham distribution. This indicates that the uncertainty estimates of the Von Mises distribution may be overconfident.” Same question as above. What amount of difference between MAAD and EAAD is considered significant and why?
+",6,,ICLR2020
+H1xSQUW2tS,3,S1lOTC4tDS,S1lOTC4tDS,Official Blind Review #1,"This paper introduced a latent space model for reinforcement learning in vision-based control tasks. It first learns a latent dynamics model, in which the transition model and the reward model can be learned on the latent state representations. Using the learned latent state representations, it used an actor-critic model to learn a reactive policy to optimize the agent's behaviors in long-horizon continuous control tasks. The method is applied to vision-based continuous control in 20 tasks in the Deepmind control suite. 
+
+Pros:
+1. The method used a latent dynamics model, which avoids reconstruction of the future images during inference.
+2. The learned actor-critic model replaced online planning, where actions can be evaluated in a more efficient manner.
+3. The model achieved better performances in challenging control tasks compared to previous latent space planning methods, such as PlaNet.
+
+Cons:
+1. The work has limited novelty: the learning of the world model (recurrent state-space model) closely follows the prior work of PlaNet. In contrast to PlaNet, the difference is that this work learns an actor-critic model in place of online planning with the cross entropy method. However, I found the contribution of the actor-critic model is insufficient and requires additional experimental validation (see below).
+
+2. Since the actor-critic model is the novel component in this model (propagating gradients through the learned dynamics), I would like to see additional analysis and baseline comparisons of this method to previous actor-critic policy learning methods, such as DDPG and SAC training on the (fixed) latent state representations, and recent work of MVE or STEVE that use the learned dynamics to accelerate policy learning with multi-step updates.
+
+3. The world model is fixed while learning the action and value models, meaning that reinforcement learning of the actor-critic model cannot be used to improve the latent state model. It'd be interesting to see how optimization of the actions would lead to better state representations by propagating gradients from the actor-critic model to the world model.
+
+Typos:
+Reward prediction along --> Reward prediction alone
+this limitation in latenby?",6,,ICLR2020
+rkxd5xEcFB,1,r1geR1BKPr,r1geR1BKPr,Official Blind Review #3,"The authors derive the influence function of models that are first pre-trained and then fine-tuned. This extends influence functions beyond the standard supervised setting that they have been primarily considered in. To do so, the authors make two methodological contributions: 1) working through the calculus for the pre-training setting and deriving a corresponding efficient algorithm, and 2) adding $L_2$ regularization to approximate the effect of fine-tuning for a limited number of gradient steps.
+
+I believe that these are useful technical contributions that will help to broaden the applicability of influence functions beyond the standard supervised setting. For that reason, I recommend a weak accept. I have some questions and reservations about the current paper:
+
+1) Does pretraining actually help in the MNIST/CIFAR settings considered? These seem to be non-standard pretraining settings. More generally, can we relate influence to some objective measure that we care about (say test accuracy), for example by showing that removing the top X% of influential pretraining data hurts test accuracy as much as predicted? Minor: section 4.2 also seems non-standard. Are the exact same bird vs. frog examples being used for both pretraining and finetuning?
+
+2) In what situations might we want to examine the influence of pretraining data, and can we design experiments that show those situations? For example, perhaps we're wondering if different types of sentences in the one-billion-word dataset might be more or less useful. Can we verify those claims using these multi-stage influence functions? It is otherwise difficult to assess the utility of the qualitative results (e.g., Figure 3 and Appendix C).
+
+3) It'd be helpful to get a better understanding of the technical contributions of this paper. Specifically, 
+a. What is the impact of $\alpha$ in equation 12 and how does it interact with the number of fine-tuning steps taken?
+b. If the Hessian has negative eigenvalues, we can still take $H^{-1}b$ by solving CG with $H^2$, but what does this correspond to? Is the influence equation well defined (or the Taylor approximation justified) if $H$ is not positive definite? 
+
+",6,,ICLR2020
+mSmn9VQQ8Bu,2,P0p33rgyoE,P0p33rgyoE,A weakly motivated extension of VIC with implicit options,"## Summary
+
+The paper points out a limitation of the implicit option version of the Variational Intrinsic Control (VIC) [Gregor et al., 2016] algorithm in the form of a bias in stochastic environments. Two algorithms are proposed that fix the limitation: the first requiring the size of state space to be known and the second which does not make such assumptions. Experiments on simple discrete state environments demonstrate that the original VIC algorithm works well only on deterministic environments whereas the proposed fix works well on the stochastic environments as well.
+
+## Strengths
+- The paper provides a sound theoretical analysis of the limitation of the VIC implicit-option algorithm, the proposed fix and a practical algorithm (*Algorithm 2*).
+- A clear distinction is presented with respect to prior work. The differences between the proposed algorithm and VIC shows that the intrinsic reward now has an added term which depends on an approximate model of the transition probability distribution.
+
+## Weaknesses
+- The paper focuses on a very specific and narrow topic without providing much stand-alone motivation for the same. It implicitly borrows motivation from prior work (VIC, etc) without providing its own. The rigorous mathematical derivations are simply re-deriving the VIC mutual information bounds with a new added term and with some extra details on how to do it with a gaussian mixture model. Overall, the paper seems like a minor extension of prior work.
+- Considering the experiments on partially observed environments presented in the VIC paper, this paper chooses a much simpler set of discrete environments for empirical analysis instead of stepping up to more complicated environments which would have strengthened both the motivation for fixing the bias of VIC and the empirical evidence of the GMM algorithm (*Algorithm 2*).
+- The paper's contributions boil down to Eqn 6 (the bias in VIC) and the two proposed algorithms. However, the difficulty in reading the mathematical notations and expressions severely handicaps the reader's ability to carefully understand these contributions. Some suggestions on improving the notation are provided below.
+
+## Feedback to authors
+- The paper introduces extremely dense notation. The frequent overloading of symbols or use of similar looking symbols (e.g. $p, p^p, \rho$) makes it quite difficult for the reader to parse each expression. I would recommend usage of longer variable names, e.g.: replace p -> gen, for generative model and replace q -> inf for inference models. Phrases are easier to parse than single character symbols. Also, colorizing certain important symbols can help -- especially for important distinctions such as the true probability distributions vs their estimates.
+
+-------
+## Post-rebuttal update
+
+Having read through all reviews and the author's response, I am updating my assessment in light of the responses and new experiments. I agree with the authors that the derivation has theoretical value and is not a simple re-derivation of VIC. The new experiments and visualizations have been helpful (I am happy with the author's responses to R3), but the overall clarity of the paper is still lacking due to the dense mathematical notation. In light of this, I am increasing my score from 4 -> 6, slightly leaning towards acceptance.",6,4.0,ICLR2021
+ih3OXjn8Ipc,1,WdOCkf4aCM,WdOCkf4aCM,CDTs are not demonstrated to be interpretable,"Edit:
+
+I have read the authors' response and the other reviews. I still believe that this paper is not ready for acceptance. 
+
+Summary: 
+
+The authors propose to use Cascading Decision Trees (CDTs) to express the policy for an RL agent. The authors describe CDTs and evaluate their use in imitating a trained expert as well as representing a policy during training. 
+
+
+Reasons for score: 
+
+The interpretability of CDTs is not convincingly demonstrated by the examples provided. CDTs do perform better than the tested SDTs, but the experiments are insufficient to conclude that CDTs perform well compared to other models which are less interpretable than SDTs. 
+
+
+Pros:
+
+-CDTs are explained well.
+
+-CDTs are shown to solve basic RL environments which are commonly used for evaluation in XRL work. 
+
+
+Cons:
+
+-The authors argue that linear partitions are superior to axis-aligned partitions based on the number of parameters. However, the authors do not consider the number nor complexity of operations required for a ""forward pass"" through the model. A fairer evaluation of explainability would include the average or worst-case number of operations required to select an action (e.g., 4 multiplications, 4 additions, and 2 comparisons in the worst case for Figure 1a). By using a single D tree for all F trees, the number of parameters is reduced, but the length of any given path is still long (with many operations). 
+
+-The authors claim that CDTs are more ""explainable"" than SDTs, but this is not sufficiently demonstrated. One advantage of DTs over MLPs is that all splitting operations occur on the original features. If the original features are interpretable, then the partitioning process operates on meaningful features. When learned features are used (as in CDTs), this property is lost. 
+
+-Between the leaf of a feature learning tree and an internal node in the decision tree, the input is multiplied by a set of weights and put through a non-linearity. When this is performed several times in sequence, this begins to resemble a MLP. The authors do acknowledge this potential problem (as motivation for not evaluating hierarchical CDTs), but this concern also applies (to a lesser degree) to the ""single F, single D"" case. 
+
+-The authors use ""heuristic agents"" as experts in their experiments. This does not follow the procedure established by prior work, and this is at odds with the motivation of CDTs as useful for explaining RL agents.
+
+-The experimental evaluation is lacking in a number of ways: 
+
+--The authors should report policy performance (in imitation learning experiments). Accuracy is a useful metric, but not sufficient on its own. A model can have a high accuracy without learning a well-performing policy. Also, results are reported for a different set of configurations in Table 2 as compared to Table 1 (without any justification). This suggests that discretization of CDTs must be performed in a more nuanced way than otherwise stated in this work.
+
+--The authors do not compare to VIPER in the imitation learning experiments they perform though VIPER was shown to yield higher-performing policies than standard classification-based learning.
+
+--In the RL experiments, the authors compare to a MLP. They find that CDT has a similar (but sometimes worse) final performance despite having fewer parameters. However, the authors should also compare to a smaller MLP, ideally with the same number of parameters as CDT. Without this comparison, conclusions cannot be drawn about the ""parameter vs performance"" benefits of CDT.
+
+-State normalization (based on a ""well-trained policy"") is not standard and generally not feasible when applying RL. It is unclear why this was done. 
+
+-The second paragraph on page 8 attempts to explain a learned MountainCar-v0 model. The lack of certainty and vagueness of the insights (""kind of like an estimated future position or previous position, and makes action decisions based on that"") suggests that additional work is required to make CDTs interpretable. Why are ""future position"" and ""previous position"" both options for this explanation? The environment has two features (position and velocity), so discovering that the agent selects actions based on intermediate features derived from position is a given. Ideally, the authors select a way to measure interpretability and perform a quantitative evaluation. 
+
+
+Questions During Rebuttal Period:
+
+Please address and clarify the ""Cons"" above.
+
+
+Minor Comments:
+
+-The method's motivation is best placed in the main paper, not in the Appendix (page 3, footnote 2).
+
+-The figures would be more helpful if they appeared on the same page as the corresponding text.
+
+-The caption for Table 1 is not clear with respect to which CDT accuracies are for which discretization schemes.
+
+-The authors note that discretization decreases performance and ""claim that this is a general drawback for tree-based methods in XRL..."" However, this is only applicable to soft DTs. This should be made clear (e.g., VIPER does not have this drawback).
+
+-The y-limits of Figures 5a and 5b should match so that performances can be more readily compared across plots (given that CDT and SDT are not plotted within one figure). The same applies for Figures 5c and 5d.
+
+-The paper would benefit from another editing pass for grammar. 
+Some Typos:
+
+-Abstract: ""trees (DDTs) have [been] demonstrated to achieve""
+
+-Introduction: ""are generally lack[ing] interpretability""; ""In this paper, [w]e propose""
+
+",4,5.0,ICLR2021
+SJgAEEpDhQ,1,HkfYOoCcYX,HkfYOoCcYX,"Hard to read, but idea is interesting","Summary:
+
+This paper addresses the computational aspects of Viterbi-based encoding for neural networks. 
+
+In usual Viterbi codes, input messages are encoded via a convolution with a codeword, and then decoded using a trellis. Now consider a codebook with n convolutional codes, of rate 1/k. Then a vector of length n is represented by inputing a message of length k and receiving n encoded bits. Then the memory footprint (in terms of messages) is reduced by rate k/n. This is the format that will be used to encode the row indices in a matrix, with n columns.  (The value of each nonzero is stored separately.)  However, it is clear that not all messages are possible, only those in the ""range space"" of my codes. (This part is previous work Lee 2018.) 
+
+The ""Double Viterbi"" (new contribution) refers to the storage of the nonzero values themselves. A weakness of CSR and CSC (carried over to the previous work) is that since each row may have a different number of nonzeros, then finding the value of any particular nonzero requires going through the list to find the right corresponding nonzero, a sequential task. Instead, m new Viterbi decompressers are included, where each row becomes (s_1*codeword_1 + s_2*codeword2 + ...) cdot mask, and the new scalar are the results of the linear combinations of the codewords. 
+
+Pros:
+ - I think the work addressed here is important, and though the details are hard to parse and the new contributions seemingly small, it is important enough for practical performance. 
+ - The idea is theoretically sound and interesting.
+
+Cons: 
+ - My biggest issue is that there is no clear evaluation of the runtime benefit of the second Viterbi decompressor. Compressability is evaluated, but that was already present in the previous work. Therefore the novel contribution of this paper over Lee 2018 is not clearly outlined.
+ - It is extremely hard to follow what exactly is going on; I believe a few illustrative examples would help make the paper much clearer; in fact the idea is not that abstract. 
+ - Minor grammatical mistakes (missing ""a"" or ""the"" in front of some terms, suggest proofread.)
+
+",7,2.0,ICLR2019
+B1egi4UoYH,1,Hke0oa4KwS,Hke0oa4KwS,Official Blind Review #3,"Summary: This paper proposes an uncertainty measure called an implied loss. The authors suggest that it is a simple way to quantify the uncertainty of the model. It is suggested that ""Low implied loss (uncertainty) means a high probability of correct classification on the test set."". They suggest that the analysis of an implied loss justifies the maximum confidence value of softmax-cross entropy. They also extend to evaluate Top-k uncertainty (the uncertainty whether our prediction is in the Top-5 maximum values of our confidence score or not). 
+
+========================================================
+Clarity:
+I found that this paper content does not seem to be difficult mathematically, however, it is difficult to follow the paper and here I list several parts that can potentially be improved:
+
+1. INTRO: the sentence ""Our implied loss interpretation justifies both methods, since we demonstrate that ""both these quantities"" are uncertainty measures."". What is both quantities here, the maximum softmax probability and something? 
+
+2. INTRO: first contribution, accurate estimates of the probability that the classification of the model on a test set is correct. What do you mean by this sentence? I couldn't see the accuracy of the estimation in Table 2.
+
+3. second contribution in INTRO: what is the meaning of ""a consistent manner"" here. If it means ""successfully"", is there a case that your method fail? It would be nice to be precise when describing the contribution. Figure 2 is more like a small example but can be considered difficult to convince to the readers about the capability of the proposed method.
+
+4. Figure 1 can be much improved. I found the caption is hard to understand. Loss (y-axis) may be written as Kullback-Leibler loss to be more precise in this context (if I understood correctly).
+minor comment: Figure 1(a) [colon missing], since Figure 1(b) has "":"" right after. The authors made an effort to explain the color in the caption of Figure 1(b) but the explanation of (yellow) is missing. The explanation of how to train a network to get Figure 1(a) seems to be missing. It seems -log(f_(1)) should be -log(f^{sort}_(1)). Finally, I'm not sure what is P(Correct|x). Is this histogram suggested that when the U_1 is large and the P(Correct|x) is small, we get a correct prediction, or maybe this means the ratio of correct prediction under the value of loss (histogram)? I think it would help the reader to make it more concise. In the description in page 4 -\log(p_{max}) seems never define before, was it a typo?
+
+5.1 First page, last sentence: what do you mean by the current setting. Before this point there is no explanation about the problem setting, only we are interested in quantifying an uncertainty measure of the networks.
+5.2 First page, last sentence:  I'm not familiar with Bayes factors, is the last sentence your contribution or it's the finding of the existing work, if it's the latter one, it would be nice to cite them. However, I found this sentence a bit vague: Bayes factor more informative (not sure what is the definition of informative here) better than Brier score under the situation where prob of correct classification is high (is this means high accuracy on the test set?).
+ 
+6. PRIOR-WORK: Although the authors suggested many existing works, it would be highly useful if the authors discuss the relationship between existing works and their proposed work, e.g., where to put your work in the literature. And since they proposed an uncertainty function, it would be nice to see a few definitions of uncertainty existing works described (doesn't need to be mathematical formulation, I think just an intuitive explanation is sufficient).
+
+7. I am confused with the definition of an implied loss. It is first defined in page 3 with a fixed k (as 1) as a loss where the prediction is correct but in Eq. (4), it looks like a set with one element where y is the maximum prediction score, not a correct label. Then there is U_k(x), if k!=1, is this an implied loss? Although in (5) it is a real number, not a set anymore, I read this paper and took it as a ""yes U_k(x) is also an implied loss"". Also it would be better to define Kullback-Leibler here to be concise and kind to the readers. Then in def 4.1, it's mentioned that the uncertainty measure U_k(x) is an implied loss if the event has high expected loss, does that mean if the event has low expected loss, it is not an implied loss? My opinion is that the authors may use a definition environment and define precisely what is an implied loss. For example, given ""\ell: X x Y \to R, a correct label y \in Y, and an integer k <= K (number of classes), an implied loss is defined as"" to avoid confusion.
+
+8. Def 4.1: U(x) without subscript is undefined (perhaps U_k(x)?). What is an element of the set S^\eps_k? If it is a set then what is the meaning of the event S has a high expected loss? Does the definition of implied loss after Eq. (6) and the definition of implied loss in Eq. (5) identical when it is Kullback-Leibler loss?
+
+9. Theorem 4.2: S^\eps seems to be undefined without k. Moreover, how to interpret the bound in (7), it would be nicer to explain the bound after stating the theorem. It is only an inequality that says the left-hand side is smaller or equal to the right-hand side. And does (7) hold for any y? And how tight is the bound?
+
+10. Remark 4.3: what is e_k? what is k here if k in U is set to one? And the name of the remark, how to interpret (8) as ""Neural networks are always overconfident?"" Is this about neural networks or this apply to any function?
+
+11. Sec 5: Figure 6 is not in the main body but the appendix... If this a mistake (and the paper is supposed to be 9 pages without ref.) or it is supposed to be in the appendix? If It's in the appendix, it would be better to mention that this figure is in the appendix.
+
+12. Sec5: since Bayes factor is highly used in this paper to motivate the use of the measure, I don't think it is a good idea to put the explanation of Bayes factor in the appendix, i.e., it is impossible to understand this paper without knowing Bayes factor. 
+
+13: Fig 6: I think it is better and kinder to use U_1, U_5 in the legend of the plot instead of f. What is the model entropy?
+
+14. Table 1: why there is ""-"" in CIFAR-10, it is better to clarify it in the paper (or maybe I missed it). I am not sure how to interpret the result, if the higher the better, does that mean the Loss is great?, I'm confused with the experimental results.
+
+15. Before the beginning of 6.2: Tables 7 and 6 are in the appendix and we should state clearly it is in the appendix.
+
+========================================================
+Comments:
+This paper lacks of clarity and difficult to understand. Although it is claimed to be better than existing measure, I am not convinced about that despite many experiments were conducted unfortunately. 
+For the criticism of using the maximum confidence of the softmax score from the softmax-cross entropy loss may not quantify the uncertainty, it is known theoretically that the score of softmax-cross entropy corresponds to p(y|x) if our prediction function achieve the global minimizer this loss function and our function class to be considered is all measurable functions (Zhang, JMLR2004: Statistical Analysis of Some Multi-Category Large Margin Classification Method). For other losses, see Williamson+ JMLR2016: Composite Multiclass Losses. However, it may not be accurate empirically when we use a deep network as it is reported in Guo+, 2017. Thus, one direction is to do post-processing or finding a way to modify a network. For U_1(k), I feel that it should suffer from the same problem as using maximum confidence score of softmax. Extending to top-k may have a good point when discussing about uncertainty and I believe it is good to explore that direction. For experiments, I would like to know how many trials did the authors run the experiment? and it would be helpful to see the standard deviation of the reported value. I believe this paper can still be improved a lot. For these reasons, I vote to reject this paper this time.
+
+========================================================
+Minor comments:
+there exists the writing convention of ""Top 5"", ""top 5"", ""top5"". It's better to pick one way to describe it if there is no reason to make it different.
+========================================================
+After the rebuttal:
+I have read the rebuttal.
+
+I appreciate the authors' effort to modify the paper. 
+Also, please let me state that I totally agree that the problem the authors try to solve is indeed important and relevant for using machine learning in the real-world. 
+
+I feel that the structure of the modified version is better than the first version. It is a nice to include the explanation of the Bayes factor in the main body. Appendix C is also very useful for everyone to understand the Bayes factor. 
+
+As the author requested, I have read through the whole revised paper (including the appendix) carefully . I am aware of the positive sides of this paper. For example, it is interesting that we can find wrongly labeled data. Utilizing the uncertainty information for several applications. However, I found that the paper still requires a lot of modifications. The authors have modified many of my concerns, but still several of them were not addressed. I also emphasized the comments for parts that are unrelated to clarity (please see below). For these reasons, I decide not to change my score. 
+
+Below are my comments after reading the rebuttal (which some of them may be overlapped with the issues that were not addressed in the revised version).
+============================
+General comments:
+
+1. Although the work is about tackling the empirical confidence estimation problem, Theorem 3.4 and Eq. (5) are providing insights about the population, not finite data points for empirical estimation. If we focus on the population case, it is known that the minimizer of the softmax cross-entropy loss must be a conditional probability $p(y|x)$. Thus, it is natural that as we minimize such a loss, the probability of correct classification must be high, since we can pick the best choice for classification, i.e., argmax of $p(y|x)$. But the most challenging part of the research in this direction is that, although the theory suggests we can get nice confidence information (in population), when we use the deep neural networks, the quality of confidence estimation can be very bad (overconfidence in empirical estimation) compared with the high accuracy we can achieve. As a result, a finer theory that can quantify the quality of confidence estimate for the finite sample case is highly needed, but I think Theorem 3.4 fails to capture this. I am aware that this theorem only concerns the KL loss. I believe that even only for KL-loss, the result can be significant if we discuss a finer theory for empirical estimation.
+
+2. Regarding the Remark 3.5 (Neural networks are always overconfident), in my understanding, the result has no relationship with neural networks at all since it is true regardless of the hypothesis class of interest (e.g., linear models, kernel models, deep networks). This is because it is simply the definition of a loss function (and the result in the paper focus on KL loss, but I believe we can derive for many other losses). We know that the quality of confidence estimation of simple models can be better although the accuracy is worse. Thus, if the objective is to visualize the problem of neural networks, Remark 3.5 does not seem to help and adding neural networks in the remark title can be misleading. Thus, the implication of Remark 3.5 is insufficient to state that Neural networks are always overconfident.
+
+3. What is the advantage of an implied loss? It seems the paper has two separate stories, the first one is implied loss (Sec. 3) and then move on to Bayes factor (Sec.4). Then, there is an adversarial detection problem using the gradient norm in the last experiment. From Sec. 4, the discussion about implied loss is very limited and if I understand correctly, the $-\log p_\max$ and $-log\sum p_{1:5}$ (the latter seems to require a superscript $sort$) are the implied loss, which does not seem to have the clear advantages over other methods. My impression is that the contributions of this paper are unclear and I do not know what is the main point of this paper. While the abstract dedicates most space to highlight the implied loss (nothing about Bayes factor), the conclusion dedicates most space to highlight the Bayes factor. Improving the connection between two parts may help to signify the contributions. Despite all that, I do like the idea of introducing the Bayes factor in this paper. 
+============================
+Clarity:
+
+1. Most importantly, I think the clear and solid definition of implied loss is missing.
+According to Definition 3.1, ""The uncertainty measure $U_k(x)$ is an implied loss if the event $S^\epsilon_k$ has high expected loss"". I believe the implied loss is one of the most important contributions of this paper. I am not sure what does it means by ""if the event $S^\epsilon_k$ has high expected loss"" How can we define ""high expected loss""?. I tried to read the paper many times to understand what exactly is the implied loss, and what is the scope of implied loss (what is and what is not an implied loss).
+
+2. Following my first issue on clarity, what is the definition of uncertainty measure in this paper? According to the paper, it is defined roughly as $U_k$, whose statistics allow us to better estimate $p_k$. And in the abstract, it is emphasized that if uncertainty measure is an implied loss, then low uncertainty means a high probability of correct classification. Is there any uncertainty measure that is not implied loss? Also in the introduction, it seems that both softmax and the model entropy are uncertainty measures, are they implied losses? Is MC-dropout an uncertainty measure or even an implied loss in this context?
+
+3. Eq. (11) left side: I think it is useful to the expectation with respect to which variable, I believe it is $B$. As a result, is it a typo to have $Y_i$ instead of $B_i$ on the left-hand side of (11)?
+
+4. Example 3.3: If we set $k=1$ according to the definition of the implied loss at the last sentence of this example. Will it contradict $U_1(x)$ defined just right before that sentence? Because $y_w$ will become the $2$-nd ranked label, which I feel it is different from $U_1(x)$ defined in Example 3.3
+
+5. I think it is better to clearly state that the figures/tables are in the appendix when referring to them from the main body. I saw the authors refer to Table 5 in Sec. 4.1, Figure 5 in Sec. 4.2 and Tables 6 and 7 in Sec. 4.3. All of them are in the appendix. And it seems some of them are highly needed. For example, the authors said that ""by fine graining the bins we can capture relatively small ... on the order of 20"" (before Sec. 4.3), then suggested the reader to see Figure 5. I feel Figure 5 must be included in the main body because it is hard to understand that without seeing the figure. 
+
+6. I think all figures that have $f_{(1)}$ (Figures 1a, 1b, and 5) must have a superscript $sort$ for all of them. Otherwise, it is wrong.
+
+7. Kullback-Leibler loss is used extensively here without definition. It is important to clarify the clear definition of it. $U_1(x)$ also used extensively for the KL loss case and non-KL loss cases. This can make the paper hard to read. I suggest using $U^{\mathrm{KL}}_1(x)$ when referring to the uncertainty measure with respect to the KL-loss.
+
+8. I saw $-\log(p_\max), -\log(f_{(1)}), -\log(f_1^{sort}), U_1$. Are these all refer to the same thing? And I found that sometimes the argument $(x)$ is ignored in the paper sometimes it doesn't in a quite random way. If they are the same, it would be nice to unify them.
+
+9. The caption of figure 2(b) is uninformative. The authors may consider improving it.
+
+10. It would be very helpful to add the implication or interpretation of the theoretical results to help the readers understand the intuition of the proven results. For example, how does Eq. (3) implies that when small uncertainty implies high chance of correct classification.
+============================
+Minor typos:
+INTRO: ""uncertainly measures"" -> ""uncertainty measures""
+5.2: ""on-distribution"" -> ""in-distribution""
+Conclusion: logpmax -> write clearly with $$ should be better
+
+Minor comments in the appendix:
+1. How does Appendix A related to implied loss or Bayes factor in this paper? Did I miss something?
+2. Figure 5:
+	2.1 y-axis: is it Bayes ratio or Bayes factor? It seems the Bayes ratio and Bayes factor to be a different thing. Even if it is the same in some literature, I think it's better to use the Bayes factor here for the consistency of this paper. 
+	2.2 Caption: ""entropy"" -> ""model entropy"". 
+	2.3 Caption: Is it mistakes or do you want to insist on using $U_1$ and $U_5$ in the caption but using different notions in the figure? (in figure, they are $-\log(f_{(1)})$ and $-\log(\sum f_{(1:5)})$).
+3. Appendix C: Eq. 16 and Eq. 19: I believe there is a typo. I think it should be $BF(X|Y_1), BF(X|Y_2), BF(X|Y_3)$, respectively.
+4. Appendix C: Eq. 18-right: does it need to sum to 1 not 0.3+0.5+0.3 = 1.1?
+",1,,ICLR2020
+g64i4_ZEJUs,3,WUNF4WVPvMy,WUNF4WVPvMy,This paper obtains the state-of-the-art rates with solid theoretical guarantees on accelerated Riemannian optimization.,"This paper proposes a global accelerated method on Riemannian manifolds with the same rates as accelerated methods in the Euclidean space up to log factors. Reductions have also been studied on Riemannian manifolds.
+
+Quality: I think this paper has high quality in theory.
+
+Clarity: I have no experience on Riemannian manifolds before. This paper reads difficult for me. I think this paper is too technical and some descriptions are not clear.
+
+Originality: There are a number of works that study the problem of first-order acceleration on Riemannian manifolds. This paper studies the special case of constant sectional curvature, i.e., the hyperbolic and spherical spaces. I am not sure whether there are literatures studying the optimization algorithms (either accelerated or non-accelerated) on the constant sectional curvature before.
+
+Significance: This paper gives the state-of-the-art rates in the special case of constant sectional curvature. I think it is significant.
+
+I have some comments. I have no experience on the optimizaton on Riemannian manifolds before. My comments may be too strict for the analysis on Riemannian manifolds.
+
+1. Previous literatures have studied the optimization on Riemannian manifolds of bounded sectional curvature, while this paper focuses on the special hyperbolic and spherical spaces, that have constant sectional curvature. Is there any literature focusing on the constant sectional curvature before? either accelerated or non-accelerated. What is the critical difference when transforming the analysis on the bounded sectional curvature to constant sectional curvature? Is it a straightforward extension, or very challenging?
+
+2. I am not sure whether each step of the proposed method needs more computations than the standard accelerated gradient method. For example, function f is a composition of F and h^{-1}, can \nabla f(x) be efficiently computed? The BinaryLineSearch needs to compute \Gamma_i^{-1} and x_{i+1}^{\lambda}, do they need more computations? 
+",7,1.0,ICLR2021
+rJlRuaviKB,1,SylzhkBtDB,SylzhkBtDB,Official Blind Review #3,"This paper studies how to improve the multi-task learning from both theoretical and experimental viewpoints. More specifically, they study an architecture where there is a shared model for all of the tasks and a separate module specific to each task. They show that data similarity of the tasks, measured by task covariance is an important element for the tasks to be constructive or destructive. They theoretically find a sufficient condition that guarantee one task can transfer positively to the other; i.e. a lower bound of the number of data points that one task has to have. Consequently, they propose an algorithm which is basically applying a covariance alignment method to the input. 
+The paper is well-written, and easy to follow. 
+Pros:
+A new theoretical analysis for multi-task learning, which can give insight of how to improve it through data selection.
+They empirically show that their algorithm improves the multi-task learning on average by 2.35%. 
+
+Cons:
+There is not much of novelty in the algorithm and architecture. Their method is very similar to domain adaptation but for multi-learning setting.
+In the Theorem 2, they have assumed parameter c <= 1/3. They have not provided any insight of how much restrictive this assumption is.  
+
+",6,,ICLR2020
+ClWDvhypd6k,3,8SP2-AiWttb,8SP2-AiWttb,Review report ,"This paper identified the issue of Imbalanced Gradient, verified through some recent defense methods. Motivated by such an issue, a marginal decomposition (MD) attack is proposed to offer a stronger robustness measure. In general, the paper is well written, and the studied problem is interesting. The MD perspective explains why label smoothing may provide insufficient robustness.
+
+My comments are listed below. 
+
+1)  In Sec. 3.2, why are the first K/2 iterations used to maximize the individual margin term and then update the entire loss? What will happen if the scheduling is the opposite: updating the entire loss at the first K/2 iterations, then individual terms？Some ablation studies or explanation should be provided.
+
+2) Does the proposed stronger attack offer a stronger min-max defense? Suppose that the ordinary PGD attack is replaced by an MD attack during min-max training, will it offer better overall robustness? The general question to ask is: In addition to root cause analysis on the ineffectiveness of some existing defense methods, what are the additional benefits of the newly proposed MD attack?
+
+3) Does it seem that the MD attack has to run over more iterations than the PGD attack, leading to extensive computation cost?
+
+4) What is the possible effect of the MD attack on the generated perturbation pattern? In the black-box setting, will the MD attack be more query-efficient than a commonly-used PGD black-box attack?
+
+
+Post-rebuttal:
+
+I am mostly satisfied with the authors' response. After reading other reviewers' comments, I shared a similar concern on the marginal contribution. However, the newly added black-box result is a good addition to the paper. Thus, I keep my original rating toward the positive side. 
+
+",6,4.0,ICLR2021
+4-bSS_H_j7,2,#NAME?,#NAME?,ICLR 2021 Conference Paper1592 AnonReviewer1,"# Summary #
+
+This paper proposes a new normalizing flow model for categorial data, where the typical dequantization is not applicable. 
+
+We assume a categorical sample x has S variables, and each attribute x_i is a categorical variable. 
+We want to model the probability mass of the S variable categorical data, and devise an invertible map that can convert x into the continuous latent variable z. 
+For simplicity, we assume that each attribute x_i has its own latent continuous probability distribution p(z_i). 
+We expect an encoder, q(z_i|x_i), map categorical x into a continuous space where all categories are well partitioned. For that purpose, the paper proposes to formulate q(z_i|x_i) by a mixture of logistic distributions. 
+
+A graph generative flow model is proposed as an application. 
+Existing flow models for graphs do not handle the categorical data in proper manners and are permutation-dependent. 
+The proposed categorical flow can develop a permutaion-invariant graph generative flow model. 
+
+The proposed model performs better than the existing graph flow models in the molecular graph generation experiments. Typical invalid generation examples include isolated nodes. If we only focus on the largest sub-graphs, the proposed model can almost perfect graph generations. 
+The permutation-invariant nature of the proposed model results in a stable performance on the graph coloring problem, while the baseline RNN models are deeply affected by the order choices. 
+
+# Comment
+
+I found the mixture of logistic regression is a good idea. Figure 2 and 3 in appendix indicate that this formulation can pratitioin the latent space into categories. 
+
+I have a few questions to confirm my understanding of the paper. 
+
+Q1. The proposed categorical normalizing flow with K-logistic mixture provides an approximated invertible map for the true distribution of the categorical samples x. Is this correct? Namely, there is a non-zero KL divergence between the evidence P(X) and the marginalized ``likelihood'' q(X)?? 
+
+Q2. The paper says ""we do not have a unknown KL divergence between approximate and true posterior constituting an ELBO"". Does this mean we can compute the KL(q||p) analytically for the categorical normalizing flow? 
+
+Concerning the Q2. it is better if the final objective function to maximize/minimize, and the actual procedure for model training is clearly written in the main manuscript or the appendix. 
+
+Concerning the molecular graph generation experiments, I'm interested in how the latent representations of the graphs are distributed in the space of Z. It is preferable if the paper can provide a visualization of the latent space for the actual molecular graph generation experiments, not the simulated ones of Figure 2 and 3. 
+
+Presentaions of the experimental results in the main manuscript totally rely on the tables. However, current tables are not much effective to tell the significance of the proposed method. 
+Please consider visual presentations: we may use plots or bar graphs to compare several methods for example if the detailed numbers are not important. 
+The actual numbers can be moved to the appendix. 
+
+# Evaluation points
+
+(+) A new approach to apply the normalizing flow. 
+
+(+) Truly permutation-invariant NF for graph generation is great.
+
+(-) insufficient explanations for the optimization procedure
+
+(-) more visual results may improve impressions of the manuscript (especially for non-expert readers)
+",6,3.0,ICLR2021
+BylUth6tqS,3,rkglZyHtvH,rkglZyHtvH,Official Blind Review #4,"The paper proposes to improve standard variational inference by increasing the flexibility of the variational posterior by introducing a finite set of auxiliary variables. Motivated by the limited expressivity of mean field variational inference the author suggests to iteratively refine a ‘proposal’ posterior by conditioning on a sequence of auxiliary variables with decreasing variance. The key requirement to set the variance of the auxiliary variables such that the integrating over them leaves the original model unchanged. As noted by the authors this is a variant of auxiliary variables introduced by Barber & Agakov. The motivation and theoretical sections seems sound and the experimental results are encouraging, however maybe not completely supporting the claim of new ‘state of the art’ on uncertainty prediction. 
+
+Overall i find the motivation and theoretical contribution interesting. However I do not find the experimental section completely comprehensive why I currently think the paper is borderline acceptance. 
+
+Comments
+1) The computational demand using the method seems quite large by adding O(NumSamples * NumAuxiliary) additional computational cost on top of the standard VI. Here each factor M is quite large e.g. 200 epochs for CIFAR10 (if i understand the algorithm correctly?)
+2) For the UCI experiments the comparison is only made against DeepEnsembels or other VI methods, however to the best of my knowledge MCMC methods are superior in this setting given the small dataset size? 
+3) The results on CIFAR10 do seem to demonstrate that the proposed method is superior to DeepEnsembles and standard VI in one particular setting where VI is only performed over a small subset of layers in a ResNet (why doesn’t it work for when doing VI on all the parameters?). However generally looking at the best obtained results of ~86% acc this is quite far from current best probabilistic models (see e.g. Heek2019 that gets 94% acc). Some of this can probably be attributed to differences in data-augmentation and model architecture however in general it makes it very hard to compare with other methods when the baselines are not competitive. 
+
+Minor Comments:
+ In relation to comment 3) above I think you should reword the sentence “It sets a new state-of-the-art in uncertainty estimation at ResNet scale on CIFAR10” in the conclusion.
+
+
+“In  order  to  get  independent  samples  from  the  variational  posterior,we have to repeat the iterative refinement for each ensemble member”: Does this imply that if we want M samples we first have to optimize using the standard VI and then to M optimizations to get q_k(w)?
+
+
+How sensitive is the method to sequence of variances for a?
+
+[Heek2019]: Bayesian Inference for Large Scale Image Classification
+",6,,ICLR2020
+o422tncF9cr,2,WEHSlH5mOk,WEHSlH5mOk,"The proposed GTS appears to advance the current state-of-the-art in graph-based multiple (multivariate) time series forecasting. This is a problem of considerable importance and, as far as I am aware, simultaneously learning the graph structure and forecasting model is understudied topic. As for the several “Improvement points” I raised in this review, I believe that the authors will have the chance to address them in the rebuttal period.","Paper summary:
+
+This paper proposes an approach for time series forecasting that learns the graph structure among multiple (multivariate) time series simultaneously with the parameters of a Graph Neural Network (GNN). The problem is formulated as learning a probabilistic graphical model by optimizing the expectation over the graph distribution, which is parameterized by a neural network and encapsulated in a single differentiable objective. Empirical evidence suggests that the proposed GTS obtains superior forecasting performance to both deep and non-deep learning based, as well as graph and non-graph based, competitor forecasting models. In addition, GTS appears to be more computationally efficient compared to LDS, a recently proposed meta-learning graph-based approach.
+
+##########################################################################
+
+Strong points:
+1. A time series forecasting model is proposed to automatically learn a graph structure among multiple time series and forecast them simultaneously using a GNN.
+
+2. The graph structure and the parameters of the GNN are learned simultaneously in a joint end-to-end framework.
+
+3. The graph structure is parameterized by neural networks rather than being treated as a (hyper)parameter, thus significantly reducing the training cost compared with the recently proposed bilevel optimization approach LDS. 
+
+4. A structural prior-based regularization is incorporated in GTS. In case a “ground-truth” graph is provided upfront, this may serve as a healthy variation of such a graph for the purpose of more accurate forecast.
+
+5. Extensive experiments are conducted in which the proposed GTS is compared to a number of baselines, including a recently proposed graph structure learning approach, and deep or non-deep learning based (as well as graph or non-graph based) forecasting models.
+
+6. The experimental results demonstrate that GTS outperforms its competitor approaches in terms of forecasting accuracy and is more efficient that the recently proposed LDS.
+
+7. Generally, the paper is well written, while the notation is clear and easy to follow.
+
+##########################################################################
+
+Improvement points:
+1. In section 3.4. (Comparison with NRI), the authors state that the “structural prior” $A^{a}$ offers a stronger preference on the existence/absence of each edge than a uniform distribution over all edges. This seems a bit unclear, thus I would encourage the authors to elaborate a bit more on this difference between GTS and NRI w.r.t. the structural prior.
+
+2. In the case of the PMU dataset, despite the fact that the grid topology is not provided, the authors still consider a certain structural prior by constructing a kNN graph among the PMUs. I am wondering whether the correlation between the series (mentioned briefly in Appendix A) is used for the graph construction or another distance/similarity metric is considered?
+
+3. Two variables are recorded by the 42 PMUs, however each node in the constructed graph (shown in Fig. 5) corresponds to one PMU. In case a single node corresponds to a single PMU, then I wonder how the similarity between two PMUs’ recordings is calculated across the two variables (voltage magnitude and current magnitude)?
+
+4. The authors construct the PMU dataset by extracting only one month of data. However, a single month of PMU data would not allow for capturing certain long-term seasonalities (for instance, the PMU recordings are typically impacted by outages that occur more frequently in certain seasons or periods in the year). Is this perhaps due to data unavailability? If that is not the case, I would ask the authors to clarify the reasoning behind the decision to extract the data for February 2017?
+
+5. In Tables 2 & 3 (Appendix C), some of the MAPE values obtained by GST are bolded even though the same percentages are reported for GTSv. In such cases, I would suggest the authors to either bold the MAPEs obtained by both GTS and GTSv, or present the MAPE values using more decimal places.
+
+6. There are several minor textual errors throughout the paper that can be easily addressed. Some of them are summarized as follows:
+- The term “LDS” is initially used at the beginning of page 2, but is not defined earlier in the text.
+- In the third paragraph on page 2, “computation is expensive” should be replaced by “its computation is expensive”.
+- I am wondering whether the training loss $L$ should be used instead of the validation loss $F$ in Eq. (2)? If so, correct accordingly, otherwise disregard this comment.
+- In the next-to-last paragraph on page 2, consider replacing “it is better scaled” with “it scales better”.
+- In the last paragraph of the Related Work section, “of node classification tasks” should be replaced by “on node classification tasks”.
+- At the beginning of section 3, the term “NRI” is used, but is not defined earlier in the text.
+- On page 5, consider replacing the abbreviation “ELBO” with “evidence lower bound (ELBO)”. 
+- In the first paragraph on page 7, “treating it a (hyper)parameter as in LDS” could be replaced with “treating it *as* a (hyper)parameter *which is the case* in LDS”.
+- In the second paragraph on page 8, replace “regularization $\lambda$” with “regularization strength $\lambda$”. In the same sentence, consider adding “the” before both “forecasting error” and “cross entropy”.
+
+##########################################################################
+
+Questions during rebuttal period:
+Please address the aforementioned remarks/questions.
+",7,4.0,ICLR2021
+pCrX_TCzYsL,2,vY0bnzBBvtr,vY0bnzBBvtr,"Review for ""Provably More Efficient Q-Learning in the One-Sided-Feedback/Full-Feedback Settings""","This paper proposes Q-learning based algorithms called Elimination-Based Half-Q-Learning (HQL) and Full-Q-Learning (FQL). In the one-sided-feedback setting, the proposed algorithm improves the regret bounds over existing methods in terms of the dependency on the size of state-action space. Numerical experiments are provided to show the performance of the algorithm.
+
+Overall, I vote for rejecting. The detailed comments are as follows:
+
+Pros:
+- By incorporating domain-specific structures into Q-learning algorithms, the author developed new algorithms tailored to one-sided-feedback/full-feedback models. The algorithm improves the regret bound in terms of the dependency on the state-action space.
+
+Cons:
+- Although the algorithm improves the regret bounds with respect to the state-action space, the time complexity grows linearly with S and A (in Table 1). Therefore, I'm skeptical of the claim that ""the algorithms are barely hampered by even infinitely large state-action sets"".
+
+- The exposition in Section 4 could be improved. In the current version, there are only statements of two theorems. I think the authors should spend more space in Section 4 instead of Section 5. It would be helpful if authors can provide interpretation of the theorems and detailed comparison with existing results. And it would be nicer to provide examples that satisfy the assumptions made in Section 2.
+
+- I felt that the numerical experiment is not so convincing. For example, the episode length seems to be too small (H <= 5). In addition, since authors claim that the algorithms scale well to large state-action sets, it would be better to conduct the numerical experiment in that regime to show the efficiency of the algorithm.
+
+- The writing quality could be improved. There are several grammar mistakes and typos.",4,2.0,ICLR2021
+B1g6Ld0gaX,2,r1lohoCqY7,r1lohoCqY7,Unclear problem setting,"Quality/clarity:
+- The problem setting description is neither formal nor intuitive which made it very hard for me to understand exactly the problem you are trying to solve. Starting with S and i: I guess S and i are both simply varying-length sequences in U.
+- In general the intro should focus more on an intuitive (and/or formal) explanation of the problem setting, with some equations that explain the problem you want to work on. Right now it is too heavy on 'related work' (this is just my opinion).
+
+Originality/Significance:
+I have certainly never seen a ML-based paper on this topic. The idea of 'learning' prior information about the heavy hitters seems original.
+
+Pros:
+It seems like a creative and interesting place to use machine learning. the plots in Figure 5.2 seem promising.
+
+Cons:
+- The formalization in Paragraph 3 of the Intro is not very formal. I guess S and i are both simply varying-length sequences in U.
+- In general the intro should focus more on an intuitive (and/or formal) explanation of the problem setting, with some equations that explain the problem you want to work on. Right now it is too heavy on 'related work' (this is just my opinion).
+
+-In describing Eqn 3 there are some weird remarks, e.g. ""N is the sum of all frequencies"". Do you mean that N is the total number of available frequencies? i.e. should it be |D|? It's not clear to me that the sum of frequencies would be bounded if D is not discrete.
+- Your F and \tilde{f} are introduced as infinite series. Maybe they should be {f1, f2,..., fN}, i.e. N queries, each of which you are trying to be estimate.
+- In general, you have to introduce the notation much more carefully. Your audience should not be expected to be experts in hashing for this venue!! 'C[1,...,B]' is informal abusive notation. You should clearly state using both mathematical notation AND using sentences what each symbol means. My understanding is that that h:U->b, is a function from universe U to natural number b, where b is an element from the discrete set {1,...,B}, to be used as an index for vector C. The algorithm maintains this vector C\in N^B (ie C is a B-length vector of natural numbers). In other words, h is mapping a varying-length sequence from U to an *index* of the vector C (a.k.a: a bin). Thus C[b] denotes the b-th element/bin of C, and C[h(i)] denotes the h(i)-th element. 
+- Still it is unclear where 'fj' comes from. You need to state in words eg ""C[b] contains the accumulation of all fj's such that h(j)=b; i.e. for each sequence j \in U, if the hash function h maps the sequence to bin b (ie $h(j)=b$), then we include the *corresponding frequency* in the sum.""
+- What I don't understand is how fj is dependent on h. When you say ""at the end of the stream"", you mean that given S, we are analyzing the frequency of a series of sequences {i_1,...,i_N}?
+- Sorry, it's just confusing and I didn't really understand ""Single Hash Function"" from Sec 3.2 until I started typing this out.
+- The term ""sketch"" is used in Algorithm1, like 10, before 'sketch' is defined!!
+-I'm not going to trudge through the proofs, because I don't think this is self-contained (and I'm clearly not an expert in the area).
+
+Conclusion:
+Honestly, this paper is very difficult to follow. However to sum up the idea: you want to use deep learning techniques to learn some prior on the hash-estimation problem, in the form of a heavy-hitter oracle. It seems interesting and shows promising results, but the presentation has to be cleaned up for publication in a top ML venue.
+
+
+
+******
+Update after response:
+The authors have provided improvements to the introduction of the problem setting, satisfying most of my complaints from before. I am raising my score accordingly, since the paper does present some novel results.",6,1.0,ICLR2019
+VrrNtIv9v4n,2,6YuRviF_FC-,6YuRviF_FC-,No machine learning contribution,"SUMMARY 
+
+This work proposes to use an ensemble of very well-known machine learning models (mainly tree-based methods) to calibrate radio interferometric data from the KAT-7 telescope. This provides a more efficient alternative to the traiditional approach, which is based on the astrophysicists workforce.
+
+REASONS FOR SCORE
+
+In my humble opinion, this work does not present a machine learning contribution of interest for the ICLR community. It applies standard and well-known approaches to an specific remote sensing problem.  I would point the authors to different venues, such as the IEEE International Geoscience and Remote Sensing Symposium. 
+
+Moreover, I would recommend the authors to compare their approach to some others methodologies used in the field. Right now, the experimental section only analyzes the results of the proposed approach, with no baselines.
+
+",2,5.0,ICLR2021
+SyxgP3dn2Q,3,S1E3Ko09F7,S1E3Ko09F7,A new method for computing Shapely values,"This paper proposes two methods for instance-wise feature importance scoring, which is the task of ranking the importance of each feature in a particular example (in contrast to class-wise or overall feature importance).  The approach uses Shapely values, which are a principled way of measuring the contribution of a feature, and have been previously used in feature importance ranking.
+
+The difficulty with Shapely values is they are extremely (exponentially) expensive to compute, and the contribution of this paper is to provide two efficient methods of computing approximate Shapely values when there is a known structure (a graph) relating the features to each other.
+
+The paper first introduces the L(ocal)-Shapely value, which arises by restricting the Shapely value to a neighbourhood of the feature of interest.  The L-Shapely value is still expensive to compute for large neighbourhoods, but can be tractable for small neighbourhoods.
+
+The second approximation is the C(onnected)-Shapely value, which further restricts the L-Shapely computation to only consider connected subgraphs of local neighbourhoods.  The justification for restricting to connected neighbourhoods is given through a connection to the Myerson value, which is somewhat obscure to me, since I am not familiar with the relevant literature.  Nonetheless, it is clear that for the graphs of interest in this paper (chains and lattices) restricting to connected neighbourhoods is a substantial savings.
+
+I have understood the scores presented in Figures 2 and 3 as follows:
+
+For each feature of each example, rank the features according to importance, using the plugin estimate for P(Y|X_S) where needed.
+For each ""percent of features masked"" compute log(P(y_true | x_{S\top features})) - log(P(y_true | x)) using the plugin estimate, and average these values over the dataset.
+
+Based on this understanding the results are quite good.  The approximate Shapely values do a much better job than their competitors of identifying highly relevant features based on this measure.  The qualitative results are also quite compelling, especially on images where C-Shapely tends to select contiguous regions which is intuitively correct behavior.
+
+Comparing the different methods in Figure 4, there is quite some variability in the features selected by using different estimators of Shapley values.  I wonder is there some way to attack the problem of distinguishing when a feature is ranked highly when its (exact) Shapley value is high versus when it is ranked highly as an artifact of the estimator?
+",7,3.0,ICLR2019
+Hkl96tvtn7,1,Bkl2SjCcKQ,Bkl2SjCcKQ,"Research that has some interesting observations, but needs to take the next step to have impact on the field","The primary purpose of this paper, from what I understand, is to show that fake samples created with common generative adversarial network (GAN) implementations are easily identified using various statistical techniques. This can potentially be useful in helping to identify artificial samples in the real world.
+
+I think that the reviewers did an excellent job of probing into different statistical perspectives, such as looking at the continuity of the distribution of pixel intensities and the various higher moments of spectral data. I also must applaud the fact that they did not relegate themselves to image data, but branched out to speech and music data as well.
+
+One of the first findings is that, with MNIST and CIFAR, the pixel intensities of fake samples are noticeably different when viewed from the perspective of a Kolgomorov-Smirnov Test or Jensen-Shannon Divergence comparison. This is an interesting observation, but less so than it would be if compared to something such as a variational autoencoder (VAE), which fits a KL distribution explicitly. IWGAN and LSGAN are using different metrics in their loss functions (such as Wassertein and least squares), and thus the result is not surprising or novel. I think if the authors had somehow shown how they worked their metrics into IWGAN or LSGAN to achieve better results, this could have been interesting.
+
+Another observation the authors make is about the smoothness of the GAN distributions. This may not be so easily wrapped into the loss function, but it seems easily remedied as a post-processing step, or perhaps even a smoothing layer in the network itself. Nevertheless, this is an observation that I have not seen discussed in the literature so there is merit to at least noting the difference. It is confusing that on page 4, the authors state that they hypothesized that the smoothness was due to the pixel values themselves, and chose to alter the distribution of the original pixels in [0,1]. However, they state that in Figure 5, the smoothness remained ""as expected."" Did the authors misspeak here?
+
+I found the music and speech experiments very interesting. The authors note that the synthetic Bach chorales, for instance, introduce many transitions that are not seen in the training and testing set of real Bach chorales. This, again, is interesting to note, but not surprising as the authors are judging the synthetic chorales on criteria for which they were not explicitly optimized. I do not believe these observations to be paper-worthy by themselves. However, the authors I believe have a good start on creating papers in which they specifically address these issues, showing how they can create better synthetic samples by incorporating their observations.
+
+As to the writing style, there are many places where the writing is not quite clear. I would suggest getting an additional party to help proofread to avoid grammatical mistakes. I do not believe that the mistakes are so egregious as to impede understanding. However, it could distract from the importance on the authors future innovations if not corrected.
+
+One last note. The title of the paper is ""TequilaGAN: How to Easily Identify GAN Samples."" This makes it seem as if the authors were introducing another type of GAN, like LSGAN or DCGAN. However, they are not. As a matter of fact, nowhere else in the paper is the word ""TequilaGAN"" mentioned. This title seems a bit sensational and misleading.
+
+In the end, although I did find this paper to be an interesting read, I cannot recommend it for publication in ICLR.
+
+----
+
+Edit - November 29, 2018: Increasing my rating from a 4 to a 5 after discussion with the authors. Though their insights are not unknown, I think the authors are right in the fact that this is not explicitly discussed, at least not in the peer-review research with which I am familiar. But I don't think this by itself merits an ICLR publication.",5,4.0,ICLR2019
+S1JBxOqlz,3,r1vccClCb,r1vccClCb,Review: neighbor-encoder -> neighbor encoder,"This paper presents a variant of auto-encoder that relaxes the decoder targets to be neighbors of a data point. Different from original auto-encoder, where data point x and the decoder output \hat{x} are forced to be close, the neighbor-encoder encourage the decoder output to be similar to the neighbors of the input data point. By considering the neighbor information, the decoder targets would have smaller intra-class distances, thus larger inter-class distances, which helps to learn better separated latent representation of data in terms of data clusters. The authors conduct experiments on several real but relative small-scale data sets, and demonstrate the improvements of learned latent representations by using neighbors as targets. 
+
+The method of neighbor prediction is a simple and small modification of the original auto-encoder, but seems to provide a way to augment the targets such that intra-class distance of decoder targets can be tightened. Improvements in the conducted experiments seem significant compared to the most basic auto-encoder.
+
+Major issues:
+
+There are some unaddressed theoretical questions. The optimal solution to predict the set of neighbor points in mean-squared metric is to predict the average of those points, which is not well justified as the averaged image can easily fall off the data manifold. This may lead to a more blurry reconstruction when k increases, despite the intra-class targets are tight. It can also in turn harm the latent representation when euclidean neighbors are not actually similar (e.g. images in cifar10/imagenet that are not as simple as 10 digits). This seems to be a defect of the neighbor-encoder method and is not discussed in the paper.
+
+The data sets used in the experiments  are relatively small and simple, larger-scale experiments should be conducted. The fluctuations in Figure 9 and 10 suggest the significant variances in the results. Also, more complicated data/images can decrease the actual similarities of euclidean neighbors, thus affecting the results.
+
+The baselines are weak. Only the most basic auto-encoder is compared, no additional variants or other data augmentation techniques are compared. It is possible other variants improve the basic auto-encoder in similar ways. 
+
+Some results are not very well explained. It seems the performance increases monotonically as the number of neighbors increases (Figure 5, 9, 10). Will this continue or when will the performance decrease? I would expect it to decrease as the far away neighbors will be dissimilar. The authors can either attach the nearest neighbors figures or their statistics, and provide explanations on when and why the performance decrease is expected.
+
+Some notations are confusing and need to be improved. For example, X and Y are actually the same set of images, the separation is a bit confusing; y_i \in y in last paragraph of page 4 is incorrect, should use something like y_i in N(y).",4,4.0,ICLR2018
+jDZKUIgJFKo,2,uys9OcmXNtU,uys9OcmXNtU,This paper lacks clarity.,"After rebuttal:
+I appreciate authors' detailed responses and an updated version of the paper. The new version is a lot clearer. After reading other reviews, I agree that the algorithmic novelty is limited, but the model is well-adapted for multi-horizon forecasting problem. Overall, I increase my score to 6. marginally above acceptance threshold. 
+
+----------------------------------------------------------
+Summary:
+This paper introduces a new model for multi-horizon forecasting. The proposed model is an extension of MQRNN with two new modules: task specific attention and decoder self-attention. The experiments on the large-scale demand forecasting dataset and other publicly available datasets show that the proposed model outperforms or is comparable with the CNN, RNN-based models as well as the transformer-based model. 
+--------------------------------------
+Pros:
++ Authors adapted the attention mechanism for a challenging problem (multi-horizon forecasting). 
++ The proposed model is evaluated on a large-scale dataset as well as existing public datasets. 
++ Additional experiments and Figure 4 on Appendix are helpful. 
+--------------------------------------
+Cons:
+I found this paper lacks clarity. I list the issues below:
+
+1. Authors stated that 'Our horizon-specific attention mechanism can be viewed as a multi-headed attention mechanism. Each head corresponds to a different horizon.'. I don't think this is true. Multi-head attention is an ensemble of attentions from the same input to attend different positions by incorporating different representations. The purpose of the horizon-specific attention in this paper is to merge the encodings of multiple horizon. I ask authors to clarify it.
+
+2. Design choice: the main contribution of this paper is attention mechanisms (horizon-specific attention between encoder-decoder and decoder self-attention) for multi-horizon forecasting. 
+Questions:
+    1. Since the proposed model architecture is similar to MQRNN [Wen et al. 2017], it should be the baseline. What is the reason MQCNN is used as a baseline? 
+    2. The proposed model performs better than MQRNN for public datasets. Which module in the model has a big role between horizon specific attention and decoder self-attention? Or is attention specifically handling better for multi-horizon forecasting problems than RNNs? Adding MQRNN as another baseline and the result without horizon specific attention for the large-scale demand forecasting would be helpful (in Table 3 and Figure 2). 
+
+3. TFT [Lim2019] is an existing transformer-based model for Multi-horizon Time Series Forecasting. In my understanding, the major differences are horizon specific attention in the proposed model and a different design of decoder in TFT. Could you clarify the difference and similarities between these two models in the paper?
+Also, authors stated that 'We were unable to compare to TFT (the prior state of the art on several public datasets) as it does not scale-up'. What does this mean? I assume TFT cannot be easily applied for large-scale demand forecasting. This needs more explanations what are the major difficulties to scale-up TFT for the problem. 
+
+3. A description of experiment setup and model specification is missing (for both the proposed model and baseline). The results can not be easily reproduced. 
+    - Hyper-parameters: the number of layers and hidden units, learning rate, optimizer, etc. 
+    - Preprocessing if there is any.
+--------------------------------------
+Minor comments:
+- A description of P50 and P90 in Figure 2, 3, and Table 4 is missing. 
+- A description of LTSP in Section 4 is missing. 
+- The baseline (MQCNN) has no reference. ",6,3.0,ICLR2021
+S1lH_kUq2Q,2,BkE8NjCqYm,BkE8NjCqYm,it is not right to do analysis on test set.,"
+This work does extensive experiments on three different text generation tasks and shows the relationship between wider beam degradation and more and larger early discrepancies. This is an interesting observation but the reason behind the scene are still unclear to me. A lot of the statements in the paper lack of theoretical analysis. 
+
+The proposed solutions addressing the beam discrepancies are effective, which further proves the relationship between beam size and early discrepancies. My questions/suggestions are as follows:
+* It’s better to show the dataset statistics along with Fig1,3. So that readers know how much of test set have discrepancies in early steps.
+* It is not right to conduct your analysis on the test set. You have to be very clear about which results are from test set or dev set.
+* All the results with BLEU score must include the brevity penalty as well. It is very useful to analyze the length ratio changes between baseline, other methods, and your proposal.
+* The example in Sec. 4.6 is unclear to me, maybe you could illustrate it more clearly.
+* Your approaches eliminate the discrepancies along with the diversity with a wider beam. I am curious what if you only apply those constraints on early steps.
+* I suggest comparing your proposal to the word reward model in [1] since it is also about improving beam search quality. Your threshold-based method is also kind of word reward method.
+* In eq.2, what do you mean by sequence y \in V? y is a sequence, V just a set of vocabulary.  What do you mean by P (y|x;{y_0..y_t}). Why the whole sequence y is conditioned on a prefix of y?
+
+[1] Huang et al, ""When to Finish? Optimal Beam Search for Neural Text Generation"" 2017",5,5.0,ICLR2019
+rJAqyzU4x,2,Syoiqwcxx,Syoiqwcxx,"Presents interesting cases of local minima in neural networks, but contains technical issues","This paper studies the error surface of deep rectifier networks, giving specific examples for which the error surface has local minima. Several experimental results show that learning can be trapped at apparent local minima by a variety of factors ranging from the nature of the dataset to the nature of the initializations. This paper develops a lot of good intuitions and useful examples of ways that training can go awry. 
+
+Even though the examples constructed in this paper are contrived, this does not necessarily remove their theoretical importance. It is very useful to have simple examples where things go wrong. However the broader theoretical framing of the paper appears to be going after a strawman.
+
+“The underlying easiness of optimizing deep networks does not simply rest just in the emerging structures due to high dimensional spaces, but is rather tightly connected to the intrinsic characteristics of the data these models are run on.” I believe this perspective is already contained in several of the works cited as not belonging to this perspective. Choromanska et al., for instance, analyze Gaussian inputs, and so clearly make claims based on characteristics of the data the models are run on. More broadly, the loss function is determined jointly by the dataset and the model parameters, and so no account of the error surface can be separated from dataset properties. It is not clear to me what ‘emerging structures due to high dimensional spaces’ are, or what they could be, that would make them independent of the dataset and initial model parameters. The emerging structure of the error surface is necessarily related to the dataset and model parameters.
+
+Again, a key worry with this paper is that it is aiming at a strawman: replica methods characterize average behavior for infinite systems, so it is not surprising that specific finite sized systems might yield poor optimization landscapes. The paper seems surprised that training can be broken with a bad initialization, but initialization is known to be critical, even for linear networks: saddle points are not innocuous, with bad initializations dramatically slowing learning (e.g. Saxe et al. 2014).
+
+It seems like the proof of proposition 5 may have an error. Suppose cdf_b(0) = 0 and cdf_W(0)=1/2. We have P(learning fails) >= 1 - 1/2^{h^2(k-1)}, meaning that the probability of failure _increases_ as the number of hidden units increases. It seems like it should rather be (ignoring the bias) p(fails) >= 1 - [ 1 - p(w<0)^h^2]^{k-1}. In this case the limit as k-> infinity depends on how h scales with k, so it is no longer necessarily true that “one does not have a globally good behaviour of learning regardless of the model size.”
+
+The paper also appears to insufficiently distinguish between local minima and saddle points. Section 3.1 states it shows training being stuck in a local minimum, but this is based on training with a fixed budget of epochs. It is not possible to tell whether this result reflects a genuine local minimum or a saddle point based on simulation results. 
+It may also be the case that, while rectifiers suffer from genuine blind spots, sigmoid or soft rectifier nonlinearities may not. On the XOR problem with two hidden nodes, for instance, it was thought that were local minima but in fact there are none (e.g. L. Hamey, “Analysis of the error surface of the XOR network with two hidden nodes,” 1995). 
+
+If the desire is simply to show that training does not converge for particular finite problems, much simpler counterexamples can be constructed and would suffice: set all hidden unit weights to zero, for instance. 
+
+In the response to prereview questions, the authors write ‘If the “complete characterization” [of the error surface] was indeed universally valid, we would not be able to break the learning with the initialization’ but, as mentioned previously, the basic results for even deep linear networks show that a bad initialization (at or near a saddle point) will break learning. Again, it seems this paper is attacking a straw man along the lines of “nothing can possibly go wrong with neural network training.” No prior theoretical result claims this. 
+
+The Figure 2 explanation seems counterintuitive to me. Simply scaling the input, if the weight matrices are initialized with zero biases, will not change the regions over which each ReLU activates. That is, this manipulation does not achieve the goal of concentrating “most of the data points in very few linear regions.” A far more likely explanation is that the much weaker scaling has not been compensated by the learning algorithm, but the algorithm would converge if run longer. The response notes that training has been conducted for an order of magnitude longer than required for the unscaled input to converge, but the scaling on the data is not one but five orders of magnitude—and indeed the training does converge without issue for scaling up to four orders of magnitude. The response notes that Adam should compensate for the scaling factor, but this depends on the details of the Adam implementation—the epsilon factor used to protect against division by zero, for example. 
+
+This paper contains many interesting results, but a variety of small technical concerns remain.
+
+",5,4.0,ICLR2017
+rylBWoZNe,1,S1TER2oll,S1TER2oll,"interesting idea, reasonable experimental validation, some concerns about practicality","The paper proposes a method for optimising the shape of filters in convolutional neural network layers, i.e. the structure of their receptive fields. CNNs for images almost invariably feature small square filters (e.g. 3x3, 5x5, ...) and this paper provides an algorithm to optimise this aspect of the model architecture (which is often treated as fixed) based on data. It is argued that this is especially useful for data modalities where the assumption of locality breaks down, as in e.g. spectrograms, where correlations between harmonics are often relevant to the task at hand, but they are not local in frequency space.
+
+Improved performance is demonstrated on two tasks that are fairly non-standard, but I think that is fine given that the proposed approach probably isn't useful for the vast majority of popular benchmark datasets (e.g. MNIST, CIFAR-10), where the locality assumption holds and a square filter shape is probably close to optimal anyway. Fig. 1 is a nice demonstration of this.
+
+The paper spends quite a bit of space on a theoretical argument for the proposed method based on Gaussian complexity, which is interesting but maybe doesn't warrant quite so much detail. In contrast, section 3.3 (about how to deal with pooling) is quite handwavy in comparison. This is probably fine but the level of detail in the preceding sections makes it a bit suspicious.
+
+I'm also not 100% convinced that the theoretical argument is particularly relevant, because it seems to rely on some assumptions that are clearly untrue for practical CNNs, such as 1-norm weight constraints and the fact that it is probably okay to swap out the L1 norm for the L2 norm.
+
+I would also like to see a bit more discussion about Fig. 4, especially about the fact that some of the filter shapes end up having many fewer nonzeros than the algorithm enforces (e.g. 3 nonzeros for layers 6 and 7, whereas the maximum is 13). Of course this is a perfectly valid outcome as the algorithm doesn't force the solution to have an exact number of nonzeros, but surely the authors will agree that it is a bit surprising/unintuitive? The same figure also features an interesting phase transition between layers 1-4 and 5-8, with the former 4 layers having very similar, almost circular/square filter shapes, and the later having very different, spread out shapes. Some comments about why this happens would be welcome.
+
+Regarding my question about computational performance, I still think that this warrants some discussion in the paper as well. For many new techniques, whether they end up being adopted mainly depends on the ratio between the amount of work that goes into implementing them, and the benefit they provide. I'm not convinced that the proposed approach is very practical. My intuition is that creating efficient implementations of various non-square convolutions for each new problem might end up not being worth the effort, but I could be wrong here.
+
+
+Minor comments:
+
+- please have the manuscript proofread for spelling and grammar.
+
+- there is a bit of repetition in sections 2 and 3, e.g. the last paragraphs of sections 2.1 and 2.2 basically say the same thing, it would be good to consolidate this. 
+
+- a few things mentioned in the paper that were unclear to me (""syllables"", ""exclude data that represent obvious noise"", choice of ""max nonzero elements"" parameter) have already been adequately addressed by the authors in their response to my questions, but it would be good to include these answers in the manuscript as well.
+
+- the comparison in Fig. 5 with L1 regularisation on the filter weights does not seem entirely fair, since the resulting shape would have to be encompassed in a 5x5 window whereas Fig. 4 shows that the filter shapes found by the algorithm often extend beyond that. I appreciate that training nets with very large square filters is problematic in many ways, but the claim ""L1 regularization cannot achieve the same effect as filter shaping"" is not really convincingly backed up by this experiment.",7,4.0,ICLR2017
+fG9hsUk20k,3,xCxXwTzx4L1,xCxXwTzx4L1,ChipNet: Budget-Aware Pruning with Heaviside Continuous Approximations,"This paper proposes a new deterministic pruning strategy that employs continuous Heaviside function and crispness loss to identify a sparse network out of an existing dense network. Experiments show its effectiveness and robustness. Generally, it is well-written and easy to follow. Some minor issues are shown below.
+
+1.	Pseudo codes for the functions mentioned in Algorithm 1 should be provided to show them clearly. 
+2.	The proposed algorithm is a deterministic algorithm and may fail in complex network pruning problems. An in-depth analysis of its limitation is needed.
+3.	It is recommended to use different labels for different algorithms in the figures.
+
+",7,4.0,ICLR2021
+HJxB2mDv6m,3,SJl8J30qFX,SJl8J30qFX,"well written with thorough experiments, but limited novelty and scope","Summary:
+This paper incorporates Generalized Additive Models (GAMs) with model distillation to provide global explanations of neural nets (fully-connected nets as black-box in the paper). It is well written with detailed experiments of synthetic and real tabular data, and makes some contribution towards the interpretability of black-box models. However, it lacks novelty and is limited to tabular data as presented.
+
+Pros:
+- The paper is well written.
+- The experiments are detailed and thorough with both synthetic and real data.
+
+Cons:
+- The novelty is limited. The core consists of GAMs well studied in the literature, e.g. Caruana et al 2015. Admittedly, this work also tries to incorporate model distillation to explain black-box models globally. The concept of student models approximating teacher models is not new either. The originality seems incremental in both directions.
+- The scope is limited. The paper only presents applications in tabular data. Also, it would be better to experiment with black-box models beside simple fully-connected nets.
+- The interpretability is not convincing. It is not sufficient to demonstrate the interpretability of the proposed method, or the expressive advantage of feature shapes. It is encouraged to include studies with human subjective to compare against other existing interpretable approaches.
+
+Specifics:
+- With Figure 3, it is not convincing that the student model actually explains the teacher model, so the paper tries to elaborate more with Table 1. I think Table 1 also needs more details to help, such as the significance of error difference and '-' elements.
+- Many figures are hard to read mostly because of font, color, and overlap.
+",6,5.0,ICLR2019
+rJl5OkRLnm,2,SJlh2jR9FX,SJlh2jR9FX,the paper is technically flawed,"This paper is technically flawed. Here are three key equations from Section 2. The notations are simplified for textual presentation:  d – p_data; d(y|x) – p_d(y|x); m(y|x) – p_theta(y|x)
+
+max E_x~d E_y~d(y|x) [ log m (y|x) ]                 				               (1) 
+max E_x~d { E_y~d(y|x) ) [ log d(y|x) ]}  -  E_y~d(y|x) [ log m (y|x) ]}        (2)
+max { E_y~d [  log  (y) ]  -  E_y~d  log E_x~d(x|y) [ m (y|x) ]}                        (3)
+
+First error is that the “max” in (2) and (3) should be “min”. I will assume this minor error is corrected in the following.
+The equivalence between (1) and (2) is correct and well-known. The reason is that the first entropy term  in (2) does not depend on model.  The MAJOR ERROR is that (1) is NOT equivalent to (3). Instead, it is equivalent to the following:
+
+ min { E_y~d [  log d (y) ]  -  E_y~d  E_x~d(x|y) [ log m (y|x) ]}                     (3’)
+
+Notice the swap of “E_x” and “log”. By Jensen’s nequality, we have 
+
+ log E_x~d(x|y)  m (y|x) ]  > E_x~d(x|y) [ log m (y|x)
+ -  E_y~d  log E_x~d(x|y)  [ m (y|x) ]    < -  E_y~d  E_x~d(x|y) [ log m (y|x) ]                    
+
+So, minimizing (3) amounts to minimizing a lower bound of the correct objective (3’). It does not make sense at all.
+",2,4.0,ICLR2019
+ryg6pPtQ5r,3,B1xmOgrFPS,B1xmOgrFPS,Official Blind Review #2,"This paper is about the task of object detection in the setting of few-shots dataset. The problem is addressed in the learning scheme of meta-learning paradigm: the proposed meta-rcnn trains the popular faster-rcnn on several tasks of few shots object detection while the RPN and the object classification networks are meta-learned among the tasks. Compared to previous work the paper introduces the meta learning framework and several changes to the faster rcnn detector. A prototype representation is derived from the standard RPN network and its proposed bounding box. An attention mechanism choose the object of interest and is used to train the final RPN and classification network. Experiments on the popular Pascal Voc 2007 and ImageNet-FSOD show that the proposed system have state of the art performance.
+
+The paper is very well written, easy to read and of excellent presentation. The introduction of the meta learning paradigm and its use to learn the RPN and classification networks are incremental in novelty but interesting. The experiments are solid and show state of the art performance. As a result I recommend this paper to be accepted.
+
+Minor issues:
+- in caption of Fig1: avialable -> available
+- in 4.1: “Compared to other variants...” please add a reference to the specific methods you are comparing to.",8,,ICLR2020
+S1-ToUJWz,3,By5ugjyCb,By5ugjyCb,Review,"This paper presents a new idea to use PACT to quantize networks, and showed improved compression and comparable accuracy to the original network. The idea is interesting and novel that PACT has not been applied to compressing networks in the past. The results from this paper is also promising that it showed convincing compression results. 
+
+The experiments in this paper is also solid and has done extensive experiments on state of the art datasets and networks. Results look promising too.
+
+Overall the paper is a descent one, but with limited novelty. I am a weak reject",5,4.0,ICLR2018
+lRdMrcaXwmR,1,NUCZeoVlAe,NUCZeoVlAe,Interesting hypothesis but results are not convincing enough,"This paper studies the top singular vector of the feature space learned by supervised and unsupervised deep learning models on CIFAR datasets. The hypothesis of converging feature spaces is interesting (converging both in terms of different models, and in terms of training epochs), but the conclusion from the current experiment results is overstretching.
+
+1. While the authors emphasize the convergence of subspaces, the P-vector defined in the paper is actually the top singular vector of the feature space, so it's actually about the convergence of the 1-dimensional principal subspace. A subspace refers to an arbitrary dimensional space in general. In the context of SVD, the literature often studies the top-$k$ dimensional subspace, which is represented by the $k$ top singular vectors, and the approximation error of the top-$k$ dimensional subspace: $E=\|X - U_k \Sigma_k V_k^T\|_F^2$, where $X$ would be the feature matrix in this paper, and $U_k, V_k$ are the first $k$ columns in the result of SVD. The authors didn't measure $E$, so the readers won't know how well the top-1 dimensional subspace represents the feature matrix. I recommend looking at $E$ as a function of $k$, and use some criteria to determine how closely you want the subspace to approximate the feature matrix. For example, we can say we want to keep the top-$k$ dimensional subspace such that $E < 0.1 \|X\|_F^2$. This way, you can rule out the possibility that the P-vector is a trivial vector that every model will converge to.
+(As an analogy for a trivial vector, we can consider the top-1 eigenvector of the similarity matrix defined in the classical spectral clustering method called Normalized Cut. No matter how the edge weights in a graph is defined, the similarity matrix used in Normalized Cut always has an all-one vector as the top-1 eigenvector.)
+And to measure the angle between general subspaces, many methods are available including classical ones (e.g. Åke Björck and Gene H. Golub, Numerical Methods for Computing Angles Between Linear Subspaces, 1973).
+
+2. This paper tries to emphasize the P-vectors found in the features from different deep learning models are very close (for example, ""no matter what type of DNN architectures or whether the labels have been used to train the models, the P-vectors of different models would converge to the same one""). Actually it seems the angle typically converges to 10 to 20 degrees. It may be better to lower the tone, or quantify better (compared to the angles obtained by ..., the angles between P-vectors are smaller).
+
+3. The data in Fig. 7 looks quite noisy, though p-value shows statistical significance of the correlation. p-value can guide our findings but is not always meaningful. For example, comparing Fig.7(e) and Fig.7(l), we may argue the latter has a better correlation but the former has a much smaller p-value. It seems the very small p-value in Fig.7(e) results from some outliers. Intuitively I don't quite understand why the raw data and the features should have a correlated linear principal subspace, given that the neural network layers that generate the feature from the data are highly nonlinear.
+
+The only convincing data I found is in Table 1, which shows P-vectors can serve as an indicator of the model performance. But overall the readers would need more evidence as explained in #1 above.",3,4.0,ICLR2021
+rJFql_Nxz,1,rJk51gJRb,rJk51gJRb,,"This paper is outside of my area of expertise, so I'll just provide a light review:
+
+- the idea of assuming that the opponent will take the worst possible action is reasonable in widely used in classic search, so making value functions follow this intuition seems sensible,
+- but somehow I wonder if this is really novel? Isn't there a whole body of literature on fictitious self-play, including need RL variants (e.g. Heinrich&Silver, 2016) that approaches things in a similar way?
+- the results on Hex have some signal, but I don’t know how to calibrate them w.r.t. The state of the art on that game? A 40% win rate seems low, what do other published papers based on RL or search achieve?
+",5,2.0,ICLR2018
+Hy80uyQVg,3,ryEGFD9gl,ryEGFD9gl,My thoughts,"The paper discusses sub modular sum-product networks as a tractable extension for classical sum-product networks. The proposed approach is evaluated on semantic segmentation tasks and some early promising results are provided.
+
+Summary:
+———
+I think the paper presents a compelling technique for hierarchical reasoning in MRFs but the experimental results are not yet convincing. Moreover the writing is confusing at times. See below for details.
+
+Quality: I think some of the techniques could be described more carefully to better convey the intuition.
+Clarity: Some of the derivations and intuitions could be explained in more detail.
+Originality: The suggested idea is great.
+Significance: Since the experimental setup is somewhat limited according to my opinion, significance is hard to judge at this point in time.
+
+Detailed comments:
+———
+1. I think the clarity of the paper would benefit significantly from fixes to inaccuracies. E.g., \alpha-expansion and belief propagation are not `scene-understanding algorithms’ but rather approaches for optimizing energy functions. Computing the MAP state of an SSPN in time sub-linear in the network size seems counterintuitive because it means we are not allowed to visit all the nodes in the network. The term `deep probabilistic model’ should probably be defined. The paper states that InferSSPN computes `the approximate MAP state of the SSPN (equivalently, the optimal parse of the image)’ and I’m wondering how the `approximate MAP state' can be optimal. Etc.
+
+2. Albeit being formulated for scene understanding tasks, no experiments demonstrate the obtained results of the proposed technique. To assess the applicability of the proposed approach a more detailed analysis is required. More specifically, the technique is evaluated on a subset of images which makes comparison to any other approach impossible. According to my opinion, either a conclusive experimental evaluation using, e.g., IoU metric should be given in the paper, or a comparison to publicly available results is possible.
+
+3. To simplify the understanding of the paper a more intuitive high-level description is desirable. Maybe the authors can even provide an intuitive visualization of their approach.",5,4.0,ICLR2017
+Gkg5k0fRqZF,3,Oc-Aedbjq0,Oc-Aedbjq0,Some parameters need further investigations. ,"This work explores the model compression problem with a hyper-network to adaptively determine the preservation of inter-channel and inter-layers. To this end, they use a shared-GRU layer to explore the relations in consecutive layers and then FCs to generate the coefficient vectors which indicate the preserved rates within each layer. As a result, the inter-layer relation can be given from GRU and inter-channel relation can be given from FCs. In this paper, a learnable factor in each layer is also involved in the network optimization to better balance the classification performance and FLOPs regularization. Experimental results also validate the performance of the proposed method.
+
+Overall, I think the paper is easy to follow and the derivation is also clear to understand. Nonetheless, there are still some problems that need to be further explored, given as follows:
+
+1. This paper only investigates the channel pruning, but the title uses the ""model compression"". This might be not very specific since model compression serves as a general topic and consists of many approaches, such as channel pruning, quantization, knowledge distillation and tensor decomposition. 
+
+2. This paper also uses FCs to generate the coefficient vectors (ideally 0-1 code) to indicate the importance for each channel. However, this practice has been investigated in other channel pruning papers and is thus not new. Just to name a few,
+
+[NeurIPS2019] AutoPrune - Automatic Network Pruning by Regularizing Auxiliary Parameters  
+[Pattern Recognition] AutoPruner: An End-to-End Trainable Filter Pruning Method for Efficient Deep Model Inference
+
+3. In subsection 4.4, the experiments on Figure (a)-(b) reveal that “changing $\lambda$ does not have a large impact on the final performance of a sub-network, and our method is not sensitive to it”. I think the reason why the method is not sensitive toward $\lambda$ should be further analyzed. As in the appendix (E) shows, in Eq.(8), the update of $\alpha$ contains $\lambda$, so $\alpha$ can actually be reviewed as the adaptive variable, and for different $\lambda$ during the network training, the value of will gradually converge to the corresponding scale which considers $\lambda$.
+
+4. I think there are some mistakes in analyzing the bias of FLOPs regularization. As stated in appendix (B), the relative magnitude of gradients w.r.t  $\theta_k$ and $\theta_j$ can be approximately estimated. However, since the ratio value of $v_k$ and $v_i$ is 8, the conclusion about ""the gradient in the early layers are larger than that of the latter layers"" does not hold. I wonder whether it is a typo here. Furthermore, can you explain why the assumption “the magnitude of  $\partial v_i \partial \theta_i$ is similar given different layers” holds herein? 
+
+5. In subsection 4.4, “In Fig.4(c,d) should be changed into “In Fig.3(c,d)” instead.
+
+6. How to choose the hyper-parameter p?  
+
+",5,5.0,ICLR2021
+Byf4Vs5gM,3,Sk4w0A0Tb,Sk4w0A0Tb,On Rotational Unit of Memory,"Summary:
+This paper proposes a way to incorporate rotation memories into gated RNNs. They use a specific parametrization of the rotation matrices. They run experiments on several toy tasks and on language modelling with PTB character-level language modeling (which I would still consider to be toyish.)
+
+
+Question:
+Can the rotation proposed here cause unintentional forgetting by interleaving the memories? Because in some sense rotations are glorified summation in high-dimensions, if you do a full-rotation of a vector (360 degrees) you can end up in the same location. Thus the model might overwrite into its past memories.
+
+Pros:
+Proposes an interesting way to incorporate the rotation operations into the gated architectures.
+
+Cons:
+The specific choice of rotation operation is not very well justified.
+This paper more or less uses the same architecture from Jing et al 2017 from EU-RNNs with a different parametrization for the rotation matrices.
+The experiments are still limited to simple small-scale tasks.
+
+
+General Comments:
+
+The idea and the premise of this paper is interesting. In general the paper seems to be well-written. However the most important part of the paper section 3.1 is not very well justified. Why this particular parameterization of the rotation matrices is used and where does actually that come from? Can you point out to some citation? I think the RUM architecture section also requires better explanation on for instance why why R_t is parameterized that way (as a multiplicative function of R_{t-1}). A detailed ablation study would help too.
+
+The model seems to perform really close to the GORU on Copying Task. I would be interested in seeing comparisons to GORU on “Associative Recall” as well. On QA task, which subset of bAbI dataset have you used? 1k or 10k training sets? 
+
+On language modelling there is only insignificant difference between the FS-LSTM-2 with FS-RUM model. This does not tell us much.
+",5,4.0,ICLR2018
+kWf6Pobr5mx,1,BnokSKnhC7F,BnokSKnhC7F,"Interesting idea, but clarity and experiments can be improved","Summary: 
+
+This paper proposes a new reinforcement algorithm based on max-Bellman operator which trains the policy to optimize the maximum reward achieved in a trajectory, i.e., $R(\tau) = \max_{t\geq 0} \gamma^{t}r_{t}$. The authors analyze how the newly proposed max-Bellman operator leads to an optimal policy. Experiments on a toy task and de novo drug design tasks show better performance compared to the considered baselines. I think the proposed idea is promising, timely, and impactful. However, I think (a) the problem setting should be clarified and (b) the empirical evaluations are below the standard of ICLR conference. Especially, regarding (b), the experiments only compare with the Bellman operator under MDP with cumulative rewards that do not align with the true objective. However, researchers already know how to design MDPs with cumulative rewards aligning with the true objective and this paper should consider this as their most important baseline.
+
+Pros:
+- This paper tackles an important class of problems, where the agent aims to identify states (or solutions) that achieve the highest score. Examples include drug discovery, (mathematical) combinatorial optimization problems, and program synthesis. 
+- The proposed algorithm is fundamental and can be extended to many problems. 
+
+Cons:
+- This paper confusingly use the terminology ""reward function"" to indicate two meaning. First, it is used as a reward function associated with a Markov decision process (MDP) (taking $s_{t}$ and $a_{t}$ as inputs), where the agent is trained to optimize the cumulative summation or maximization of rewards over the trajectory. Second, it is used as a score function (taking molecule as input) that is associated with the given problem and independent of MDP (used for solving the problem). The distinction between two functions is very important and existing methods often shape rewards different from the scoring criteria. 
+- In the experiments, the authors only compare the Bellman and max-Bellman operators defined on an identical MDP. Especially, the MDP is designed so that only the max-Bellman operator aligns with optimization of the true objective of the problem. This dismisses how one can also design MDP so that the Bellman operator can optimize the true objective of the problem. \
+In drug design, [You et al. 2018, Shi et al., 2020] consider assigning the desired score of the drug as the reward only at the terminal state of the MDP. This allows the Bellman operator to properly solve the drug design problem without any modification. I note how this approach is briefly mentioned in the introduction. Authors claim that they fail to optimize for the very high reward molecules that may be encountered in the middle of the episode. While I partly agree with such a claim, this should be empirically verified to prove the empirical superiority of the max-Bellman operator over the Bellman operator. \
+In combinatorial optimization and program synthesis, e.g., [Chen et al. 2019], it is common to assign the difference of scores between intermediate states, i.e., $r_{t} = c_{t}- c_{t-1}$ where $c_{t}$ is the true objective of the problem evaluated at state $s_{t}$. This also alleviates the mismatch between the training of RL and the true objective of the problem.
+- For the drug design experiment, I also suggest the authors provide a comparison on the maximum reward per episode during training, e.g., Figure 4 (b), to ablate the effect of the best max-Bellman operator. 
+- In molecule generation tasks, it is common to provide an illustration of the generated molecules to check whether if they are indeed helpful and synthesizable in real-life. This aspect is especially important since PGFS was initially proposed to constrain the drug design over molecules with a synthesizable structure. 
+- Given the fundamental nature of the proposed algorithms, I encourage the authors to provide more demonstrations on the superiority of the max-Bellman operator. Especially, I think it is important to compare with settings where the agents receive rewards at end of the episode, e.g., [You et al. 2018]. This paper only compares with the case where the cumulative summation of score function is maximized.
+- In (No Ad, RT) column of Table 2, PGFS+MB is marked bold even though it achieves the same score as PGFS.
+
+[You et al. 2018] Graph convolutional policy network for goal-directed molecular graph generation
+
+[Chen et al. 2019] Learning to Perform Local Rewriting for Combinatorial Optimization
+
+[Shi et al. 2020] GraphAF: a flow-based autoregressive model for molecular graph generation
+
+
+",4,3.0,ICLR2021
+BJxvbVG9h7,1,S1fUpoR5FQ,S1fUpoR5FQ,Review,"The authors introduce a class of quasi-hyperbolic algorithms that mix SGD with SGDM (or similar with Adam) and show improved empirical results. They also prove theoretical convergence of the methods and motivate the design well. The paper is well-written and contained the necessary references. Although I did feel that the authors could have better compared their method against the recent AggMom (Aggregated Momentum: Stability Through Passive Damping by Lucas et al.). Seems like there are a few similarities there. 
+
+I enjoyed reading this paper and endorse it for acceptance. The theoretical results presented and easy to follow and state the assumptions clearly. I appreciated the fact that the authors aimed to keep the paper self-contained in its theory. The numerical experiments are thorough and fair. The authors test  the algorithms on an extremely wide set of problems ranging from image recognition (including CIFAR and ImageNet), natural language processing (including the state-of-the-art machine translation model), and reinforcement learning (including MuJoCo). I have not seen such a wide comparison in any paper proposing training algorithms before. Further, the numerical experiments are well-designed and also fair. The hyperparameters are chosen carefully, and both training and validation errors are presented. I also appreciate that the authors made the code available during the reviewing phase. Out of curiosity, I ran the code on some of my workflows and found that there was some improvement in performance as well. 
+
+
+",8,3.0,ICLR2019
+k-TfifrIaM9,1,Ovp8dvB8IBH,Ovp8dvB8IBH,The idea makes sense but some concerns about the experiment details and theory,"-- POST REBUTTAL --
+
+The authors addressed well most of my concerns.  I increase my rating. However, the authors need to address all comments of the reviewers and also discuss all missing related works in the updated version.
+
+– Summary – 
+
+The paper proposes a new method of leveraging the negative samples (out-of-distribution samples purposely generated from the training data distribution) in generative modeling and representation learning. The main idea aims to leverage the inductive bias of negative samples to constrain the learning of the model, e.g., these negative samples may tell some more information about the support of the data. The experimental results suggest that using these negative samples in GANs (studies with BigGAN model for conditional/unconditional image generation) and contrastive representation learning (study with CPC (Oord et al., 2018) on unsupervised learning on images and videos ) improves the performance of baselines. The paper also reports on improvements in image-image translation and anomaly detection. The paper also provides theorems to prove the convergence of the proposed model on GANs and CPC. 
+
+Overall, the paper is easy to read and the idea makes sense. However, I'm a bit concerned about the theory, the significance of improvements and the fairness of the comparison. The paper also misses to discuss and compare with recent works also on data augmentation for GANs.
+
+ -- Strength -- 
+
+S1 – The usage of negative examples, which obtained from some prior knowledge, to provide the evidence of the learning model about the support/geometry of data distribution sounds reasonable. It has been applied in (Sung et al. (2019)) in semi-supervised learning. The proposed method applies to new applications of generative and representation learnings. 
+
+S2 – The experimental results are quite extensive in regards to the applications, and the improvements on GANs quite significant with jigsaw augmentation.
+
+ -- Weakness --
+
+W1 – The paper does not provide the very detailed implementations of proposed models, which is a bit difficult to justify the correctness.
+
+*Generative learning*
+
+W2 – The detail of how to incorporate NDA into GAN is not clear. Also, the PDA baseline for GAN is not detailedly discussed. Does the PDA, NDA, and baseline BigGAN train with the same batch size? I guess that the PDA and NDA had more augmented samples, therefore batch size is larger than the bigGAN baseline?
+
+W3 – The paper does not discuss important related works [a,b,c] of Data Augmentation for GAN recently published. In these papers, they show transforming only real samples (if I understand correctly, it likely is similar to PDA of the proposed baseline) only to train GAN changing the target distribution therefore the generator will learn infrequent or out-of-distribution samples. However, if both real/fake are transformed, data augmentation is helpful in training GAN. Can the authors compare the proposed NDA to at least one of them with the same GAN model?
+
+[a] Differentiable Augmentation for Data-Efficient GAN Training
+
+[b] On Data Augmentation for GAN Training
+
+[c] Training Generative Adversarial Networks with Limited Data
+
+
+W4 – Eq. 10 showed that $L_f(\lambda * G_{\theta} + (1 - \lambda) * \overline{P}, D_{\phi}) <= D_f(P||\lambda*Q+(1-\lambda * \overline{P}))$, but then infers the lower bound: $D_f(P||\lambda*Q+(1-\lambda * \overline{P})) >= \lambda * f(\frac{1}{\lambda} + (1 - \lambda) f(0)) = D_f(P||\lambda*P+(1-\lambda * \overline{P}))$. Therefore, theoretically I am concerned a bit about the convergence of the model. I wonder whether the authors need an upper bound instead of the lower-bound in this case?
+
+W5 – The paper claimed: “Random Horizontal Flip is not effective as an NDA; this is because flipping does not spatially corrupt the image but is rather a semantic preserving transformation”. How about the Random Vertical Flip only? Can it improve the model since this augmentation looks very good to tell us about the boundary of the data? 
+
+*Representation learning*
+
+W6 – The improvements in representation learning do not look significant to me, and the improvements are not too consistent on different datasets according to the type of augmentations.
+
+W7 – The lower bound of Eq. 18 looks just like due to adding a larger batch size for negative samples to train CPC. Can authors compare NDA to CPC with just the same batch size training as the NDA method? 
+",6,4.0,ICLR2021
+UZ4qdCmndTa,4,QxQkG-gIKJM,QxQkG-gIKJM,Combining optimism with Deep RL,"The authors of the submission ""Optimistic Exploration with Backward Bootstrapped Bonus for Deep Reinforcement Learning"" draw inspiration from the theoretical reinforcement learning literature to propose an optimism based bonus for deep q learning. The idea is to compute an optimistic bonus based on a Q function ensemble, and use this to augment the present reward signal and the future return estimator (future optimism) to yield an algorithmic approach that reduces to LSVI in the case of linear MDPs (or at least reduces to something akin to LSVI in that case). 
+
+The authors then introduce a generic procedure (BEBU) that can be plugged into a variety of Q learning algorithms to produce an ""optimistic"" bootstrapped backward episodic update. Crucially the order of the updates matters, since the uncertainty bonuses should be propagated backwards in time. This is a very nice poin: the uncertainty propagation should be done backwards. 
+
+The authors are right in my opinion to claim their work represents a welcome addition to the emerging literature that proposes the use of uncertainty bonuses around the value function as opposed to myopic uncertainty bonuses only at the immediate reward level. There exist other recent works that introduce similar ideas (bringing in bonuses for the future uncertainty as opposed to solely penalizing immediate  ) are some missing citations in the related work section, most notably ""On optimism in model based reinforcement learning"" (using value optimism in model based RL and deep RL), ""Efficient model based RL through optimistic policy search and planning""(optimism and GPs in model based RL), and also SUNRISE which looks awfully related to BEBU-UCB. 
+
+The experimental results of this work are strong. It is nevertheless unclear how much of these results are the consequence of accessibility to massive computing resources. I would like to see the paper positioned more faithfully within the relevant optimism-at-value-level literature, even though these works are model based in nature. It would also be very useful to have algorithm boxes for the different methods or method templates that the authors describe in the text. It is hard to follow what they intended to say or at least a table listing succinctly in a reader friendly way what the differences are between the different instantiations of the approach (OEB3, BEBU, BEBU-UCB, etc ...). ",6,4.0,ICLR2021
+rzpj-D7M6o7,2,qYda4oLEc1,qYda4oLEc1,"Paper is very well written and addresses an important topic; using Traveling Observer Model in multi-task learning for tasks that do not have no spatial organization unlike, for example, images. ","Paper is very well written and addresses an important topic; using Traveling Observer Model (TOM) in multi-task learning for tasks that do not have no spatial organization unlike, for example, images. Although the paper is said to be a first implementation of TOM, it does thorough experimenting and result analysis of its preformance from various aspects and by comparing it to many sophisticated models. Future research for improving and testing the algorithm is clearly detailed. 
+
+Related scientific literature is sufficiently addressed,  mathematical background and the method are clearly presented, extensive and relevant experiments are done and result analyzed.
+
+ I didn't even find any typos. ",9,4.0,ICLR2021
+Hyxsb3Ha2m,2,rJlk6iRqKX,rJlk6iRqKX,Interesting paper,"This paper proposed a reformulation of objective function to solve the hard-label black-box attack problem. The idea is interesting and the performance of the proposed method seem to be capable of finding adversarial examples with smaller distortions and less queries compared with other hard-label attack algorithms. 
+
+This paper is well-written and clear.
+
+==============================================================================================
+Questions
+
+A. Can it be proved the g(theta) is continuous? Also, the theoretical analysis assume the property of Lipschitz-smooth and thus obtain the complexity of number of queries. Does this assumption truly hold for g(theta), when f is a neural network classifiers? If so, how to obtain the Lipschitz constant of g that is used in the analysis sec 6.3? 
+
+B. What is the random distortion in Table 1? What initialization technique is used for the query direction in the experiments? 
+
+C. The GBDT result on MNIST dataset is interesting. The authors should provide tree models description in 4.1.3. However, on larger dataset, say imagenet, are the tree models performance truly comparable to ImageNet? If the test accuracy is low, then it seems less meaningful to compare the adversarial distortion with that of imagenet neural network classifiers. Please explain. 
+
+D. For sec 3.2, it is not clear why the approximation is needed. Because the gradient of g with respect to theta is using equation (7) and theta is already given (from sampling); thus the Linf norm of theta is a constant. Why do we need the approximation? Given that, will there be any problems on the L2 norm case? 
+",6,5.0,ICLR2019
+SJlFHGEiaQ,4,r1erRoCqtX,r1erRoCqtX,Requires further clarification and empirical justification,"The paper presents a method for improving the convergence rate of Stochastic Gradient Descent for learning embeddings by grouping similar training samples together. The basic idea is that gradients computed on a batch of highly associated samples encode related information in a single update that independent samples might take multiple updates to capture. These structured minibatches are constructed by independently combining subsets of positive examples called “microbatches”. Two methods are presented for constructing these microbaches; first by grouping positive examples by shared context (called “basic” microbatches), second by applying Locality Sensitive Hashing to further partition the microbatches into groups that are more likely to contain similar examples.
+
+Three datasets are used for experimental analysis: a synthetic dataset generated using the stochastic block model, and two large scale recommendation datasets. The presented algorithms are compared to a baseline of independently sampled minibatches using the cosine gap and precision for the top k predictions. The authors show the measured cosine gaps over the course of training as well as the gains in training performance for several sets of hyperparameters.
+
+The motivation and basic intuition behind the work is clearly presented in the introductory section. The theoretical justification for the structured minibatches is reasonably convincing and invites empirical verification.
+
+General concerns:
+Any method for improving the performance of an optimization process via additional preprocessing must show that the additional overhead incurred from preprocessing the data (in this case, organizing the minibatches) does not negate the achieved improvement in convergence time. This work presents no evidence that this is the case. I expected to see 1) time complexity analysis of each new algorithm proposed for preprocessing and 2) experimental results showing that the overall computation time, including the proposed preprocessing steps, was reduced by this method. Neither of these things are present in this work.
+
+Furthermore, the measured “training gains” are, to my knowledge, not clearly defined. I assume that the authors are using the number of epochs or iterations before convergence as their measure of training performance, but this should be stated explicitly rather than implicitly.
+
+Finally, the experimental results presented do not seem to entirely support the authors’ conclusions. Figures 2, 3, and 4, as well as several of the figures in the appendix, show some parameter settings for which the gains over the baseline are quite limited. This makes me suspect that perhaps the coordinated minibatches aren’t the only variable affecting performance.
+
+I have organized my remaining minor concerns and requests for clarification by section, detailed below.
+
+Section 1
+- In the last paragraph, the acronym SGNS is mentioned before being defined. You should either state the full name of the method (with citation) or omit the mention altogether.
+
+Section 2
+- I would like a few sentences of additional clarification on what “focus” entities vs. “context” entities are in the more general case. I am familiar with what they mean in the context of Skip Gram, but I think more discussion on how this generalizes is necessary here. Same goes for what kappa (“association strength”) means, especially considering that this concept isn’t really present (to my understanding) in Skip Gram.
+- Grammar correction:
+“The negative examples provide an antigravity effect that prevents all embeddings to collapse into the same vector”
+“to collapse” -> “from collapsing”
+
+Section 3
+- Maybe this is just me, but I find the mu-beta notation for the microbatch distributions rather odd. Why not just use a single symbol?
+- I would like a bit more clarification on the proof for lemma 3.1, specifically on the last sentence, “the product of these events …”; that statement did not follow obviously to me.
+
+Section 3.1
+- Remove the period and colon near kappas at the end of paragraph 3. It’s visually confusing with the dot index notation right next to them.
+
+Section 4
+- Typo: “We selects a row vector …” -> “We select a row vector …”
+
+Section 5
+- I don’t understand what Figure 1 is trying to demonstrate. It doesn’t do anything (as far as I can tell) to defend the authors’ claim that COO provides a higher expected increase in cosine similarity than IND.
+
+Section 6
+- All figures in this section should have axis labels. The captions don’t sufficiently explain what they are.
+
+Section 6.2
+- How is kappa computed for the recommendations datasets? This isn’t obvious at all.
+",4,3.0,ICLR2019
+rJgSzZK33Q,3,rJzLciCqKm,rJzLciCqKm,New technique for positive-unlabeled learning focussing on addressing selection bias.,"In this paper, the authors present a new technique to learn from positive and unlabeled data. Specifically they are addressing the issues that arise when the positive and unlabeled data do not come from the same distribution. The way to achieve this is to learn a scoring function which preserves -the order- of the label posteriors. In other words, the authors are not making assumptions and then learning the exact posterior of p(y|...) but rather just a function r(x) with the property that if p(y_i) < p(y_j) then r(x_i) < r(x_j).
+
+I am not super familiar in the area but I didn't see any fundamental flaws. The approach makes sense and although I cannot judge the novelty of this paper, it is a useful tool in the PU learning toolbox addressing an arguably important problem (selection bias). Except for section 5.3, the experiments are not that interesting as they are made up artificially by the authors.
+
+Thoughts:
+- In example 1, be specific about what p(y|...) and p(o|...) are.
+- In example 2, I wasn't sure what p(o|...) exactly would be.
+- Assumption 1, the first sentence I understand. The ""if and only if"" part I don't see. Can you clarify?
+",7,2.0,ICLR2019
+r1dNqr9xf,2,ry9tUX_6-,ry9tUX_6-,Weak Accept,"1) I would like to ask for the clarification regarding the generalization guarantees. The original Entropy-SGD paper shows improved generalization over SGD using uniform stability, however the analysis of the authors rely on an unrealistic assumption regarding the eigenvalues of the Hessian (they are assumed to be away from zero, which is not true at least at local minima of interest). What is the enabling technique in this submission that avoids taking this assumption? (to clarify: the analysis is all-together different in both papers, however this aspect of the analysis is not fully clear to me).
+2) It is unclear to me what are the unrealistic assumptions made in the paper. Please, list them all in one place in the paper and discuss in details.
+",6,3.0,ICLR2018
+SJxdi0zh5S,3,BkxadR4KvS,BkxadR4KvS,Official Blind Review #4,"This paper tries to analyze the similarities and transferring abilities of learned visual representations for embodied navigation tasks. It uses PWCCA to measure the similarity.  There are some interesting observations by smart experimental designing. 
+
+I have several concerns.
+
+- for the non-disjoint experiments, the difference between A and B is that the subsets contain different instances. The objects in subsets A and B may have the same category. The objects with the same category may share similar surrounding environment. Thus, the visual inputs for the training model on A and B may just have minor differences. This point is also related to the spatial coverage used in the paper. Since the visual input is similar, why is the conclusion in Figure1(b) non-trivial?
+
+- for the transferring experiments, in the beginning, the finetuning way is better than the new training makes sense. But, why do the results of learning a new policy from scratch will inferior to the finetuning way when training to convergence? The two experiments are both performed on the same fixed visual encoder.
+
+- I think the experiments can not support the argument that residual connections help networks learn more similar representations. Will other structures such as VGG also learn similar representations? Will the degrees of similar representations be proportional to the accuracy of the classification tasks and the modified residual network still outperforms the squeezenet? The more straightforward ablation studies might be that we remove all shortcuts of the ResNet as the plain version.
+
+=========================================================
+After Rebuttal:
+
+I thank the author for the response. I still think the evaluations and experimental settings cannot fully support the conclusions. So I keep the original score.
+
+I hope the comments are useful for preparing a future version of this work.",3,,ICLR2020
+Hk4ciAteG,3,SySaJ0xCZ,SySaJ0xCZ,review: questionable utility,"This paper proposes a variant of neural architecture search.  It uses established work on network morphisms as a basis for defining a search space.  Experiments search for effective CNN architectures for the CIFAR image classification task.
+
+Positives:
+
+(1) The approach is straightforward to implement and trains networks in a reasonable amount of time.
+
+(2) An advantage over prior work, this approach integrates architectural evolution with the training procedure.  Networks are incrementally grown; child networks are initialized with learned parameters from their parents.  This eliminates the need to restart training when making an architectural change, and drastically speeds the search.
+
+Negatives:
+
+(1) The state-of-the-art CNN architectures are not mysterious or difficult to find, despite the paper's characterization of them being so.  Indeed, ResNet and DenseNet designs are both guided by extremely simple principles: stack a series of convolutional layers, pool occasionally, and use some form of skip-connection throughout.  The need for architectural search is unclear.
+
+(2) The proposed search space is boring.  As described in Section 4, the possibly evolutionary changes are limited to deepening the network, widening the network, and adding a skip connection.  But these are precisely the design aspects that have been well-explored by human trial and error and for which good rules of thumb are already available.
+
+(3) As a consequence of (1) and (2), the result is essentially rigged.  Since only depth, width, and skip connections are considered, the end network must end up looking like a ResNet or DenseNet, but with some connections pruned.  There is no way to discover a network outside of the principled design space articulated in point (1) above.  Indeed, the discovered network diagrams (Figures 4 and 5) fall in this space.
+
+(4) Performance is worse than the best hand-designed baselines.  One would hope that, even if the search space is limited, the discovered networks might be more efficient or higher performing in comparison to the human designs which fall within that same space.  However, the results in Tables 3 and 4 show this not to be the case.  The best human designs outperform the evolved networks.  Moreover, the evolved networks are woefully inefficient in terms of parameter count.
+
+Together, these negatives imply the proposed approach is not yet at the point of being useful in practice.  I think further work is required (perhaps expanding the search space) to resolve the current limitations of automated architecture search.
+
+Misc:
+
+Tables 3 and 4 would be easier to parse if resources were simply reported in terms of total GPU hours.",4,4.0,ICLR2018
+SygVYCxF27,2,SyVhg20cK7,SyVhg20cK7,Interesting connections to study of social dilemma and role of peer evaluation; experiments not enough to make the scalability claim  ,"The paper introduces a DQN based, hierarchical, peer-evaluation scheme for reward design that induces cooperation in semi-cooperative multi-agent RL systems. The key feature of this approach is its scalability since only local “communication” is required -- the number of agents is impertinent; no states and actions are shared between the agents. Moreover this “communication” is bound to be low dimensional since only scalar values are shared and has interesting connections to sociology. Interesting metaphor of “feel” about a transition.
+
+Regarding sgn(Z_a) in Eq2, often DQN based approaches clip their rewards to be between say -1 and 1. The paper says this helps reduce magnitude, but is it just an optimization artifact, or it’s necessary for the reward shaping to work, is slightly unclear. 
+
+I agree with the paper’s claim that it’s important for an agent to learn from it’s local observation than to depend on joint actions. However, the sentence “This is because similar partially-observed transitions involving different subsets of agents will require different samples when we assume that agents share some state or action information.” is unclear to me. Is the paper trying to just say that it’s more efficient because what we care about is the value of the transition and different joint actions might have the same transition value because the same change in state occured. However, it seems that paper is making an implicit assumption about how rewards look like. If the rewards are a function of both states and actions, r(s,a) ignoring actions might lead to incorrect approximations.
+
+In Sec 3.2, under scalability and flexibility, I agree with the paper that neural networks are weird and increasing the number of parameters doesn’t necessarily make the task more complex. However the last sentence ignores parameter sharing approaches as in [1], whose input size doesn’t necessarily increase as the number of agents grows. I understand that the authors want to claim that the introduced approach works in non homogeneous settings as well.
+
+I get the point being made, but Table 1 is unclear to me. In my understanding of the notations, Q_a should refer to Action Q-table. But the top row seems to be showing the perceived reward matrix. How does it relate to Mission Q-table and Action Q-table is not obviously clear.
+
+Given all the setup and focus on flexibility and scalability, as I reach the experiment section, I am expecting some bigger experiments compared to a lot of recent MARL papers which often don’t have more two agents. From that perspective the experiments are a bit disappointing. Even if the focus is on pedagogy and therefore pursuit-evasion domain, not only are the maps quite small, the number of agents is not that large (maximum being 5). So it’s hard to confirm whether the scalability claim necessarily make sense here. I would also prefer to see some discussion/intuitions for why the random peer evaluation works as well as it did in Fig 4(a). It doesn’t seem like the problem is that of \beta being too small. But then how is random evaluation able to do so much better than zero evaluation?
+
+Overall it’s definitely an interesting paper. However it needs more experiments to confirm some of its claims about scalability and flexibility.
+
+Minor points
+I think the section on application to actor critic is unnecessary and without experiments, hard to say it would actually work that well, given there’s a policy to be learned and the value function being learned is more about variance reduction than actual actions.
+In Supplementary, Table 2: map size says 8x7. Which one is correct?
+
+[1]: https://link.springer.com/chapter/10.1007/978-3-319-71682-4_5
+",5,4.0,ICLR2019
+rJe2gvhV9S,3,HJlXC3EtwB,HJlXC3EtwB,Official Blind Review #1,"This paper studies the problem of improving proximity graph for nearest neighbor search. It formulates the task of pruning the graph as a problem of learning annealable proximity graph. A hard pruning processes is used after the learning process, and the results shows that the proposed method can reduce 50% of the edges and speed up the search time by 16-41%.
+
+The biggest concern I have is how to evaluate the performance.  The proposed method is mainly based on the comparison with [Malkov 2016], which did not use an extra training set to learn the NPG as proposed in this paper. So it is not surprising the proposed method will perform better. I would like to see more comparisons with at least the following methods: (1) a heuristic pruning strategy (2) the state of the arts of tree based NN search and hashing based search (3) the recent work in proximity graph [Fu et al 2019]
+
+To summarize, I think the paper studies an important problem and the proposed method is reasonable. However, I cannot be convinced it is the state of the art for large scale nearest search unless I see more comparisons in the new version. 
+
+Detailed comment:
+- in section 5.2, ""APG reduce the number of edges by 2 times "" -> ""APG reduce the number of edges by 50\%""",6,,ICLR2020
+rJ6ELPbVe,1,BJa0ECFxe,BJa0ECFxe,"interesting idea, but not very convincing","The authors propose ""information dropout"", a variation of dropout with an information theoretic interpretation. A dropout layer limits the amount of information that can be passed through it, and the authors quantify this using a variational bound. 
+
+It remains unclear why such an information bottleneck is a good idea from a theoretical standpoint. Bayesian interpretations lend a theoretical basis to parameter noise, but activation noise has no such motivation. The information bottleneck indeed limits the information that can be passed through, but there is no rigorous argument for why this should improve generalization.
+
+The experiments are not convincing. The CIFAR-10 results are worse than those in the paper that originally proposed the network architecture they use (Springenberg et al). The VAE results on MNIST are also horrible.",4,4.0,ICLR2017
+G0JD4iWp1U,3,JNP-CqSjkDb,JNP-CqSjkDb,Review #3,"The authors propose to make use RNN cell to replace the connection between continuous layers in Transformer. Although the proposed method makes use RNN cell to replace the heavy MLP layer after self-attention. I still think there's no significant difference between vanilla Transformer. The authors also propose to use position aware bi-directional attention mechanism. While it's hard to say whether it's better than multi-head self-attention in Transformer and the authors don't have an ablation study on this. Moreover, I would like to see some model comparisons on more popular datasets such as GLUE, SQuAD, etc.
+
+Cons:
+1. It is hard to tell the benefits of the proposed method compared to vanilla Transformer. The motivation for integrating RNN cell is not clear enough for me. If Feed Forward Network is a big concern on speed, there should have some fair comparison on speed.
+2. If the authors care more about final performance. It should be evaluated on more popular datasets like GLUE/SQuAD. Pre-training a base model on Wikipedia should not take too much time. I would appreciate it if the authors can have this experiment.
+3. The paper needs some further proofread.
+
+Minor comments:
+1. improve the computation efficency. (second paragraph in Intro) --> efficiency
+2. third paragraph in page 2, in Transformer achieving still competive -->competitive
+3. first paragraph in page 3, retraining on a dadtaset --> dataset
+4. first paragraph in page 6,we concatenate the two input sentences with a seperation --> separation or special token?
+
+###update###
+
+As no author response, I would like to keep my rating.",4,4.0,ICLR2021
+rkn9pmD4e,2,SJBr9Mcxl,SJBr9Mcxl,"review of ""UNDERSTANDING TRAINED CNNS BY INDEXING NEURON SELECTIVITY""","The authors analyze trained neural networks by quantifying the selectivity of individual neurons in the network for a variety of specific features, including color and category.   
+
+Pros:
+   * The paper is clearly written and has good figures. 
+   * I think they executed their specific stated goal reasonably well technically.   E.g. the various indexes they use seem well-chosen for their purposes. 
+
+Cons:
+   * I must admit that I am biased against the whole enterprise of this paper.   I do not think it is well-motivated or provides any useful insight whatever.   What I view their having done is produced, and then summarized anecdotally, a catalog of piecemeal facts about a neural network without any larger reason to think these particular facts are important.  In a way, I feel like this paper suffers from the same problem that plagues a typical line of research in neurophysiology, in which a catalog of selectivity distributions of various neurons for various properties is produced -- full stop.  As if that were in and of itself important or useful information.   I do not feel that either the original neural version of that project, or this current model-based virtual electrophysiology, is that useful.   Why should we care about the distribution of color selectivities?   Why does knowing distribution as such constitute ""understanding""?    To my mind it doesn't, at least not directly.   
+
+Here's what they could have done to make a more useful investigation:
+  
+     (a) From a neuroscience point of view, they could have compared the properties that they measure in models to the same properties as measured in neurons the real brain.   If they could show that some models are better matches on these properties to the actual neural data than others, that would be a really interesting result.   That is is to say, the two isolated catalogs of selectivities (from model neurons and real neurons)  alone seem pretty pointless.  But if the correspondence between the two catalogs was made -- both in terms of where the model neurons and the real neurons were similar, and (especially importantly) where they were different --- that would be the beginning of nontrivial understanding.   Such results would also complement a growing body of literature that attempts to link CNNs to visual brain areas.  Finding good neural data is challenging, but whatever the result, the comparison would be interesting. 
+
+and/or 
+
+    (b) From an artificial intelligence point of view, they could have shown that their metrics are *prescriptive* constraints.   That is, suppose they had shown that the specific color and class selectivity indices that they compute, when imposed as a loss-function criterion on an untrained neural network, cause the network to develop useful filters and achieve significantly above-chance performance on the original task the networks were trained on.     This would be a really great result, because it would not only give us a priori reason to care about the specific property metrics they chose, but it would also help contribute to efforts to find unsupervised (or semi-supervised) learning procedures, since the metrics they compute can be estimated from comparatively small numbers of stimuli and/or high-level semantic labels.    To put this in perspective, imagine that they had actually tested the above hypothesis and found it to be false:  that is, that their metrics, when used as loss function constraints, do not improve performance noticeably above chance performance.  What would we then make of this whole investigation?  It would then be reasonable to think that the measured properties were essentially epiphenomenal and didn't contribute at all to the power of neural networks in solving perceptual tasks.  (The same could be said about neurophysiology experiments doing the same thing.)  
+     [--> NB: I've actually tried things just like this myself over the years, and have found exactly this disappointing result.  Specifically,  I've found a number of high-level generic statistical property of DNNs that seem like they might potentially ""interesting"", e.g. because they apparently correlate with complexity or appear to illustrate difference between low, intermediate and high layers of DNNs.  Every single one of these, when imposed as optimization constraints, has basically lead nowhere on the challenging tasks (like ImageNet) that cause the DNNs to be interesting in the first place.  Basically, there is to my mind no evidence at this point that highly-summarized generic statistical distributions of selectivities, like those illustrated here, place any interesting constraints on filter weights at all.   Of course, I haven't tried the specific properties the authors highlight in these papers, so maybe there's something important there.]
+
+I know that both of these asks are pretty hard, but I just don't know what else to say -- this work otherwise seems like a step backwards for what the community ought to be spending its time on. 
+ ",3,5.0,ICLR2017
+KP68nw6ma4d,2,hecuSLbL_vC,hecuSLbL_vC,Novel Theory for Continual Learning in the context of Orthogonal Gradient Descent.,"The authors use a Neural Tangent Kernel (NTK) approximation of wide neural nets to establish generalization bounds for continual learning (CL) using stochastic gradient descent (SGD) and orthogonal gradient descent (OGD).  In this regime, the authors prove that OGD does not suffer from catastrophic forgetting of training data.  The authors additionally introduce a modification to OGD which causes significant performance improvements in the Rotated MNIST and Permuted MNIST problems.  OGD involves storing feature maps from data points from previous tasks.  The modified OGD method (OGD+) additionally stores feature maps from the current task.  
+
+The primary contribution of this paper is the theoretical analysis of continual learning.  Given that the CL problem does not have an extensive theoretical foundation, the generalization bound in this paper is a notable advance. The theory presented also provides a justification for the empirical observations observed by the authors that as overparameterization increases, the effect of catastrophic forgetting decreases in a variety of CL task setups.  The primary drawback of the paper is that the authors do not compare the OGD+ algorithm to other continual learning algorithms (synaptic intelligence, elastic weight consolidation, etc.).  As a result it is difficult to know how OGD+ compares to alternatives.  It is not clear to the reviewer why improving OGD to OGD+ is itself a contribution.  Given the expense occurred by OGD-type methods in storing ever increasing numbers of directions, it would be important to know the comparison of this method with others.  
+
+Minor comments:
+
+(1) Section 3.2: f^* is not defined as of this point in the paper.
+(2) Theorem 1: The theorem needs a quantifier of lambda
+(3) Line above Remark 1 k_\tau -> \kappa_\tau
+(4) Theorem 2: The paper should define what ""is in the memory"" means when introducing OGD
+s
+(5) Theorem 3: Definition of R_T has incorrect dummy index in the summation
+",6,3.0,ICLR2021
+TBsQHrJEGcw,1,tyd9yxioXgO,tyd9yxioXgO,Sound method for video generation based on action graphs,"This paper proposes a generative method (AG2Vid) that generates video conditioned by the first frame, first layout and an action graph. An action graph is defined such that nodes represent objects in the scene and edges represent actions. To capture the temporal dynamics, each pairwise connection is enriched with a time interval to indicate the temporal segment when the action happens. For each time step, the method consists of several stages: First, it creates the layout corresponding to the current time step based on the current graph and previous layout. Then it extracts the optical flow based on the last two layouts and the previous generated frame and finally, it generates the current frame (at pixel level) based on the predicted optical flow and the previous frame. Several metrics, including human evaluation, indicates that the method outperforms powerful baselines on two datasets: CATER and Something-Something v2.
+
+Pro:
+- Generating video content is a difficult task and the idea of generating frames based on action graphs to more explicitly focus on the activity class is interesting and naturally integrated into the proposed architecture.
+- The method clearly outperforms the baselines and produces high-quality videos. 
+- The experiments regarding the generalization to novel compositions of actions are interesting and show promising results for generating videos beyond the training domain.
+- The paper is generally well written, with clear explanations on the main aspects and a good balance between quantitative evaluation and qualitative examples.
+
+Cons:
+- Since Action Genome [1] dataset provides more complex scene-graph annotations for videos in Charades, quite similar to the one required in this work (especially the contact subset of Action Genome), why do the authors choose Something-Something dataset instead of Charades? 
+- From the paper, it seems to me that the subset of videos picked from Smt-Smt dataset only contains 2 objects (nodes), thus the action graph is very simple and the temporal segment covers the entire video. This aspect is only briefly mentioned in the main paper. Moreover, I think the ability of the method would be more clearly demonstrated in classes that have more than 2 objects if the extraction of the AG would be possible in that case. 
+- The RNN ablation is interesting, showing the necessity of GNN processing. However, more details about the motivation and intuition behind these experiments should be added in section 5.
+
+Minor:
+- I think there is a typo in Eq (4). Shouldn’t VGG be applied also on the predicted v_t?
+- In the same manner, as the compositional experiment, it would be interesting to test the model using the same first frame from training videos, but changing the action labels from the Action Graphs (on Smt-Smt). In this way, it would be clearer that the model doesn’t use any kind of biases in the dataset. 
+
+Since video generation could be a sensitive task, especially when conditioned on a set of actions, the ethical aspect of that work should be taken into consideration and discussed. 
+
+[1] Action Genome: Actions as Compositions of Spatio-temporal Scene Graphs, Ji et. al, CVPR 2020
+
+I found the proposed method interesting and suitable for the video generation tasks. Moreover, both the ablation study and the quantitative evaluation show good performance, so I recommend the acceptance.
+
+########### UPDATE #########
+
+I thank the authors for their responses and for updating the paper. I think this work introduces some new and valuable ideas for generating videos conditioned by an action graph and I recommend the acceptance.
+
+",7,3.0,ICLR2021
+HkxcSGnntB,2,BJlPOlBKDB,BJlPOlBKDB,Official Blind Review #1,"The paper proposes an uncertainty driven acquisition for MRI reconstruction. Contrary to most previous approaches (which try to get best reconstruction for a fixed sampling pattern) the method incorporates an adaptive, on-the-fly masking building (which is similar in spirit to Zhang at al. 2019). The measurements to acquire are selected based on variance/uncertainty estimates coming from a conditional GAN model. This is mostly an ""application"" paper that is evaluated on one dataset.
+
+Strengths:
+- The paper studies an interesting problem of adaptive MRI reconstruction
+- The review of MRI reconstruction techniques is well scoped
+
+Weaknesses:
+- The evaluation is rather limited and performed on one, proprietary, relatively small sized dataset
+- Some simple baselines might be missing
+
+
+I like the idea of adaptive sampling in MRI. However, I'd slightly lean towards rejection of the paper. My main concerns are as follows:
+
+The presentation of the paper could be improved. At the moment, the Theory section describes background information, related work and problem definition as well as the contribution of the paper. Maybe braking the section into related work, background and methodology (where the main contribution is presented) sections would improve the paper readability. 
+
+The paper uses a conditional GAN model (with a discriminator from Adler & Oktem, 2018 and a generator that is based on Schlemper et al. 2018). Making the methodological contribution to be rather limited.  The main difference w.r.t. the previous papers seem to be the last paragraph of  section 2.2 - the empirical variance estimation is performed in Fourier space. 
+
+A simple baseline to compare might be to train a Unet-like model (e. g. Schlemper et al 2018) with a Gaussian observation model (outputting a mean and a variance per each pixel) and train it to minimize Gaussian NLL. At the test time, one could simply sample from the Gaussian model instead of taking just the argmax of the output. It might be the case that the assumption of gaussian image might be too simplistic, however, it would be interesting to show it experimentally. Note that when sampling from such model the empirical variance estimation could be performed is the Fourier space too.
+
+The experimental evaluation is rather limited and the dataset used in the experimental section is small. Adding another dataset would make the paper stronger.
+
+Other comments:
+
+There is a mention on training dataset and testing dataset -- there is no mention on validation set. How were the hyperparamenters of the conditional GAN selected?
+
+As acknowledged by the authors, this paper bears several similarities with the work of Zhang at al. 2019. However, the approach is not compared to Zhang et al. Including this comparison would make the paper stronger.
+
+It is interesting to see that CLUDAS outperforms CLOMDAS in terms of SSIM. If I understand this part properly, CLOMDAS uses ground truth image to estimate MSE. Is it expected that CLUDAS would outperform CLOMDAS? 
+
+Section 5, Adaptive vs. fixed mask: ""We also have a simple generalization bound of the obtained mask, relaying on a simple application of Hoeffding's inequality."" Could the authors add a citation or explain this part in more detail?
+
+
+Some typos:
+""...we aim make a series...""
+"".. define an closed-loop...""
+""We choose adopt a greedy""
+""... we we found that..."" 
+",3,,ICLR2020
+rkxP3xuEYB,1,BJeKwTNFvB,BJeKwTNFvB,Official Blind Review #3,"The paper proposes to integrate model-based physical simulation and data-driven (deep) learning. In a nutshell, one deep network predicts the state variables of the physics simulation (such as objet location, shape and velocity) from an image. A second network does the inverse task, to render images given the state variables (and a background image). In this way, one can go from a video frame to a physical system state, modify the state with physics simulation, and then go back from the modified state to a video frame. Together with a differentiable physics engine, through which one can back-propagate, this makes it possible to use the un-annotated video itself as supervision. At the same time, the two neural networks can be seen as an auto-encoder, in which the latent state is explicitly constrained to correspond to the desired physical state variables.
+
+The topic of the paper is hot: a proper integration of physical models with data-driven deep learning is, arguably, one of the big short- to mid-term themes of machine learning research. The way it is done in the present paper intuitively makes sense. The approach is fairly obvious at the conceptual level; but in the details poses a number of technical challenges especially for the decoder, which are nicely analysed and resolved.
+
+Some minor design choices are not well justified and at first sight appear a bit l'art pour l'art. While it is a sensible, pragmatic choice to first predict object masks, then extract their location ands and velocities in a second step; I do not quite see why one would have to do the latter with neural networks. it would seem that once the masks have been found, their location can be chosen as something like the (perfectly differentiable) mask-weighted centroid and does not need a multi-layer network; and similarly that deriving velocity from locations in adjacent frames can be hard-coded and does not need a 3-layer network.
+
+The experiments are still at an early ""toy"" level, with synthetic videos where high-contrast, homogeneous objects move in front of a uniform or blurry background. The baselines are sensible and ablation studies are done with care. Still, it would have been nice to also run the method on some real video. To my understanding, this would be easily possible at least for future frame prediction, all one has to do is either annotate the objects in the target frame or measure success by comparing the predicted and true frames at the image level. It is also not clear whether the videos were synthesised with the same physics engine also use inside the system - which would be slightly questionable, in the sense that the learnable pipeline is then a-priori matched to the biases in the data.
+
+One comment on the presentation: while the paper is generally well-written and easy to follow, the wording could at times be more careful. There is a slight tendency to identify the particular (simple) physical systems of the paper with physics as a whole. E.g., not all physics simulation must have objects - for instance, fluid dynamics or radiative transfer do not have individual objects, but are nevertheless relevant in the context of visual data. Similarly, even for defined objects, position and velocity are not always a sufficient state, for instance objects might deform, or have different elastic properties when colliding.
+
+Overall, I find the work interesting and well-executed. It is a natural step to take towards the important goal of integrated data-driven and physical models, including the associated theme of self-supervision via physical constraints. On the negative side the paper does make a slightly rushed and unfinished impression by not showing any, even qualitative, experiments on real video. Most people - rightly - use simple toy-like datasets for development and analysis. But showing only those gives me the impression that the paper was written too early, just to be the first and to make the deadline. Or that moving to real video poses a much greater challenge than expected - but then this should be stated and discussed.
+",6,,ICLR2020
+CioF1RY_yW1,3,eJIJF3-LoZO,eJIJF3-LoZO,The paper need more clarification.,"This paper proposes concept learners to effectively combines the outputs of independent concept learners. The model is evaluated on several datasets from different domains. 
+
+First of all, why the authors define it as generalizable few-shot learning, the settings targeted in this paper seem to do no different from traditional few-shot learning. Why is it called generalizable few-shot learning?
+
+The other two main concerns are:
+
+The idea of learning to attend different segments of an image or learning to the segment has been proposed in previous literature [a,b,c,d]. Even if they are not specifically targeted on a few-shot image classification, the proposed concept learners are still pretty similar to previous works and are not specifically designed for few-shot image classification tasks. Thus, I believe the novelty is somewhat limited for this submission.
+
+In experiments, even if the authors choose three datasets for comparisons. I am more interested in results on standard benchmarks, such as miniImageNet, tieredImageNet. The results on the current datasets are not a convincing performance in my point of view. It is expected to see experiments on more standard and large-scaled datasets.
+
+A minor point I am curious about is that by simple data augmentation method: crop, is it possible that multiple random cropping can generate different concepts and achieve similar effects by simply cropping multiple times on one image?
+
+
+[a] Linsley, D., Shiebler, D., Eberhardt, S. and Serre, T., 2018. Learning what and where to attend. arXiv preprint arXiv:1805.08819.
+
+[b] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning Deep Features for Discriminative Localization. CVPR'16 (arXiv:1512.04150, 2015).
+
+[c] Zhu Y, Liu C, Jiang S. Multi-attention meta learning for few-shot fine-grained image recognition[C]//Twenty-Ninth International Joint Conference on Artificial Intelligence and Seventeenth Pacific Rim International Conference on Artificial Intelligence. 2020: 1090-1096.
+
+[d] Hou R, Chang H, Bingpeng M A, et al. Cross attention network for few-shot classification[C]//Advances in Neural Information Processing Systems. 2019: 4003-4014.
+",5,4.0,ICLR2021
+Hygc2fIhtS,2,ryxn8RNtvr,ryxn8RNtvr,Official Blind Review #2,"Months ago, I read this article on arxiv (https://arxiv.org/pdf/1909.04200.pdf). It is an interesting work that tries to propose a simple yet effective interpretable model. I am not familiar with this research direction, and I try to make an educated guess.
+
+Pros:
+-- As the author suggested, the method is simple and effective.
+-- The authors conducted user studies to demonstrate that the results generated by their proposed method is strongly favored over previous methods.
+
+Cons:
+-- Subscripts in equations need improvement to make them consistent. For example, in Equation (7), we have E_{y}, but in Equation (8), we have E_{Y_j} and E_{Y_k}.
+-- Section 4, Figure 3, top, it seems obvious to choose the fourth one to distinguish the number 9? I feel this example is too easy and not convincing enough.",6,,ICLR2020
+Skg0uXMXqS,3,r1ghgxHtPH,r1ghgxHtPH,Official Blind Review #1,"In this paper, the authors proposed a semi-structured composition of free-form filters and structured Gaussian filters to learn the deep representations. Experiments demonstrate its effectiveness in semantic segmentation. The idea is interesting and somewhat reasonable but I still have several concerns. However, I still have several concerns:
+1.	The authors proposed to compose the free-form filters and structured filters with a two-step convolution. The authors are expected to clarify why and how the decomposition can realized its purpose? The authors need to further justify the methods by providing more theoretical analysis, and comparing with alternative methods. 
+2.	The experiments are rather insufficient, and the authors are expected to make more comprehensive evaluations, e.g., more comparisons with the traditional CNN models. 
+3.	The improvement is rather incremental compared with the alternative methods. The authors actually introduce some prior to the learning process. It would be better if the authors could show some other advantages, e.g., whether it can train the model with smaller number of samples, and whether we can integrate other prior besides Gaussian filters for other structures since Gaussian is a good prior for blurring. 
+",3,,ICLR2020
+HJgVjdDRFB,2,rJxGLlBtwH,rJxGLlBtwH,Official Blind Review #2,"
+Summary
+---
+
+(motivation)
+To develop language speaking agents we can teach them to mimic human language
+or to solve tasks that require communication. The latter is efficient, but
+the former enables interpretability. Thus we combine the two in an attempt
+to take advantage of both advantages. This paper studies a variety of ways to
+combine these approaches to inform future work that needs to make this tradeoff.
+
+(approach)
+The trade-off is studied using reference games between a speaker and a
+listener. Goal oriented _self-play_ and human _supervision_ are considered two contraints one
+can put on a network during learning. This work considers algorithms that vary
+when self-play and supervision are used (e.g., training with self-play then supervision,
+or supervision then self-play, or alternating back and forth between the two).
+Additional variations freeze the speaker or distill an ensemble of agents into one agent.
+
+(experiments)
+A synthetic Object Reference game (OR) and a Image-Base Reference game (IBR) with real images are used for evaluation. Performance is accuracy at image/object guessing.
+1. (OR) Like previous work, this work finds that emergent languages are imperfect at supporting their goals and cannot be understood by agents that only understand a human language like English.
+2. (OR) Pre-training with supervision then fine-tuning with self-play is superior to pre-training with self-play then fine-tuning with supervision. This is presented as surprising from the perspective of language emergence literature, which is though of as pre-training with self-play.
+3. (IBR) Distilling an agent from an ensemble of 50 independently trained agents outperforms training single agents from scratch, but is still not as good as the whole ensemble.
+
+Self-play vs supervision schedules:
+4. (IBR) Supervision (using image captions) followed by self-play performs much worse than all other approaches.
+5. (IBR) Alternating between supervision and self play (e.g., randomly choosing supervision or self-play every iteration) performs best.
+
+
+
+Strengths
+---
+
+The curricula considered by this paper seem to have a sigificant impact on performance. These are new and could be important for future work on language learning, which may have considered the sup2sp setting from figure 7a without considering the sched setting.
+
+The diversity of experiments provided and the analysis help the reader get a better sense for how emergent communication models work.
+
+It's nice to see experiments on both a toy setting and a setting with realistic images.
+
+Future directions suggested throughout the paper are interesting.
+
+
+Weaknesses
+---
+
+
+* The 3rd point of section 5 is presented as a major conclusion of this paper, but it is not very surprising and I don't see how it's very useful. The perspective of language emergence literature is presented a bit strangely. The self-play to supervision baseline seems to be presented as an approach from the language emergence literature. I don't think this is what any of that literature promotes exactly, though it is close. Generally, I (and likely others) don't think it's too surprising that trying to fine-tune a self-play model with language supervision data doesn't work very well, for the same reasons cited in this paper (point 3 of section 5). I think the general strategy when trying to gain practical benefits from self-play pre-training is a translation approach where the learned language is translated into a known language like English rather than trying to directly align it to English as does the supervision approach in this paper. This particular baseline would be more useful if the paper considered learning some kind of translation layer on top of the self-play pre-trained model.
+
+* How significant are the performance differences in figure 7a, especially those between the frozen and non-frozen models? Is the frozen model really better or this performance difference just due to noise?
+
+* I'm somewhat skeptical that these trends will generalize to other tasks/models. The main goal of this paper is to inform future work. That makes it even more important than normal that the trends identified here are likely to generalize well. Are these trends likely to generalize well? Does the paper address when these trends are expected to hold anywhere?
+
+
+Minor Presentation Weaknesses:
+
+* Figure 4: I think the sub-figures are mis-labeled in the caption.
+
+* In the related work I'm not sure the concept of generations is right. I think it should refer to different languages of different agents across time rather than different languages of the same agent across time.
+
+
+Missing details / clarification questions:
+
+* What exactly does Figure 4c compare? Are both methods distilled from ensembles or is the blue line normal S2P while the other is distilled from an ensemble of compositional languages? It's not clear since point (3) in section 5 refers to the S2P result (not Pop-S2P) in that plot. I'm also assuming that PB-S2P means the same thing as Pop-S2P, but that's not made clear anywhere. Does PB stand for Population Based?
+
+* In the rand setting how is convergence defined? Do both objectives need to converge or just one?
+
+* In the sched_rand_frz setting what is r?
+
+* In the IBR how are the distractor images picked?
+
+
+Suggestions:
+
+* Can't both self-play and supervision be used at the same time (just use a weighted combination of the two objectives)? I don't think the paper ever did this but it seems like a very useful variation to consider.
+
+
+Preliminary Evaluation
+---
+
+Clarity: The writing is fairly clear, though some details are lacking.
+Significance: This work could help inspire some future models in the language emergence literature.
+Quality: Experiments are aligned with the paper's goals and support its conclusions.
+Originality: The distillation approach and curricula are novel.
+
+Overall the work could prove to be an interesting and useful reference point inside the language emergence literature so I recommend it for acceptance.
+
+",8,,ICLR2020
+S1lS16Ohsm,1,rJxpuoCqtQ,rJxpuoCqtQ,Lacks clarity,"In the manuscript entitled ""Likelihood-based Permutation Invariant Loss Function for Probability Distributions"" the authors propose a loss function for training against instances in which ordering within the data vector is unimportant.  I do not find the proposed loss function to be well motivated, find a number of confusing points (errors?) in the manuscript, and do not easily follow what was done in the examples.
+
+First, it should be noted that this is a very restricted consideration of what it means to compare two sets since only sets of equal size are under consideration; this is fundamentally different to the ambitions of e.g. the Hausdorff measure as used in analysis.  The logsumexp formulation of the proposed measure is unsatisfactory to me as it directly averages over each of the independent probabilities that a given element is a member of the target set, rather than integrating over the combinatorial set of probabilities for each set of complete possible matches.  Moreover, the loss function H() is not necessarily representative of a generative distribution.
+
+The definition of the Hausdorff distance given is directional and is therefore not a metric, contrary to what is stated on page 2.
+
+I find the description of the problem domain confusing on page 3: the space [0,1]^NxF is described as binary, but then values of log y_i and log (1-y_i) are computed with y in [0,1] so we must imagine these are in fact elements in the open set of reals: (0,1).
+
+Clarity of the examples could be greatly improved, in particular by explaining precisely what is the objective of each task and what are the 'ingredients' we begin with.",4,4.0,ICLR2019
+SkxXjkpqKr,2,B1gcblSKwB,B1gcblSKwB,Official Blind Review #3,"
+In this paper, the authors propose a new method to alleviate the effect of overfitting in the meta-learning scenario. The method is based on network pruning. Empirical results demonstrate the effectiveness of the proposed method.
+
+Pros:
++ The problem is very important in the meta-learning field. The model is intuitive and seems useful. In addition, the generalization bound further gives enough insights for the model.
+
+Cons:
+- The proposed method is simple and lacks technical contributions. Adding sparse regularizers is a common practice to alleviate over-fitting in the machine learning field. In addition, the retraining process increases the time complexity of the proposed model (i.e., we need to train three times to get the powerful model). 
+
+- In the experiment parts, it will be more interesting if the authors can do the experiments without pre-training. Since in traditional meta-learning settings (e.g., Reptile and MAML), pre-training process does not be introduced. Thus, it might be more convincing if the authors can training the mask and initialization together.
+
+
+Post rebuttal:
+
+I have read other reviewers and the authors' responses. I still think the contribution is not enough to be accepted. I do not change my mind.",3,,ICLR2020
+S1eWiCYE3m,2,S1geJhC9Km,S1geJhC9Km,The presentation is unclear ,"This paper presents a feature quantization technique for logistic regression, which has already been a common practice in 
+ many finance applications.  The text feels rushed. From the current presentation, I find it difficult to understand what is the motivation of adopting the proposed relaxation of the optimization method, and how is the neural network-based estimation strategy connected to the logistic regression model. It seems the difference lies in the parameterized nonlinear transformation such that the cutting points can be somehow optimized.  The quality of the experiments performed is way below the expectation for ICLR. Although numerical experiments are performed on both simulated data and credit scoring data, it is still unclear whether the proposed method has superiority over competitors.  
+
+Question: In the test phase, how would the proposed method handle features that are not seen in the training phase? ",3,3.0,ICLR2019
+ceKs9niPx0Z,3,OthEq8I5v1,OthEq8I5v1,Interesting and well-motivated approach with some slight issues in clarity,"Summary: This paper introduces MUSIC, a reinforcement learning approach that separates the agent state from its surrounding state and trains the agent to maximize the mutual information between the two. This implies that the agent has control over the surrounding state. The approach is evaluated within four environments and compared to multiple baselines.
+
+
+This paper is well-motivated and the approach is interesting. The paper is mostly well-written, though I found parts to be somewhat confusing. Code is provided as well as hyperparameters so the approach seems reproducible. The experiments are strong as the approach is evaluated within multiple environments with extensive comparisons to relevant baselines.
+
+
+MUSIC is shown to achieve very good performance on simulated robotic tasks, and was able to improve performance when combined with other intrinsic reward and RL methods. I think this is an interesting direction and it does make sense to separate out the agent’s state from the environment state. For these reasons I do think the paper should be accepted.
+
+
+However, I found the description of the methodology in section 3.2 to be very confusing. The equations are referred to before they are introduced which was unexpected. Hence, this section would be greatly improved by some rearranging. I also did not understand what exactly T was. What does this function output and how is it trained? 
+
+
+Comments:
+
+
+- Some other related works are [1] which uses an intrinsic reward to maximize the controllable entities in the environment and [2] which learns an intrinsic reward that maximizes controllable features.
+
+
+-  Question 3 in the paper does not refer to any figure (does this correspond to figure 5?). Where are the MUSIC + DIAYN results?
+
+
+- Is the reward in Question 8 the negative L2 norm?
+
+
+- How does MUSIC alone perform in Table 1? This should be included here as well. 
+
+
+[1] Mega-Reward: Achieving Human-Level Play without Extrinsic Rewards. Song et al. 
+
+
+[2] Feature Control as Intrinsic Motivation for Hierarchical Reinforcement Learning. Dilokthanakul et al. 
+",7,3.0,ICLR2021
+BkMR49YxM,2,Byt3oJ-0W,Byt3oJ-0W,Gumbel-Sinkhorn networks for learning permutations,"Learning latent permutations or matchings is inherently difficult because the marginalization and partition function computation problems at its core are intractable. The authors propose a new method that approximates the discrete max-weight matching by a continuous Sinkhorn operator, which looks like an analog of softmax operator on matrices. They extend the Gumbel softmax method (Jang et al., Maddison et al. 2016) to define a Gumbel-Sinkhorn method for distributions over latent matchings. Their empirical study shows that this method outperforms competitive baselines for tasks such as sorting numbers, solving jigsaw puzzles etc.
+
+In Theorem 1, the authors show that Sinkhorn operator solves a certain entropy-regularized problem over the Birkhoff polytope (doubly stochastic matrices). As the regularization parameter or temperature \tau tends to zero, the continuous solution approaches the desired best matching or permutation. An immediate question is, can one show a convergence bound to determine a reasonable choice of \tau?
+
+The authors use the Gumbel trick that recasts a difficult sampling problem as an easier optimization problem. To get around non-differentiable re-parametrization under the Gumbel trick, they extend the Gumbel softmax distribution idea (Jang et al., Maddison et al. 2016) and consider Gumbel-Sinkhorn distributions. They illustrate that at low temperature \tau, Gumbel-matching and Gumbel-Sinkhorn distributions are indistinguishable. This is still not sufficient as Gumbel-matching and Gumbel-Sinkhorn distributions have intractable densities. The authors address this with variational inference (Blei et al., 2017) as discussed in detail in Section 5.4.
+
+The empirical results do well against competitive baselines. They significantly outperform Vinyals et al. 2015 by sorting up to N = 120 uniform random numbers in [0, 1] with great accuracy < 0.01, as opposed to Vinyals et al. who used a more complex recurrent neural network even for N = 15 and accuracy 0.9. 
+
+The empirical study on jigsaw puzzles over MNIST, Celeba, Imagenet gives good results on Kendall tau, l1 and l2 losses, is slightly better than Cruz et al. (arxiv 2017) for Kendall tau on Imagenet 3x3 but does not have a significant literature to compare against. I hope the other reviewers point out references that could make this comparison more complete and meaningful.
+
+The third empirical study on the C. elegans neural inference problem shows significant improvement over Linderman et al. (arxiv 2017).
+
+Overall, I feel the main idea and the experiments (especially, the sorting and C. elegance neural inference) merit acceptance. I am not an expert in this line of research, so I hope other reviewers can more thoroughly examine the heuristics discussed by the authors in Section 5.4 and Appendix C.3 to get around the intractable sub-problems in their approach.    ",7,4.0,ICLR2018
+SkJi7LQEg,3,rJiNwv9gg,rJiNwv9gg,Final Review,"This work proposes a new approach for image compression using auto encoders. The results are impressive, besting the state of the art in this field.
+
+Pros:
++ Very clear paper. It should be possible to replicate these results should one be inclined to do so.
++ The results, when compared to other work in this field are very promising. I need to emphasize, and I think the authors should have emphasized this fact as well: this is very new technology and it should not be surprising it's not better than the state of the art in image compression. It's definitely better than other neural network approaches to compression, though.
+
+Cons:
+- The training procedure seems clunky. It requires multiple training stages, freezing weights, etc.
+- The motivation behind Figure 1 is a bit strange, as it's not clear what it's trying to illustrate, and may confuse readers (it talks about effects on JPEG, but the paper discusses a neural network architecture, not DCT quantization)",8,5.0,ICLR2017
+HygYH9sl5H,3,rkgc06VtwH,rkgc06VtwH,Official Blind Review #1,"In this paper, a method for re-ranking beam search results for semantic parsing is introduced and experimentally evaluated. The general idea is to train a paraphrase critic. Then, the critic is applied to the each pair (input sentence, logic form) in the beam to determine if they are close.
+
+The main problem with the proposed method is that the critic does not receive high quality negative examples. The generator is never trained to adapt to the critic. Second big problem is that the critic trained on two sources of data: the original dataset and the Quora paraphrasing dataset. It is very unclear what is the impact of each of the data sources. Also, it is unclear how the critic works in this case. It seems to be an easy task to distinguish a logical form from a natural sentence.
+
+In general, the paper is well written. I would suggest to reduce the size of the introduction and dedicate this space to more detailed explanation how reranking works and the experimental details. Figures don't add much to understanding.
+
+The experimental part is rather weak. The error analysis part is great, but not very methodical. It is not clear is these examples are cherry picked or it is frequent mistake of the baseline. I would like to the accuracy of the critic and the analysis of its performance. The critic is the main contribution of this paper and it is strange that so little attention is dedicated to it. Other aspects that need to be highlighted in the experimental section: 
+- how the Quora pretraining helps
+- do other strategies for negative sample work
+- how important is not to rerank in certain cases (Sec 3.3)
+
+In conclusion, I encourage the authors to develop the idea further. Taking in into account the issues with the method (or its presentation) and the experimental weaknesses I recommend reject for now.
+
+Typos:
+- [CLS] is not defined in text
+- Shaw et al. should be in Previous methods in Table 3",3,,ICLR2020
+i2IJns_QCwN,2,D2TE6VTJG9,D2TE6VTJG9,"Review of ""Predicting What You Already Know Helps: Provable Self-Supervised Learning"".","
+
+  *  Summary of the paper.
+  The paper shows theoretical results to support the claim that (approximate) conditional independence is a good way to quantify the link between the downstream task and the pretext task in self-supervised learning. Doing so, the authors prove theoretical results showing that indeed self-supervised learning decrease the estimation error in a classification task when the downstream task and the pretext task are linked through conditional independence or the weaker approximate conditional independence showing that approximate conditional independence is a good quantification of the link that the downstream task and the pretext task must have in order for the self-supervised learning to be efficient. The authors also provide numerical illustrations.
+ 
+  * Strong points: theoretical guarantees in SSL (there are not a lot of those) while using approximate conditional independence which is more realistic than conditional independence.  The paper has some experiments that support its claim. The paper takes a gradual approach, first with jointly Gaussian variables and then with general (sub-Gaussian) random variables which helps the comprehension because at least in the Gaussian case, things are not so hard.
+   
+  * Weak points: Use of mean squared error for a classification task which negates most of the practical aspect of this paper. Not enough details in the proofs. The bounds are not optimal. The experiments are not reproducible and the presentation of those experiments is somewhat lacking. Overall problems with experiments.
+    
+  * Recommendation.
+    I vote to accept because this is one of the rare theoretical guarantees for SSL, the approximate conditional independence is very interesting as a way to quantify the link between downstream and pretext task and this gives more understanding on how to choose a pretext task in practice when doing SSL.
+
+
+  *  Questions:
+    * Why did you use MSE ? Why not Cross entropy for instance ?
+    * $Tr(\Sigma_{YY})$ in Theorem 3.3 is of order k for instance in the case where $\Sigma_{YY}$ is the identity matrix, the bound is $O(k^2)$ not $O(k)$, is it not? (contradiction with what you say in the text).
+
+
+  * Additional feedback.
+    * Please provide the minimax bound (i.e. optimal bound) in the ""not self-supervised"" context for us to compare with your bounds for self-supervised, below Theorem 3.3 you say that we gain from $O(d1)$ to $O(k)$ but you don't provide the bound that explain this $O(d1)$.
+    * Please include the results of the simulation study in the main text. In my opinion, a simulation study must firstly show that your method works as intended, a sanity check of sorts. You simulation study does not do this because we don't see the results of this study right away, it is hidden at the end of the appendix.
+    * Be careful, Theorem A.6 is only true for sub-Gaussian, please include this in the theorem, this is important in my opinion. 
+    * Theorem A.6 can be improved, for now it is of order $\sqrt{\frac{Tr(\Sigma)}{n}}(1+t)$ but it can be made of order 
+$\sqrt{\frac{Tr(\Sigma)}{n}}+\sqrt{\frac{||\Sigma||_{op}t}{n}}.$
+  
+        The difference can be important when the dimension is large (for instance if Sigma is the identity matrix in $R^d$, $Tr(\Sigma)= d$, $||\Sigma||_{op}=1$). To obtain this for sub-Gaussian vectors, one can for instance use the article ""A tail inequality for suprema of unbounded empirical processes with applications to Markov chains"" by R. Adamczak. This is important because the dimension is important for your conclusion.
+    * Lemma A.7 is not true for any delta, please state in the lemma for which delta it is true (delta is at most $ke^{-n}$ if I am not wrong)
+    * Generally, when reading the proofs I would have appreciated if you included more details. The ""therefore"" in the proof of theorem 3.3 was not clear right away for me, there are numerous lines where you do a lot of manipulations that the reader has to guess to obtain the good results and this makes it very hard for us to check the results you claim.
+    * In the proof of Lemma C.2, please provide a reference for ""Davis kahan"", for those that are not familiar with this result.
+    * Typo: beginning of the first sentence of A1.
+    
+    
+    
+
+
+
+",6,3.0,ICLR2021
+SygQJELD3X,1,H1g0piA9tQ,H1g0piA9tQ,Interesting attack but an unclear paper with limited experimental support.,"The paper presents an evaluation methodology for evaluating attacks on confidence thresholding methods and proposes a new kind of attack. In general I find the writing poor, as it is not exactly clear what the focus of the paper is - the evaluation or the new attack? The experiments lacking and the proposed evaluation methodology & theoretical guarantees trivial.
+	
+Major remarks:
+- Linking the code and asking the reviewers not to look seems like bad practice and close to violating double blind, especially when considering that the cleavhans library is well known. Should have just removed the link and cleavhans name and state it will be released after review. 
+
+- It is unclear what the focus of the paper is, is it the evaluation methodology or the new attack? While the evaluation methodology is presented as the main topic in title, abstract and introduction most of the paper is dedicated to the attack.
+
+- The evaluation methodology is a good idea but is quiet trivial. Also, curves are nice visually but hard to compare between close competitors. A numeric value like area-under-the-curve should be better.
+
+- The theoretical guarantees is also quiet trivial, more or less saying that if a confident adversarial attack exists then finding the most confident attack will be successful. Besides that the third part of the proof can be simplified significantly.
+
+- The experiments are very lacking. The authors do not compare to any other attack so there is no way to evaluate the significance of their proposed method
+
+- That being said, the max-confidence attack by itself sounds interesting, and might be useful even outside confidence thresholding.
+
+- One interesting base-line experiment could be trying this attack on re-calibrated networks e.g. “On Calibration of Modern Neural Networks” 
+
+- Another baseline for comparison could be doing just a targeted attack with highest probability wrong class.
+
+- I found part 4.2 unclear 
+
+- In the conclusion, the first and last claims are not supported by the text in my mind. 
+
+
+
+Minor remarks:
+
+- The abstract isn’t clear jumping from one topic to the next in the middle without any connection.
+
+- Having Fig.1 and 2 right on the start is a bit annoying, would be better to put in the relevant spot and after the terms have been introduced.
+
+-In 6.2 the periodically in third line from the end seems out of place.
+
+",4,4.0,ICLR2019
+B1gXtGEb5S,2,B1gZV1HYvS,B1gZV1HYvS,Official Blind Review #3,"The authors propose a decentralized adversarial imitation learning algorithm with correlated policies, which recovers each agent’s policy through approximating opponents action using opponent modeling. Extensive experimental results showed that the proposed framework, CoDAIL, better fits scenarios with correlated multi-agent policies.
+
+Generally, the paper follows the idea of GAIL and MAGAIL. Differing from the previous works, the paper introduces \epsilon-Nash equilibrium as the solution to multi-agent imitation learning in Markov games. It shows that using the concept of \epsilon-Nash equilibrium as constraints is consistent and equivalent to adding the difference of the causal entropy of the expert policy and the causal entropy of a possible policy in RL procedure. It makes sense. 
+
+Below, I have a few concerns to the current status of the paper.
+
+1.	The authors propose \epsilon-Nash equilibrium to model the convergent state in multi-agent scenarios, however, in section 3.1 the objective function of MA-RL (Equation 5) is still the discounted causal entropy of policy, the same as that of MA-GAIL paper. It is unclear how the \epsilon-NE is considered in modeling MA-RL problem.
+
+2.	Rather than assuming conditional independence of actions from different agents, the authors considered that the joint policy as a correlated policy conditioned on state and all opponents’ actions. With the new assumption, the paper re-defines the occupancy measure and introduces an approach to approximate the unobservable opponents’ policies, in order to access opponents’ actions. However, in the section 3.2 when discussing the opponents modeling, the paper did not clearly explain how the joint opponent function \sigma^{(i)} is designed. The description \sigma^{(i)} is confusing.
+
+3.	Typos: in equation 14 “i” or “-i”; appendix algorithm 1 line 3 “pi” or “\pi”. 
+",6,,ICLR2020
+H1MkPyf4g,3,Bk3F5Y9lx,Bk3F5Y9lx,skeptical of motivation and experiments,"This paper replaces the Gaussian prior often used in a VAE with a group sparse prior. They modify the approximate posterior function so that it also generates group sparse samples. The development of novel forms for the generative model and inference process in VAEs is an active and important area of research. I don't believe the specific choice of prior proposed in this paper is very well motivated however. I believe several of the conceptual claims are incorrect. The experimental results are unconvincing, and I suspect compare log likelihoods in bits against competing algorithms in nats.
+
+Some more detailed comments:
+
+In Table 1, the log likelihoods reported for competing techniques are all in nats. The reported log likelihood of cVAE using 10K samples is not only higher than the likelihood of true data samples, but is also higher than the log likelihood that can be achieved by fitting a 10K k-means mixture model to the data (eg as done in ""A note on the evaluation of generative models""). It should nearly impossible to outperform a 10K k-means mixture on Parzen estimation, which makes me extremely skeptical of these eVAE results. However, if you assume that the eVAE log likelihood is actually in bits, and multiply it by log 2 to convert to nats, then it corresponds to a totally believable log likelihood. Note that some Parzen window implementations report log likelihood in bits. Is this experiment comparing log likelihood in bits to competing log likelihoods in nats? (also, label units -- eg bits or nats -- in table)
+
+It would be really, really, good to report and compare the variational lower bound on the log likelihood!! Alternatively, if you are concerned your bound is loose, you can use AIS to get a more exact measure of the log likelihood. Even if the Parzen window results are correct, Parzen estimates of log likelihood are extremely poor. They possess any drawback of log likelihood evaluation (which they approximate), and then have many additional drawbacks as well.
+
+The MNIST sample quality does not appear to be visually competitive. Also -- it appears that the images are of the probability of activation for each pixel, rather than actual samples from the model. Samples would be more accurate, but either way make sure to describe what is shown in the figure.
+
+There are no experiments on non-toy datasets.
+
+I am still concerned about most of the issues I raised in my questions below. Briefly, some comments on the authors' response:
+
+1. ""minibatches are constructed to not only have a random subset of training examples but also be balanced w.r.t. to epitome assignment (Alg. 1, ln. 4).""
+Nice! This makes me feel better about why all the epitomes will be used.
+
+2. I don't think your response addresses why C_vae would trade off between data reconstruction and being factorial. The approximate posterior is factorial by construction -- there's nothing in C_vae that can make it more or less factorial.
+
+3. ""For C_vae to have zero contribution from the KL term of a particular z_d (in other words, that unit is deactivated), it has to have all the examples in the training set be deactivated (KL term of zero) for that unit""
+This isn't true. A standard VAE can set the variance to 1 and the mean to 0 (KL term of 0) for some examples in the training set, and have non-zero KL for other training examples.
+
+4. The VAE loss is trained on a lower bound on the log likelihood, though it does have a term that looks like reconstruction error. Naively, I would imagine that if it overfits, this would correspond to data samples becoming more likely under the generative model.
+
+5/6. See Parzen concerns above. It's strange to train a binary model, and then treat it's probability of activation as a sample in a continuous space.
+
+6. ""we can only evaluate the model from its samples""
+I don't believe this is true. You are training on a lower bound on the log likelihood, which immediately provides another method of quantitative evaluation. Additionally, you could use techniques such as AIS to compute the exact log likelihood.
+
+7. I don't believe Parzen window evaluation is a better measure of model quality, even in terms of sample generation, than log likelihood.",4,5.0,ICLR2017
+ryOXS6N4g,2,rJbPBt9lg,rJbPBt9lg,An interesting paper but only initial work about neural network based code completion.,"Pros:
+  using neural network on a new domain.
+Cons:
+  It is not clear how it is guaranteed that the network generates syntactically correct code.
+
+Questions, comments:
+  How is the NT2N+NTN2T top 5 accuracy is computed? Maximizing the multiplied posterior probability of the two classifications?
+  Were all combinations of NT2N decision with all possible NTN2T considered?
+
+  Using UNK is obvious and should be included from the very beginning in all models, since the authors selected the size of the
+  lexicon, thus limited the possible predictions.
+  The question should then more likely be what is the optimal value of alpha for UNK.
+  See also my previous comment on estimating and using UNK.
+
+  Section 5.5, second paragraph, compares numbers which are not comparable.
+
+",5,4.0,ICLR2017
+pMMVg722sm_,3,66H4g_OHdnl,66H4g_OHdnl,Closed formed solution for MLPs with ReLU activations,"This work uses dual formulations of Neural Networks with ReLU activations. It starts explaining the dual formulations with simplest single layer unregularised linear neural networks with a single dimensional output layer. Then gradually extends the models to deep, regularised models with ReLU activations. There is also an assumption on the data to be of rank one or whitened. The experiments are limited and not essential, since they only show that the theory can be confirmed with experiments, albeit they also demonstrate the limitations of simplified models studied here.
+This work builds on a recently published work by Ergen and Pilanci, where more limited NNs have been studied, although with a similar dual formulation approach. The main contributions of this paper are the proofs of Theorems 4.1 and 4.2 (given in the appendix). The theorems in section 3 are also novel, but the simplified case of the results in section 4. While very interesting, these results are not applicable in practice, because as the experiments in section 5 show, the models studied here are too simple. However, this is a good step forward towards building a more complete framework to study better NNs. Specifically, it would be interesting to see if a NN with a softmax activation on the output layer can be reformulated with a similar technique.
+I did not find the paper easy to read. For example, I haven't found the proof of Lemma 1.1. Several other results have proofs in the appendix, but it is not clear from the main article which proofs are available in the appendix.",6,4.0,ICLR2021
+M7SWXscDUil,1,H8VDvtm1ij8,H8VDvtm1ij8,"seems like a good paper, but I am not an expert","The paper proposes to use normalizing flows to improve estimates of aleatoric uncertainty in regression tasks. First, the paper suggests that since normalizing flows can improve the flexibility of output distribution, they can be used to mitigate issues of underfitting. Second, the paper proposes an approach that uses normalizing flows for recalibration, which allows people to still query the model’s statistical properties. The authors also introduce a plot that is useful for diagnosing calibration issues. 
+
+The paper’s approach of applying normalizing flows in recalibration seems that it could be useful to the community. The supporting experimental results look reasonable. In addition, the presentation of the paper looks nice, the experiments are well documented, and the diagnosing plots seem like a helpful tool. Given these plus points, I would recommend acceptance. 
+
+One suggestion is that it feels quite obvious that since flows can represent flexible distributions, using them to model aleatoric uncertainty can reduce underfitting issues. I am not sure if it is worth stating in the paper. It seems that the interesting part of the paper is recalibration, so perhaps it might be better to focus more on that. 
+
+Questions for the authors:
+- I wonder whether using flows to recalibrate is susceptible to overfitting. 
+- The paper focuses on aleatoric uncertainty. How can the approach proposed in the paper be combined with approaches for addressing epistemic uncertainty?",7,1.0,ICLR2021
+734BN_BtTwW,4,TCAmP8zKZ6k,TCAmP8zKZ6k,Review of Towards a Reliable and Robust Dialogue System for Medical Automatic Diagnosis,"The paper proposes a method for automatic medical diagnosis in a dialog system which is comprised of two modules: one which proposes symptoms to inquire about, and another which decides whether to go ahead with the inquiry or inform a disease. The second module makes the decision by looking ahead and checking whether the symptom wold case differences in determining the disease. The authors argue that existing systems have only been evaluated with respect to diagnosis accuracy, and introduce two additional metrics to evaluate the reliability and robustness of the method. They evaluate their method and compare with existing state-of-the-art methods on two datasets. Furthermore, they do a small human evaluation and report results on how evaluators perceive the diagnosis validity, symptom rationality, and topic transition smoothness of different methods. Overall, the paper is well motivated and written, and the authors show equal or better results compared to other state-of-the-art methods.
+
+I have the following questions/comments. Given clarifications in an author response, I would be willing to increase the score. 
+- Can you add more information about the datasets/augmented test sets used (e.g., train/dev/test size). What data is the disease classifier trained on? What do the diseases in Table 4 represent?
+- Some details about the training are missing. For example, how many seeds were used? How long does it take for the training to converge using each method?
+- For the results of the DX dataset, is the same NLU model used for all three models? If not, how does that impact the results?
+- What do the numbers in Table 2 indicate? Are they accuracy results on the noisy test-sets?
+- The section ""Accuracy of diagnosis"" is not well written. Sequicity is introduced without any discussion on what it is. Additionally, new methods are introduced at the end of the paragraph. Why are these methods not used in Table 1? What is the intuition behind different results for different diseases? I suggest rewriting this section to first introduce the methods that are used, and then discussing the results. It might make sense to put this section before the ""Trusts for reliability"" section on page 6.
+- In the same section, the sentence ""And the results of each disease of Basic DQN is missing because the results are from the paper"", it is not clear that refers to table 4.
+- Since you discuss Figure 2 in detail in section ""Qualitative Analysis"", I suggest removing it from section ""Accuracy of diagnosis"" as mentioning it just briefly in this section is confusing.
+- There is no discussion of the errors that the proposed approach is making. Are the errors the same as the ones made with the other methods?
+
+The text has spelling errors:
+ - Page 3, P2: measures DSMAD -> measuring
+ - Page 3, P3: dataset is includes -> dataset includes
+ - Page 3, P4: paper is the same -> paper are the same
+ - Page 4, P1: record -> records, produce -> produces, Remove And from the beginning of the sentence
+ - Page 4, P2: imformed -> informed
+ - Page 8, P1: an reckless -> a reckless (what does a reckless diagnosis refer to?)
+ - Page 8, P2: ask -> asked. Remove And from the beginning of the sentence
+ - Page 8, P3: mimics -> mimic, could -> can, therefore insensitive -> therefore is insensitive
+
+
+",6,4.0,ICLR2021
+BylQuBY6tB,2,B1g_BT4FvS,B1g_BT4FvS,Official Blind Review #3,"This paper proposes a novel way to denoise the policy gradient by filtering the samples to add by a criterion ""variance explained"". The variance explained basically measures how well the learn value function could predict the average return, and the filter will keep samples with a high or low variance explained and drop the middle samples. This new mechanism is then added on top of PPO to get their algorithm SAUNA. Empirical results show that it is better than PPO, on a set of MuJoCo tasks and Roboschool.
+
+From my understanding, this paper does not show a significant contribution to the related research area. The main reason I tend to reject this paper is that the motivation of their proposed algorithm is very unclear, lack of theoretical justification and the empirical justification is restricted on PPO -- one policy gradient method.
+
+1) It's unclear to me how it goes to the final algorithm, and what is the intuition behind it. Second 3.1 is easy to follow but the following part seems less motivated. In section 3.2 it's unclear to me why we need to fit a parametric function of Vex. In section 3.2, it's unclear to me why the filter condition is defined as Eq (7). The interpretation is a superficial explanation of what Eq 7 means but does not explain why I should throw out some of my samples, why high and low score means samples are helpful for learning and score in between does not?
+
+2) This paper argues the filter condition improves PG algorithms by denoising the policy gradient. This argument is not justified at all except a gradient noise plot in one experimental domain in figure 5b. That's not enough to support the argument that what this process is really doing. Some theoretical understanding of what the dropped/left samples will do is helpful.
+
+3) The method of denoising the policy gradient is expected to help policy gradient methods in general. It's important to show at least one more PG algorithm (DDPG/REINFORCE/A2C) where the proposed method can help, for verifying the generalizability of algorithm.
+
+In general, I feel that the content after section 3.1 could be presented in a much more principled way. It should provide not just an algorithm and some numbers in experiments, but also why we need the algorithm, what's the key insight of designing this algorithm, what the algorithm really did by the algorithm mechanism itself instead of just empirical numbers.
+",3,,ICLR2020
+SJncf2gWz,3,HkL7n1-0b,HkL7n1-0b,A well-written paper that generalizes Wasserstein distance to VAEs ,"This paper provides a reasonably comprehensive generalization to VAEs and Adversarial Auto-encoders through the lens of the Wasserstein metric. By posing the auto-encoder design as a dual formulation of optimal transport, the proposed work supports the use of both deterministic and random decoders under a common framework. In my opinion, this is one of the crucial contributions of this paper. While the existing properties of auto-encoders are preserved, stability characteristics of W-GANs are also observed in the proposed architecture. The results from MNIST and CelebA datasets look convincing, though could include additional evaluation to compare the adversarial loss with the straightforward MMD metric and potentially discuss their pros and cons. In some sense, given the challenges in evaluating and comparing closely related auto-encoder solutions, the authors could design demonstrative experiments for cases where Wassersterin distance helps and may be  its potential limitations.
+
+The closest work to this paper is the adversarial variational bayes framework by Mescheder et.al. which also attempts at unifying VAEs and GANs. While the authors describe the conceptual differences and advantages over that approach, it will be beneficial to actually include some comparisons in the results section.",8,4.0,ICLR2018
+S1lPyVNjFS,2,BJxSI1SKDH,BJxSI1SKDH,Official Blind Review #1,"
+This paper proposes a method, Latent Morphology Model (LMM), for producing word representations for a (hierarchical) character-level decoder used in neural machine translation (NMT). The main motivation is to overcome vocabulary sparsity or highly inflectional languages such as Arabic, Czech, and Turkish (experimented in the paper). To model the morphological inflection, they decouple lemmas and inflection types into 2 latent variables (z and f) where f is enforced to be sparse (arguably mimic the process of the human). The literature review of NMT and the discussion on the potential advantage of morphology are concise. The proposed model is a variation of Luong & Manning (2016) and Schulz et al (2018) models, thus, their main contribution is an introduction of the latent morphological features to the decoder. The proposed LMM is trained by sampling z and f from prior directly, and the sparsity of the morphological features is encouraged by L0 of the feature vector (parameterized as independent Kumaraswamy variables). They perform the main empirical study using 3 languages by translating from English to justify the proposed LMM. Lastly, they provide a quantitative analysis of the perplexities of unseen words and a qualitative on words generated of a lemma with different feature vectors.
+
+Overall, this paper could provide a novel insight into the role of modeling morphology as latent variables. However, the experiments and analyses do not sufficiently support the claim of mimicking the process of the inflection (besides the gain in performance). Some clarification would be appreciated.
+
+In section 3, the model, to my understanding, is agnostic to the morphological inflection, except the sparsity regularization (i.e., it could model any transformation including changing the lemma itself). In addition, there is nothing in the training particularly specific about the morphology. For example, z and f are generated from only the prior whereas we could get more accurate posteriors from observing the word itself. Some background on the language might help motivate the choice of model. Are morphological inflections ambiguous? Are the morphology labels hard to obtain? I think more discussion on previous attempts to model morphology (e.g., Vylomova et al 2017, Passban et al 2018) will be very helpful to the readers.
+
+For the experiments (section 4), the model with LMM is shown to consistently outperform character and hierarchical models. The multi-domain performance also shows the model's ability to tackle a larger vocabulary set. However, we cannot certainly conclude that the gain comes from LMM successfully modeling the inflections. A contrast of performance with less morphological languages might reveal some insight (unfortunately I do not have enough knowledge to recommend languages). Finally, the feature variation analysis is interesting, but we only see one lemma from one language. Further discussion such as the consistency of features and the types of morphology, or the similarity of the lemma vector across context, will be helpful.
+
+
+Minor Questions:
+1. Given a word, how ambiguous it is to determine the stem and the morphological type in the subject languages?
+2. How do you compute char-level PPL of the subword model?
+3. In 4.4.4, did you obtain z of `go` from just translating `go` without any context?
+
+",6,,ICLR2020
+rkgxcuLcYr,2,rklj3gBYvH,rklj3gBYvH,Official Blind Review #2,"- The paper targets the scalability issue of certain meta-learning frameworks, and is therefore addressing an important and interesting topic. 
+
+- Theoretical and technical novelty is rather minimal. 
+
+- The paper writing is well beyond the ICLR level, and is honestly beyond the level required by a scientific manuscript in general. Writing needs a massive overhaul.
+
+- There are way too many grammatical mistakes. In addition, there are citations written the wrong way -e.g. Rajeswaran (2019)-, and also the flow of the ideas has room for improvement. 
+- For the aforementioned reference, the author ordering is wrong, and not consistent with the way it is cited in the text.
+
+- page 1: ""while the learner can learn a set of initial task parameters are easily optimized for a new task. "": Please fix and/or clarify. 
+
+- page 2: ""where the classes contained in each meta-set is disjointed"". 
+
+- page 3: ""The LSTM-based meta-learner proposed in this work, allow gradients to"" 
+
+
+- ""Memory-based Under review as a conference paper at ICLR 2020 methods (Ravi & Larochelle (2017)) that use all of the learner’s parameters as input to a meta-learner tend to break down when using a learner with a large number of parameters (Andrychowicz et al., 2016)."": Just to clarify: The latter paper, which is criticising the former category, is older than the paper representing the former category. Is that right? 
+
+- page 2: ""superior parameter updates"". I see that the result has been based on classification accuracy. Notwithstanding the ablative study in the experiments secion, maybe superior parameter updates can be either replaced by a more unequivocal description of the mentioned comparison, or formally defined from there onwards. 
+
+- At the ICLR level, I do not think that all this detailed description of backpropagation would be necessary. 
+
+- Equation 1 and the description that follows: ""a_l is the layer’s pre-activation output"". Is a_l the post-activation output as well? Equation 1 implies so, doesn't it?
+
+- page 3: ""This limits MAML to domains where a small amount of update steps are sufficient for learning."": What do you mean by ""This""? Is it to have inner loops consisting of multiple sequential updates, which is what was referred to as ""preferable"" in the beginning of the same paragraph? i.e. Is the MAML limitation noted in this paragraph a limitation of the vanilla MAML (plus the other versions) as well or solely due to the ""preferred, yet detrimental"" extension of adopting multiple sequential updates?
+
+
+
+Minor:
+- page 1: Supervised few-shot learning ""aims to challenge machine learning models to ..."": It does not challenge ML models; it is a specific ML paradigm. ",1,,ICLR2020
+HylZ2Rw6FH,1,Skxn-JSYwr,Skxn-JSYwr,Official Blind Review #2,"The authors propose a method to extract features utilizing the adjacency between patches, for better classification/regression of satellite image patches. The proposed method achieves better results compared to a straightforward baseline method.
+
+
+I have several significant concerns:
+
+- In the abstract, the authors claim that existing approaches such as post-classification add computational overhead to the task, whereas the proposed method does not add significant overhead. However, to me, post-classification can be very simple and straightforward, whereas the proposed method adds a series of computations: the proposed method not only extracts features from the input image, but also for another neighboring image; then features are combined (if two images are similar), before feeding into the network. The authors need to validate the claim that their method is more efficient.
+
+- The baseline the authors compare to is weak. There are existing works on satellite image classification/regression. Many of them also use semantic/contextual information, or aim to improve the robustness of features. For example:
+
+[1] Derksen et al. Spatially Precise Contextual Features Based on Superpixel Neighborhoods for Land Cover Mapping with High Resolution Satellite Image Time Series. IGARSS 2018.
+
+[2] Ghassemi et al. Learning and Adapting Robust Features for Satellite Image Segmentation on Heterogeneous Data Sets. Geoscience and Remote Sensing 2019.
+
+I understand that the authors cannot compare to everything. But the authors should compare to representative baseline methods. Methods mentioned in the related work section (Section 2.1) can also be compared to.
+
+- The proposed method is very application specific. The author only discussed the remote sensing application. Given the ICLR community's interest in general methods that can be applied to (or already been tested on) multiple applications, the paper would have been stronger if the methods applicabilityto other domains was discussed (and even better demonstrated).",3,,ICLR2020
+4LmOLgxI-_P,2,Fblk4_Fd7ao,Fblk4_Fd7ao,"Introduces two interesting ideas at once, but this makes interpretation difficult","*** Summary ***
+
+The paper investigates emergent gesture-based communication in Embodied Multi-Agent Populations. A noticeable feature of the paper is that it investigates emergent communication in the case of non-uniform distribution of intents and costly communication (i.e. agents are penalized for effort). The authors find that in certain scenarios, these conditions may lead to communication generalization of learned communication strategies to new, previously unseen agents.
+
+*** Relevance, Originality, Novelty *** 
+
+I believe that the paper meets ICLR conference standards when it comes to the importance of the problem considered, the originality of the approach and the novelty of the proposed ideas.
+
+*** Clarity ***
+
+The paper is, in general, clearly written and is a pleasure to read. Figure 4 is deeply confusing, however. As mentioned in the appendix, the actual number of timesteps is equal to 5 and is upsampled during visualization. I believe that it is important to mention it right away (i.e. in the image description).
+Otherwise, the reader can easily misunderstand the setup after seeing this picture.
+
+*** Quality ***
+
+The quality of the experimental support is, unfortunately, the weakest part of the contribution.
+
+The authors introduce a number of exciting ideas, all of which make the communication learning setting more realistic. Specifically, the authors are studying communication in an embodied setting, Zipfian intent distribution, and the idea of effort-based action cost.
+
+At the same time, while the ideas sound very promising in theory, in practice, the complications introduced by the embodied setting complicate the interpretation of the obtained results when it comes to the effects of Zipfian intent distribution and effort-based action costs.
+
+In order to mitigate the difficulties, authors propose additional measures: providing explicit feature information to the outside observer (i.e. the observer directly gets the action effort); using ""torque pretraining"" to make the model more familiar with the reward landscape, dramatically reducing the number of actions.
+These measures, in my opinion, are significantly detrimental to the potential impact of the paper. Providing effort information directly they makes the setting less realistic, defeating the purpose of the embodied approach.  Torque curriculum makes the results very specific to the considered setting.
+
+There are other significant limitations of the experimental results that are probably indirectly induced by the embodied setting. Specifically, figure 2 and table 1 report no variability measures for the obtained result, and it's not clear, how many runs were made to obtain these results. Moreover, the agent population considered in these experiments is 10 (mentioned in the appendix). With such a small population, it is not surprising that the observer can simply memorize the individual patterns, and is not pressured to infer the underlying structure.  It seems very natural, therefore, to increase the number of agents in the simulation.
+I suspect that these limitations are a consequence of the fact that training in the multidimentional setting is very costly and time-consuming.
+
+Moreover, some of the limitations I describe in the ""technical soundness"" section could have been controlled for in a simpler setting.
+
+Overall, all things considered, it's not clear what are the insights gained from the embodied approach.
+
+*** Technical soundness ***
+
+There are certain technical problems with the contribution:
+
+Firstly, the justification for why Zipfian intents, together with effort-based penalty should lead to generalization is not rigorous (paragraph 1, section 4.1), and not completely convincing.
+
+One of the sentences in the intuitive justification raises particularly serious concerns. The authors say ""Coupled, [Zipfian intents and Energy regularization] incentivize an inverse relationship between energy exertion and intent frequency, assigning minimum energy trajectories to maximally occurring intents."". This assumes that energy levels are ""occupied"" by a limited number of actions. Why that would be true is entirely unclear to me.
+
+For example, for any action with a certain energy level, there is a mirror action (with all joint torques multiplied by -1) with exactly the same energy consumption. Even if for some reason such specific pair of distinct actions with exactly the same effort 
+is impossible in a given setting, given how high the degree of freedom of the system is, it is still easy to imagine that every energy level offers a plethora of actions to choose from.
+
+This raises a question of why then we see good performance in 2-action scenario. There is an alternative explanation that was not explored: there is a unique action with zero effort: doing nothing.
+I suspect that it may be the only special case, where energy level is mapped uniquely onto an action (i.e. there is no different action with the same (0) energy penalty).
+
+Another alternative explanation that must be addressed is the exposure bias. Rare intents are sampled less frequently, therefore the model has less time to optimize its actions for that intent. This may explain some of the differences in energy costs between the rare and frequent actions.
+
+Lastly, the architecture choice and problem setting formalization also raise certain concerns. Specifically, the sender only receives joint positions and angles at time t, with no history. This does not seem to be an appropriate state specification. Coupled with the fact that the network is a simple feedforward architecture (no recurrence), this 
+limits the number of possible jestures that can be generated.
+
+*** Conclusion ***
+
+The paper addresses a highly relevant problem and proposes a number of interesting ideas. Unfortunately, in my opinion, 
+some of the proposed ideas (studying communication in an embodied setting, Zipfian intent distribution, costly actions), when introduced together, are not interacting well and are hindering result interpretation.
+
+As authors mention, using an embodied setting with multidimentional continuous action and observation spaces makes the problem extremely challenging. Therefore, when we see only partial success or no success in learning, it becomes difficult to identify the true source of the problem. That is, it is unclear if we are hitting a fundamental limit of what can be achieved by the combination of the Zipfian intent + costly actions, or is whether it a just a limitation of the architectures/algorithms that the authors used, when those algorithms are applied in a challenging multidimentional setting.
+
+The latter concern is exacerbated by some technical questions. In particular, the choice of a feedforward architecture seems extremely limiting, when combined with the state description that the authors used. The state space for the policy (sender) agent only includes the current joint configuration, without the history. This renders a large number of actions impossible (i.e. the ""pendulum-like"" wave of a hand becomes impossible, since when the hand returns to the middle after the first half-swing, the policy would act as if the action just started).
+
+Lastly, the logic behind the main hypothesis is not fully theoretically supported, and the alternative explanations are not addressed.
+
+Overall, I believe that the contribution is extremely promising, but it still requires some polishing and potential restructuring in order to be up to the standards of the ICLR conference. I do deeply hope to see this paper improved and published in the future, as I believe that after some improvements, it will be of interest to a large portion of the community.
+
+*** Suggestions ***
+
+I believe that the paper can be made much stronger if the authors take a step back and start with a simplified communication game setting. E.g. instead of actually producing an action trajectory, agents could pick among a number of actions with pre-defined effort levels, or they can separately pick the gesture and the amount of effort to allocate to that gesture (i.e. a subtle hand-wave vs flailing one's arms in the air).
+This would allow to explore the benefits and limitations of the proposed Zipfian intent+Costly actions approach. In that case, demonstrating that the approach also works in an emboddied setting would be a beautiful cherry on top. Alternatively, it is possible to ""boost"" the embodiment part of the paper, 
+e.g. by pre-training the agents on some other tasks, thus making certain actions more natural, etc. Currently, it just seems that embodiment itself is not adding additional insights, but complicating the experiments.
+
+Adding noize (mentioned in the appendix) is a very good idea as intuitively, it seems that it is the only reason why the agents don't converge on an array of extremely subtle actions with near-zero effort. I believe that it was not discussed enough in the paper and maybe its importance is somewhat overlooked.
+
+*** Questions *** 
+
+- ""Furthermore, with an increase in task complexity to 10 concepts, the external observer never performs better than this baseline strategy"" I thought the baseline accuracy was 0.34, so it seems that with curriculum the external observer does outperform the baseline.
+- Apart from the zero-effort action, why can't the agents converge on minimally distinguishable actions with the same effort level? 
+
+*** Typos and other minor suggestions ***
+
+""execution decentralized"" -> ""decentralized execution""
+""Agent reward"" verbally described as a ""function of state and actions taken"", but the formula is written as if that it is a function of observation and actions taken.
+
+Table 1 is slightly confusing to read, it may be better to make it broader so that the phrase ""Train Input"" does not break into two lines.
+
+Conflicting notation: N is used as the number of agents on page 3, but later N refers to the number of concepts. 10 agents - not enough to infer the latent variable, especially if no randomness is added to the agents.
+
+I personally believe that the use of the term ""zero-shot"" in this setting is not ideal. The observer is trained on a sample of communication protocols and is then tested on a held-out set. 
+In my view, it's analogous to ""generalization performance"", not ""zero-shot"" performance. I understand that the authors are referring to another study that did use ""zero-shot"", so this consideration did not affect my score.
+Before the name for this problem setting is thoroughly established, it may be possible to use ""out of sample coordination"" or some other option.
+
+*** Update after rebuttal ***
+
+I have read the rebuttal and I deeply appreciate the detailed response by the authors. I especially appreciate the introduction of a number of experiments on a simple domain that help to illustrate the main point of the paper.
+
+At the same time, I believe that some serious issues still remain unresolved. For example, my main concern remains: I believe that wrapping the problem in the embodied setting is not introducing additional insights. I must clarify that I fully agree that studying how embodiment affects cognition/behavior is an extremely important and exciting area of research. But in the present paper, the models have no chance to benefit from embodiment (since there is no prior / shared embodied experience), but rather have to solve the task despite being placed in an embodied setting.
+
+The main insight (in my opinion) is the observation that Zipfian intent distribution together with energy costs could be a good basis for zero-shot communication. I think that additional experiments that the authors introduced help to strengthen this point, although more experiments could still be beneficial (e.g. systematically varying the population size), as well as a more thorough theoretical discussion. At the same time, the limitations of the main ""embodied"" experiment remain (most importantly, the fact that fairly high accuracy can be achieved because of the unique ""do nothing"" action trajectory). 
+
+In short, a large part of the paper (on embodiment) contributes relatively little in terms of its impact and conclusions that can be drawn from it, which necessarily limits the extent to which the main insightful point can be explored. The main point is truly interesting, however, which makes the paper borderline.
+
+Overall, I believe that the paper is extremely promising and I would love to see an expanded version published. I feel very torn about the decision, but at the moment, I believe that the paper is still below the acceptance threshold, although only marginally. I am happy to adjust the score up, and I regret that I can not switch it to an ""accept"" recommendation.
+
+As a minor aside - the newly introduced Colab Notebook does not fully run and crashes at the cell #4 (model loading), so I can not fully explore the newly introduced experiments. That being said, I think that after fixing, this resource can be very useful in the future. This minor issue did not affect my evaluation.",5,4.0,ICLR2021
+SkxED7_nKS,1,HygSq3VFvH,HygSq3VFvH,Official Blind Review #3,"I take issue with the usage of the phrase ""skill discovery"". In prior work (e.g. VIC, DIAYN), this meant learning a skill-conditional policy. Here, there is only a single (unconditioned) policy, and the different ""skills"" come from modifications of the environment -- the number of skills is tied to the number of environments. This is not to say that this way of doing things is wrong, but rather that it is misleading in the context of prior work. Skill discovery in this context implies being able to have a single agent execute a variety of learned skills, rather than having one agent per environment with each environment designed to elicit a specific skill.
+
+Rather than ""skill discovery"", I suggest the authors position MISC relative to earlier work on empowerment, wherein a single policy was used to maximize mutual information of the form I(a; s_t | s_{t-1}). Modifying the objective to incorporate domain knowledge (as done in your DIAYN baseline) yields I(a; s_i | s_{t-1}) and is amenable to maximization by either of the lower bounds considered here. Indeed, your DIAYN baseline with skill length set to 1 and the number of skills equal to the number of actions (or same parameterization in the case of continuous actions) should recover this approach. I believe this would be a much more appropriate baseline, and I'd be curious to hear the intuition for why I(s_c ; s_i) should be superior.
+
+Apart from this missing baseline, the experimental results seem convincing. However, it is unclear whether or not VIME and PER were modified to incorporate domain knowledge (i.e. s_i/s_c distinction). Indeed, an appendix would be greatly appreciated, as many experimental details were omitted. Ideally, an experimental setup with previously published results (e.g. control suite for DIAYN, Seaquest for DISCERN) would be considered, but I can understand why this wasn't done as incorporating domain knowledge is the main contribution of the paper. That said, the claims should be weakened to reflect this gap, and domain knowledge should be mentioned more prominently (e.g. states of interest vs context are given, not learned).
+
+Rebuttal EDIT:
+
+The language around skills and the extent of prior knowledge still downplays things a bit too much for my liking. Needing new environment variations to obtain new skills is a large step backwards from things like DIAYN (the MISC/DIAYN combination needs more evidence to be considered a possible solution), and the s_i/s_c distinction is non-trivial to specify or learn for harder problems (e.g. pixel observations).
+
+That said, in the sort of settings under consideration (low dimensional state variables and environmental variations are simple to create) MISC does appear to be superior to prior work. The empowerment baseline is much appreciated, and while modifications of PER and VIME that incorporate prior knowledge would've also been nice, the experimental results pass the bar for acceptance in my view.
+
+",6,,ICLR2020
+H1lXESRY2X,1,H1gZV30qKQ,H1gZV30qKQ,"Has potential, needs some more investigation","This paper proposes a model-based value-centric (MVC) framework for transfer learning in continuous RL problems, and an algorithm within that framework. The paper attempts to answer two questions: (1) ""why are current RL algorithms so inefficient in transfer learning"" and (2) ""what kind of RL algorithms could be friendly to transfer learning by nature""? I think these are very interesting questions to investigate, and researchers that work on transfer learning could benefit from insights on them. However, I am not yet convinced that this paper answers these questions satisfyingly. It would be great to hear the author's thoughts on my questions below. 
+
+The main insight I take away from the paper is that policy gradient methods are not suitable for transfer learning compared to model-based and value-centric methods for some assumptions (the reward function not changing and the transition dynamics being deterministic). This insight and the experiments in the paper are interesting, but I am unsure if the paper as it is presented now passes the bar for ICLR.
+
+In general the paper has two contributions:
+A) analysis of value-centric vs policy-centric methods
+B) an algorithm that is more useful for transfer learning.
+
+Regarding A)
+The authors argue that policy-centric algorithms are less useful for transfer learning than value-centric methods. 
+
+They first illustrate this with an example in Section 3. Since this is just one example, as a reader I wonder if it would not be possible to construct an example that shows the exact opposite, where value iteration fails but policy gradient doesn't. It feels like there are many assumptions that play into the given example (the reward function not changing; the transition dynamics being deterministic; the choice of using policy gradients and value iteration). 
+
+In addition, the authors provide a theoretical justification in the Appendix (which I have briefly scanned) and the intuition behind it in Section 5. From what I understand, the main problem arises from the policy's output space being a Gaussian distribution, which causes the policy being able to get stuck in a local optimum. Further, the authors show (in the Appendix) that under some assumtions the value function always converges. Are there any guarantees on this when we don't have access to the true reward and transition functions (which themselves could get stuck in a local optimum)?
+
+Would the authors say that the phenomenon is more a problem with the algorithm (policy gradient vs value iteration) than policy-centric and value-centric methods in general? Are there other methods that would be able to transfer policies better than policy gradient methods?
+
+Regarding B)
+The author's proposed method (MVC) has three components: the value function, the dynamics model and the reward model, all of which are learned by neural networks. It seems like the main advantage comes from using a model (since that's the aspect which changes when having to transfer to an altered MDP). Does the advantage of this method over DDPG and TRPO come from the fact that the dynamics model changes smoothly, and we have an approximation to it? Then it is not surprising that this outperforms a policy gradient method. 
+
+Other comments:
+
+- Could you explain what is meant by ""precise"" and ""imprecise"" when speaking about policies or value functions?
+- Could you explain what is meant by the algorithm being ""accessible"" (e.g., Definition 1)?
+
+- Section 2.1: In Property 1, what is f? Could you make explicit why we are interested in the two properties listed? By ""not rigorously"", do you mean that those properties are based on intuition? These properties are used later in the paper and the appendix, so I wonder how strong of an assumption this is.
+- Section 2.2: Could you explain what is meant by ""task""? You say that within the MDP, the transition dynamics and reward functions change, but the task stays the same. However, earlier (in the introduction) you state that only the environment dynamics change. I find it confusing that ""the task"" is something hand-wavy and not part of the formal definition of the MDP. In what exact ways can the reward function be influenced by the change in the transition dynamics? 
+- Section 3: Replace ""obviously"" with ""hence""; remove ""it is not hard to find that"". This might not be so trivial for some readers.
+- Appendix B: Refer to Table 1 in the text.
+
+Clarity: The paper is written well, but I think some assumptions and their affects should be stated more clearly and put into context. The paper misses a discussion / conclusion section. It would be great to see a discussion on some of the assumptions; e.g., what if the low dimensional assumtion breaks down? What if we assume that also the reward function can change? The authors are in a unique position to give insight into these things (even if the results from the paper do not hold after dropping some assumptions) and it would be very helpful to share these with the reader in a discussion section.",5,2.0,ICLR2019
+HJe4_DGZ9S,3,S1e4Q6EtDH,S1e4Q6EtDH,Official Blind Review #3,"This paper proposes to use TensorTrain representation to transform discrete tokens/symbols to its vector representation.
+Since neural networks can only work with numerical numbers, in many NLP tasks, where the raw inputs are in the discrete token/symbol format, the popular technique is to use ""embedding"" matrices to find a vector representation of those inputs. 
+
+As the authors point out, the embedding matrices usually require huge number of parameters, since it assigns one vector for each input token for one embedding vector, but to attain a competitive performance in the real world applications, we need to use large number of embedding vectors, which results in a large number of parameters in the neural networks.
+
+The paper assumes that those embedding matrices can be compressed by assuming that the low-rank property of embedding matrices. I think this is a valid assumption in many cases, and the paper shows the performance degradation according to this assumption is relatively small compared to the gain, a dramatically reduced size of parameters in the embedding stage, is substantial.
+
+I think the paper is well written and proposes a new direction to find a memory efficient representation of symbols. I am not sure the current initialization techniques, nor the training method in the paper are the right way to train a tensor train ""embedding"" but I expect that the authors would perform the follow up work on those topics.",6,,ICLR2020
+sHFHhpP26tO,4,NeRdBeTionN,NeRdBeTionN,Interesting proposal with some missing experiments ,"This paper proposes to use image representations from trained self-supervised models to evaluate GANs more accurately. Compared to the currently used representations from supervised-pretrained models e.g. InceptionV3, the authors claim, that such embeddings suppress information not critical for the classification process which, however, are assumed to be crucial for assessing the full distributions of real and generated images. The authors use 5 datasets and their respective representations from 5 models, 3 supervised and 2 self-supervised, to show that representations from self-supervised models lead to better GAN evaluations. The representations were used to evaluate 6 GAN models with 3 metrics, namely FID, Precision and Recall. A ranking of the GAN models shows inconsistencies between supervised and self-supervised based representations. By visual inspection, prediction accuracy tests, and a comparison of representation invariances the authors show that rankings via self-supervised embeddings are more plausible.
+
+Pros: Interesting proposal to better evaluate GANs and generative models for image data. The paper is well written and easy to understand. The experiments are extensive and support the claim of the authors. Testing for invariances of representations is an interesting idea and the results support the use of embeddings from self-supervised models.  
+
+Cons: The authors argue that latent representations from an autoencoder capture all the information from images. It would be interesting to see how such representations, e.g. from the autoencoder used to show the invariances described in section A.1, behave compared to the proposed self-supervised representations. I would like to see them to be included in the experiments.
+
+Minor comment: Typo in A.1: corrseponding
+
+Edit: The authors have not responded to any of the reviews, i lower my rating to 4 
+
+Edit2: Oh there was a misunderstanding, i probably was not logged in and didn't see any comments and reviews. I raised the rating and will read the answers and will rate again.
+
+Edit3: After reading the rebuttals, i raise my rating to 7 ",7,4.0,ICLR2021
+r11pKi1i7,1,B1x9siCcYQ,B1x9siCcYQ,"Interesting idea and fleshed-out experiments, but somewhat niche appeal.","The paper proposes node embedding methods for applications where nodes are sequentially related. An example application is the ""Wikispeedia"" dataset, in which nodes are connected in a graph, but a datapoint (a wikispeedia ""game"") consists of a sequence of nodes that are visited. Each node is further attributed with textual information.
+
+The methods proposed are most closely related to skipgrams, whereby the sequence of nodes are treated like words in a sentence. Then, node attributes (i.e., text) and node representations must be capable of predicting neighboring nodes/words. (Fig.s 1/2 are a pretty concise overview of the proposed architecture).
+
+Positively, this is a quite sensible extension and modification of existing ideas in order to support a new (or different) problem setting.
+
+Negatively, I'd say the applications for this technique are fairly niche, which may limit the paper's readership. The method is mostly fairly straightforward and not methodologically groundbreaking (probably borderline in terms of expected methodological contribution for ICLR). I also didn't understand whether the theoretical claims were significant.
+
+The wikispedia/physics experiments feel a bit more like proofs-of-concept rather than demonstrating that the technique has compelling real-world uses. The experiments are quite well fleshed-out and detailed though.
+",5,3.0,ICLR2019
+kiA3o9QHwf,2,WGWzwdjm8mS,WGWzwdjm8mS,A new generalization metric as an early stopping criterion,"Summary:
+The paper proposed to use the average of l_2 distance between stochastic gradients called gradient disparity as a metric to predict generalization. Early stopping is achieved by monitoring such a metric without needing a validation set. Some theoretical results are also given to motivate the use of gradient disparity as a generalization metric.
+
+Pros:
+Overall the paper is well written. The use of a generalization metric to replace the validation set is an interesting idea. From the experiments in the paper, it seems the proposed metric indeed shares a similar trend as the test error. Also, the paper compared with a few other generalization metrics in Appendix H and showed the proposed one performs better than the baseline metrics when used as early stopping criteria. The experiments look quite comprehensive to me and the effects of different factors (label noise, batch size, network width, etc) on the gradient disparity are studied. 
+
+Cons:
+1. It looks like gradient disparity can be more correlated to generalization error instead of test error on some datasets (e.g., in Figure 13 for MNIST). This could make the algorithm stop too early since in some cases, the generalization error is increasing, but the training error decreases even faster and overall the test error is decreasing. However, gradient disparity has a similar trend with generalization error and when it increases for 5 epochs, we will terminate the algorithm while the test error is still decreasing. In addition, using test error + gradient disparity as a proxy for test error is not valid since gradient disparity has a different scale instead of while test error is between 0 and 1.
+
+2. The scale of gradient disparity may change with gradient magnitude and the authors used a re-scaling heuristic to stabilize the scale. I think the metric scale is an important issue when the metric is used for early stopping. However, the effect of re-scaling is not comprehensively studied and discussed in the experiments. 
+
+
+",5,4.0,ICLR2021
+BylK3m7RtH,1,BJxnIxSKDr,BJxnIxSKDr,Official Blind Review #3,"This paper introduces a factorization method for learning to share effectively in deep multitask learning (DMTL). The approach has some very satisfying properties: it forces all sharable structure to be used by all tasks; it theoretically captures the extremes of total sharing and total task independence; it is easy to implement, so would be a very useful baseline for future methods; and it is able to effectively exploit task similarities in experiments, and outperforms some alternative DMTL methods.
+
+I have two main concerns with the work: (1) It is most closely related to MTL factorization methods, but does not discuss this literature, or provide these experimental comparisons; (2) the interpretation of why Mint works is not clear: it is not clear that the universality is what makes it work, and there are no experimental analyses of what Mint learns. 
+
+W.r.t. (1), there are several DMTL approaches that factorize layers across shared and task-specific components, e.g., [1], [2]. Such approaches are extensions of factorization approaches in the linear setting, e.g., [3], [4]. Compared to previous DMTL approaches, Mint is more closely related to these linear methods, as it takes the idea of factorizing each model matrix into two components and applies it to every applicable layer. In particular, the formal definition (i.e., without nonlinear activation between M and W) of Mint appears to be a special case of the more general factorizations in [1]; an experimental  comparison [1] would make the conclusions more convincing, e.g., that universality is important.
+
+However, in the Mint experiments, a non-linear activation is added between the two components of each layer. This could void the universality property. Is there some reason why this is not an issue in practice? 
+
+More generally, it is not clear that universality is the important advantage of Mint. Some existing DMTL methods already have this property, including Cross-stitch, which is compared to in the paper. The intriguing difference with Mint is that shared and unshared structure are applied sequentially instead of in parallel. Could there be an advantage in in this difference? E.g., is Mint a stronger regularizer because it forces all tasks to use all shared layers (learning the identity function for shared layers is hard), while something like cross-stitch could more easily degenerate to only use task-specific layers even when tasks are related?
+
+Beyond performance, analysis on what Mint actually learns would be clarifying. Can the sharing behavior be analyzed by looking at the trained Mint layers? Is Mint actually able to learn both of the extreme settings in practice? The non-synthetic experiments in the paper are only performed on tasks that are closely related. 
+
+As a final note, adding layers to non-Mint models to make the topologically more similar to Mint models may not help these other models. It may make them more difficult to train or overfit more easily, since they are deeper, but do not have Mint method to assist in training. Comparisons without these extra layers would make the experiments more complete. Do cross-stitch and WPL share the conv layers across all tasks like Mint in Table 1? They should to make it a clear comparison.
+
+Other questions:
+-	What exactly are the “two simple neural networks” that produce the goal-specific parameters for goal-conditioned RL? Do these maintain the universality property?
+-	Can Mint be readily extended to layer types beyond FC layers? This may be necessary when applying to more complex models.
+
+
+[1] Yang, Y. & Hospedales, T. M. “Deep Multi-task Representation Learning: A Tensor Factorisation Approach,” ICLR 2017.
+[2] Long, M., Cao, Z., Wang, J., & Philip, S. Y. “Learning multiple tasks with multilinear relationship networks”, NIPS 2017.
+[3] Argyriou, A., Evgeniou, T., & Pontil, M. “Multi-task feature learning,” NIPS, 2007.
+[4] Kang, Z., Grauman, K., & Sha, F. “Learning with Whom to Share in Multi-task Feature Learning,” ICML 2011.
+",3,,ICLR2020
+wCRYHCRpEkE,4,LVotkZmYyDi,LVotkZmYyDi,Proximal Gradient Descent-Ascent: Variable Convergence Under KL Geometry,"In this paper, the authors analyze the convergence of a proximal gradient descent ascent (GDA) method when applied to non-convex strongly concave functions. To establish convergence results, the authors show that proximal-GDA admits a novel Lyapunov function that monotonically decreases at every iteration. Along with KL-parametrized local geometry, the Lyapunov function was used to establish the convergence of decision variables to a critical point. Moreover, the rate of convergence of the algorithm was computed for various ranges of KL-parameter. 
+
+Pros:
+The paper studies an interesting and relevant problem in a vibrant field of research. 
+
+
+The convergence analysis of proximal-GDA are detailed and well presented (covering function value convergence, variable convergence, function value convergence rates, and variable value convergence rates). To show the results, the authors provided a novel Lyapunov function and analyzed the convergence using KL local geometry.
+
+The paper is very clear, concise and neatly written (almost free of typos).
+
+The related material are referenced and well-discussed in the paper. The authors clearly positioned their work in the related field and discussed their contributions in comparison to other similar works.
+
+
+Cons:
+The paper lacks any experimental results. Demonstrating the different convergence rates on critical points with various KL-parameter can be good experiment.
+
+Minor Comments:
+1.	Definition 2: Should it be $h(z)$ instead of $f(z)$?
+2.	Equation (19) in the Appendix, and P Third Inequality should be =.
+3.	Page 14: upper bound on distance of subgradient set to 0, second inequality should be =.
+4.	Appendix F: Theorem 3, Function missing a ``'c'
+5.	Equation (36): $d_{t-1}$ instead of $t_{t-1}$.
+6.	Last expression of Page 19: I think 1/2 is missing in the left hand-side. 
+
+---------------------------------------------------------------------------------------------------
+Satisfied with the response, will keep my score the same.",8,4.0,ICLR2021
+r1BvYvAeG,3,r1YUtYx0-,r1YUtYx0-,Clear accept,"The paper studied the generalization ability of learning algorithms from the robustness viewpoint in a deep learning context. To achieve this goal, the authors extended the notion of the (K, \epsilon)- robustness proposed in Xu and Mannor, 2012 and introduced the ensemble robustness. 
+
+Pros: 
+
+1, The problem studied in this paper is interesting. Both robustness and generalization are important properties of learning algorithms. It is good to see that the authors made some efforts towards this direction.
+2, The paper is well shaped and is easy to follow. The analysis conducted in this paper is sound. Numerical experiments are also convincing. 
+3, The extended notion ""ensemble robustness"" is shown to be very useful in studying the generalization properties of several deep learning algorithms. 
+
+Cons:    
+
+1,  The terminology ""ensemble"" seems odd to me, and seems not to be informative enough.
+2,  Given that the stability is considered as a weak notion of robustness, and the fact that the stability of a learning algorithm and its relations to the generalization property have been well studied, in my view, it is quite necessary to mention the relation of the present study with stability arguments. 
+3, After Definition 3, the author stated that ensemble robustness is a weak notion of robustness proposed in Xu and Manner, 2012. It is better to present an example here immediately to illustrate. ",8,4.0,ICLR2018
+S1xBSNsZKH,1,rkerLaVtDr,rkerLaVtDr,Official Blind Review #1,"This paper introduces a new upper bound of unsupervised domain adaptation, which takes the adaptability term lambda into consideration. The new theory can be expanded into a novel algorithm. Experiments on domain adaptation datasets demonstrate improvement over previous state-of-the-art methods.
+The authors propose to incorporate lambda into adversarial feature learning. Specifically, the authors assume that f_s and f_t are from some hypothesis space H. Then relaxing f_s and f_t to f_1 and f_2, we can turn the problem into a minimax game between f_1, f_2 and feature extractor g. To further implement their method, the authors propose to constrain f_1 and f_2 with source accuracy and target pseudo label accuracy. Based on the margin theory, the authors also introduce the cross margin discrepancy, which increase the reliability of adversarial adaptation.
+The paper is well-written and the contributions are stated clearly. The attempt to incorporate lambda into feature learning is really interesting.
+
+However, I have several concerns:
+*The proposed theory of equation (4), (5), and (6) is problematic. h is the hypothesis which belongs to a hypothesis class H. f_s and f_t are true labeling functions, and do not necessarily belong to the hypothesis space H. In this sense, the inequality of equation (4) does not hold. Problems of equation (5) and (6) are similar. The authors do realize that the supremum term can be arbitrarily large and put constraints to f_1 and f_2. But no matter what hypothesis class we are using, it generally does not contain the true labeling functions, and what we can do is only approximating them. Thus, in spite of the good performance of the proposed method, the proposed upper bound is not reliable.
+*Lack of experimental results on the role of f_1 and f_2. The proposed method demonstrates good performance, but the manuscript does not provide some experimental results on the source of performance gain. In particular, how is f_1 and f_2 changed during training? Are they substantially different from the h’in MCD and MDD? Besides, how does each part contribute to the performance gain? Is it from the novel loss function or just the new adversarial adaptation method itself? A proper ablation study would be helpful. 
+",1,,ICLR2020
+r1glU6XCnQ,3,HkgSEnA5KQ,HkgSEnA5KQ,Interesting problem setup; insufficient experiments,"This paper provides a meta learning framework that shows how to learn new tasks in an interactive setup.  Each task is learned through a reinforcement learning setup, and then the task is being updated by observing new instructions. They evaluate the proposed method in a simulated setup, in which an agent is moving in a partially-observable environment. They show that the proposed interactive setup achieves better results than when the agent all the instructions are fully observable at the beginning. 
+
+The task setup is very interesting. However, the experiments are rather simplistic, and does not evaluate the full capability of the model. Moreover, the current experiments does not convince the reviewer if the claims are true in a more realistic setup. The authors compare the proposed method with one algorithm (their baseline) in which all the instructions are given at the beginning. I am wondering how the method will be compared with a state-of-the-art method that focuses on following instructions, e.g., Artzi and Zettlemoyer work. Moreover, the authors need to compare their method in an environment that has been previously used for other domains with instructions. ",6,4.0,ICLR2019
+B1gYdNvCFS,2,Byl8hhNYPS,Byl8hhNYPS,Official Blind Review #2,"This paper provides an approach to use visual information to improve text only neural machine translation systems. The approach creates a ""topic word to images"" map using an existing image aligned translation corpora. Given a source sentence, the model extracts relevant images, extracts their Resnet features and fuses them with the features generated from the word sequence. The decoder uses these fused representation to generate the target sentence. Overall, I like the approach, seems like it can be easily augmented to existing NMT systems. 
+
+One of the claims of the paper was to be able to use monolingual image aligned data. However image captioning datasets are not mentioned. It would make sense to use image captioning data to create the image lookup. Also, what will be the performance of a standard image captioning system on the task ? I believe it will not be great, but I think for completeness, you should add such a baseline.
+
+Minor comments: 
+1. What is M in Algorithm 1 ? 
+2. First paragraph in related work is very unrelated to the current subject, please remove.
+",8,,ICLR2020
+rk5LF4OeM,1,BkUDW_lCb,BkUDW_lCb,The contributions have already been done in the past,"This paper proposes a model for solving the WikiSQL dataset that was released recently.
+
+The main issues with the paper is that its contributions are not new.
+
+* The first claimed contribution is to use typing at decoding time (they don't say why but this helps search and learning). Restricting the type of the decoded tokens based on the programming language has already been done by the Neural Symbolic Machines of Liang et al. 2017. Then Krishnamurthy et al. expanded that in EMNLP 2017 and used typing in a grammar at decoding time. I don't really see why the authors say their approach is simpler, it is only simpler because the sub-language of sql used in wikisql makes doing this in an encoder-decoder framework very simple, but in general sql is not regular. Of course even for CFG this is possible using post-fix notation or fixed-arity pre-fix notation of the language as has been done by Guu et al. 2017 for the SCONE dataset, and more recently for CNLVR by Goldman et al., 2017.
+
+So at least 4 papers have done that in the last year on 4 different datasets, and it is now close to being common practice so I don't really see this as a contribution.
+
+* The authors explain that they use a novel loss function that is better than an RL based function used by Zhong et al., 2017. If I understand correctly they did not implement Zhong et al. only compared to their numbers which is a problem because it is hard to judge the role of optimization in the results.
+
+Moreover, it seems that the problem they are trying to address is standard - they would like to use cross-entropy loss when there are multiple tokens that could be gold. the standard solution to this is to just have uniform distribution over all gold tokens and minimize the cross-entropy between the predicted distribution and the gold distribution which is uniform over all tokens. The authors re-invent this and find it works better than randomly choosing a gold token or taking the max. But again, this is something that has been done already in the context of pointer networks and other work like See  et al. 2017 for summarization and Jia et al., 2016 for semantic parsing.
+
+* As for the good results - the data is new, so it is probable that numbers are not very fine-tuned yet so it is hard to say what is important and what not for final performance. In general I tend to agree that using RL for this task is probably unnecessary when you have the full program as supervision.",3,4.0,ICLR2018
+AyE47WgzzW,2,8qsqXlyn-Lp,8qsqXlyn-Lp,Interesting paper + Important problem but the method formulation is somewhat contrived ,"Summary:
+=======
+The distance metric learned by low-dimensional embeddings typically captures the knowledge that we already know. This paper proposes a principled way of factoring out prior knowledge (in the form of distance matrices) from tSNE and UMAP embeddings. Two algorithms are proposed for factoring out prior knowledge. JEDI (for tSNE embeds) uses a parameterized JS divergence-- the objective is to learn a low-dimensional distance metric that preserves high-dimensional distances but is orthogonal to the prior distance matrix. CONFETTI is the second algorithm which also optimizes a similar objective to JEDI but doesn't employ JS divergence and is algorithm independent, so one can use it for tSNE or UMAP. Results are shown on synthetic and real-world flower and cell-sequencing data and they highlight the superior ability of JEDI and CONFETTI algorithms in factoring-out prior knowledge compared to the baselines. 
+
+
+
+Comments:
+==========
+The paper is well written and puts itself nicely in context of previous work. Given the ubiquity of low-dimensional embeddings these days, the paper addresses an important problem of factoring out prior information from the embeddings.
+
+
+1). The paper doesn't describe the details of optimizing the parameterized JS Divergence (pJSD) metric that they propose. Is it even convex?
+
+2). How is the beta parameter of pJSD chosen?
+
+3). I didn't fully understand why the paper makes a big deal about UMAP, when JEDI is based on tSNE and CONFETTI can work with any embedding formulation? How not just mention a ""general embedding.""
+
+4). The formulation of the CONFETTI method seems a little arbitrary. Why are the prior distances factored out linearly? Again, no optimization details are provided regarding the CONFETTI method.
+
+5). It seems that, as an application, the proposed methods can also be used for factoring out demographic information from word embeddings. For such applications how would one define the prior distance matrix? The datasets used in the paper and the broader setup seems a little contrived. It would be nice to provide guidelines on how one can readily define such prior distance matrices for other applications. 
+
+
+Typo:
+
+Conclusion: ""This shows that both are able applicable to real world...""  ",6,4.0,ICLR2021
+SkgtrSEZ67,3,r1lohoCqY7,r1lohoCqY7,A good problem discussed and the proposed ML approach seems reasonable.,"The authors are proposing an end-to-end learning-based framework that can be incorporated into all classical frequency estimation algorithms in order to learn the underlying nature of the data in terms of the frequency in data streaming settings and which does not require labeling. According to my understanding, the other classical streaming algorithms also do not require labeling but the novelty here I guess lie in learning the oracle (HH) which feels like a logical thing to do as such learning using neural networks worked well for many other problems.
+
+The problem formulation and applications of this research are well explained and the paper is well written for readers to understand. The experiments show that the learning based approach performs better than their all unlearned versions. 
+
+But the only negative aspect is the basis competitor algorithms are very simple in nature without any form of learning and that are very old. So, I am not sure if there are any new machine learning based frequency estimation algorithms.  
+
+
+",7,4.0,ICLR2019
+Byxla5U9h7,1,SklXvs0qt7,SklXvs0qt7,The method proposed in the paper may have poor generalization and scaling performance,"This work considers a version of importance sampling of states from the replay buffer.  Each trajectory is assigned a rank, inversely proportional to its probability according to a GMM. The trajectories with lower rank are preferred at sampling.
+
+Main issues:
+
+1. Estimating rank from a density estimator
+
+- the reasoning behind picking VGMM as the density estimator is not fully convincing and (dis)advantages of other candidate density estimators are almost not highlighted.
+
+- it is unclear and possibly could be better explained why one needs to concatenate the goals (what would change if we would not concatenate but estimate state densities rather than trajectories?)
+
+2. Generalization issues
+
+- the method is not applicable to episodes of different length
+- the approach assumes existence of a state to goal function f(s)->g
+- although the paper does not expose this point (it is discussed the HER paper)
+
+3. Scaling issues
+
+- length of the vector grows linearly with the episode length
+- length of the vector grows linearly with the size of the goal vector
+
+For long episodes or episodes with large goal vectors it is quite possible that there will not be enough data to fit the GMM model or one would need to collect many samples prior.
+
+4. Minor issues
+
+- 3.3 ""It is known that PER can become very expensive in computational time"" - please supply a reference
+ 
+
+- 3.3 ""After each update of the model, the agent needs to update the priorities of the transitions in the replay buffer with the new TD-errors"" - However the method only renews priorities of randomly selected transitions (why would there be a large overhead?). Here is from the PER paper ""Our final implementation for rank-based prioritization produced an additional 2%-4% increase in running time and negligible additional memory usage""
+",6,4.0,ICLR2019
+0UoDJeBJvxC,1,VbCVU10R7K,VbCVU10R7K, ,"## Review
+
+Given as set of pre-specified policies, this paper proposes a Bayesian method to estimate the posterior distribution of their average values, by estimating posterior distributions of their discounted stationary distribution ratios. These posterior distributions are used for off-policy evaluation in various ways.
+
+
+ 
+## Positives
+
++ The idea focusing on estimating nonlinear functionals of multiple policy values is appealing.
+
+
+ 
+## Major concerns
+
++ The results in Figure 2 are a bit surprising. We expect that methods based on concentration inequalities like Bernstein or student-t to be somewhat conservative, but the results suggest that their confidence intervals are extremely wide. For example, in the Bandit case, even after 200 samples, the interval log-width would suggest that Bernstein's confidence intervals is more than 7x the confidence intervals suggested by BayesDICE. What explains these results?
+
++ Again on Figure 2, if ""Bernstein"" and ""Student t"" are unbiased methods, then having a very wide confidence interval should translate into over-coverage. However, they seem to be *under*-covering the true value. Are these methods somehow heavily biased? If not, what explains the under-coverage?
+
++ The paper proposes a method for evaluating non-linear functionals of policy values, such as ranking scores over their values. However, it seems to me that in order to evaluate such nonlinear functions one would require knowledge about the *joint* distribution of values over all policies of interest. In the notation of the paper, one would require knowledge of $q(\bar{\rho}_1, ..., \bar{\rho}_N)$. However, it is not clear from the method description in Section 3.2 how one is able to estimate this joint distribution. Instead, it seems to me that all we get is $q(\bar{\rho}_i)$ for each policy $i$ -- that is, their marginal distributions. If that is the correct interpretation of what's going on in Section 3.2, then that raises the question of whether these distributions are independent. 
+
+
+
+ 
+## Minor concerns
+
++ In appendix C.1, I did not understand the description of the ""bandit"" environment. Are rewards binary?
+
++ Several symbols are not formally defined. E.g., on page 4, (lowercase) r(s,a) is not defined.
+
++ In Algorithm 1, what is the role of quantity L*? Why do we need it as a stopping rule?
+
++ Some notes on exposition.
+
+  - The authors take some time to reveal what is their estimand --- the discounted stationary distribution ratios. As a reader, I would have benefited from having that explained much earlier, even before the conversation about ranking evaluation.
+
+  - Section 3.2: the authors could have dedicated some more space developing the intuition for their method (e.g. an abridged version of Nachum and Dai 2020), even if that meant relegating some of the mathematical details to the appendix. As it stands, the section makes the paper incomprehensible as a standalone piece of research.
+
+
+## Typos
+
++ The indices on the sum in the definition of ""stationary visitation"" (p.4) are wrong. On the next line, the last conditioning should have been s[i+1] ~ T(.|s[i],a[i]) instead of s[i+1] ~ T(.|s[t],a[t]).
++ Philip Thomas' ""High-confidence off- policy evaluation"" citation shows up twice in the bibliography.
++ indentical --> identical (Pg. 5)",5,4.0,ICLR2021
+HyeCZACtYr,1,Hyl7ygStwB,Hyl7ygStwB,Official Blind Review #1,"The paper proposes an approach to incorporate BERT pretrained sentence representations within a NMT architecture.
+It shows that simply pretraining the encoder of a NMT model with BERT does not necessarily provide gains (and can even be detrimental) and proposes instead to add a new attention mechanism, both in the encoder and in the decoder. The modification is relatively simple, but provides significant improvements in supervised and unsupervised MT, although it makes the model slower and computationally more expensive. The paper contains a lot of experiments, and a detailed ablation study.
+
+===
+
+I'm very surprised by the results in Table 1, i.e. the fact that pretraining can decrease the performance significantly. The provided explanation ""Our conjecture is that the XLM model is pre-trained on news data, which is out-of-domain for IWSLT dataset mainly about spoken languages"" is not satisfactory to me. The domain mismatch is also there in the majority of GLUE tasks, SQUAD, etc. and yet pretraining with BERT significantly improves the performance on these tasks. When the encoder is pretrained with a BERT/XLM model, I assume the encoder is not frozen, but finetuned?
+
+The description of the algorithm in Section 4 could be simplified a lot I feel. Overall, the attention in the encoder is simply replaced by two attention layers: one over the previous layer like in a standard setting, and one on top of the BERT representation. Also I don't understand why the attention over the BERT sequence is also necessary in the decoder. Shouldn't this information already be captured by the encoder output?
+
+The Drop-Net Trick is interesting. But the fact that 1.0 gives the best performance (Section 6.2) is very unintuitive to me. This means that the model will never consider the setting with two attentions at training time, although this is what it does at test time.
+
+In Table 6, you propose experiments with 12 and 18 layers for fair comparison, because as you mention, your model with BERT-fused has more parameters. But IWSLT is a very small dataset and it would have been surprising that using 18 layers actually helps (overfitting is much more likely in that setting). Instead, I think something like an ensemble model would be a more fair comparison. In fact, the BERT-fused is essentially an ensemble model of the encoder.
+Could you try the following experiment on IWSLT, where you do not pretrain the BERT model with the BERT objective, but with a NMT encoder trained in a regular supervised setting (i.e. do not reload a BERT model, but a NMT encoder that you previously trained without the fused architecture)?
+
+Overall, I think the gains are nice, but I would really like to see the comparison I mentioned just above, and comparisons with ensemble models. The proposed model is significantly larger / slower than the baseline models considered, and I wonder if you could not achieve the same gains with ensemble models.
+
+Something I like about the approach is that is it quite generic in the sense that you can provide any external sequence of vectors as input to your encoder. As a result, it is possible to leverage a model pretrained with a different tokenization. Tokenization is often an issue with pretraining in NLP (how do you leverage a model trained without BPE if you actually want to use BPE in your new model). The proposed approach does not has this constraint and I think this is something you should highlight more in the paper.
+
+===
+
+Small details in the related work section:
+- I would cite ""Sutskever et al, 2014"" for the LSTM encoder, along with ""Hochreiter & Schmidhuber"", and not only ""Wu et al, 2016""
+- Removing the NSP task was proposed in ""Lample & Conneau, 2019"", not in ""Liu et al, 2019""",6,,ICLR2020
+elLyEavOhV9,2,VVdmjgu7pKM,VVdmjgu7pKM,Very interesting work with good experiments.,"The authors propose SCOFF, a novel architectural motif, one with memory, which, as they describe, can serve as a drop-in for an LSTM or GRU within any architecture. It is inspired by the notion that when modeling a structured, dynamic environment (such as one with objects moving around), one must keep track of both declarative knowledge and procedural knowledge. They propose that these two types of knowledge be factored, creating an architecture consisting of ""object files"" (OF) whose evolution is governed by input, all objects, and  ""schemata"" which can be selectively applied to each OF.
+
+They evaluate SCOFF along several axes:
+(1) Does SCOFF successfully factorize knowledge into OFs and schemata? 
+(2) Do the learned schemata have semantically meaningful interpretations?
+(3) Is the factorization of knowledge into object files and schemata helpful in downstream tasks?
+(4) Does SCOFF outperform state-of-the-art approaches?
+To do this, they use several video prediction tasks as well as an RL task.
+
+This is a very interesting paper, with a natural, novel motif. It is well-written -- the motif has several important attributes, and one can quickly come to understand them (though perhaps a more involved diagram, like Figure 2 but showing what parameters come into play where, might be useful). The experiments appear to be carefully done, with considerable effort, and they put forth interesting evidence towards an affirmative in each of the above four questions.
+
+While the evidence put forth is useful, I do think there is considerable follow-up work needed to really demonstrate the efficacy of this system. In the realm of video prediction, the tasks considered are fairly simple, and certainly what is of true interest is video prediction much closer to the real world. With this, there are a wide variety of techniques and benchmarks (https://arxiv.org/abs/1804.01523 and its follow-ons come to mind as useful to quickly try). Much recent work has deferred the task of obtaining a reasonable encoding from/decoding to real-world images and assumed it in order to make progress on predicting the future within a fixed encoding (https://arxiv.org/abs/1612.00222, https://arxiv.org/abs/2002.09405, https://arxiv.org/abs/1806.08047) and have with them benchmarks to which the authors' method could be adapted. Certainly for some of these, SCOFF could be a subcomponent that augments these methods. I think these more complex future predictive tasks could better stress-test the structure of how information passes through SCOFF. Related, many of the experiments rely on the encoding/decoding provided by [Van Steenkiste 2018], which showed considerable success on the environments used, and it would be interesting to see how crucial that is.
+
+I recommend acceptance. The paper is interesting, well-written, and the experiments are useful. While I look forward to more definitive demonstrations of the utility of this approach, I do think that the amount done is considerable and warrants publication, and these important follow-ups would take a great deal more effort and are for future work.
+
+A more minor comment, I think more details could be given for the RL task, both in model implementation and in exactly how the test task is specified -- apologies if I missed but I only see train environment details in the appendix.
+
+Minor, wrong ""two""/""to"" in ""Single object with switching dynamics"" experiments description.
+",8,4.0,ICLR2021
+SygQwnxPaX,3,BJGVX3CqYm,BJGVX3CqYm,Inetresting approach to quantization and interesting experimental results,"The authors propose a network quantization approach with adaptive per layer bit-width. The approach is based on a network architecture search (NAS) method. The authors aim to solve the NAS problem through SGD. Therefore, they propose to first reprametrize the the discrete random variable determining if an edge is computed or not to make it differentiable and then use Gumbel Softmax function as a way to effectively control the variance of the obtained unbiased estimator. This variance can indeed make the convergence of the procedure hard. The procedure is then adapted to the problem of network quantization with different band-widths.
+
+The proposed approach is interesting. The differerentiable NAS procedure is particularly important and can have an important impact. The idea of having an adaptive per layer precision is also well motivated, and shows competitive (if not better) results empirically. 
+
+Some additional experiments can make the paper stronger:
+* Compare the result of the procedure to an exhaustive search in a setting where the latter is feasible (shallow architecture on an easy task with few possible bit widths)
+* Compare the procedure to other state of the art NAS procedures (DARTS and ENAS) with the same search space adapted to the quantization problem, to empirically show that the proposed procedure is a compromise between these two methods as claimed by the authors. ",6,3.0,ICLR2019
+b7dA9eXjf9M,3,NMgB4CVnMh,NMgB4CVnMh,"Interesting idea, experimental validation falls short","### Overview:
+
+This paper proposes a new acoustic word embedding approach, where acoustic and text embeddings are jointly learned. The text encoder takes either phonemes or characters as input. The novelty of the paper lies in a new loss, which is based on stochastic neighbour embeddings (SNE). The acoustic embedding network is first trained with this loss, after which the text network is trained to produce similar embeddings for matched (acoustic, text) input. The proposed model is evaluated in a word recognition task, where an isolated spoken word's acoustic embedding is compared to the text embeddings and the nearest neighbour is used to classify the spoken word.
+
+**Note the edit at the bottom of this review, based on the authors' feedback.**
+
+### Strengths:
+
+- I do not believe that existing acoustic embedding methods have considered the idea of including a SNE-like objective, especially using the text-view to identify ""neigboring"" items.
+- The use of posteriorgrams as input is well-motivated, in the context of what the paper tries to accomplish.
+- The paper is generally easy to follow.
+
+
+### Weaknesses:
+
+- The models aren't sufficiently compared to previous models on established tasks. (I make this more concrete below.)
+- The experiments don't show the effect of using e.g. the posteriorgrams over standard features (like MFFCs). Posteriorgrams means that this approach is essentially reliant on a first-pass ASR model.
+- As a minor weakness, some recent related work aren't cited (references given at the bottom).
+
+
+### Detailed questions and suggestions:
+
+A number of studies in the acoustic word embedding literature have used the same-different task to evaluate performance, e.g. in (He et al., 2017) and [1] and [2]. Given the (somewhat surprising) poor performance of the triplet-based model, I would suggest that the paper does a comparison on the same data and task, to confirm the validity of their triplet-based model. This will also make the work more valuable, in that it can be directly compared to previous studies. One potential issue with the triplet model in this paper is that I believe the model of (He et al., 2017) makes use of cosine similarity, instead of the Euclidean distance. Since labelled examples are available, it would also have been good to compare to a direct classification model, as in [3].
+
+I did not state this as a weakness, since I am worried that I am missing some details, but I am concerned by the overall analysis of Section 4. First, by defining the neighbourhood as in equations (7) and (8), it seems that this model essentially optimises the loss based on whether two acoustic realisations are from the same or different words, and no finer-grained neighbourhood information is included. The only ""strength"" that is imposed comes from $c_i$, which is essentially linked to the word frequency. The section concludes that ""ANE has the added subtlety of pushing or pulling with more 'measured strength' based on how good the embeddings currently are,"" but if the loss is purely based on a weighing according to the word frequency, could something similar be accomplished by having a type-specific margin for the triplet loss in equation (13)? If I read the analysis correctly the margin is actually completely ignored: ""we can ignore the max operator and set $\alpha = 0$ in (13), since they are merely for data selection.""  I do not believe that this last statement is correct.
+
+One further suggestion is to look at more advanced sampling strategies in the triplet model, as e.g. in [1] or [6].
+
+
+### Overall assessment:
+
+Given the shortcomings in the experimental investigation, I do not believe the paper can be accepted as it is. I would recommend that the authors include the above-mentioned additional experiments; this would be a non-trivial extension, and I, therefore, recommend the paper then be submitted a future conference. I award an ""Ok but not good enough - rejection"".
+
+
+### Typos, grammar and style:
+
+- ""approximate phonetic match task"" -> ""approximate phonetic matching task""
+- ""As we can see in Table. 1"" -> ""As we can see in Table 1""
+- ""this beautiful equation"". I would suggest that the authors remove subjective words such as ""beautiful"".  (I agree it's a beautiful equation, but I don't think this type of language is appropriate for such a paper.)
+- ""it is not a good idea"". Similar to the above.
+- Make sure to write phoneme sequences correctly. See [4].
+
+
+### Missing references:
+
+1. https://arxiv.org/abs/2006.02295
+2. https://arxiv.org/pdf/2006.14007
+3. http://arxiv.org/pdf/1510.01032
+4. https://arxiv.org/abs/1907.11640
+5. https://arxiv.org/abs/2007.00183
+6. https://arxiv.org/abs/1804.11297
+7. https://arxiv.org/abs/2007.13542
+
+
+### Edit based on the authors' response
+
+I believe the author(s) were able to address many of the major concerns that I and the other reviewers had.  One issue is that much of this is placed in an appendix, so it doesn't form a core part of the main thread of the paper;  I also disagree about one small point (see my separate comment to the Part 2 message below), but this is minor.  Based on their more extensive investigation, I am changing my rating from ""4: Ok but not good enough - rejection"" to ""6: Marginally above acceptance threshold"".",6,3.0,ICLR2021
+qjH00SIdTUy,1,JeweO9-QqV-,JeweO9-QqV-,Review for SoGCN: Second-Order Graph Convolutional Networks,"This paper proposes a so-called second-order graph convolution, where an additional second-order term is introduced into traditional first-order graph convolution. The authors explain the merits of second-order graph convolution from the perspective of representation ability. The resulted second-order graph convolutional networks are compared with several graph convolutional networks on three benchmarks. 
+
+Strengths:
++: Incorporation of second-order information into graph convolution seems interesting, and such analogous idea has been verified in 2D convolutions.
+
++: The proposed method is simple and easy to implement.
+
+Weaknesses:
+-: The idea on incorporation of second-order or higher-order information into convolution is not novel. For example, Factorized Bilinear (FB) [r1] and Second-Order Response Transform (SORT) [r2] introduce second-order terms into traditional 2D convolutions, and they also claim second-order terms have better representation ability. Besides, second-order or higher-order information have also been used for global pooling for convolution networks [r3, r4, r5], which also show better representation ability. This paper lacks discussions on above these works, which will bring a side effect on contributions of this paper.
+[r1] Factorized bilinear models for image recognition, ICCV 2017.
+[r2] SORT: Second-order response transform for visual recognition, ICCV 2017.
+[r3] Second-Order Pooling for Graph Neural Networks, arXiv, 2020.
+[r4] Deep CNNs Meet Global Covariance Pooling: Better Representation and Generalization, TPAMI 2020.
+[r5] Kernel pooling for convolutional neural networks, CVPR, 2020.
+
+-: The experimental results are not very convincing.
+(1) As shown in Table 2, pure soGCN achieves no improvement over compared methods, i.e., GatedGCN. GRU brings further gains for soGCN, but could GRU bring improvement for other compared methods?
+(2) Number of parameters hardly represents model complexity totally. The model with the same number of parameters can have different computational complexity. Therefore, more metrics on model complexity (e.g., FLOPs) are suggested for comparison in Table 2.
+(3) CIFAR10 and MNIST are too small and old to verify the effectiveness of different methods. The authors would better conduct experiments on more graph benchmarks. Besides, why higher-order  GCN are not compared on real-world benchmarks.
+(4) Why parameters number of 3WLGNN is 100K on ZINC ?
+
+-: The writing needs significant improvement.
+(1) The authors would better give more detailed descriptions on differences between the proposed soGCN and related works ([Defferrard, 2016] and [Kipf&Welling, 2017]), further clarifying the contributions of the proposed method.
+(2) I wonder the detailed computation methods and each curve in Fig.2, and why SoGCN is better than Vanilla GCN?
+(3) The comparisons in terms of representation ability in section 4.3 is not very clear. The authors would better add a table to summarize representation ability of different graph convolution.
+(4) Which method does MoNet indicate in Table 2 ?
+
+",5,4.0,ICLR2021
+Byg52vRAYr,3,ryeFY0EFwS,ryeFY0EFwS,Official Blind Review #2,"Summary
+The surprising generalization properties of neural networks trained with stochastic gradient descent are still poorly understood. The present work suggests that they can be explained at least partly by the fact that patterns shared across many data points will lead to gradients pointing in similar directions, thus reinforcing each other. Artefacts specific to small numbers of data points however will not have this property and thus have a substantially smaller impact on the learning. Numerical experiments on MNIST with label-noise indeed show that even though the neural network is able to perfectly fit even the flipped labels, the ""pristine"" labels are fittet much earlier during training. The authors also experiment with explicitly clipping ""outlier gradients"" and show that the resulting algorithm drastically reduces overfitting, thus further supporting the coherent gradient hypothesis.
+
+Decision
+The present work proposes a plausible, simple mechanism that might be contributing to the generalization of Neural Networks trained with gradient descent. Parts of the discussion stay informal as the authors themselves admit, but I appreciate that rather than providing mathematical decoration the authors focus on well-designed experiments that support their claims. Overall, the paper is of high quality and provides an interesting perspective on an important topic, which is why I think it should be accepted.
+
+Questions for the authors
+The coherent gradient hypothesis seems equally valid in the absence of stochasticity. However, the latter is often seen as an explanation of the generalization performance of SGD. My understanding is that you are also using minibatched gradient descent. Would you expect your experiments to still be valid when using deterministic gradient descent (full batch)? Did you study the effects of large batch sizes on the experiments?",8,,ICLR2020
+Byg1CKm9KH,2,SJx4Ogrtvr,SJx4Ogrtvr,Official Blind Review #2,"Summary: This paper tries to improve the training for the binary neural network.
+
+Weaknesses:
+[-] A lack of related works. There have been many related works about BNN in these years (after 2017), but the authors do not have a quick summary of them.
+[-] More reference. e.g, when authors mention 'many related works require to store the full-precision activation map during the inference stage',  some reference is necessary.
+[-] Weak Motivation: The authors argue 'We analyze the behaviour of the full-precision neural network with ReLU activation' in the abstract. However, in Section 3, I cannot find any analysis. Only writing down the backward and forward cannot be called analysis. Initialization is different from the training dynamics. Assumptions and theorems should be highlighted. 
+[-] Poor writing: A lot of typos. Only in the last paragraph in Section 2, I find many typos,  e.g. 'replaced replacing ReLU activation', 'any relaated works'.
+
+Questions:
+[.] In experiments, what structure is used for ResNet? ResNet-18-like or ResNet-110-like? (The results for these two kinds of structure are totally different for binary neural network, as the difference in the number of channels) 
+[.] In experiments, the performance of the baselines seems lower than related papers? Do the authors increase the number of channels in each layer as the other people do? It can improve the result a lot, and I wonder whether the improvement still exists in this setting.
+[.] In experiments, only CIFAR10 results have been reported, but I wonder what is the error bar looks like? (Do the authors run the experiments several times and calculate the variance?)
+",1,,ICLR2020
+BkwNvgRgf,3,S1sRrN-CW,S1sRrN-CW,Review,"The paper proposes a new method to train knowledge base embeddings using a least-squares loss. For this purpose, the paper introduces a reweighting scheme of the entries in the original adjacency tensor. The reweighting is derived from an analysis of the cross-entropy loss. In addition, the paper discusses the connections of the margin and cross-entropy loss and evaluates the proposed method on WN18 and FB15k.
+
+ The paper tackles an interesting problem, as learning from knowledge bases via embedding methods has become increasingly important for tasks such as question answering. Providing additional insight into current methods can be an important contribution to advance the state-of-the-art.
+
+However, I'm concerned about several aspects in the current form of the paper. For instance, the derivation in Section 4 is unclear to me, as eq.4 suddenly introduces a weighted sum over expectations using the degrees of nodes. The derivation also seems to rely on a very specific negative sampling assumption (uniform sampling without checking whether the corrupted triple is a true negative). This sampling method isn't used consistently across models and also brings its own problems, e.g., see the LCWA discussion in [4]
+
+In addition, the semantics that are introduced by the weighting scheme are not clear to me either. Using the proposed method, the probability of edges between high-degree nodes are down-weighted, since the ground-truth labels are divided by the node degrees. Since these weighted labels are then fitted using a least-squares loss, this implies that links between high-degree nodes should be less likely, which seems the opposite of what the scores should look like.
+
+With regard to the significance of the contributions: Using a least-squares loss in combination with tensor methods is attractive because it enables ALS algorithms with closed-form updates that can be computed very fast. However, the proposed method still relies on SGD optimization. In this context, it is not clear to me why a tensor framework/least-squares loss would be preferable.
+
+Further comments:
+- The paper seems to equate ""tensor method"" with using a least squares loss. However, this doesn't have to be the case. For instance see [1,2] which propose Logistic and Poisson tensor factorizations, respectively.
+- The distinction between tensor factorization and neural methods is unclear. Tensor factorization can be interpreted just as a particular scoring function. For instance, see [5] for a detailed discussion.
+- The margin based ranking loss has been proposed earlier than in (Collobert et al, 2011). For instance see [3]
+- p1: corrupted triples are not described entirely correct, typically only one of s or o is corrputed. 
+- Closed-form tensor in Table 1: This should be least-squares loss of f(s,p,o) and log(...)?
+- p6: Adding the constant to the tensor as proposed in (Levy & Goldberg, 2014) can done while gathering the minibatch and is therefore equivalent to the proposed approach.
+
+[1] Nickel et al: Logistic Tensor Factorization for Multi-Relational Data, 2013.
+[2] Chi et al: ""On tensors, sparsity, and nonnegative factorizations"", 2012
+[3] Collobert et al: A unified architecture for natural language processing, 2008
+[4] Dong et al: Knowledge Vault: A Web-Scale Approach to Probabilistic Knowledge Fusion, 2014
+[5] Nickel et al: A Review of Relational Machine Learning for Knowledge Graphs, 2016.",3,4.0,ICLR2018
+dxszZI1MEyy,3,x1uGDeV6ter,x1uGDeV6ter,Review for the paper: Adaptive Automotive Radar data Acquisition,"The authors propose an algorithm to select radar return regions that potentially contain objects inside. 
+
+Strengths:
++ It seems to be an interesting topic that uses radar data together with RGB images from the camera. 
+
+Weaknesses:
+- The problem to solve in this paper is not reasonably stated. The authors claim that the purpose of detecting the ""important regions"" in the radar domain is to implement compressed sensing during radar data acquisition. However, since object detection on radar data is already proposed and studied by some researchers (e.g., [1][2][3]), the reason why this problem needs to be proposed separately is not clear to me.  
+- The innovation is limited for both Algorithm 1 and 2. It seems that algorithms that trying to provide some bounding boxes on radar from Faster R-CNN and CFAR detections. 
+- The experiments are not adequate to illustrate the performance. It's not clear that the dataset used in this paper. The radar points in the nuScenes dataset are significantly different from the Oxford radar dataset, especially on the density of the radar points due to different kinds of radar sensors. Besides, the results in Table 1 seems very simple by selecting some special cases from the testing set. 
+
+Overall, I think the paper is not good enough to be accepted by ICLR.
+
+
+[1] Major, Bence, et al. ""Vehicle Detection With Automotive Radar Using Deep Learning on Range-Azimuth-Doppler Tensors."" Proceedings of the IEEE International Conference on Computer Vision Workshops. 2019.
+[2] Nobis, Felix, et al. ""A Deep Learning-based Radar and Camera Sensor Fusion Architecture for Object Detection."" 2019 Sensor Data Fusion: Trends, Solutions, Applications (SDF). IEEE, 2019.
+[3] Wang, Yizhou, et al. ""RODNet: Object Detection under Severe Conditions Using Vision-Radio Cross-Modal Supervision."" arXiv preprint arXiv:2003.01816 (2020).
+",4,4.0,ICLR2021
+oLq0Qn3RqyX,3,yrDEUYauOMd,yrDEUYauOMd,clarify the setting in more detail,"The paper studies the attainability of the equalized-odd fairness criteria introduced by Hardt et al'16 in classification and regression tasks. In particular, the paper claims that under certain conditions EQ is not even attainable. They proved the claim for the regression task but I could not find exactly where they discuss the classification attainability.  In fact, the (non)attainability claim about EQ is confusing to me since by definition, a *perfect* predictor (which is non-trivial) satisfies EQ. Intuitively, their result is meaningful for the task of linear regression as a perfect predictor may not exist. But as we consider classification or non-linear regression it becomes less believable. For instance as the authors mentioned, in a previous work, Woodworth et al.'17 showed that even checking a predictor is fair w.r.t. EO notion is not possible when we use *finite* many samples. 
+
+The results on classification with deterministic prediction seems restrictive as condition (i) seems very strong. Assuming condition (i), Theorem 4.1 seems straightforward to prove. Also, what is the difference between Theorem 4.2 and the previous results on the existence of fair predictor w.r.t. EQ in Hardt et al.'16 (e.g., Proposition 4.4.). Please elaborate on this. Lastly, Theorem 4.3 seems very interesting.
+
+Minor comments:
+- The figure 1 which has been referred to several times of the paper is missing.
+- Hardt et al. entry in References has typo.
+- page 2: outputing -> outputting, utiziling ->utlizing
+- page 3: Eqaulized -> Equalized
+- page 4: Eqalized -> Equalized
+- page 6: unconstrianed -> unconstrained, seperately -> separately
+- page 8: convariance -> covariance, appximated -> approximated, numerial -> numerical
+- page 13: quadractic -> quadratic
+- page 15: insection -> intersection
+
+=====POST-REBUTTAL COMMENTS========
+I would like to thank authors for their clarifications. Accordingly, I have increased my score to 6.",6,3.0,ICLR2021
+26xhwWFEMq5,1,HjD70ArLTQt,HjD70ArLTQt,"The paper provides a systematic evaluation of scene generation methods, but there are some concerns regarding novelty and the proposed setup.","Summary: 
+
+The paper provides a set of comparisons among different scene generation methods. It assesses ability of the models to fit the training set (seen conditionings), generalize to unseen conditionings of seen object combinations, and generalize to unseen conditionings composed of unseen object combinations. It finds that these models fit the training distribution with a moderate success, display decent generalization to unseen fine-grained conditionings, and have significant space for improvement when it comes to generating images from unseen coarse
+conditionings. 
+
+############################################################
+
+Strengths:
+
+The authors provide a comprehensive set of experiments to compare performance of different scene generation methods using several metrics. This can be helpful for researchers to assess strengths and weaknesses of each model and its components, and helps them to gain insights into which aspect of models need to be improved.  
+
+###########################################################
+
+Weaknesses:
+
+1. While the comparisons among different scene generation methods are valuable, there are concerns about novelty of the paper especially since similar works have been published considering other computer vision tasks (e.g. [A, B]). The authors assess existing models in different settings and report their findings. 
+2. There are some concerns regarding the overall setup for the experiments. Out of the three cases considered, the ability of a model to fit its training set (case 1) is not very interesting practically as we are more interested in the model’s generalization. Case 3, generalizing to unseen conditionings composed of unseen object combinations, is also not expected as the models are not particularly trained to be generalizable to unseen coarse conditionings. If one is seeking models with generalization ability to unseen coarse conditionings, he/she needs to incorporate a form of transfer/meta-learning and train the models differently. Generalization to novel categories is also an issue in object-level GANs.      
+3. The paper claims that it is very hard to assess which models perform better due to “models being trained to fit different data splits, using different conditioning modalities and levels of supervision, and reporting inconsistent quantitative metrics (e.g. repeatedly computing previous methods’ results using different reference distributions, and/or using different image compression algorithms to store generated images), among other uncontrolled sources of variation”. However, I do not see a clear evidence supporting this. Each of the other papers (LostGAN, OC-GAN, etc.) provides a set of comparisons with other methods. The authors need to note specifically which setups are inconsistent among different papers. They report that LostGAN-v2 outperforms other models in most tasks. This is consistent with results reported in the LostGAN-v2 paper (although they evaluate their method on a smaller number of metrics).  
+
+##############################################################
+
+Reason for rating: 
+
+While the paper provides a systematic evaluation of scene generation methods, there are some concerns regarding novelty and the proposed setup. I hope the authors clarify these in the rebuttal. 
+
+##############################################################
+
+Additional comments:
+
+There are some minor grammatical errors in the paper, and it needs further proofreading. 
+
+##############################################################
+
+References:
+
+[A] Are GANs Created Equal? A Large-Scale Study; Lucic et al.; NeurIPS 2018 
+
+[B] A metric learning reality check, Musgrave et al., ECCV 2020
+
+
+################################################################
+
+After author response: The authors have addressed my comment about inconsistent evaluation setups among different papers. However, I sill think novelty of the paper is limited as it is a conditional counterpart of [A]. As mentioned by other reviewers, findings of the paper are quite incremental and are in line with LostGAN-v2 although the authors use a more consistent evaluation setup. 
+Overall, I keep my current rating. 
+",6,5.0,ICLR2021
+HJltqtIeKS,1,BkxA5lBFvH,BkxA5lBFvH,Official Blind Review #2,"This paper studies the problem of safe adaptation to avoid catastrophic failure in a new environment. It draws intuition from human behavior. The proposed method (risk-averse domain adaptation (RADA)) learns probabilistic model-based RL agents from source domains, and uses them to select actions that has the best worst-case performance in the target domain.
+
+The paper mentions safety-critical applications like auto-driving. However, generally, I don't think black-box models are suitable for these safety-critical applications.",3,,ICLR2020
+Hkl6hcj55H,2,Bke_DertPB,Bke_DertPB,Official Blind Review #4,"Summary: Virtual Adversarial Training (Miyato et al., 2017) can be viewed as a form of Lipschitz regularization. Inspired by this, the paper proposes a Lipschitz regularization technique that tries to ensure that the function being regularized doesn’t change a lot in virtual adversarial directions. This method is shown to be effective in training Wasserstein GANs. 
+
+Motivation and placement in literature: Being able to effectively enforce the Lipschitz constraint on neural networks has wide ranging applications. Even though the paper predominantly considers the WGAN setting, the topic at hand is within the scope of NeurIPS and will of interest to the machine learning community at large. 
+
+Claimed Contributions and their significance: 
+1. Practical method with good performance: The proposed method can be used to train WGANs with high (subjective) sample quality. Although better, quantitative evaluation methods are needed to make stronger claims about the efficacy of this approach for GAN training in general (see below), the method described here will likely be useful for practitioners and GAN community. I’m also convinced that this method has the potential to work for higher dimensions. 
+2. VAT as Lipschitz regularization: There is a relatively straightforward connection between the Lipschitz constraint and adversarial robustness - both imply that small changes in the inputs should lead to small changes in the outputs, in their respective space. There are also a number of papers that make strong connections between adversarial training and Lipschitz regularization (Parseval Networks (Cisse et. al, 2017) for example). Therefore, it is perhaps not too surprising that the LDS term from Miyato et. al. can be rephrased as a Lipschitz regularization term by picking suitable input-output (pre-)metrics (in Section 3). I currently don’t see this as a major contribution of this paper, although I’m open to changing my mind if this involves a subtlety that I’m missing. 
+
+Related Work:
+Khrulkov et al (2017) looks like a related work - especially related to how the way the adversarial perturbation is computed and backpropagation is performed. Also Gemici et. al. also discuss the limitations of the original gradient penalty paper (for Section 2.2)
+
+Questions and Points of Improvement
+1. Better evaluation of GANs: Could you further convince us that this method alleviates common pitfalls of GAN training, such as mode collapse? There are a number of papers that give quantitative metrics for this purpose (such as Xu et. al. 2018). Since the quality of the WGANs presented is one of the biggest strengths of this paper, further evidence in this direction will make the paper stronger. 
+
+2. Different tasks: 
+The method described looks flexible enough to be applied on domains other than Wasserstein distance estimation. Did you try other tasks where a Lipschitz penalty might help, such as adversarial robustness? The semi-supervised setting mentioned in the appendix look promising yet perhaps under-explored. 
+
+3. Resultant Lipschitz constant: 
+Since this paper is about enforcing the Lipschitz constraint through regularization, more experiments on how well the Lipschitz constraint is enforced in practise would be helpful. For example, how much do your WGAN critics violate the 1-Lipschitz constraint? Once this is quantified, how does ALR compare to other Lipschitz regularization techniques? The function approximation task in Section 4.2 seems simple enough that you can probably compute gradient norms on a 2D grid and draw a histogram. How would the histograms look if you did this, for different methods?
+
+4. Sample efficiency: 
+Section 4.2 claims that using the explicit Lipschitz penalty is inefficient because violations of the Lipschitz constraint on samples from P_real, P_generated or P_interpolated likely be non-maximal. Could you make a theoretical or empirical case that the additional time spent for finding adversarial directions is actually worth it? If you have a way of quantifying how well the Lipschitz constraint is satisfied (as described above), then doing this empirically should be possible. 
+
+5. Problematic baseline for spectral normalization: 
+The way spectral normalization (SN) was used/described in Section 4.1 seem to have some issues. First of all, batch normalization is incompatible with methods that achieve Lipschitz constraint via. architectural constraints, such as spectral normalization. Also, this statement looks problematic: “It can be seen that SN has a very strong regularization effect, which is because SN works by approximating the spectral norm of each layer and then normalizing the layers by dividing their weight matrices by the corresponding spectral norms, thereby resulting in overregularization if the approximation is greater than the actual spectral norm.“ In most practical cases, power iteration used in spectral normalization can get a very close approximation of the spectral norm of the weight matrices with a reasonable number (<20 is a conservative guess) of iterations. The over-regularization effect, however, does exist and is more connected to the loss of gradient norm as described in Anil et. al. than bad approximations to the spectral norm of weight matrices. 
+
+Writing: The paper is well-written and easy to understand. 
+
+Decision: Weak Accept. 
+
+Other, lesser important points of improvement:
+1. The argmax expression in (18) looks problematic - r doesn’t seem bounded, hence can be chosen arbitrarily large. 
+2. Equation (25) describes the optimal approximation. According to which metric is this optimal? 
+3. Use \leq for “less than or equal to” in 25. 
+4. Consider adding a colormap to Figure 1. 
+
+________
+Post-rebuttal edit: The revisions made to the paper address some of the points of improvement listed above. I maintain my initial assessment of weak accept (leaning more towards accept), as I believe the methods discussed in this paper will be of interest to the research community. 
+",6,,ICLR2020
+_s6FkEfaiat,3,Qm7R_SdqTpT,Qm7R_SdqTpT,Official Blind Review #3,"Summary:
+This paper proposes a future frame prediction framework where the video generation can transition between different actions using a Gaussian process trigger. The framework consists of three components: an encoder which encodes the frame to a latent code, an LSTM which predicts the next latent code given the current one, and a Gaussian process which samples a new latent code. The framework can decide whether to switch to the next action by adopting the new latent code, depending on the number of frames passed or the variance of Gaussian.
+
+Strengths:
+The paper is easy to follow overall. The usage of Gaussian process to trigger the transition to the next action is reasonable and intuitive. Quantitative evaluations show that the method outperforms existing works for both reconstruction and output diversity for various datasets.
+
+Weaknesses and comments:
+There are quite a few typos in the writing, especially toward the latter part of the paper. I’d encourage the authors to do a thorough check to ensure the paper is typo-free.
+It seems switching actions at some fixed number of frames beats using the Gaussian variance for FVD, which is quite surprising. Can the authors provide some insights? Is it due to some inherent nature of FVD, or there’s still some room for improvement for the choosing criteria?
+How important is the heuristic of changing states when using GP? Currently it is triggered when the variance is larger than two standard deviations. How will it affect the performance if a different threshold is used?
+There’s a mistake in Table 1. The diversity score for DVG@15,35 is the best for KTH frames [10,25] (48.30), but DVG GP is bolded (47.71). This might also be an interesting point to discuss about why fixed number of frames performs better than GP.
+",6,3.0,ICLR2021
+rJegO1tfKB,1,SJeLIgBKPS,SJeLIgBKPS,Official Blind Review #2,"This paper studies the implicit regularization phenomenon. More precisely, given separable data the authors ask whether homogenous functions (including neural networks) trained by gradient flow/descent converge to the max-margin solution.  The authors show that the limit points of gradient descent are KKT points of a constrained optimization problem.
+
+-I think that the topic is important and the authors clearly made some interesting insights.
+-The main results of this paper (Theorem 4.1 and Theorem 4.4) require that assumption (A4) is satisfied. Assumption (A4) essentially means, that gradient flow/descent is able to reach weights, such that every data x_n is classified correctly. To me this seems to be a quit restrictive assumption as due to the nonconvexity of the neural net there is a priori no reason to assume that such a point is reached. In this sense, the paper only studies the latter part of the training process. 
+
+I feel that Assumption (A4) clearly weakens the strength of the main results. However, because the topic studied by the paper is interesting and the authors have obtained some interesting insights, I decided to rate the paper as a weak accept.
+
+Typos:
+-p. 4: ""Very Recently""
+-p. 7 and p. 9: ""homogenuous"" (instead of ""homogeneous"")
+
+----------
+
+I want to thank the authors for their response. However, I will stand by me evaluation and will not change it.
+I agree though that assumption (A4) is indeed reasonable, although of course very strong.  
+",6,,ICLR2020
+#NAME?,1,qOCdZn3lQIJ,qOCdZn3lQIJ,Official Blind Review #1,"This paper proposed an extension of blockwise scaled sign compressor in Zheng et al. (2019). The proposed method exploits the temporal correlation between two consecutive gradients. The authors show that one can have a higher compression rate by inserting distortion to the compressed gradient. A tighten bound is provided such that the asymptotic rate (including constant) is exactly the same as the full-precision counterpart. The experiments show that the proposed compressor can achieve additional 40%-50% reduction on communication compared to the scaled sign. Overall, the reviewer thinks the idea is interesting. The reviewer has a few comments:
+
+1. The proposed method considers randomly flipping the direction for elements that have the same sign as the averaged gradient in the last step. In this way, the sign is always correct for the elements that have opposite direction from the last gradient. I wonder will the results change if we consider flipping the sign of the elements that have opposite direction?
+
+2. Since alpha has a very small upper bound, it is hard to see any theoretical improvement over scaled sign.
+
+3. Theorem 4.2 does not show that one can achieve a linear speedup, i.e., O(1/\sqrt{nT}) rate.
+
+4. For the distributed training with high speed network, the extra overhead incurred by compression is not trivial and cannot be overlooked. As there is no results against CPU wall clock time, it is not clear if the proposed method is really faster than the scaled sign in terms of elapsed time.
+
+5. Can you show the final test accuracies on ImageNet achieved by each algorithm? It seems that scaled sign has slightly higher accuracy.",6,4.0,ICLR2021
+ecOgK7MiCg,2,YhhEarKSli9,YhhEarKSli9,"Very interesting idea and results, but relevant previous work regarding Bayesian networks and structure learning is not discussed at all.","The authors propose a framework, called AutoBayes, to automatically
+detect the conditional relationship between data features (X), task
+labels (Y), nuisance variation labels (S), and potential latent
+variables (Z) in DNN architectures.  Assuming a Bayesian network (BN)
+which represents the (possibly) conditional independencies between the
+aforementioned variables, the authors propose a learning algorithm
+which consists of applying Bayes-ball to detect and prune unnecessary
+edges in the graph (effectively finding a subgraph, independence map
+of the BN), train the resulting DNN architecture, and choose the network
+which achieves the highest validation performance.  This idea is
+interesting, especially compared to hyperparameter optimization
+approaches for model tuning, and the results seem convincing.
+
+However, relevant previous work is not cited and discussed in the paper.
+Specifically, BN structure learning and inference in BNs (both of
+which are well studied and have extensive literature) are fully
+relevant, but are not discussed or mentioned at all.  For instance,
+the paper uses undefined terms such as ""Bayesian graph model,"" ""Bayesian
+graphs,"" and ""graph model,"" in place of Bayesian network (which is
+rigorously defined).  It is important that such related previous work
+be discussed to delineate what is novel in the presented approach and
+place its contributions within the greater context of this previous
+work.  This inclusion would also help the presentation of concepts in
+the paper.  For instance, the discussion surrounding equations 1
+and 2, i.e.:
+""The chain rule can yield the following factorization for a generative
+model from Y to X (note that at most 4! factorization orders exist
+including useless ones such as reverse direction from X to Y )...,"" is
+the concept of an elimination ordering in the elimination
+algorithm for BNs (and graphical models in general).  Showcasing the
+presented work in this light, (i.e., as a natural
+combination of BN structure learning with macro-level neural-architecture
+optimization) would be particularly novel and compelling.
+
+Finally, it is important to discuss the complexity of the presented
+algorithm.  Given the Bayesian networks (BNs) in Algorithm 1,  each
+independence map (and the underlying DNN
+architecture) must be trained then validated.  This algorithm scales
+factorially in the number of nodes in the BN.  It is great that the
+selected subgraph performs so well (Figure 2), but super-exponential
+complexity multiplied by DNN cross-validation training is going to be
+very hard to do as m and n grow in Algorithm 1.
+
+Other comments:
+
+-The definition of nuisance and nuisance-variables are implicitly
+assumed throughout the paper.  An exact definition of what is meant by
+nuisance, in the context, of the paper would be very helpful.
+
+-Algorithm 1 is mostly one large block of text, and is very hard to
+parse on the first read.
+
+-In the main contributions enumerated from pages 2-3, contribution 3
+looks redundant given contribution 1.
+
+-""Besides fully-supervised training, AutoBayes can automatically build some relevant graphi-
+cal models suited for semi-supervised learning."" <- please include a
+link to where this is discussed.  The enumerated list of contributions
+would be a perfect roadmap for the paper (just include references to
+sections after every contribution)
+
+-""We note that this paper relates to some existing literature... as addressed in Appendix A.1. Nonetheless, AutoBayes
+is a novel framework that diverges from AutoML, which is mostly employed to architecture tuning
+at a micro level.  Our work focuses on exploring neural architectures at a macro level, which is not
+an arbitrary diversion, but a necessary interlude."" <- Appendix A.1
+should really be included in the
+paper, related work is a not an optional section.  For instance, the
+reader may not know what the authors mean in terms of micro versus
+macro level.  Reading this without further explanation until later
+in the paper, it would
+seem that micro-level is more nuanced than macro-level and the former
+perhaps subsumes the latter; the authors should detail what they mean
+and distinguish their work from previous works in the main paper.
+
+-The terminology is very clumsy: ""Bayesian graph models,"" ""Bayesian
+graph,"" and ""graph model"" are not established terms and, as such,
+should be defined so the reader knows what type of ML method is being discussed.  The authors should specify that they have a Bayesian
+network whose factorization describes the conditional relationship
+between (random) variables.
+
+-""VAE Evidence Lower Bound (ELBO) concept"" <- please include citation
+for the ELBO
+
+-""How to handle the exponentially growing search space of possible
+Bayesian graphs along with the number of random variables remains a
+challenging future work."" <- this is exactly structure learning in
+Bayesian networks (see the Bayesian information criterion, i.e., BIC score).",5,4.0,ICLR2021
+Bkyu46Xlz,1,r1Zi2Mb0-,r1Zi2Mb0-,Official review,"The paper explores neural architecture search for translation and reading comprehension tasks. It is fairly clearly written and required a lot of large-scale experimentation. However, the paper introduces few new ideas and seems very much like applying an existing framework to new problems. It is probably better suited for presentation in a workshop rather than as a conference paper.
+
+A new idea in the paper is the stack-based search. However, there is no direct comparison to the tree-based search. A clear like for like comparison would be interesting.
+
+Methodology. The test set newstest2014 of WMT German-English officially contains 3000 sentences. Please check http://statmt.org/wmt14. 
+Also, how stable are the results you obtain, did you rerun the selected architectures with multiple seeds? The difference between the WMT baseline of 28.8 and your best configuration of 29.1 BLEU can often be simply obtained by different random weight initializations.
+
+The Squad results (table 2) should list a more recent SOTA result to be fair as it gives the impression that the system presented here is SOTA.",3,4.0,ICLR2018
+gaeMs7gLkdb,4,uHNEe2aR4qJ,uHNEe2aR4qJ,The paper systematically investigates the sequential learning problem in neural networks through the lens of negative pretraining however there are several concerns relating to the experimental setup,"**Summary:** This paper conducts an empirical study to examine the well-known negative transfer phenomenon (termed as a negative pretraining effect in this work) in neural networks. In particular, a network trained on a sequence of tasks performs inferior to a network trained from scratch on the intended target task. The main idea of the paper is to study this phenomenon by formulating and intervening on different constituents of the sequential learning process - (1) changing the learning rate across tasks, (2) number and type of tasks encountered in the learning process, and (3) resetting the model biases when going from one task to another. The paper conducts experiments on four visual classification datasets (CIFAR-10, FashionMNIST, MNIST, SVHN) and report their findings for sequential training of ResNet-18 architecture. They show that increasing the learning rate after training on the first task can alleviate the negative pretraining effect. They further showcase how different task discretization and resetting model biases help to reduce the effect. 
+
+**Pros:** 
+The paper investigates an important problem of negative pretraining which has implications for different time-dependent learning paradigms (lifelong/ continual learning). Systematically studying how different constituents of the sequential learning process affect the severity of the negative pretraining effect is a reasonable approach and this paper attempts to take an initial step with this approach. 
+
+**Cons:**
+There are several concerns about the experimental setup which makes the results unconvincing:
+- The paper examines a single model (ResNet-18) on four datasets and through different experiments, the paper demonstrates that on MNIST, FashionMNIST, and SVHN datasets there is no clear (or less substantial) negative pretraining effect. However, this paper is about how different interventions can help mitigate the negative pretraining effect. If the current experimentation setup does not render the phenomena (except for CIFAR-10), this raises the question of whether the paper is analyzing the right setup (datasets and model)? Effectively a single model (ResNet-18) is examined on the CIFAR-10 dataset for the interventions, which is not representative enough. What happens if we increase the model complexity? What happens if we change the input modality? (see [1] for more details)
+- The paper concludes that increasing the learning rate on the subsequent task helps to **remove** the negative pretraining effect. In Figure 3 (discussion in Section 4) they report **a single case** of increasing the learning rate (10e-4 to 20e-4). It is unclear whether this is always the case. What happens if the learning rate is increased from 10e-4 to 50e-4 or 10e-4 to 10e-3? What is the general recipe to set the learning rate for the next task in the sequence?
+- The paper claims that it has proposed three distinct ways to **remove** the negative pretraining effect. Given the empirical evidence, this **(remove)** is too strong a claim to make. In some experiments, it alleviates the negative effect (see Figure 3, 4, 6: CIFAR-10) while in other experiments, results are inconclusive (see Figure 5: CIFAR-10, Shift-Then-Blur, Contrast-Then-Blue, and most of the experiments on MNIST, FashionMNIST, SVHN). 
+
+**General comment:** There is no denying the fact that this paper studies an interesting phenomenon through systematic experiments. However, the authors should consider evaluating the proposed interventions on the setup (datasets/ models) where the negative pretraining is a clearly visible phenomenon to conclude generic applicability of the discussed interventions.
+
+Please cite the below-mentioned work as it also empirically demonstrates the difficulty of warm-starting with pre-trained initialization.
+
+[1] Ash, Jordan T., and Ryan P. Adams. ""On the difficulty of warm-starting neural network training."" arXiv preprint arXiv:1910.08475 (2019).
+
+Please cite the peer-reviewed version of the related literature instead of the arXiv (for available ones), e.g.:
+
+Dyer, Ethan, and Guy Gur-Ari. ""Asymptotics of Wide Networks from Feynman Diagrams."" International Conference on Learning Representations. 2019.
+",4,4.0,ICLR2021
+BJNCqPqxM,3,SJx9GQb0-,SJx9GQb0-,Official review for paper 1144,"Updates: thanks for the authors' hard rebuttal work, which addressed some of my problems/concerns. But still, without the analysis of the temporal ensembling trick [Samuli & Timo, 2017] and data augmentation, it is difficult to figure out the real effectiveness of the proposed GAN. I would insist my previous argument and score. 
+
+Original review:
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+This paper presented an improved approach for training WGANs, by applying some Lipschitz constraint close to the real manifold in the pixel level.  The framework can also be integrated to boost the SSL performances. In experiments, the generated data showed very good qualities, measured by inception score. Meanwhile, the SSL-GANs results were impressive on MNIST and CIFAR-10, demonstrating its effectiveness. 
+
+However, the paper has the following weakness: 
+
+Missing citations: the most related work of this one is the DRAGAN work. However, it did not cite it. I think the author should cite it, make a clear justification for the comparison and emphasize the main contribution of the method. Also, it suggested that the paper should discuss its relation to other important work, [Arjovsky & Bottou 2017], [Wu et al. 2016].
+
+Experiments: as for the experimental part, it is not solid. Firstly, although the SSL results are very good, it is guaranteed the proposed GAN is good [Dai & Almahairi, et al. 2017]. Secondly, the paper missed several details, such as settings, model configuration, hyper-parameters, making it is difficult to justify which part of the model works. Since the paper using the temporal ensembling trick [Samuli & Timo, 2017],  most of the gain might be from there. Data augmentation might also help to improve. Finally, except CIFAR-10, it is better to evaluate it on more datasets. 
+
+Given the above reason, I think this paper is not ready to be published in ICLR. The author can submit it to the workshop and prepare for next conference. ",4,4.0,ICLR2018
+BJelopbHn7,1,SJequsCqKQ,SJequsCqKQ,"Interesting for the ICLR community, but somewhat straightforward ","The paper proposes deep learning extension of the classic paradigm of 'conformal prediction'. Conformal prediction is similar to multi-label classification, but with a statistical sound way of thresholding each (class-specific) classifier: if our confidence in the assignment of an x to a class y is smaller than \alpha, then we say 'do not know / cannot classify'). This is interesting when we expect out of distribution samples (e.g., adversarial ones).
+
+I think this paper, which is very well written, would make for nice discussions at ICLR, because it is (to my knowledge) the first that presents a deep implementation of the conformal prediction paradigm.  However, there are a couple of issues, which is why I think it is definitely not a must have at ICLR. The concrete, deep implementation of the approach is rather straightforward and substandard for ICLR: Features are taken from an existing, trained SOTA DNN, then input KDE, based on which for each class the quantiles are computed (using a validation set). Thus, feature and hypothesis learning are not coupled, and the approach requires quite a lot of samples per class (however, oftentimes in multi-label prediction we observe a Zipf law, ie many classes have fewer than five examples). Furthermore, there is no coupling between the classes; each class is learned separately; very unlikely this will work better than a properly trained multi-class or (e.g., one-vs.-rest) multi-label classifier in practice. Since a validation set is used to compute the quantiles, substantial 'power' is lost (data not used very efficiently; although that could be improved at the expense of expensive CV procedures).
+",4,5.0,ICLR2019
+S1lORAAAcB,3,HyxjOyrKvr,HyxjOyrKvr,Official Blind Review #3,"This paper focuses on the problem of  neural network compression, and proposes a new scheme, the neural epitome search. It learns to find compact yet expressive epitomes for weight parameters of a specified network architecture. The learned weight tensors are independent of the architecture design.  It can be encapsulated as a drop in replacement to the current
+convolutional operator. It can incur less performance drop. Experiments are conducted to show the effectiveness of the proposed method. However, there are some concerns to be addressed.
+-It is not too clear how to learn the epitomes and transformation functions.
+-Authors stated that the proposed method is independent of the architecture design. From the current statements, it is not explained clearly.",6,,ICLR2020
+HklJ3TmqnX,2,Hy4R2oRqKQ,Hy4R2oRqKQ,More explanation of the method can improve the paper,"This paper proposes to improve deep variational canonical correlation analysis (VCCA, Bi-VCCA) by 1) applying adversarial autoencoders (Makhzani et al. ICLR 2016) to model the encoding from multiple data views (X, Y, XY) to the latent representation (Z); and 2) introducing q(z|x,y) to explicitly encode the joint distribution of two views X,Y. The proposed approach, called adversarial canonical correlation analysis (ACCA), is essentially the application of adversarial autoencoder to multiple data views. Experiments on benchmark datasets, including the MNIST left right halved dataset, MNIST noisy dataset, and Wisconsin X-ray microbeam database, show the proposed ACCA result in higher dependence (measured by the normalized HSIC) between two data views compared to Bi-VCCA. 
+
+This paper is well motivated. Since adversarial autoencoder aims to improve based on VAE, it's natural to make use of adversarial autoencoder to improve the original VCCA. The advantage of ACCA is well supported by the experimental result.
+
+In ACCA_NoCV, does the author use a Gaussian prior? If so, could the author provide more intuition to explain why ACCA_NoCV would outperform Bi-VCCA, which 1) also use a Gaussian prior; and 2) also does not use the complementary view XY? Why would adversarial training improve the result?
+
+In ACCA, does the form of the prior distribution have to be specified in advance, such as Gaussian or the Gaussian mixture? Are the parameters of the prior learned during the training?
+
+When comparing the performance of different models, besides normalized HSIC, which is a quite recent approach, does the author compute the log-likelihood on the test set for Bi-VCCA and different variants of ACCA? Which model can achieve the highest test log-likelihood?
+
+According to equation (6), in principle, only q(z|x,y) is needed to approximate the true posterior distribution p(z|x,y). Did the author try to remove the first two terms in the right hand side of Equation (11), i.e., the expectation w.r.t. q_x(z) and q_y(z), and see how the model performance was affected?
+
+Does adversarial training introduce longer training time compared to the Bi-VCCA?",6,4.0,ICLR2019
+SygAAIO3n7,3,HJx38iC5KX,HJx38iC5KX,This paper lacks sufficient novelty,"In this paper, the author(s) propose a method, invariant feature learning under optimal classifier constrains (IFLOC), which maintains accuracy while improving domain-invariance. Here is a list of suggestions that will help the author(s) to improve this paper.
+1.The paper explains the necessity and effectiveness of the method from the theoretical and experimental aspects, but the paper does not support the innovation point enough, and the explanation is too simple.
+2.In this paper, Figure3-(b) shows that the classification accuracy of IFLOC-abl method decreases a lot when γ is taken to 0. Figure3-(c) shows that the domain invariance of IFLOC-abl method becomes significantly worse when γ is 10. The author(s) should explain the reasons in detail.
+3. The lack of analysis on domain-class dependency of each dataset makes the analysis of experimental results weak.
+",4,5.0,ICLR2019
+HklYjjdaKH,1,HkxU2pNYPH,HkxU2pNYPH,Official Blind Review #2,"This paper studies the problem of data-to-text generation so that the generated text stays truthful to the data source. The idea of the paper is use a learned confidence score as to how much the the encoder-decoder is paying attention to the source. The paper includes several components, 1. unconditioned language model to incorporate the confidence score, 2. use calibration techniques to adjust the output probability; 3. variational bayes objective to learn the confidence score. 
+
+The paper has good motivations and is quite well-written. The problem is of great pragmatic interest. In the experimental part, the authors demonstrate the effectiveness of the proposed algorithm.
+
+1. For training part, regarding the language model and  variational bayes objective being trained jointly, does it have convergence problem? What is the motivation of not training them jointly?
+2. Will the code be released and the human evaluation be published?
+3. There are some importance baseline missing, such as [1], [2], [3]
+
+[1] Marcheggiani, Diego, and Laura Perez-Beltrachini. ""Deep graph convolutional encoders for structured data to text generation."" arXiv preprint arXiv:1810.09995 (2018).
+[2] Ratish Puduppully, Li Dong, and Mirella Lapata. ""Data-to-text Generation with Entity Modeling."" arXiv preprint arXiv:1906.03221 (2019).
+[3] Ma, Shuming, et al. ""Key Fact as Pivot: A Two-Stage Model for Low Resource Table-to-Text Generation."" arXiv preprint arXiv:1908.03067 (2019).
+",6,,ICLR2020
+rJgJLYo39H,4,H1xauR4Kvr,H1xauR4Kvr,Official Blind Review #4,"In this work, the authors develop the discriminative jackknife (DJ), which is a novel way to compute estimates of predictive uncertainty. This is an important open question in machine learning and the authors have made a substantial contribution towards answering the question of ""can you trust a model?"" DJ constructs frequentist confidence intervals via a posthoc procedure. Throughout, the authors provide excellent background and exposition. They develop an exact construction of the DJ confidence intervals in Section 3.1. This is an intuitive approach that the authors explain well. Next, they explain and then develop the concept of higher order influence functions. They do a great job of communicating this concept. Section 3.4 provides the theoretical guarantees for DJ. The related work section is extensive and thorough. The authors have thoughtful experiments that demonstrate positive attributes of DJ. 
+
+I suggest that this paper is weak accepted. On synthetic and real data, the DJ empirically works well, based on Figure 4 and Table 2. In addition, the theoretical exposition is very clear and compelling. The intuition provided in Sections 3.1 and 3.2 helps readers really understand what's going on, then 3.3 and 3.4 give theoretical justifications of the utility of DJ. 
+
+However, I have a few suggestions for improvement that lead to the ""weak"" acceptance. First, I'll cover minor quibbles, then more major points. 
+
+In Figure 1: At first, the blue dots and blue shading were not clear to me. In the legend, maybe explain that the blue shading indicates coverage and the blue dots indicate regions of higher/lower discrimination. 
+
+In Equation 6: Looks like there is a misplaced parentheses. I think the last two terms of the equations should be Q(Vn-)+Q(Vn+) not Q(Vn-+Q(Vn+))
+
+In Equation 7, and throughout: I think a better notation for the function would be \mathcal{I}^{(1)}_{\hat{\theta}} rather than \mathcal{I}^{(1)}_{\theta} since the influence function is a derivative with respect to the optimal parameters. 
+
+Below equation 8: you should probably have an additional k exponent in the numerator of the kth order influence term, i.e. = \frac{\del^k \hat{\theta}_{i, \epsilon}}{\del \epsilon^k}
+
+In Theorem 1: could you calculate this without \grad L(D, \theta), since L a function of \ell ? Maybe mention this in the appendix
+
+It could be nice for an appendix study on how approximating the Hessian impacts performance for cases when we can compute the Hessian exactly. 
+
+Table 1 could be expanded to include comparisons on computational bottlenecks and if there's retraining in these other methods. 
+
+In Figure 4, please indicate the order of the IF used in the DJ procedure. DJ(m=?)? 
+
+Major issues: 
+
+Why weren't the other jackknife procedures used as baselines as well? I realize DJ has advantages compared to them, but an apples to apples comparison would be useful. For some researchers, LOO CV might not be prohibitive. This could be a chance to really sell your method: if it does well enough compared to more expensive LOO jackknife procedures, that would be a compelling reason to choose DJ. 
+
+Could you please check this reference and let us know if it substantial is different from Influence functions that you develop? ""Higher order influence functions and minimax estimation of nonlinear functionals"" Robins 2008 DOI: 10.1214/193940307000000527 Robins et al. develop a way to compute higher order influence functions, which you do claim you're the first to do ""to the best of your knowledge."" 
+",6,,ICLR2020
+Bycm6ytgf,2,B1CQGfZ0b,B1CQGfZ0b,"Interesting formulation, but execution lets the paper down","This paper presents a method for choosing a subset of examples on which to run a constraint solver
+in order to solve program synthesis problems. This problem is basically active learning for
+programming by example, but the considerations are slightly different than in standard active
+learning. The assumption here is that labels (aka outputs) are easily available for all possible
+inputs, but we don't want to give a constraint solver all the input-output examples, because it will
+slow down the solver's execution.
+
+The main baseline technique CEGIS (counterexample-guided inductive synthesis) addresses this problem
+by starting with a small set of examples, solving a constraint problem to get a hypothesis program,
+then looking for ""counterexamples"" where the hypothesis program is incorrect.
+
+This paper instead proposes to learn a surrogate function for choosing which examples to select. The
+paper isn't presented in exactly these terms, but the idea is to consider a uniform distribution
+over programs and a zero-one likelihood for input-output examples (so observations of I/O examples
+just eliminate inconsistent programs). We can then compute a posterior distribution over programs
+and form a predictive distribution over the output for all the remaining possible inputs. The paper
+suggests always adding the I/O example that is least likely under this predictive distribution
+(i.e., the one that is most ""surprising"").
+
+Forming the predictive distribution explicitly is intractable, so the paper suggests training a
+neural net to map from a subset of inputs to the predictive distribution over outputs.  Results show
+that the approach is a bit faster than CEGIS in a synthetic drawing domain.
+
+The paper starts off strong. There is a start at an interesting idea here, and I appreciate the
+thorough treatment of the background, including CEGIS and submodularity as a motivation for doing
+greedy active learning, although I'd also appreciate a discussion of relationships between this approach 
+and what is done in the active learning literature.Once getting into the details of the proposed approach, 
+the quality takes a downturn, unfortunately. 
+
+Main issues:
+- It's not generally scalable to build a neural network whose size scales with the number
+of possible inputs. I can't see how this approach would be tractable in more standard program
+synthesis domains where inputs might be lists of arrays or strings, for example.  It seems that this
+approach only works due to the peculiarities of the formulation of the only task that is considered,
+in which the program maps a pixel location in 32x32 images to a binary value.
+
+- It's odd to write ""we do not suggest a specific neural network architecture for the
+middle layers, one should seelect whichever architecture that is appropriate for the domain at
+hand."" Not only is it impossible to reproduce a paper without any architectural details, but the
+result is then that Fig 3 essentially says inputs -> ""magic"" -> outputs. Given that I don't even
+think the representation of inputs and outputs is practical in general, I don't see what the 
+contribution is here.
+
+- This paper is poor in the reproducibility category. The architecture is never described,
+it is light on details of the training objective, it's not entirely clear what the DSL used in the
+experiments is (is Figure 1 the DSL used in experiments), and it's not totally clear how the random
+images were generated (I assume values for the holes in Figure 1 were sampled from some
+distribution, and then the program was executed to generate the data?).
+
+- Experiments are only presented in one domain, and it has some peculiarities relative to 
+more standard program synthesis tasks (e.g., it's tractable to enumerate all possible inputs).  It'd
+be stronger if the approach could also be demonstrated in another domain.
+
+- Technical point: it's not clear to me that the training procedure as described is consistent
+with the desired objective in sec 3.3. Question for the authors: in the limit of infinite training
+data and model capacity, will the neural network training lead to a model that will reproduce the
+probabilities in 3.3?
+
+Typos:
+- The paper needs a cleanup pass for grammar, typos, and remnants like ""Figure blah shows our 
+neural network architecture"" on page 5.
+
+Overall: There's the start of an interesting idea here, but I don't think the quality is high enough
+to warrant publication at this time.
+",5,4.0,ICLR2018
+rylFpJr_3Q,1,ryxxCiRqYX,ryxxCiRqYX,"Interesting paper, should be accepted","This paper presents a very interesting interpretation of the neural network architecture.
+
+I think what is remarkable is that the author presents the general results (beyond the dense layer) including a convolutional layer by using the higher-order tensor operation.
+Also, this research gives us new insight into the network architecture, and have the potential which leads to many interesting future directions. 
+So I think this work has significant value for the community.
+
+The paper is clearly written and easy to follow in the meaning that the statement is clear and enough validation is shown. (I found some part of the proof are hard to follow.)
+
+\questions
+In the experiment when you mention about ""embed solvers as a replacement to their corresponding blocks of layers"", I wonder how they are implemented. About the feedforward propagation, I guess that for example, the prox operator is applied multiple times to the input, but I cannot consider what happens about the backpropagation of the loss.
+
+In the experiment, the author mentioned that  ""what happens if the algorithm is applied for multiple iterations?"". From this, I guess the author iterate the corresponding algorithms several times, but actually how many times were the iterations or are there any criterion to stop the algorithm?
+
+\minor comments
+The definition of \lambda_max below Eq(3) are not shown, thus should be added.",8,1.0,ICLR2019
+HJ91THweG,1,S1v4N2l0-,S1v4N2l0-,"Interesting discovery, good results, but not a lot of content.","The paper proposes a simple classification task for learning feature extractors without requiring manual annotations: predicting one of four rotations that the image has been subjected to: by 0, 90, 180 or 270º. Then the paper shows that pre-training on this task leads to state-of-the-art results on a number of popular benchmarks for object recognition, when training classifiers on top of the resulting representation.
+
+This is a useful discovery, because generating the rotated images is trivial to implement by anyone. It is a special case of the approach by Agrawal et al 2015, with more efficiency.
+
+On the negative side, this line of work would benefit from demonstrating concrete benefits. The performance obtained by pre-training with rotations is still inferior to performance obtained by pre-training with ImageNet, and we do have ImageNet so there is no reason not to use it. It would be important to come up with tasks for which there is not one ImageNet, then techniques such as that proposed in the paper would be necessary. However rotations are somewhat specific to images. There may be opportunities with some type of medical data.
+
+Additionally, the scope of the paper is a little bit restricted, there is not that much to take home besides the the following information: ""predicting rotations seems to require a lot of object category recognition"".
+
+
+
+",6,5.0,ICLR2018
+YVyW-r8ie9,3,D9I3drBz4UC,D9I3drBz4UC,"This paper proposed a simple but effective method to solve long-tailed recognition. It finds out that current methods suffer high bias and variance. To relieve this problem, it propose to ensemble several experts to make predictions. It proposes a diversiy loss to guarantee the diversity among experts and learn an expert assignment module to turn on/off an expert when predicting. The proposed method of this paper is general and significantly outperforms the state-of-the-art method by 5% to 7%.","This paper proposed a simple but effective method which significantly outperforms the state-of-the-art method by 5% to 7%. The experiments are adequate and rigorous. Besides these , the writing is clear and easy to understand. Totally, this paper is a good work
+Pros:
+1. The writing is clear.
+2. The view of bias and variance is novel and interesting.
+3. The experiments are adequate and rigorous.
+4. The performance of proposed method is very strong.
+5. The proposed method is simple, effective and general.
+Cons:
+1. The procedure of test is not clear enough. For example, how to turn on/off an expert? By a threshold?
+2. The description of the part about shared, indepent, and recuded extractor is not clear enough.",7,4.0,ICLR2021
+wFytLGTzGrA,4,#NAME?,#NAME?,My review,"Summary:
+
+This paper proposes a new approach for meta-RL. The paper claims that the proposed method reduces variance and bias of the meta-gradient estimation by only a few samples. In addition, this paper claims that their method is more interpretable. 
+
+My comments:
+
+There are lots of unknown, unwarranted claims about this paper in addition to no thorough experiments and comparison with previous works :
+ 
+1. Paper claimed that their proposed method not only reduces the variance of meta-gradient but also reduces bias. Reading through this method and experiments, I don't see any theoretical or empirical justification why that should be the case.
+
+2. The method claims this method has a more interpretable meta-gradient. Again, there is nothing in this paper that verifies this claim.
+3. Paper proposed to use fewer samples for the adaptation phase. I don't understand why this can help at all.
+4. Experiments are incomplete and there is no rigorous evaluation to analyze this method.
+5. The proposed context in this paper has been already proposed in previous work, not sure what is new here.
+6. There are lots of things that need to defined like trail, few-shot, etc. 
+
+In summary, this paper is not ready at all and needs lots of works. Hence, I'd recommend this paper to be rejected.",2,5.0,ICLR2021
+H1aEoGAgG,2,BJ_wN01C-,BJ_wN01C-,"Interesting algorithm to training with limited memory, but needs some additional relationships to existing work.","The authors provide a novel, interesting, and simple algorithm capable of training with limited memory.  The algorithm is well-motivated and clearly explained, and empirical evidence suggests that the algorithm works well.  However, the paper needs additional examination in how the algorithm can deal with larger data inputs and outputs.  Second, the relationship to existing work needs to be explained better.
+
+Pro:
+The algorithm is clearly explained, well-motivated, and empirically supported.
+
+Con:
+The relationship to stochastic gradient markov chain monte carlo needs to be explained better.  In particular, the update form was first introduced in [1], the annealing scheme was analyzed in [2], and the reflection step was introduced in [3].  These relationships need to be explained clearly.
+The evidence is presented on very small input data.  With something like natural images, the parameterization is much larger and with more data, the number of total parameters is much larger.  Is there any evidence that the proposed algorithm could continue performing comparatively as the total number of parameters in state-of-the-art networks increases? This would require a smaller ratio of included parameters.
+
+[1] Welling, M. and Teh, Y.W., 2011. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11)(pp. 681-688).
+
+[2] Chen, C., Carlson, D., Gan, Z., Li, C. and Carin, L., 2016, May. Bridging the gap between stochastic gradient MCMC and stochastic optimization. In Artificial Intelligence and Statistics(pp. 1051-1060).
+
+[3] Patterson, S. and Teh, Y.W., 2013. Stochastic gradient Riemannian Langevin dynamics on the probability simplex. In Advances in Neural Information Processing Systems (pp. 3102-3110).
+ 
+",5,5.0,ICLR2018
+KL3fCuB2Zt,1,Z_3x5eFk1l-,Z_3x5eFk1l-,"This paper presents a novel ADversarial Meta-Learner (ADML), the claimed first work that tackles adversarial samples in meta-learning. In its core algorithm, a given model is diverged simultaneously by a set of clean  samples and a set of perturbed counterparts, to generate two inner models, which are then meta-updated in an adversarial manner using new sets of clean/adversarial samples.","Summarization of the contribution:
+
+This paper presents a novel ADversarial Meta-Learner (ADML), the claimed first work that tackles adversarial samples in meta-learning. In its core algorithm, a given model is diverged simultaneously by a set of clean  samples and a set of perturbed counterparts to generate two inner models, which are then meta-updated in an adversarial manner using new sets of clean/adversarial samples.
+
+
+Strengths:
+
+The paper addressed an important topic in meta-learning, i.e., meta-learning with adversarial samples. On two datasets, the experiments, especially the ablation studies (page 6), are carefully designed, and the results were also discussed
+ in detail (page 7-8).
+
+
+Weaknesses:
+
+1. The paper gives no sensitivity analysis of the hyper-parameters. For example, on page 4, \alphas and \betas were set at different values but no reason was given.
+
+2. There is neither open-sourced code nor plan for releasing the code in the paper. Thus, it is questionable whether the proposed method is reproducible.
+
+3. The idea of combining meta-learning with adversarial defense has already been proposed and well studies, such as in ""Adversarial Attacks on Graph Neural Networks via Meta Learning, ICLR 2019"".
+
+4. The experiments are not thoroughly conducted. MiniImageNet and CIFAR100 are all small scale datasets with only 60k samples each. Indeed, image classification is an important task and a benchmark for meta-learning. Still, it would be more informative to test the proposed method with multiple types of data, say graph data as mentioned above.
+
+5. Meta-learning, by its definition, is a learning framework, which is not necessarily tantamount to few-shot/one-shot learning, and is different from the latter in many ways. However, the whole experiment section has focused on few-shot learning using classic few-shot learning tasks. This gives us an impression that the title ""adversarial meta learning"" is somewhat misleading. Actually, a recent work ""Adversarially Robust Few-Shot Learning: A Meta-Learning Approach, NeurIPS 2020"" adopted
+ nearly the same setting and datasets and achieved superior results.
+
+6. Another main problem is that there was, unfortunately, no theoretic analysis on the proposed algorithm, but only some intuition behind its design. It would be interesting to dive deeper, mathematically, to show how the algorithm
+ works under the hood and to be more confident about where and when it holds true.
+
+
+Recommendation and reason: Weak Reject.
+
+Overall, this is a good paper.  However, it is a little bit below the standard of ICLR, considering its novelty, experiments, writing, and depth of analysis.
+
+
+Additional comments:
+
+1. The language of the paper needs to be polished and refined.
+2. Page 1: it is better to rephrase the sentence about MAML to give a clearer description of it.
+3. Page 3: each task is a 5-way classification task: it is better to explain what is a ""5-way"" task for the readers who are not familiar with few-shot learning.
+4. Page 6: the paper considered several attack mechanisms including FGSM, FFGSM, RFGSM, and RPGD. But in reality, the attacker may not divulge the mechanism to be used. So can this method be attack mechanism-agnostic
+ i.e. be resistant to an attack without having known the mechanism in advance?",5,4.0,ICLR2021
+S1xJZx1i2X,2,rkl3-hA5Y7,rkl3-hA5Y7,Back to the past,"This paper is very interesting as it seems to bring the clock back to Holographic Reduced Representations (HRRs) and their role in Deep Learning. It is an important paper as it is always important to learn from the past. HRRs have been introduced as a form of representation that is invertible. There are two important aspects of this compositional representation: base vectors are generally drawn from a multivariate gaussian distribution and the vector composition operation is the circular convolution. In this paper, it is not clear why random vectors have not been used. It seems that everything is based on the fact that orthonormality is impose with a regularization function. But, how can this regularization function can preserve the properties of the vectors such that when these vectors are composed the properties are preserved.
+
+Moreover, the sentence ""this is computationally infeasible due to the vast number of unique chunks"" is not completely true as HRR have been used to represent trees in ""Distributed Tree Kernels"" by modifying the composition operation in a shuffled circular convolution. ",5,4.0,ICLR2019
+rJlD3X8n3Q,1,rJleN20qK7,rJleN20qK7,"A well written paper with thorough experimental evaluation, but lacks novelty.","Summary:
+This paper presents a Two-Timescale Network (TTN) that enables linear methods to be used to learn values. On the slow timescale non-linear features are learned using a surrogate loss. On the fast timescale, a value function is estimated as a linear function of those features. It appears to be a single network, where one head drives the representation and the second head learns the values.  They investigate multiple surrogate losses and end up using the MSTDE for its simplicity, even though it provides worse value estimates than MSPBE as detailed in their experiments.  They provide convergence results - regular two-timescale stochastic approximation results from Borkar, for the two-timescale procedure and provide empirical evidence for the benefits of this method compared to other non-linear value function approximation methods.
+
+Clarity and Quality:
+The paper is well written in general, the mathematics seems to be sound and the experimental results appear to be thorough. 
+
+Originality:
+Using two different heads, one to drive the representation and the second to learn the values appears to be an architectural detail. The surrogate loss to learn the features coupled with a linear policy evaluation algorithm appear to be novel, but does not warrant, in my opinion, the novelty necessary for publication at ICLR. 
+
+The theoretical results appear to be a straightforward application of Borkar’s two-timescale stochastic approximation algorithm to this architecture to get convergence. This therefore, does not appear to be a novel contribution.
+
+You state after equaltion (3) that non-linear function classes do not have a closed form solution. However, it seems that the paper Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation does indeed have a closed form solution for non-linear function approximators when minimizing the MSPBE (albeit making a linearity assumption, which is something your work seems to make as well). 
+
+The work done in the control setting appears to be very similar to the experiments performed in the paper: Shallow Updates for Deep Reinforcement Learning.
+
+Significance:
+Overall, I think that the paper is well written and the experimental evaluation is thorough. However, the novelty is lacking as it appears to be training using a multi-headed approach (which exists) and the convergence results appear to be a straightforward application of Borkars two-timescale proof. The novelty therefore appears to be using a surrogate loss function for training the features which does not possess the sufficient novelty in my opinion for ICLR. 
+
+I would suggest the authors' detail why their two-timescale approach is different from that of Borkars. Or additionally add some performance guarantee to the convergence results to extend the theory. This would make for a much stronger paper.",6,4.0,ICLR2019
+Skxg0CjUtH,1,HJe88xBKPr,HJe88xBKPr,Official Blind Review #4,"This paper is about training deep models with 8-bit floating point numbers. The authors use an enhanced loss scaling method and stochastic rounding method to stabilize training. They do experiments on image classification and NLP tasks.
+
+The paper is clearly written. However, I don’t think this paper passes the bar of ICLR. This paper lacks innovation and insightful analysis.
+
+1.Sec. 3.1 proposes enhanced loss scaling. Loss scaling is a heuristic to train low-precision neural networks. The authors train 8-bit GNMT with a changing scaling factor. However, this looks like some manually tuned result for GNMT only. I doubt if this generalizes to other models. Besides, there is no equation or algorithm flowchart to demonstrate their method. It’s not very readable.
+
+2.The logic of Sec. 3.2 is quite confusing. The authors first empirically show that the performance of ResNet-50 significantly drops with 8-bit training. Then they show the sum of the square of the weights in ResNet-50 is high at the beginning. With this observation, they claim it demonstrates the drawback of ‘rounding-to-nearest-even’. I cannot see the connection between the norm of weights and the rounding technique. Moreover, the stochastic rounding has already been used in 8-bit training.[1]
+
+3.The setting in the experiment section is not stated clearly. For example, what’s the hyper-parameter for loss scaling? Another question is the gradient. In Sec. 3, just above Fig. 1, the authors claim the weight update is performed in full-precision. In contrast, they claim the gradient is 8-bit in table 3. If the update is full-precision, [2] is an important baseline. 
+
+Small suggestions:
+1.For Fig. 6, I suggest the authors to smooth the loss curves to avoid overlap of two curves.   
+2.There are two ‘with’s in the last paragraph of page 7.
+
+Reference:
+[1]Wang N, Choi J, Brand D, et al. Training deep neural networks with 8-bit floating point numbers[C]//Advances in neural information processing systems. 2018: 7675-7684.
+[2]Banner R, Hubara I, Hoffer E, et al. Scalable methods for 8-bit training of neural networks[C]//Advances in Neural Information Processing Systems. 2018: 5145-5153.",1,,ICLR2020
+HJRB8Nugf,1,Byt3oJ-0W,Byt3oJ-0W,The paper utilizes finite approximation of the Sinkhorn operator to describe how one can construct a neural network for learning from permutation valued training data. A probabilistic view of permutation via injection of Gumbel noise is also presented in the paper.,"Quality: The paper is built on solid theoretical grounds and supplemented by experimental demonstrations. Specifically, the justification for using the Sinkhorn operator is given by theorem 1 with proof given in the appendix. Because the theoretical limit is unachievable, the authors propose to truncate the Sinkhorn operator at level $L$. The effect of approximation for the truncation level $L$ as well as the effect of temperature $\tau$ are demonstrated nicely through figures 1 and 2(a). The paper also presents a nice probabilistic approach to permutation learning, where the doubly stochastic matrix arises from Gumbel matching distribution. 
+
+Clarity: The paper has a good flow, starting out with the theoretical foundation, description of how to construct the network, followed by the probabilistic formulation. However, I found some of the notation used to be a bit confusing.
+
+1. The notation $l$ appears in Section 2 to denote the number of iterations of Sinkhorn operator. In Section 3, the notation $l$ appears as $g_l$, where in this case, it refers to the layers in the neural network. This led me to believe that there is one Sinkhorn operator for each layer of neural network. But after reading the paper a few times, it seemed to me that the Sinkhorn operator is used only at the end, just before the final output step (the part where it says the truncation level was set to $L=20$ for all of the experiments confirmed this). If I'm correct in my understanding, perhaps different notation need to be used for the layers in the NN and the Sinkhorn operator. Additionally, it would have been nice to see a figure of the entire network architecture, at least for one of the applications considered in the paper. 
+
+2. The distinction between $g$ and $g_l$ was also a bit unclear. Because the input to $M$ (and $S$) is a square matrix, the function $g$ seems to be carrying out the task of preparing the final output of the neural network into the input formate accepted by the Sinkhorn operator. However, $g$ is stated as ""the output of the computations involving $g_l$"". I found this statement to be a bit unclear and did not really describe what $g$ does; of course my understanding may be incorrect so a clarification on this statement would be helpful.
+
+Originality: I think there is enough novelty to warrant publication.  The paper does build on a set of previous works, in particular Sinkhorn operator, which achieves continuous relaxation for permutation valued variables. However, the paper proposes how this operator can be used with standard neural network architectures for learning permutation valued latent variable. The probabilistic approach also seems novel. The applications are interesting, in particular, it is always nice to see a machine learning method applied to a unique application; in this case from computational neuroscience.
+
+Other comments:
+
+1. What are the differences between this paper and the paper by Adams and Zemel (2011)? Adams and Zemel also seems to propose Sinkhorn operator for neural network. Although they focus only on the document ranking problem, it would be good to hear the authors' view on what differentiates their work from Adams and Zemel.
+
+2. As pointed out in the paper, there is a concurrent work: DeepPermNet. Few comments regarding the difference between their work and this work would also be helpful as well.
+
+Significance: The Sinkhorn network proposed in the paper is useful as demonstrated in the experiments. The methodology appears to be straight forward  to implement using the existing software libraries, which should help increase its usability. 
+
+The significance of the paper can greatly improve if the methodology is applied to other popular machine learning applications such as document ranking, image matching, DNA sequence alignment, and etc. I wonder how difficult it is to extend this methodology to bipartite matching problem with uneven number of objects in each partition, which is the case for document ranking. And for problems such as image matching (e.g., matching landmark points), where each point is associated with a feature (e.g., SIFT), how would one formulate such problem in this setting? 
+",8,4.0,ICLR2018
+rye0nwomaQ,3,rJgz8sA5F7,rJgz8sA5F7,Not good enough,"The work proposes a structure that mimics progressive nets. Maybe the main difference from progressive nets is that backwards connection from the new features to the old features in layer 2 are not 0 out. This could cause interference, however is solved by using the task ID to not evaluate those new features when going back to a previous task. I think this is a technical detail, that does not provide any explicit advantage or disadvantage over progressive nets. 
+
+Employing GANs/VAE to predict task id also can be seen as not an ideal choice. In particular the GAN network will suffer from catastrophic forgetting, which is solved (if I understood correctly) by training the GAN with data from all tasks. Which makes one wonder, if we can affort to access data from all tasks to learn the GAN then why not the classification model too !?
+
+I think an alternative might be something like the Forget Me Not Process published and used in the original work with EWC.
+
+Unfortunately due to presence of these previous works, lack of more thorough comparison with other existing approaches, the work should not be accepted to ICLR.",4,3.0,ICLR2019
+wKuu6ncyTL7,2,#NAME?,#NAME?,A simple and effective method for semi-supervised semantic segmentation,"Summary:
+This paper focuses on the problem of semi-supervised semantic segmentation, where less pixel-level annotations are used to train the network. A new one-stage training framework is proposed to include the process of localization cue generation, pseudo label refinement and training of semantic segmentation. Inspire by recent success in the semi-supervised learning (SSL), a novel calibrated fusion strategy is proposed to incorporate the concept of consistency training with data augmentation into the framework. Experiments on PASCAL VOC and MSCOCO benchmarks validate the effectiveness of the proposed method.  
+
+Pro:
++ The proposed one-stage training framework is elegant compared with two stage methods in this area which include one step for pseudo-label generation and another step for refinement then semantic segmentation training. 
++ The new designed calibrated fusion strategy well incorporate the concept of consistency training with data augmentation into the same framework.
++ Achieve a new state-of-the-art on both PASCAL VOC and MSCOCO benchmarks compared with recent semi-supervised semantic segmentation methods.
+
+Questions:
+- CCT (Ouali et al., 2020) includes the consistency training with perturbations which can be treated as a kind of data augmentation on features. I'm wondering if authors can provide some insights about why the proposed method can achieve better performance than CCT when they both include the consistency training and data augmentation in the designs.  
+- In table 3, I suggest to include the segmentation framework used by each method in the table. In early works, old version of deeplab is usually treated as the standard. I understand using deeplab v3 is a fair comparison with CCT. It would be good to make this information clear in the table.
+- It is also suggested to report the performance on PASCAL VOC test set as it is a common practice in this area (although CCT does not do so).
+- Sine the unlabeled data training branch does not rely on any pixel-level annotations, I'm wondering if the proposed method can also work under weakly-supervised setting, where no pixel-level annotations are available during the training. ",8,5.0,ICLR2021
+UK6HHtyxGdc,1,loe6h28yoq,loe6h28yoq,Minimal Theoretical Contribution; Little empirical value; Existence of much stronger results,"Existence of much stronger results:
+I don't get why majority voting is claimed to be the ""state-of-the-art"" technique. If I'm understanding it correctly, the majority vote technique can only handle a number of corruption points up to O(K), K being the number of voters. Furthermore, since the voters more or less split the dataset, in order to maintain the accuracy of each voter, the number of voters can't be too large, and is usually O(1). Therefore, this majority vote approach can only handle O(1) number of corruption. 
+
+More specifically, for the case of kNN, the certified accuracy (Theorem 2, ) becomes vacuous as soon as the number of corrupted points $e$ becomes greater than $k$.
+
+On the other hand, there are techniques developed from the robust statistics community that can be robust against an $\epsilon$-fraction of corruption points, that is, if there are a total of $N$ training points, it allows $\epsilon N$ number of points to be corrupted. For example, Sever [1], a recently developed robust supervised learning algorithm, guarantees $O(\sqrt{\epsilon})$ generalization error under $\epsilon$-fraction of arbitrary corruption. Such a guarantee is much stronger than the ones majority voting approaches are able to achieve. 
+
+Thus, I'm having trouble appreciating the contribution of this paper given the existence of much stronger results.
+
+Relevance to the field:
+While prior approaches like DPA also suffers from the same weak/trivial guarantee, they are at least meta-algorithms that allows one to plug in any base learners depending on the application. The method developed in this paper, however, only works on kNN. And let's be honest, not many modern ML applications use kNN with the slightest chance. So I don't see much empirical value nor any significant theoretical contribution. 
+
+[1] Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Jacob Steinhardt, Alistair Stewart. Sever: A Robust Meta-Algorithm for Stochastic Optimization. 
+",3,4.0,ICLR2021
+S1ly24CaKr,2,HkxwmRVtwH,HkxwmRVtwH,Official Blind Review #2,"**Summary**: This paper proposes a hierarchical Bayesian approach to hyper-networks by placing a Gaussian process prior over the latent representation for each weight. A stochastic variational inference scheme is then proposed to infer the posterior over both the Gaussian process and the weights themselves. Experiments are performed on toy regression, classification, (edit: post rebuttal) and transfer learning tasks, as well as an uncertainty quantification experiment on MNIST.
+
+post rebuttal (noticed recently): Many apologies for updating the review the day before the deadline; however, I recently remembered that Kronecker inference is often used in variational methods - particularly within the vein of literature of deep kernel learning. Indeed, structure exploiting SVI was proposed in Stochastic Variational Deep Kernel Learning, https://arxiv.org/pdf/1611.00336.pdf, and this method is currently the default in Gpytorch: https://github.com/cornellius-gp/gpytorch/tree/master/examples/08_Deep_Kernel_Learning .
+Furthermore, Kronecker inference for non-Gaussian likelihoods for Laplace approximations was proposed back in 2015: http://proceedings.mlr.press/v37/flaxman15.pdf. 
+I am not updating my score because it would be unfair; however, the record should be set somewhat straight here.
+
+post rebuttal: Thank you for the many clarifications and detailed responses. I'm now satisfied with their many changes and and tend to accept this paper despite the experimental results being somewhat limited. I would really encourage the authors to fix the color schemes (please less black and more brighter colors) on their decision boundaries plots however. 
+
+**tldr**: While I appreciate the concept of this paper, I tend to reject this paper because I find the experimental results to be on too small scale of datasets. Specifically, I would like to see either a larger scale problem being solved with this kind of approach or a tough to model applied problem that is solved with this approach.
+
+**Originality**: As far as I can tell, this seems to be a novel approach to hyper-networks. Neural processes (Garnelo et al; 2018) propose a somewhat similar approach to training – with a latent process over some stored weight space. However, even that is quite distinct from the method proposed in this paper, and I tend to prefer this approach. 
+
+**Quality**: I really appreciate the merging of neural network and Gaussian process methods; however, tragically, I do wonder if the proposed approach combines the worst of both worlds – the necessity of architecture search for neural networks with the choice of kernel function (as illustrated in Figure 5). 
+If the method is truly kernel dependent, is it also architecture dependent? That is, is it robust to different settings of nonlinearities and depths?
+
+Active learning experiment: While I appreciate the comparison here, it seems like here standard HMC should be trainable over well-designed priors on these architectures. So why not include a comparison instead of just MFVI?
+
+**Significance**: Unfortunately, I think that the experiments section is just a bit too limited to warrant acceptance right now. This is despite the fact that I really do appreciate the thoroughness and thoughtfulness of the experiments as they are. 
+
+Specifically, in Section 6.2 why is the metaGP prior only applied to the last layer of the network? If as I suspect, it is due to the complexity and difficulty of inference, that makes the method doubly tough to use in practice. With that being said, to only have experiments on the last layer implies that one should compare to Bayesian logistic regression and linear regression on the last layer of neural networks (e.g Perrone et al, 2018 and Riquelme et al, 2018). Experiments with other methods that combine Gaussian processes with representations on the final layer (e.g. Wilson et al, 2015) are also probably worth running. 
+
+Figure 4 is a very well-done experiment, if a bit tough to read. I’d suggest that the out of distribution examples get their own figure, with the in distribution examples going into the appendix. I’d also suggest computing the expected calibration error (Naeini et al, 2015) for in and out of distribution examples on the test sets for both MNIST and K-MNIST in order to have quantitative results on the entire test set. 
+
+To recommend acceptance, I’d really have to see experiments on either a CIFAR sized dataset for classification or a larger scale regression experiment. A larger dataset on either transfer learning (after all you do have a meta-representation over functions that the NN can learn), a larger active learning experiment, or semi-supervised learning. 
+
+**Clarity**: Overall, the paper is well-written and mostly easy to follow. The meat of the paper is found in Section 4, which I found a bit difficult to follow. 
+
+(edit: post rebuttal. This concern is somewhat resolved due to the field not being well developed in this area, although it is a useful place to possibly extend the method in the future.) My primary concern here is that the prior ends up becoming Kronecker structured (after Eq. 7), so it isn’t clear to my why dense matrices and dense variational bounds have to be derived in this setting. Can one not follow the lead of the Gaussian process literature (e.g. Saatci 2012, Wilson & Nickisch, 2015) to exploit the Kronecker structure here to make computation of the log likelihoods fast?
+(edit: post rebuttal. This concern is somewhat resolved.) As a result, it’s not immediately clear to me why a diagonal approximation (Eq. 10) is even necessary? 
+Furthermore, this may be a setting where iterative methods (e.g conjugate gradients and Lanczos decompositions as in Pleiss et al, 2018) for the predictive means and variances may shine and be fast.
+I do agree that the approximation in Figure 2 does seem to be relatively accurate, although I would ask the authors to compute a relative error for that plot if possible. Additionally, what is the strange high off diagonal correlations in the marginal covariances?
+
+(edit: post rebuttal. Thank you for the clarifications here.) Finally, I was a bit confused by the effect of adding the input dependent kernel in Section 3; this seems to make the weights much more complicated to model – now each data point has its own set of weights and therefore, we might have to store considerably more weight matrices over time. Could the authors perform a set of experiments showing the necessity of this kernel matrix in the rebuttal?
+
+**Minor Comments**: 
+-	Above Eq. 9, “splitted” should be split.
+-	Figure 3: could the data points be plotted in a brighter fashion? On a dark background, they are quick tough to see. Additionally, what is the difference between the two levels of classification plots?
+
+
+References:
+
+Naeini, et al. Obtaining Well Calibrated Probabilities by Bayesian Binning, AAAI, 2015. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4410090/
+
+Perrone, V, et al. Scalable Hyperparameter Transfer Learning, NeurIPS, 2018. http://papers.nips.cc/paper/7917-scalable-hyperparameter-transfer-learning
+
+Pleiss, G, et al. Constant Time Predictive Distributions for Gaussian Processes, ICML, 2018. https://arxiv.org/abs/1803.06058
+
+Riquelme, C, Tucker, G, Snoek, J. Deep Bayesian Bandits Showdown, ICLR, 2018. https://arxiv.org/abs/1802.09127
+
+Saatci, Y. Scalable Inference for Structured Gaussian process models, PhD Thesis, U. of Cambridge, 2011. http://mlg.eng.cam.ac.uk/pub/pdf/Saa11.pdf
+
+Wilson, AG and Nickisch, H. Kernel Interpolation for Scalable Structured Gaussian Processes, ICML, 2015. http://proceedings.mlr.press/v37/wilson15.pdf
+
+Wilson, AG, et al. Deep Kernel Learning, AISTATS, 2015. https://arxiv.org/abs/1511.02222
+",6,,ICLR2020
+Bke_mOmC2m,3,SJlpM3RqKQ,SJlpM3RqKQ,The paper presents some new approaches for communication efficient Federated Learning that allows for training of large models on heterogeneous edge devices.,"The paper presents some new approaches for communication efficient Federated Learning (FL) that allows for training of large models on heterogeneous edge devices. In FL, heterogeneous edge devices have access to potentially non-iid samples of data points and try to jointly learn a model by averaging their local models at a parameter server (the cloud). As the bandwidth of the up/downlink-link may be limited communication overheads may become the bottleneck during FL. Moreover, due to the heterogeneity of the hardware, large models may be hard to train on small devices. Due to that, there are several recent approaches that aim to minimize communication via methods of quantization, which also aim to allow for smaller models via methods of compression and model quantization.
+
+In this paper, the authors suggest a combination of two methods to reduce communication and allow for large model training by 1) using a lossy compressed model when that is communicated from the cloud to the edge devices, and 2) subsampling the gradients, a form of dropout, at the edge device side that allows for an overall smaller model update. The novelty of either of those techniques is quite limited as individually they have been suggested before, but the combination of both of them is interesting. 
+
+The paper is overall well written, however there are two aspects that make the contribution lacking in novelty. First of all, the presented methods are a combination of existing techniques, that although interesting to combine together, are neither theoretically analyzed nor extensively tested. The model/update quantization technique has been used in the past extensively [eg 1-3]. Then, the “federated dropout” can be seen as a “coordinate descent” type of a technique, i.e., randomly zeroing out gradient elements per iteration. 
+
+Since this is a more experimental paper, the setup tested is quite limited in its comparisons. For example, one would expect to see extensive comparisons with methods for quantizing gradients, eg QSGD, or Terngrad, and combinations of that with DeepCompression. Although the authors do make an effort to experiment with a different set of hyperparameters (dropout probability, quantization levels, etc), a comparison with state of the art methods is lacking.
+
+Overall, although the combination of the presented ideas has some merit, the lack of extensive experiments that would compare it with the state of the art is not convincing, and the overall effectiveness of this method is unclear at this point.
+
+[1] https://arxiv.org/pdf/1510.00149.pdf
+[2] https://arxiv.org/pdf/1803.03383.pdf
+[4] https://arxiv.org/pdf/1610.05492.pdf",4,5.0,ICLR2019
+S1eQziLmFB,1,rJxWxxSYvB,rJxWxxSYvB,Official Blind Review #2,"The paper introduces a training mechanism for spiking neural nets that employs a causal inference technique, called RDD, for adjustment of backward spiking weights. This technique induces the backward influence strengths to be reciprocal to the forward ones, bringing desirable symmetry properties.
+
+Pros:
+ * The relationship between causal inference and biologically plausible learning is very interesting. This relationship is also important and impactful for the machine learning community, as we are on the quest of new deep learning technologies.
+
+ * Application of the RDD method to spiking neural net training is novel. The reciprocal relationship of the causal effect to the synaptic strength is a very intuitive and elegant solution to the weight transport problem.
+
+Cons:
+ * From the reported results, it is not possible to decide whether RDD really outperforms Feedback Alignment (FA). The comparison is performed on only two data sets and each algorithm is better on one. Could the authors report results on at least two more data sets (however small or simple) during the rebuttal?
+
+ * Fig and Table 1 report the same outcome. One of the two need to be removed.
+
+Further Questions:
+ * The Conv Net illustrated in Fig 2 panel A shares its weights with the biologically plausible net on panel B. Further, these two nets communicate for pre-training. How does the paper then isolate the contribution of the biologically plausible net to the prediction accuracy from the vanilla ConvNet? What would happen if we trained only the LIF net without a contact with the conv net?
+
+ * Eq. 1 proposes induction of symmetry to solve the weight transform. At the extreme, this regularizer would make W and Y identical, boiling down to  a vanilla artificial neural net, which the ML community already knows wella nd performs with excellence. Would not having the biologically  implausible artificial neural model as the extreme solution contradict with the goal of biologically plausible learning? This would in the end make one conclude that the biological brain only performs a broken gradient descent.
+
+Overall, this is a decent piece of work with some potential. My initial vote is a weak reject, as I  am at present missing sufficient evidence that the improved symmetry properties introduced by the causal inference scheme also brings an accuracy improvement over the vanilla feedback alignment method. I am open to improve to an accept if this evidence is provided and my aforementioned concerns primarily on the role of ConvNet are properly addressed during rebuttal.
+
+
+--
+Post-rebuttal: My only major concern was the lack of sufficient empirical evidence to support the idea. The updated version of the manuscript has properly addressed this issue by reporting results on additional data sets. The authors have also given enlightening clarifications to some of the open points I have raised earlier. Hence, I'm happy to increase my score.",6,,ICLR2020
+VIqLBw4ck5J,2,Ovp8dvB8IBH,Ovp8dvB8IBH,Anonymous review,"This paper presents a method that uses artificial augmentation of data as Negative (aka OOD) samples to improve various computer vision tasks, including generation, unsupervised learning on images and videos.
+
+Prons:
+- The paper is very well written.
+- Experiments are comprehensive across different tasks
+- The usage of data augmentation seems interesting but with some questions (see below).
+- It designs losses for both GANs and contrastive representation learning.
+- Code is provided.
+
+Cons:
+- Augmentation has been proven in GANs to provide benefits through consistency training  (e.g. CR-GAN, ICLR2020,  Image Augmentations for GAN Training). These samples are used as ""positive"" samples that should generate consistent predictions. The most famous mixup is also treated as ""positive"" samples for training. So the augmentation usage here is a bit counterintuitive to me, because you show the opposite conclusion. Is that because only particular augmentation can be used as negative samples, e.g. Jiasaw? The answer to this question is critical. However, the paper does not mention/ study much.
+- Advanced self-supervised (contrastive) learning reply on strong augmentation, how negative samples can adapt to these methods? How do we categorize augmentation types used for general cases or NDA cases? Any insights on what kind of augmentations are useful for NDA. For example, in figure 9, the paper proposes to push samples and its jigsaw version away. However, these two pairs share strong local visual contents of objects (just like an image and its crop parts) that usually contrastive learning wants to pull them together. The proposed method tries to push them away. Any insights why it should work?
+- If justifications of these questions can be sufficient, I think it can be a strong paper. 
+",5,4.0,ICLR2021
+TRFgf94PeS_,4,oBmpWzJTCa4,oBmpWzJTCa4,A Complex Framework that Deals with Safety-Critical Problems,"The paper proposes a framework that learns the dynamic of safety-critical systems. The framework makes use of ideas from active learning, meta-learning, and reinforcement learning. As far as I understand, the framework seeks to learn an acquisition function that can evaluate the information gain from a certain action taken at a certain state. Acquisition function is a function that identifies an action the agent desires to take from a collection of unvisited states. The learning of such an acquisition function is formulated as a reinforcement learning problem, trained in a meta-learning fashion. I have the following comments:
+
+1. The proposed method is evaluated on two problems, ADMETS and damaged aircraft.  Both problems are solved using synthetic data. While it is understandable that the framework is trained on simulated data, is there any justification with respect to the quality of the data simulator used in these two problems? Why performing well on these simulated data would be an indication of the proposed method can work reasonably well in practice?
+
+2. Since the proposed framework is trained in a meta-learning fashion, is there any quantification on how fast/well (e.g. wrt. sample complexity) the trained policy adapting to new environments in test time?
+
+3. While the paper takes safety during the learning of the dynamics as a critical issue, is there any justification with respect to the percentage of safety achieved by the proposed method? For example, why an 87% safety reported in Figure 5 achieved by the proposed method is a good number?
+
+4. The key components of the proposed framework (e.g. meta-learning, reinforcement learning, active learning, safety) make sense to address the technical challenges in learning system dynamics in safety-critical domains. It is not very clear to me if the novelty of the proposed framework is appropriately highlighted. The paper can benefit from a better highlight of the novelty of the proposed method and why such novel contributions matter.",5,2.0,ICLR2021
+ms9ZpSTvUz,4,9MdLwggYa02,9MdLwggYa02,Good paper,"#### Summary
+The paper provides a new variant of PBT which utilizes ideas from differential evolution and cross-over. The original PBT and even initiator PBT do not perform crossover on the hyper-parameters, and insufficient cross-over may cause PBT to perform greedy in the initial phases which ends up with a suboptimal convergence. The investigation of better cross-over in PBT is itself an interesting research direction and the authors demonstrated its effectiveness in standard benchmarks and data augmentation tasks. The improvements of ROMUL-PBT are also helpful to the community since PBT has been applied in a variety of real world applications.
+
+#### Pros
+1. Quality: The paper quality is in general good. The experiments are well designed and the results are good. So the experiments clearly supports the argument that differential evolution helps PBT.
+2. Clarity: The paper is well written and easy to follow. The organization is also clear.
+3. Originality: I think that adapting ideas from differential evolution to PBT is new, even though differential evolution itself is not something new.
+4. The paper provides some benchmarking of PBT related algorithms in image classification, language modeling and data augmentation which is good for the community to understand these approaches.
+
+
+#### Cons
+1. Significance: the improvements over existing methods seem slight. The experiments do not provide sensitivity analysis so it is a bit hard to conclude whether the results are statistically significant. But at the same time, the proposed method does show promise.
+2. As a thorough evaluation purpose, it would be interesting to see how the proposed methods work in large set of hyperparameters (magnitude of 10-100). 
+
+#### Questions
+1. PBT needs to use validation loss to obtain fitness. Is your result evaluated on the validation data or the test data? If only evaluating on the validation data, the result may not reveal potential overfitting to the validation set. So it would be nice to have results on a held-out test set.",6,5.0,ICLR2021
+dXlZgiWlRFs,3,LkFG3lB13U5,LkFG3lB13U5,The relationship between the proposed FedAdagrad/FedYogi/FedAdam and FedSGD,"This paper proposes several federated variants of adaptive stochastic gradient methods.   Moreover,  the convergence rates of the proposed algorithms are also provided.  Based on the current submission,  The reviewer has several concerns: 
+
+1. The adaptive learning rates in FedAdagrad/FedYogi/FedAdam are all related to parameter $\tau$.  According to Theorem 1-2, Corollary 1-2,   $\tau$ needs to set as a large constant $G/L$.   When  $\tau$ is sufficiently large, FedAdagrad/FedYogi/FedAdam is actually reduced to FedSGD. Can the authors give several comments on this parameter? 
+
+2. The second question is about the generalization ability of federated adaptive stochastic gradient methods.   It has been demonstrated in [1] that adaptive SGD generalizes poorly compared with SGD.  A natural question is that ""  does this dilemma still exists in federated stochastic gradient methods? "" 
+
+[1] Wilson AC, Roelofs R, Stern M, Srebro N, Recht B. The marginal value of adaptive gradient methods in machine learning. In Advances in neural information processing systems 2017 (pp. 4148-4158).",6,5.0,ICLR2021
+rkBmo9ryM,1,BJE-4xW0W,BJE-4xW0W,"nice idea, presentation can be improved","In their paper ""CausalGAN: Learning Causal implicit Generative Models with adv. training"" the authors address the following issue: Given a causal structure between ""labels"" of an image (e.g. gender, mustache, smiling, etc.), one tries to learn a causal model between these variables and the image itself from observational data. Here, the image is considered to be an effect of all the labels. Such a causal model allows us to not only sample from conditional observational distributions, but also from intervention distributions. These tasks are clearly different, as nicely shown by the authors' example of ""do(mustache = 1)"" versus ""given mustache = 1"" (a sample from the latter distribution contains only men). The paper does not aim at learning causal structure from data (as clearly stated by the authors). The example images look convincing to me.
+
+I like the idea of this paper. IMO, it is a very nice, clean, and useful approach of combining causality and the expressive power of neural networks. The paper has the potential of conveying the message of causality into the ICLR community and thereby trigger other ideas in that area. For me, it is not easy to judge the novelty of the approach, but the authors list related works, none of which seems to solve the same task. The presentation of the paper, however, should be improved significantly before publication. (In fact, because of the presentation of the paper, I was hesitating whether I should suggest acceptance.) Below, I give some examples (and suggest improvements), but there are many others. There is a risk that in its current state the paper will not generate much impact, and that would be a pity. I would therefore like to ask the authors to put a lot of effort into improving the presentation of the paper. 
+
+
+- I believe that I understand the authors' intention of the caption of Fig. 1, but ""samples outside the dataset"" is a misleading formulation. Any reasonable model does more than just reproducing the data points. I find the argumentation the authors give in Figure 6 much sharper. Even better: add the expression ""P(male = 1 | mustache = 1) = 1"". Then, the difference is crystal clear.
+- The difference between Figures 1, 4, and 6 could be clarified.    
+- The list of ""prior work on learning causal graphs"" seems a bit random. I would add Spirtes et al 2000, Heckermann et al 1999, Peters et al 2016, and Chickering et al 2002. 
+- Male -> Bald does not make much sense causally (it should be Gender -> Baldness)... Aha, now I understand: The authors seem to switch between ""Gender"" and ""Male"" being random variables. Make this consistent, please. 
+- There are many typos and comma mistakes. 
+- I would introduce the do-notation much earlier. The paragraph on p. 2 is now written without do-notation (""intervening Mustache = 1 would not change the distribution""). But this way, the statements are at least very confusing (which one is ""the distribution""?).
+- I would get rid of the concept of CiGM. To me, it seems that this is a causal model with a neural network (NN) modeling the functions that appear in the SCM. This means, it's ""just"" using NNs as a model class. Instead, one could just say that one wants to learn a causal model and the proposed procedure is called CausalGAN? (This would also clarify the paper's contribution.)
+- many realizations = one sample (not samples), I think. 
+- Fig 1: which model is used to generate the conditional sample?  
+- The notation changes between E and N and Z for the noises. I believe that N is supposed to be the noise in the SCM, but then maybe it should not be called E at the beginning. 
+- I believe Prop 1 (as it is stated) is wrong. For a reference, see Peters, Janzing, Scholkopf: Elements of Causal Inference: Foundations and Learning Algorithms (available as pdf), Definition 6.32. One requires the strict positivity of the densities (to properly define conditionals). Also, I believe the Z should be a vector, not a set. 
+- Below eq. (1), I am not sure what the V in P_V refers to.
+- The concept of data probability density function seems weird to me. Either it is referring to the fitted model, then it's a bad name, or it's an empirical distribution, then there is no pdf, but a pmf.
+- Many subscripts are used without explanation. r -> real? g -> generating? G -> generating? Sometimes, no subscripts are used (e.g., Fig 4 or figures in Sec. 8.13)
+- I would get rid of Theorem 1 and explain it in words for the following reasons. (1) What is an ""informal"" theorem? (2) It refers to equations appearing much later. (3) It is stated again later as Theorem 2. 
+- Also: the name P_g does not appear anywhere else in the theorem, I think. 
+- Furthermore, I would reformulate the theorem. The main point is that the intervention distributions are correct (this fact seems to be there, but is ""hidden"" in the CIGN notation in the corollary).
+- Re. the formulation in Thm 2: is it clear that there is a unique global optimum (my intuition would say there could be several), thus: better write ""_a_ global minimum""?
+- Fig. 3 was not very clear to me. I suggest to put more information into its caption. 
+- In particular, why is the dataset not used for the causal controller? I thought, that it should model the joint (empirical) distribution over the labels, and this is part of the dataset. Am I missing sth?
+- IMO, the structure of the paper can be improved. Currently, Section 3 is called ""Background"" which does not say much. Section 4 contains CIGMs, Section 5 Causal GANs, 5.1. Causal Controller, 5.2. CausalGAN, 5.2.1. Architecture (which the causal controller is part of) etc. An alternative could be: 
+Sec 1: Introduction 
+Sec 1.1: Related Work
+Sec 2: Causal Models
+Sec 2.1: Causal Models using Generative Models (old: CIGM)
+Sec 3: Causal GANs
+Sec 3.1: Architecture (including controller)
+Sec 3.2: loss functions 
+...
+Sec 4: Empricial Results (old: Sec. 6: Results)
+- ""Causal Graph 1"" is not a proper reference (it's Fig 23 I guess). Also, it is quite important for the paper, I think it should be in the main part. 
+- There are different references to the ""Appendix"", ""Suppl. Material"", or ""Sec. 8"" -- please be consistent (and try to avoid ambiguity by being more specific -- the appendix contains ~20 pages). Have I missed the reference to the proof of Thm 2?
+- 8.1. contains copy-paste from the main text.
+- ""proposition from Goodfellow"" -> please be more precise
+- What is Fig 8 used for? Is it not sufficient to have and discuss Fig 23? 
+- IMO, Section 5.3. should be rewritten (also, maybe include another reference for BEGAN).
+- There is a reference to Lemma 15. However, I have not found that lemma.
+- I think it's quite interesting that the framework seems to also allow answering counterfactual questions for realizations that have been sampled from the model, see Fig 16. This is the case since for the generated realizations, the noise values are known. The authors may think about including a comment on that issue.
+- Since this paper's main proposal is a methodological one, I would make the publication conditional on the fact that code is released. 
+
+
+",7,3.0,ICLR2018
+GWwQiRnz6_-,3,VErQxgyrbfn,VErQxgyrbfn,Convex Regularization behind Neural Reconstruction,"The authors propose a convex formulation for training a 2-layer neural network for reconstruction which should make training easier.
+
+It’s nice to see some results in this domain which are provably convergent, especially since 2 layer NNs are universal approximations. I’m cautiously optimistic that this could open the door to more practically relevant results.
+
+However, the NNs experimented with and the experiments done are relatively simplistic, so it’s hard to grasp how far away this is from practical results. It would also be nice if more effort was spent on contextualising these results within the wider scope of convex methods (e.g. infinite width 2 layer networks).
+
+* The polynomial complexity on its own isn’t that interesting. The input dimension is in the hundreds of thousands, so even quadratic complexity would be highly limiting. It would probably be nice to make it clear that this is mostly of theoretical interest. Perhaps even write out a real estimate for l in the article.
+* The main result is quite interesting even outside of regression. Would it be possible to extend it to other convex loss functions?
+* Stating that the dual is easier to interpret than the primal is debatable. The primal is also locally linear around almost all points.
+* Please make all images in figure 5 equally large.
+* Is there a good reason results are not reported on the full FastMRI dataset? It would certainly make comparisons much easier and would contextualize the results. The current setup is not really reproducible.
+* Overall I’m missing baselines to understand how relevant the results are. Adding a simple U-Net to the MRI case would disambiguate if this is a relaxation which is of theoretical importance, or if it’s actually giving good results. The images on their own are kinda useless for this.
+* The “interpretable reconstruction” part is honestly somewhat debatable. Showing me a few hundred filters might give a tiny bit of insight, but I wouldn’t say it lets me interpret the reconstruction properly.
+* Figure 1b is irritatingly not vertically centered
+* Images in fig 3 suffer from similar issues. Overall images should be saved in exactly the resolution they are stored in (e.g. 28x28 for MNIST), this is especially important in articles whose whole purpose is to show imaging results. Also avoid linear interpolation between pixels.
+* Has the number of sign patterns been ablated?
+",6,4.0,ICLR2021
+BJlTuVBRFr,2,BkgYPREtPr,BkgYPREtPr,Official Blind Review #2,"This paper proposes to represent a Hamiltonian model of a physical system by a neural network. The parameters are then adjusted, so that the observations are considered maximally likely under a probabilistic model. The novelty is to consider a symplectic Leapfrog integration scheme for the Hamiltonian system, which is known to conserve important quantities such as volume in the state space. The proposed approach is shown to outperform the recent work ""Hamiltonian Neural Networks"" by a large margin on mass-spring chain dynamics and three body systems. The approach can even handle stiff dynamical systems such as bouncing billiards. 
+
+Overall, the work is solid contribution and a reasonable improvement over the recent work on HNNs is demonstrated. Therefore I recommend acceptance of the paper. However, I have some fundamental doubts on the motivation on this line of works. This might be, because I'm not too familiar with the subject, and I'm willing to increase my score if the doubts are cleared.
+
+In the shown examples, to best of my knowledge, the ""exact"" Hamiltonians describing the physics of the system are well-known. Therefore, I'm unsure what is the advantage of trying to learn physical laws, that are already well understood. The paper argues that the learned Hamiltonian will correct for errors in the discretization, but one could instead use a better integration scheme or a finer time-discretization, based on classical theory which has been developed over the last 50 years which comes with strong convergence guarantees and error bounds. I would have liked to see a stronger motivation, why it is interesting to learn an Hamiltonian of a system, where the exact Hamiltonian is already known. It would also be enlightening to see some plots, which illustrate how ""far"" the learned Hamiltonian is from the analytical one.
+
+Of course, one might argue that the ultimate goal is to have a learning based approach discover physical laws so far unknown to humans,  just from observations. But it is unclear why the inductive bias that the observations are generated by a Hamiltonian might be reasonable. It could very well be, that the law cannot be described by a Hamiltonian system. 
+
+From a high-level point of view, one might even argue that it is not too surprising that one can fit a parametrized Hamiltonian to observations generated by a Hamiltonian system better than a general purpose function approximator without such an inductive bias or better than a system based on a naive/unsuitable non-conservative integrator.
+
+As a remark, often the exact Hamiltonian is known to be (strictly) convex. I'm wondering whether convex function approximators such as convex neural networks could provide an even stronger inductive bias. But it might be that a general purpose RNN can account better for the discretization errors. ",8,,ICLR2020
+rJlRry6x2Q,2,HJMXTsCqYQ,HJMXTsCqYQ,Novelty is limited. No comparison with SOTA models,"This paper proposes to improve the chemical compound generation by the Bayes optimization strategy, not by the new models. 
+The main proposal is to use the acquisition that switches the function based on the violation of a constraint, estimated via a BNN. 
+
+I understand that the objective function, J_{comp}^{QED} is newly developed by the authors, but not intensively examined in the experiments. 
+The EIC, and the switching acquisition function is developed by (Schonlau+ 1998; Gelbard, 2015). 
+So I judge the technical novelty is somewhat limited. 
+
+It is unfortunate that the paper lacks intensive experimental comparisons with ""model assumption approaches"". 
+My concern is that the baseline is rely on the SMILE strings. 
+It is well known that the string-based generators are much weaker than the graph-based generators. 
+In fact, the baseline model is referred as ""CVAE"" in (Jing+, 2018) and showed very low scores against other models. 
+
+Thus, we cannot tell that these graph-based, ""model-assumption"" approaches are truly degraded in terms of the validity and the variety of generated molecules, 
+compared to those generated by the proposed method. 
+In that sense, preferable experimental setting is that to 
+test whether the constrained Bayesian optimization can boost the performance of the graph-based SOTA models. 
+
+
++ Showing that we can improve the validity the modification of the acquisition functions
+- Technical novelty is limited. 
+- No comparison with SOTA models in ""graph-based, model assumption approaches"". 
+",4,3.0,ICLR2019
+adb2Dse6KU6,1,kmqjgSNXby,kmqjgSNXby,Good results by applying autoregressive dynamics models to batch policy evaluation/optimization,"#### Summary
+
+The authors consider the usage of autoregressive dynamics models for batch model-based RL, where state-variable/reward predictions are performed sequentially conditioned on previously-predicted variables. Extensive numerical results are provided in several continuous domains for both policy evaluation and optimization problems. The results showcase the effectiveness of autoregressive models and, in particular, their superiority over standard feed-forward models.
+
+#### Pros
+
+- The paper is very well-written and easy to follow. The experiments are described with sufficient details to understand the results
+- The usage of these autoregressive models for model-based RL is, to my knowledge, novel
+- The paper presents extensive experiments on several challenging domains. The results are convincing and significant. In particular, they show that autoregressive models are superior to feedforward ones
+
+#### Cons
+
+- The paper's sole contribution seems to be empirical since autoregressive models are (as acknowledged) not novel, though their application to this setting is.
+- While the empirical results are very convincing, I did not find much intuition on where this big improvement over feedforward models comes from (see detailed comments below).
+- The ordering of the state variables might be a limitation (again, see below).
+
+#### Detailed comments
+
+1. As mentioned above, I did not find much intuition on the better performances of autoregressive models vs feedforward ones. As I am not entirely familiar with the system dynamics of the considered domains, do you think that they possess any property which makes autoregressive models more suitable than feedforward ones (e.g., strong correlations between next-state variables)? Aren't the transition dynamics deterministic in most of the considered domains?
+
+2. Since the reward in most of the considered domains is (I suppose) a function of state, action, and next-state, could it be that one of the reasons behind the worse performance of feedforward models is that they try to predict the reward as a function of state-action only? Would their performance change if they explicitly modeled the reward as a function of s,a,s'?
+
+3. Related to the previous point, the autoregressive model naturally predicts a reward as a function of s,a,s' since r is considered as the (n+1)-th state component. But what if we re-ordered the state variables with r as the first component instead of the last one? Would the performance change?
+
+4. More generally, do you think that the ordering of the state variables might be a limitation? For instance, could there be an ordering of these variables that makes the model perform well and one that makes it perform poorly? While in, e.g., image/text generation problems where autoregressive models are applied we have a natural ordering between the variables involved (e.g., by space or time), here there seems to be no particular relationship between state variables with similar index. Maybe some additional experiments could help in clarifying whether this could be a limitation or not.
+
+Some minor comments/questions:
+
+- In Eq. 2, should the product be up to H-1?
+- Before Sec. 3, a citation for ""behavioral cloning"" could be added
+- Sec. 5.3: the FQE acronym was not introduced
+- Fig. 4: what is ""r"" above each plot?
+
+",7,3.0,ICLR2021
+Sklo1V_znQ,1,rJgMlhRctm,rJgMlhRctm,Excellent paper including many cutting edge techniques ,"The paper is well written and flow well. The only thing I would like to see added is an elaboration of 
+""run a semantic parsing module to translate a question into an executable program"". How to do semantic parsing is far from obvious. This topic needs at least a paragraph of its own. 
+
+This is not a requirement but an opportunity, can you explain how counting work? I think you have it at the standard level of the magic of DNN but some digging into the mechanism would be appreciated. 
+
+In concluding maybe you can speculate how far this method can go. Compositionality? Implicit relations inferred from words and behavior? Application to video with words?   ",9,5.0,ICLR2019
+ZyChtxDdnz6,4,B9t708KMr9d,B9t708KMr9d,Official review,"The paper presents a novel unified model that jointly harnesses the power of graph convolutional networks and label propagation algorithms based on the unified message passing framework. The UniMP first employs graph Transformer networks to jointly propagate both feature and label information. Then, to avoid label leakage, a masked label prediction strategy is employed.
+
+Pros:
+* The presented method shows strong empirical performance on the open graph benchmark dataset.
+* The whole framework is simple and the idea is easy to follow.
+
+Cons:
+* What is the label leakage problem? It is not clear to me (1) why label will be leaked during the joint learning process and (2) what the outcome does label leakage bring.
+* The writing of this paper is poor. The authors are suggested to polish their paper. Please see minor comments below.
+* Although the proposed UniMP achieves state-of-the-art performance, the experiments are not convincing enough.
+  * Experimental results are not consistent, c.f. Table 4~6 and Table 7. It seems that the standalone Transformer even surpasses UniMP on ogbn-products and ogbn-arxiv.
+  * It seems that the hyper-parameter specifications vary greatly across the three datasets. To me the residual connection is helpful when stacking many layers (Li et al., 2019; Chen et al., 2020), while in this paper the number of layers is relatively low. A sensitivity analysis on the network depth is necessary to demonstrate the impact of the residual connection.
+  * When compared with GAT, the main differences are (1) different implementations of self-attention mechanisms and (2) whether to adopt gated residual connections. However, no ablation studies are provided to demonstrate the impact of these two independent components. Especially, as shown in Table 7 and Figure 3, given $X$,$A$,$\hat{Y}$, transformer outperforms GAT. The authors are expected to elaborate on which component (self-attention implementation or gated residual connection) brings the improvement.
+
+Minor comments:
+* Abstract: we adopt a Graph Transformer jointly [using] label embedding?
+* Abstract: UniMP ... and be empirical powerful -> is empirical powerful
+* Page 2: there are different -> they are different
+* Mathematical notations are severely abused; for example, hidden_size should be represented by $f$, and how $\hat{Y}_e$ is transformed from $\hat{Y}$ is not clear.
+",4,4.0,ICLR2021
+Syedv6PB3X,1,SkeJ6iR9Km,SkeJ6iR9Km,Interpretable VAE with sparse coding,"This paper presents variational sparse coding (VSC). VSC combines variational autoencoder (VAE) with sparse coding by putting a sparse-inducing prior -- the spike and slap prior -- on the latent code z. In doing so, VSC is capable of producing sparse latent code, utilizing the latent representation more efficiently regardless of the total dimension of the latent code, meanwhile offers better interpretability. To perform traceable inference, a recognition model with the same mixture structure as the spike and slap prior is used to produce the approximate posterior. Experimental results on both MNIST and Fashion-MNIST show that even though VSC performs comparably worse than VAE in terms of ELBO, the representation it learns is more robust in terms of the total latent dimension in a downstream classification task. Additionally, the authors show that VSC provides better interpretability by interpolating the latent code and find that some dimensions correspond to certain characteristics of the data. 
+
+Overall, the paper is clearly written and easy to follow. VSC is reasonably motivated and the idea behind it is quite straightforward. Technical-wise, the paper is relatively incremental -- all of the building blocks for performing tractable inference are standard: Since the posterior is intractable for nonlinear sparse coding, a recognition network is used; the prior is spike and slap, thus the recognition network will output parameters in a similar mixture structure with both a spike and a slap component; to apply reparametrization trick on the non-differentiable latent code, a continuous relaxation, similar to the one used in concrete distribution/Gamble trick, is applied to approximate the step selection function with a controllable ""temperature"" parameter. Overall, the novelty is not the strong suit of the paper. I do like the idea of VSC and its ability to learn interpretable latent features for complex non-linear models though. I have two major comments regarding the execution of the experiment that I hope the authors could address:
+
+1. It is understandable that VSC is not able to achieve the same level of ELBO with VAE, as is quite common in models which trade off performance with interpretability.  However, one attractive property of VAE is its ability to produce relatively realistic samples right from the prior, since its latent space is fairly smooth. It is not clear to me if VSC has the same property -- my guess is probably not, judging from the interpolation results currently presented in the paper. It would be interesting if the authors could comment on this and maybe include some examples to illustrate it.
+
+2. As is known in some recent literature, e.g. Alemi et al. Fixing a broken ELBO (2018), VAE can be easily trained to simply ignore the latent representation, hence produce terrible performance on a downstream classification task. I don't know exactly how the data is processed, but on MNIST, an accuracy of less than 90% means it is quite bad (I can get >90% with PCA + logistic regression). I wonder if the authors have explored the idea of learning better representation by including a scalar in front of the KL term -- or if VSC is more robust to this problem of ignoring latent code. 
+
+Minor comments: 
+
+A potential relevant reference: Ainsworth et al., Interpretable VAEs for nonlinear group factor analysis (2018). 
+",5,4.0,ICLR2019
+BJxv6enuhm,1,r1xlvi0qYm,r1xlvi0qYm,[Updated] Original title: ACT is a crucial missing baseline.,"This paper deals with Memory Augmnted Neural Networks (MANN) and introduces an algorithm which allows full writes to the dense memory to be only exectued every L timesteps. The controller produces a hidden output at most timestps, whih is appended to a cache. Every L steps, soft attention is used to combine this cache of N hidden states to a single one, and then this is used as the input hidden state for the controller, with the outputs performing a write in the full memory M, along with clearing the cache.
+
+The authors first derive ""Uniform Writing"" (UW) which updates the memory at regular intervals instead of every timestep. The derivation is based on the ""contribution"" which is norm of the gradient of some input timestep to some hidden state (potentially at a different timestep). I am not clear on whether this terminology for the quantity is novel, if this is the case maybe the authors should state this more clearly. UW says that if all timesteps are equally important, and only D writes can be made in a sequence of length T, then writes should be done every T/(D+1) steps. I have not checked the proof in detail but this seems reasonable that it would maximise the contribution quantity introduced. I am less clear on whether this is obviously the right thing to do - sometimes this value is referred to in relation to information, but that term does not strictly seem to be being used in the information theory sense (no mention of bits or nats anywhere). Regardless, as the authors point out, in real problems there are obviously timesteps which have less or no useful information, and clearly UW is mostly defined in order to build towards CUW.
+
+CUW expands on UW by adding the cache of different hidden states, and using soft attention over them. This feels like a reasonable step, although I would presume there are times when the L hidden states were collected over timesteps with no information, and so the resulting write is not that useful, and times when all of hte L timesteps contain different useful information. In these circumstances it seems like the problem of getting the *useful* information into the memory is still present, as the single write done with the averaged hidden state will need to contain lots of information, which may be more ideal written with several timesteps.
+
+The experiments are well described and overall the paper seems reproducable. The standard toy datasets of copy / reverse / sinusoid are used. The results are interesting - regular DNC with memory size 50 performs surprisingly badly on clean Sinusoid, my guess would be that with hyperparameter tuning this could be improved upon. I'm not sure that using exactly the same hyperparameters for a wide variety of models is appropriate - even with optimizers like Adam and RMSProp, I would want to see at least some sweeping for the best hyperparams, and then graphs like figure 3 should show error bars averaged across multiple runs with the best per-model hyperparameters. However, The DNC with CUW seems to perform well across all synthetic tasks.
+
+There is no mention of Adaptive Computation Time/ACT (Graves, https://arxiv.org/abs/1603.08983) throughout the paper, which is surprising considering Alex Graves' models form two of the baselines used throughout the paper. ACT aims to execute an RNN a variable number of times, usually to do >1 timestep of processing for a single timestep of input. In the context of this paper, I believe it could be adapted to do either zero or one steps of computation per timestep, and that would yield a very comparable network where the LSTM controller always executes, and writes to the memory only happen sometimes. Given that it allows a learned process to decide whether to write, as opposed to having a fixed L which separates full writes, this should have the potential to outperform CUW, as it could learn that at certain times, writes must happen at every step. In my view ACT is attempting to solve essentially the same problem as this paper, so it should either be included as a baseline, or the manuscript should be updated to explain why this is not an appropriate comparison.
+
+
+I think this is an interesting paper, trying to make progress on an important problem. The results look good, but I can only give a borderline score due to missing ACT numbers, and a few other unclear points. The addition of ACT experiments, and error bars on certain results, would change my mind here.
+
+
+Notes:
+
+""No solution has been proposed to help MANNs handle ultra long sequence"" - (Rae et al 2016) is an attempt to do this, by improving the complexity of reads / writes. This allows bigger memory and longer sequences to be processed.
+
+""Current MANNS only support dense writing"" - presumably this means dense as in 'every timestep', but this terminology is overloaded - you could consider NTM / DNC as doing dense writing, and then work of Rae et al 2016 doing sparse writing.
+
+In my experience training these kind of RNNs can have reasonably high variance across seeds - figures 2 & 3 should have error bars, and especially Table 4 as that contains the most important results. Getting 99 percent accuracy when previous SOTA is only 0.1% lower is only really meaningful if the standard deviation across seeds is very small.
+
+Appendix A: the 'by induction' result - I believe there is an error, it should be:
+
+h_t = \sigma_{i=1}^t U_{t-i}W x_i + C
+
+As W is applied to inputs, before the repeated applications of U? I believe the rest of the derivation still holds the same, after the correction.
+
+",7,4.0,ICLR2019
+HkVGxmGVx,2,Hy8X3aKee,Hy8X3aKee,Unique angle for modeling heterogeneous sequences,"Because the authors provided no further responses to reviewer feedback, I maintained my original review score.
+
+-----
+
+This paper takes a unique approach to the modeling of heterogeneous sequence data. They first symbolize continuous inputs using a previously described approach (histograms or maximum entropy), the result being a multichannel discrete sequence (of symbolized time series or originally categorical data) of ""characters."" They then investigate three different approaches to learning an embedding of the characters at each time step (which can be thought of as a ""word""):
+1) Concatenate characters into a ""word"" and then apply standard lookup-based embeddings from language modeling (WDE)
+2) Embed each character independently and then sum over the embeddings (SCE)
+3) Embed each character as a scalar and concatenate the scalar embeddings (ICE)
+The resulting embeddings can be used as inputs to any architecture, e.g., LSTM. The paper applies these methods primarily to event detection tasks, such as hard drive failures and seizures in EEG data. Empirical results largely suggest the a recurrent model combined with symbolization/embedding outperforms a comparable recurrent model applied to raw data. Results are inconclusive as to which embedding layer works best.
+
+Strengths:
+- The different embedding approaches, while simple, are designed to tackle a very interesting problem where the input consists of multivariate discrete sequences, which makes it different from standard language modeling and related domains. The proposed approaches offer several different interesting perspectives on how to approach this problem.
+- The empirical results suggest that symbolizing the continuous input space can improve results for some problems. This is an interesting possibility as it enables the direct application of a variety of language modeling tools (e.g., embeddings).
+
+Weaknesses:
+- The LSTMs (one layer each of 8, 16, and 15 cells, respectively) used in the three experiments sound *very* under capacity given the complexity of the tasks and the sizes of the data sets (tens to hundreds of thousands of sequences). That might explain both the relatively small gap between the LSTMs and logistic regression *and* the improvement of the embedding-based LSTMs. Hypothetically, if quantizing the inputs is really useful, the raw data LSTMs should be able to learn this transformation, but if they are under capacity, they might not be able to dos. What is more, using the same architecture (# layers, # units, etc.) for very different kinds of inputs (raw, WdE, SCE, ICE, hand-engineered features) is poor methodology. Obviously, hyperparameters should be tuned independently for each type of input.
+- The experiments omit obvious baselines, such as trying to directly learn an embedding of the continuous inputs.
+- The experimental results offer an incomplete, mixed conclusion. First, no one embedding approach performs best across all tasks and metrics, and the authors offer no insights into why this might be. Second, the current set of experiments are not sufficiently thorough to conclude that quantization and embedding is superior to working with the raw data.
+- The temporal weighting section appears out of place: it is unrelated to the core of the paper (quantizing and embedding continuous inputs), and there are no experiments to demonstrate its impact on performance.
+- The paper omits a large number of related works: anything by Eamonn Keogh's lab (e.g., Symbolic Aggregate approXimation or SAX), work on modifying loss functions for RNN classifiers (Dai and Le. Semi-supervised sequence learning. NIPS 2015; Lipton and Kale, et al. Learning to Diagnose with LSTM Recurrent Neural Networks. ICLR 2016), work on embedding non-traditional discrete sequential inputs (Choi, et al. Multi-layer Representation Learning for Medical Concepts. KDD 2016).
+
+This is an interesting direction for research on time series modeling with neural nets, and the current work is a good first step. The authors need to perform more thorough experiments to test their hypotheses (i.e., that embedding helps performance). My intuition is that a continuous embedding layer + proper hyperparameter tuning will work just as well. If quantization proves to be beneficial, then I encourage them to pursue some direction that eliminates the need for ad hoc quantization, perhaps some kind of differentiable clustering layer?",3,4.0,ICLR2017
+rJgOwFa1oB,3,BJlJVCEYDB,BJlJVCEYDB,Official Blind Review #3,"This paper presents a computational model of motivation for Q learning and relates it to biological models of motivation. Motivation is presented to the agent as a component of its inputs, and is encoded in a vectorised reward function where each component of the reward is weighted. This approach is explored in three domains: a modified four-room domain where each room represents a different reward in the reward vector, a route planning problem, and a pavlovian conditioning example where neuronal activations are compared to mice undergoing a similar conditioning.
+
+Review Summary:
+I am uncertain of the neuroscientific contributions of this paper. From a machine learning perspective, this paper has insufficient details to assess both the experimental contributions and proposed formulation of motivation. It is unclear from the discussion of biological forms of motivation, and from the experimental elaboration of these ideas, that the proposed model of motivation is a novel contribution. For these reasons, I suggest a reject.
+
+The Four Rooms Experiment:
+
+In the four-rooms problem, the agent is provided with a one-hot encoding representing which cell it the agent is located in within the grid-world. The reward given to the agent is a combination of the reward signal from the environment (a one-hot vector where the activation is dependent on the room occupied by the agent) and the motivation vector, which is a weighting of the rooms. One agent is given access to the weighting vector mu in its state vector: the motivation is concatenated to the position, encoding the weighting of the rooms at any given time-step. The non-motivated agent does not have access to mu in its state, although its reward is weighted as the motivated agent’s is. The issue with this example is that the non-motivated agent does not have access to the information required to learn a value-function suitable to solve this problem. By not giving the motivation vector to non-motivated agent, the problem has become a partially observable problem, and the comparison is now between a partially observable and fully observable setting, rather than a commentary on the difference between learning with and without motivation.
+
+In places, the claims made go beyond the results presented. How do we know that the non-motivated network is engaging in a ""non-motivated delay binge""? We certainly can see that the agent acquires an average reward of 1, but it is not evident from this detail alone that the agent is engaging in the behaviour that the paper claims. 
+
+Moreover, the network was trained 41 times for different values of the motivation parameter theta. Counting out the points in figure 2, it would suggest that the sweep was over 41 values of theta, which leaves me wondering if the results represent a single independent trial, or whether the results are averaged over multiple trials. Looking at the top-right hand corner I see a single yellow dot (non-motivated agent) presented in line with blue (motivated agent) suggesting that the point is possibly an outlier. Given this outlier, I’m led believe that the graph represents a single independent trial. A single trial is insufficient to draw conclusions about the behaviour of an agent. 
+
+The Path Routing Experiment:
+
+In the second experiment, where a population of agents is presented in fig 5, it is claimed that on 82% of the trials, the agent was able to find the shortest path. Looking at the figure itself, at the final depicted iteration, all of the points are presented in a different colour and labelled “shortest path”. The graph suggests that 100% of the agents found the shortest path. The claim is made that for the remaining 18% of the agents, the agents found close to the shortest path—a point not evident in the figures presented.
+
+
+Pavlovian Conditioning Experiment:
+
+In the third experiment, shouldn’t Q(s) be V(s)? In this setting, the agent is not learning the value of a state action pair, but rather the value of a state. Moreover, the value is described as Q(t), where t is the time-step in the trial; however, elsewhere in the text it is mentioned that the state is not simply t, but contains also the motivation value mu.  
+
+The third experiment does not have enough detail to interpret the results. It is unclear how many trials there were for both of the prediction settings. It is unclear whether the problem described is a continuing problem or a terminating prediction problem—i.e., whether after the conditioned stimulus and unconditioned stimulus are presented to the agent, does the time-step (and thus the state) reset to 0, or does time continue incrementing? If it is a terminating prediction problem, it is unclear whether the conditioned stimulus and unconditioned stimulus were delivered on the same time-steps for each independent trial. If I am interpreting the state-construction correctly, the state is incrementing by one on each time-step; this problem is effectively a Markov Reward Process where the agent transitions from one state to the next until time stops with no ability to transition to previous states.
+
+In both the terminating and continuing cases, the choice of inputs is unusual. What was the motivation for using the time-step as part of the state construction?
+
+How is the conditioned stimulus formulated in this setting? It is mentioned that it is a function of time, but there are no additional details.
+
+From reading the text, it is unclear whether fig 7b/c presents activations over multiple independent trials or a single trial.
+
+General Thoughts on Framing:
+
+This paper introduces non-standard terms without defining them first. For example, TD error is introduced as Reward Prediction Error, or RPE: a term that is not typically used in the Reinforcement Learning literature. To my understanding, there is a hypothesis about RPE in the brain in the cognitive science community; however, the connection between this idea in the cognitive science literature and its relation to RL methods is not immediately clear.
+
+Temporal Difference learning is incorrectly referred to as ""Time Difference"" learning (pg 2). 
+
+Notes on technical details:
+
+- The discounting function gamma should be 0<= gamma <=1, rather than just <=1.
+
+- discounting not only prevents the sum of future rewards from diverging, but also plays an important role in determining the behaviour of an agent---i.e., the preference for short-term versus long-term rewards.
+
+- pg 2 ""the motivation is a slowly changing variable, that is not affected substantially by an average action"" -- it is not clear from the context what an average action is. 
+
+- Why is the reward r(s|a), as opposed to r(s,a)?
+
+Notes on paper structure:
+
+- There are some odd choices in the structure of this paper. For instance, the second section---before the mathematical framing of the paper has been presented---is the results section. 
+
+- In some sentences, citations are added where no claim is being made; it is not clear what the relevance of the citation is, or what the citation is supporting. E.g., “We chose to use a recurrent neural network (RNN) as a basis for our model” following with a citation for Sutton & Barto, 1987.
+
+- In some sentences, citations are not added where substantial claims are being made. E.g, “The recurrent network structure in this Pavlovian conditioning is compatible with the conventional models of working memory”. This claim is made, but it is never made clear what the conventional computational models of working memory are, or how they fit into the computational approaches proposed.
+
+- Unfortunately, a number of readers in the machine learning community might be unfamiliar with pavlovian conditioning and classical conditioning. Taking the time to unpack these ideas and contextualise them for the audience might help readers understand the paper and its relevance.
+
+- Figure 7B may benefit from displaying not just the predicted values V(s), but a plot of the prediction over time in comparison to the true expected return.
+",1,,ICLR2020
+HkgBnP1b2r,3,S1xJikHtDH,S1xJikHtDH,Official Blind Review #4,"In this paper the authors propose a method for improving ""model compatibility"" in GANs. For this reason they add to the loss of the generation procedure a term that depends on the maximum mean discrepancy between the following datasets: (1) the output of a classifier with input the real dataset, (2) the output of the same classifier with input GAN-generated samples. They authors show that in essentially all the datasets they tried, the model compatibility of the produces generator is increased after adding the aforementioned cost, while the visual quality of the data is not decreased.
+
+Strengths:
+
+- The low model compatibility of GANs is a very important disadvantage and hence improving this aspect of GANs is a relevant problem.
+
+Weaknesses - Comments:
+A. The increase in the model compatibility is very mild. Especially in CIFAR-10, the increase in very small.
+
+B. In MNIST the increase is larger than CIFAR-10 but the initial model compatibility using vanilla GANs is smaller. The reason might be that for MNIST much simpler classification algorithms have been used. This maybe suggests that the proposed method affects more the model compatibility of methods that achieve lower model compatibility in before the addition of the extra cost term.
+
+Minor Comments:
+1. In equation (1) it looks strange that the summation is over A but A does not appear at all in the summand. I suggest you replace h and h' with A(D) and A(D') so that this is clear.
+2. In Theorem 2, \hat{L}_G is used but for the proof the authors have replaces \hat{M} with M. There should be a comment for that. In general I believe that Theorem 2 is almost trivial and does not add value to this clearly experimental paper.",3,,ICLR2020
+ryx5svN5hm,1,HJgODj05KX,HJgODj05KX,see review,"The paper talks about a method to combine preconditioning at the per feature level and Nesterov-like acceleration for SGD optimization.
+
+The explanation of the method in Section 3 should be self-contained.  The main result, computational context, etc., are poorly described, so that it would not be easily understandable to a non-expert.
+
+What was the reason for the choice of the mini batch size of 128.  I would guess that you would actually see interesting differences for the method by varying this parameter.
+
+How does this compare with the FLAG method of Chen et al from AISTATS, which is motivated by similar issues and addresses similar concerns, obtaining stronger results as far as I can tell?
+
+The figures and captions and inserts are extremely hard to read, so much so that I have to trust it when the authors tell me that their results are better.
+
+The empirical evaluation for ""convex problems"" is for LS regression.  Hmm.  Is there not a better convex problem that can be used to illustrate the strength and weaknesses of the method.  If not, why don't you compare to a state-of-the-art least squares solver.
+
+For the empirical results, what looks particularly interesting is some tradeoffs, e.g, a slower initial convergence, that are shown.  Given the limited scope of the empirical evaluations, it's difficult to tell whether there is much to argue for the method.  But those tradeoffs are seen in other contexts, e.g., with subsampled second order methods, and it would be good to understand those tradeoffs, since that might point to where and if a methods such as this is useful.
+
+The conclusions in the conclusion are overly broad.",5,3.0,ICLR2019
+wYTSIU0m2rs,4,3FkrodAXdk,3FkrodAXdk,"This paper attempts to introduce new diversity measures for ensembles that better correlates with accuracy, thus enabling an effective selection strategy of the components. ","The major issue with this paper is that the definitions of the new diversity measures HQ are not properly laid out, and an explanation/intuition why they would correlate with accuracy is not provided. I understand that the algorithm is provided in the Appendix, but it comes with barely any explanation still. The explanations and motivation in 2.2. are vague, contain undefined terms and concepts, and they are completely unclear. The idea of focal model (which seems to be crucial) is never explained. Furthermore, it's unclear why this paper focuses on deep learning models. Do the proposed diversity measures work only for deep learning models? Why? Would they work also on other models, and regardless of the models used for the ensemble components? What is about these measures that make them correlate with accuracy?
+
+- The authors should clarify early on in the paper that the focus of this work is on ensembles for supervised learning. Work has been done also on selective ensemble techniques for unsupervised learning (e.g. clustering or anomaly detection), but this is not the scope of this work.
+
+- I find the discussion in Section 2 quite lengthy and trivial. It's well known that enumerating all subsets of an ensemble of a given size is unfeasible. I would shorten it to leave space to core explanations.
+
+- Results of Table 1: the authors state that the mean threshold is computed on 1013 candidate deep ensembles. Which deep models were used? The authors need also to clarify that M=10 (this becomes clear only later) in this experiment, and how the individual subsets were generated (randomly?). State upfront what's the purpose of this experiment.
+
+- Figure 1: I don't think (c) is showing much of ""a sharper trend in terms of the relationship between ensemble diversity and ensemble accuracy"" as claimed by the authors. For a given diversity value below the threshold, there is still a range of accuracies being obtained by the ensembles with that diversity level. So I am not sure what's the added value of plot (c). I understand the authors are trying to motivate their choice of comparing ensembles of the same size, but this plot does not serve the purpose well.
+
+- When introducing the concept of focal model in 2.2: ""..., we introduce the concept of focal model to obtain the set of negative samples for computing the diversity scores of ensembles by taking each member model in turn as the focal model."" This statement is totally obscure to me. What are the negative samples? How is a focal model selected. The explanations in this paragraph are not useful. 
+
+- The HQ (alpha) method does not work that well. It's comparable to standard diversity measures on CIFAR-10, and worse than standard diversity measures on Cora. This raises the question whether it's basically the extra step of clustering that gives an advantage to HQ (alpha + K).
+
+",3,5.0,ICLR2021
+Sklg9J-a37,3,rJgP7hR5YQ,rJgP7hR5YQ,"interesting paper, but missed quantitative analysis and comparisons.","[Overview]
+
+In this paper, the authors studied the problem of composition and decomposition of GANs. Motivated by the observations that images are naturally composed of multiple layouts, the authors proposed a new framework to study the compositional image generation and its decomposition by defining several tasks. On those various tasks, the authors demonstrate the possibility of the proposed model to composing image components and decompose the images afterwards. These results are interesting and insightful to some extent.
+
+[Strengthes]
+
+1. The authors proposed a framework for compose images from components and decompose the images into components. Based on this new framework, the authors tried different settings, by fixing the learning of one or more modules in the model. The experiments on various tasks are appreciated.
+
+2. In the experiments, the authors tried both image and text to demonstrate the concepts in this paper. Moreover, some qualitative results are presented.
+
+[Weaknesses]
+
+1. The authors performed multiple experiments regarding various tasks defined in this paper.However, I can hardly find any quantitative evaluation for the results. It is not clear to me that how the quality of the composed images and the decomposed components from images are. I would suggest the authors derive some metric to measure quality quantitatively, provide some statistics on the whole datasets.
+
+2. In this paper, the authors proposed multiple tasks in terms of which parts are fixed and known in the training process. However, dominated by so many different tasks, the core idea is losses in the paper. From the paper, I cannot get the core idea the authors want to deliver. I would suggest the authors focus on one certain task and perform more qualitative and quantitative analysis and comparisons, as also mentioned above.
+
+3. The proposed model has several tricky parts. First, the number of components are pre-determined. However, in realistic cases, the number of components are unknown, and thus how many component generators should be used is ill-posed. Second, the composing operation is simple and tricky. Such a simple composing operation make it hard to adapt to some more complicated data, such as cifar10 or so. Thirdly, almost all tasks need some components known. Even for the Task 4, c is known, and the model performs poorly for generating the disentangled components.
+
+4. The authors missed one very relevant paper:
+
+LR-GAN: Layered Recursive Generative Adversarial Networks for Image Generation. Yang et al.
+
+In the above paper, the authors proposed an end-to-end model for generating images with background and foreground compositionally. It can be applied to a number of realistic datasets. Regardless of the decomposition part in this paper, the proposed method in the above paper seems to be clearly superior to the composition part in this paper considering this paper fails on Task 4. The authors should give credit to the above paper (even the synthesized MNIST dataset looks similar ) and pay some efforts to explain the advantages in comparison it.
+
+[Summary]
+
+This paper proposed a new framework to study the compositionally of images during generation and decomposition. Through several experiments on various tasks, the authors presented some interesting results and provided some insights on the potentials and difficulties in this direction. However, as pointed above, I think this paper lacks enough experimental analysis and comparison. Its core idea hard to capture. Also, it missed a comparison to some related work.",4,5.0,ICLR2019
+EM6koFBcsuU,2,hsFN92eQEla,hsFN92eQEla,Empirical Evaluation of Cross Entropy vs Square Loss ,"The paper compares cross-entropy and squared losses on a wide range of tasks. The focus primarily is on thorough empirical evaluation of these two losses on NLP, Vision and Speech tasks. The paper shows that squared loss is better or competitive with cross-entropy loss in most cases. Most of the experiments and comparisons seem to be well done; parameters, setups etc. are well explained. 
+
+I believe a few additional tasks and settings would have helped put a better picture of the comparison. 
+For speech, the application of the two losses in tasks beyond ASR might give more insights. Similar for vision, classification tasks other than image classification (on MNIST/CIFAR and Imagenet) might be useful. 
+
+Additional learning settings might also be useful. For example, there is considerable amount of work (e.g R1 and R2 below) on noisy label learning, where the loss function (often cross entropy losses) is modified for noisy label condition. What do we expect for these two losses  in this noisy label setting ? Essentially what can we expect in learning settings beyond the supervised training in vanilla form. 
+
+For square loss, scaling is done for a large number of classes. The motivation behind it is not very clear. Also, it is perhaps worth showing results for squared loss with softmax outputs. Especially for the large number of classes. Can softmax be advantageous in this case for performance, even though computationally it will be worse than the scaling mechanism applied. 
+
+R1. Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels
+R2. Normalized Loss Functions for Deep Learning with Noisy Labels
+",6,4.0,ICLR2021
+Hyx08Tcp2Q,3,ByleB2CcKm,ByleB2CcKm,Learning procedural abstractions and evaluating discrete latent temporal structure,"In ""Learning procedural abstractions and evaluating discrete latent temporal structure"" the authors develop a hierarchical Bayesian model for patterns across time in video data. They also introduce new metrics for understanding structure in time series (completeness and homogeneity). This work is appropriate for ICLR. They provide some applications to robotics, suggesting that this could be used to teach robots to act in environments by learning from videos.
+
+This manuscript paid quite close attention to quality of segmentation, in which actions in videos are decomposed into component parts. It is quite hard to determine groundtruth in such situations and many metrics abound, and so a thorough discussion and comparison of metrics is useful.
+
+The state of the art for Bayesian hierarchical models for segmentation is Fox et al., which is referenced heavily by this work (including the use of test data prepared in Fox et al.) I wonder why the authors drop the Bayesian nonparametric nature of the hierarchy in the section ""Modeling realizations in each time-series"" (i.e., for Fox et al., the first unnumbered equation in this section would have had arbitrary s).
+
+I found that the experiments were quite thorough, with many methods and metrics compared. However, I found the details of the model to be quite sparse, for example it's unclear how Figure 5 is that much different from Fox et al. But, overall I found this to be a strong paper.
+",6,3.0,ICLR2019
+PiMZzwLC_9a,4,ZHJlKWN57EQ,ZHJlKWN57EQ,Serious problems with the approach,"I think the use of QPyTorch for the experiments here invalidates the results since the intermediate matrix multiplies are done in single precision (FP32), and so are more optimistic than a pure 16-bit implementation. (This is both according to the authors Sec 4, experiment setup; and according to the QPyTorch paper arxiv:1910.04540, Sec 3 intro.) For these kinds of experiments to be meaningful, they have to be done on native 16-bit hardware which luckily is becoming more common, e.g., Google's TPUs or the newer NVIDIA GPUs.
+
+There are two other problems. First, it is not clear how stochastic rounding would be implemented in hardware. Doing it for every MAC operation could likely be even more expensive than just doing 32-bit MAC operations, since it involves the generation of random numbers, division, etc. Second, Kahan summation takes up twice the weight storage, so a more detailed calculation is needed to compare any hardware/energy savings to use that instead of just 32-bit.
+
+As an aside, it may be interesting in Figure 1 to zoom in on the initial part of training to understand where the difference between 32-bit and standard 16-bit comes from in early training since at that point, the gradients are generally larger than later on in training.
+",3,4.0,ICLR2021
+r1egbGCth7,1,B1fA3oActQ,B1fA3oActQ,"Review of ""GraphSeq2Seq: Graph-Sequence-to-Sequence for Neural Machine Translation""","This paper proposes a method for combining the Graph2Seq and Seq2Seq models into a unified model that captures the benefits of both.  The paper thoroughly describes in series of experiments that demonstrate that the authors' proposed method outperforms several of the other NMT methods on a few translation tasks.
+
+I like the synthesis of methods that the authors' present.  It is a logical and practical implementation that seems to provide solid benefits over the existing state of the art.  I think that many NMT researchers will find this work interesting.
+
+Table 4 begs the question, ""How does one choose the number of highway layers?""  I presume that the results in that table are from the test data set.  Using the hold out data set, which number gives the best value?
+
+The paper's readability suffers from poor grammar in some places.  This fact may discourage some readers.
+
+The authors should fix the missing parentheses in Eqns. (6)-(9).",6,3.0,ICLR2019
+zjeBkkuNAcT,3,3Wp8HM2CNdR,3Wp8HM2CNdR,"A simple and clean method, with very weak experiments. ","This paper proposed an MSE loss function with whitening (W-MSE) for self-supervised representation learning. The motivation is to reduce the demand of negative examples in contrastive representation learning. The proposed W-MSE loss function is compared with popular contrastive loss on a few benchmarks.
+
+Contrastive learning is a popular topic in the self-supervised learning domain. Most of the existing methods rely on negative samps to avoid trivial solutions. This paper proposed a simple and clean solution to tackle this problem, by using whitening in the loss term. This paper is also well explained and illustrated.
+
+The main weakness of the paper is the experiments. Based on the results, I am not convinced that the proposed W-MSE is effective. Here are more comments:
+
+(1) The experimental results are pretty close to the existing contrastive/BYOL baselines. On CIFAR-10 and STL-10, results are saturated thus the diff is very minor. On more challenging CIFAR-100 and Tiny ImageNet, the results are mixed when compared to BYOL. 
+
+(2) Given that the results are very close, what are the other benefits of using W-MSE loss? For example, is the training speed faster than the other methods (contrastive, BYOL) with negative sampling? It would be nice to include such results to demonstrate the effectiveness of W-MSE.
+
+(3) This paper claims that without negative sampling, W-MSE loss encourages the use of more positive pairs in the batch. I'm wondering if the authors have tried more positive pairs beyond 4 in the paper.
+
+(4) I cannot understand the motivation of ablating the popular methods with Euclidean distance. Cosine similarity is just simple and it costs nothing compared to Euclidean distance. I think ablating this from existing contrastive loss is not sufficient to show the effectiveness of BYOL. 
+ 
+(5) It would be nice to include a few comparable numbers from literature directly, instead of reproducing their methods. For example, this paper used ResNet18 while the BYOL paper used ResNet50. This paper reports results on Tiny ImageNet while the existing methods report on ImageNet. Note that I do not penalize this paper for this point, as computational cost is high for using larger architecture and larger dataset. But it would be good to have a comparable baseline in the middle ground, e.g. ResNet50 architecture on CIFAR-100 dataset, where BYOL reports 78.4% top-1 accuracy with linear eval. 
+
+======Post Rebuttal Update======
+
+I would like to thank the authors for their rebuttal, which has addressed part of my concerns. After reading the authors' rebuttal and other reviewers' comments, I'm still concerned on the weak baselines and mixed results in this paper. Unfortunately, I will keep my rating.
+
+Concrete suggestions to improve this paper in the future:
+
+(1) Strongly recommended: use ResNet50 instead of ResNet18 for the small scale experiments, in this way you get directly the numbers from the literature (e.g. BYOL on CIFAR-100);
+
+(2) Nice to have: for the expensive ImageNet experiments, it would be nice to get comparable results using the smallest comparable architecture (e.g. ResNet50) from the literature. SimCLR claims that ""With 128 TPU v3 cores, it takes ∼1.5 hours to train our ResNet-50 with a batch size of 4096 for 100 epochs"" and MoCo claims that ""For IN-1M, we use a mini-batch size of 256 in 8 GPUs, ..., train for 200 epochs ..., taking ∼53 hours training ResNet-50.""",5,4.0,ICLR2021
+H1xNwye0YH,1,rkxuWaVYDB,rkxuWaVYDB,Official Blind Review #1,"This paper investigates the design of adversarial policies (where the action of the adversarial agent corresponds to a perturbation in the state perceived by the primary agent). In particular it focuses on the problem of learning so-called optimal adversarial policies, using reinforcement learning.
+
+I am perplexed by this paper for a few reasons:
+1)	What is the real motivation for this work?  The intro argues “casting optimal attacks is crucial when assessing the robustness of RL policies, since ideally, the agent should learn and apply policies that resist *any* possible attack”.   If the goal is to have agents that are robust to *any* attacks, then they cannot be robust just to so-called optimal attacks.  And so what is really the use of learning so-called optimal attacks?
+2)	The notion itself of “optimal” attack is not clear.  The paper does not properly discuss this.  It quickly proposes one possible definition (p.4): “the adversary wishes to minimize the agent’s average cumulative reward”.  This is indeed an interesting setting, and happens to have been studied extensively in game-theoretic multi-agent systems, but the paper does not make much connection with that literature (apart from a brief mention at bottom of p.2 / top of p.3), so it’s not clear what is new here compared to this.   It’s also not discussed whether it would ever be worthwhile considering other notions of optimality for the adversary, and what would be the properties of those.
+
+So overall, while I find the general area of this work to be potentially interesting, the current framing is not well motivated enough, and not sufficiently differentiated from other work in robust MDPs and multi-agent RL to make a strong contribution yet.
+
+More minor comments:
+-	P.3: “very different setting where the adversary has a direct impact on the system” => Clarify what are the implications of this in terms of framework, theory, algorithm.
+-	P.4: You assume a valid Euclidean distance for the perturbed state.  Is this valid in most MDP benchmarks?  How is this implemented for the domains in the experiments?  What is the action space considered? Do you always assume a continuous action space for the attacker?
+-	P.5: “we can simply not maintain distributions over actions” -> Why not?  Given the definition of perturbation, this seems feasible.
+-	P.5:  Eqn 4 is defined for a very specific adversarial reward function. Did you consider others?  Is the gradient always easy to derive?
+-	P.6: Eqn (5) & (6): What is “R” here?
+-	P.7: Figure 1, top right plot. Seems here that the loss is above 0 for small \epsilon.  Is this surprising?  Actually improving the policy?
+-	P.7: What happens if you consider even greater \epsilon?  I assume the loss is greater.  But then the perturbation would be more detectable?  How do you think about balancing those 2 requirements of adversarial attacks?  How should we formalize detectability in this setting?
+-	Fig.2: Bottom plots are too small to read.
+-	Sec.6:  Can you compare to multi-agent baselines, e.g. Morimoto & Doya 2005.
+-	P.8: “We also show that Lipschitz policies have desirable robustness properties.” Can you be more specific about where this is shown formally?  Or are you extrapolating from the fact that discrete mountain car suffers more loss than continuous mountain car?  I would suggest making that claim more carefully.
+
+",3,,ICLR2020
+HJSCsaf4g,2,r1yjkAtxe,r1yjkAtxe,Variant of an SMDP model for skill learning ,"The framework of semi-Markov decision processes (SDMPs) has been long used to model skill learning and temporal abstraction in reinforcement learning. This paper proposes a variant of such a model called a semi-aggregated MDP model. 
+
+The formalism of SAMDP is not defined clearly enough to merit serious attention. The approach is quasi heuristic and explained through examples rather than clear definition. The work also lacks sufficient theoretical rigor. 
+
+Simple experiments are proposed using 2D grid worlds to demonstrate skills. Grid worlds have served their purpose long enough in reinforcement learning, and it is time to retire them. More realistic domains are now routinely used and should be used in this paper as well. 
+
+",4,5.0,ICLR2017
+rkxMwMsFn7,1,HyzdRiR9Y7,HyzdRiR9Y7,"Good paper, contribution moderate, experiments promising","My summary: A new model, the UT, is based on the Transformer model, with added recurrence and dynamic halting of the recurrence. The UT should unite the computational universality properties of Neural Turing Machines and Neural GPU with good performance on disparate language and algorithmic tasks.
+
+(I have read your author feedback and have modified my rating according to my understanding.)
+
+Review:
+The paper is well written and proofread, concrete and clear. The model is quite clearly explained, especially with the additional space of the supplementary material, appendices A and B (note fig 4 is less good quality than fig 2 for some reason) -- I’m fine with the use of the Supp Mat for this purpose.
+ 
+The experiments have been conducted well, and demonstrate a wide range of tasks, which seems to suggest that the UT has pretty general purpose. The range of algorithmic tasks is limited, e.g. compared to the NTM paper.
+I miss any experimental details at all on training.
+I miss a comparison to Neural GPU and Stack RNN in 3.1, 3.2.
+
+I miss a proof that the UT is computationally equivalent to a Turing machine. It does not have externally addressable, shared memory like a tape, and I’m not sure how to transpose read/write heads either.
+
+The argument that the UT offers a good balance between inductive bias and expressivity is weak, though it may be the best one can hope for of a statistical model in a way. I note that in 3.1, the Transformer overfits, while it seems to underfit in 3.3 (lower LM and RC accuracy, higher LM perplexity), while the UT fare well, which suggests that the UT hits the balance better than the Transformer, at least.
+
+From the point of view of network structure, it seems natural to lift further constraints on the model: 
+why should width of intermediate layers be exactly equal to sequence length?
+why should all hidden state vectors be size $d$, the size of the embeddings chosen at the first layer, which might be chosen out of purely practical reasons like the availability of pre-trained word embeddings?
+
+What is the contribution of this work? It starts from the Transformer, the ACT idea for dynamic halting in recurrent nets, the need for models fit for algorithmic tasks. 
+The UT’s building blocks are near-identical to the Transformers (and the paper is upfront and does a good job of explaining these similarities, fortunately)
+- cf eq1-5: residuals, multi-headed self attention, and layer norm around all this. 
+- shared weights among all such units
+- encoder-decoder architecture
+- autoregressive decoder with teacher forcing
+- decoder units like the encoder’s but with extra layer of attention to final output of encoder
+- coordinate embeddings
+The authors may correct me, but I believe that the UT with FC layers is exactly identical to the Transformer described in Vaswani 2017 for T=6. 
+So this paper introduces the idea of varying T, interprets it as a form of recurrence, and adds dynamic halting with ACT to that. Interestingly, the recurrence is not over sequence positions here.
+This contribution is not major, on the other hand the experimental validation suggests the model is promising.
+
+Typos and writing suggestions
+above eq 8: masked such that -> masked so that
+eq 8: dimensions of O and H^T are incompatible: d*V, m*d; to evacuate the notation issue for transposition, cf footnote 1, here and elsewhere, you could use either ${^t A}$ or $A^\top$ or $A^\intercal$. You could also write $t=T$ instead of just $T$.
+sec3.3 line -1: designed such that -> designed so that
+Towards the beginning of the paper, it may be useful to stabilise terminology for $t$: depth (as opposed to width for $m$), time steps, recurrence dimension, revisions, refinements
+
+",8,4.0,ICLR2019
+SylNE5kstr,2,Syejj0NYvr,Syejj0NYvr,Official Blind Review #3,"This is an interesting work proposing a new robust training method using the adversarial example generated from adversarial interpolation. The experimental results seem surprisingly promising. The ablation studies show that both image and label interpolating help the robustness improvement. 
+
+I think it is important to provide a running time comparison between the proposed method and SOTA robust training method such as Madry's. Since the feature extractor is implemented by excluding the last layer for the logits, the backprop goes through almost the entire model. It seems that the proposed interpolating method has a similar amount of computation as PGD, so the training should take similar time as Madry's if it can converge quickly. 
+
+Also, there are too many works on robustness defense that have been proven ineffective (consider the works by Carlini). Since this is a new way of robust training and there is no certified guarantee, I would be very conservative and suggest the authors refer the checklist in [1] to evaluate the effectiveness of the defense more thoroughly to convince the readers that it really works. Especially, a robustness evaluation under adaptive attacks is necessary. In other words, if the attacker knows the strategy used by the defender, it may be possible to break the model. PGD and CW are non-adaptive since no defender information is provided. 
+
+[1] Carlini, Nicholas, et al. ""On evaluating adversarial robustness."" arXiv preprint arXiv:1902.06705 (2019).",3,,ICLR2020
+xzmW6W8H10O,4,xgGS6PmzNq6,xgGS6PmzNq6,Review of dyadic fairness,"This paper discusses fairness problems in graph embedding for link prediction to mitigate problems such as graph segregation related to sensitive attributes. The paper uses a variational graph autoencoder to learn an embedding, and predicts links based on distance in this embedded space. The notion of dyadic fairness refers to statistical constraints on link prediction between/across sensitive attribute groups, such as parity for positive predictions or false negatives. The learning algorithm updates a weighted adjacency matrix with nonzero entries preserving the observed adjacency structure of the graph, while optimizing edge weights to tradeoff between utility and dyadic fairness of edges predicted from the weights. The paper compares the proposed method to a baseline and a competing fair graph embedding method on several example datasets to show good performance on utility, fairness, and diversity of inter/intra-group link predictions.
+
+I recommend accepting this paper because it focuses on a problem where there is relatively little previous work, introduces a novel approach, and shows empirically that the approach is competitive. I have a few questions or comments which, if addressed, could improve my rating of the paper.
+
+0. Is it really necessary to constrain the nonzero entries of the adjacency matrix based on observed edges? The paper suggests ""adding fictitious links might mislead the directions of message passing,"" but is there any reason to believe the resulting errors would be any worse than errors from other sources of uncertainty in the problem? It seems possible to me, a priori, that an algorithm without this constraint might achieve both better utility and fairness under some conditions on some examples. This might be worth exploring or mentioning, even if the present paper makes the design choice of keeping this constraint.
+
+1. It is not clear to me exactly how the theoretical findings motivate the algorithm. Perhaps this would be clearer with a better explanation related to my previous point about the structural constraint.
+
+2. In the theoretical analysis, immediately after Corollary 4.1, the paper states ""Multiple layers of GNNs can be reasoned out similarly."" This seems like quite a stretch without providing further justification. 
+
+3. It is a little confusing to determine the meaning of ""proportion"" on the horizontal axis in Figure 2.
+
+4. For the experiments, it's not clear to me why the category of an article should be considered a sensitive attribute in citation networks. Perhaps there is an explanation of the fairness concern in citation networks that would make this more obvious. Instead of so many citation networks, the experiments might be improved with social networks having sensitive attributes like race, religion, or political identification, where there are social science reasons to be especially concerned about social network segregation. I believe there is less reason to believe there will be a high degree of network segregation on the basis of gender than on the basis of those other sensitive attributes, so I would suggest not limiting the sensitive attribute to only gender in the social network examples.",7,3.0,ICLR2021
+B171xj6eM,3,S14EogZAZ,S14EogZAZ,"Nice goal augmentation in state representation for DQN with, unfortunately incomplete and quite preliminary","The authors use a variant of deep RL to solve a  simplified 2d physical stacking task. To accommodate different goal stacking states the authors extend the state representation of DQN. The input to the network is the current state of the environment as represented by the 2d projection of the objects in the simulated grid world and a representation of the goal state in the same projection space. The reward function in its basic form rewards only the correctly finished model. A number of heuristics are used to augment this reward function so as to provide shaping rewards along the way and speed up learning. The learnt policy is evaluated on the successful assembly of the target stack and on a distance measure between the stack specified as goal and the actual stack. 
+
+Currently, I don’t understand from the manuscript, how DQN is actually trained. Are all different tasks used on a single network? If so, is it surprising that the network performs worse than when augmenting the state representation with the goal? Or are separate DQNs trained for multiple tasks?
+
+The definition of value function at the bottom of page 4 uses the definition for continual tasks but in the current setting the tasks are naturally episodic. This should be reflected by the definition.
+
+It would be good if the authors could comment on any classic research in RL augmenting the state representation with the goal state and any recent related developments, e.g. multi-task RL or the likes of Dosovitskiy & Koltun “Learning to act by predicting the future”.
+
+It would be helpful do obtain more information about the navigation task, especially a plot of sorts would be helpful. Currently, it is particularly difficult to judge exactly what the authors did. 
+
+How physically “rich” is this environment compared to some of the cited work, e.g. Yildirim et al. or Battaglia et al:?
+
+Overall it feels as if this is an interesting project but that it is not yet ready for publication. ",5,3.0,ICLR2018
+qYadRr6LzR0,1,kBVJ2NtiY-,kBVJ2NtiY-,Cool idea but a bit limited in a number of ways. I did not find the empirical results fully convincing. ,"This paper introduces an algorithm, called deep reward learning by simulating the past (deep RLSP), that seeks to infer a reward function by looking at states in demonstration data. An example of this described in the paper is an environment with a vase: if demonstration data shows an intact vase in the presence of an embodied agent then breaking the vase is unlikely to be the intended behavior. Otherwise the vase would already be broken in the demo.
+
+To achieve this, the paper assumes a Boltzmann distribution on the demonstration policy and a reward function that is linear in some pre-trained state features. The paper then derives a gradient of the log probability of a demonstration state. The gradient estimator involves simulating a possible past from a demonstration state (using a learned inverse policy and inverse transition function) and then simulating forward from the possible past (using the policy and a simulator) The gradient is then the difference between features counts from the backward and forward simulations. 
+
+The paper is generally clearly written and works on a crucial problem in reinforcement learning, namely how to specify a human preference without resorting to tedious reward engineering. Novel, scalable approaches to this problem would certainly be of interest to the ICLR community. The primary technical contribution of the paper is the derivation of the gradient estimator which is correct. 
+
+I find the idea of the paper very interesting and the results showing meaningful behavior emerge from a single demonstration are quite nice. However I think the paper is limited in a number of ways:
+- It requires access to a pretrained state representation
+- It requires access to a simulator of the environment which requires being able to reset the environment to arbitrary states. This seems quite limiting for real world applications. Worryingly, appendix D states that learning a dynamics model was attempted by the authors but failed to yield good results.
+- I think the choice of evaluation environments is a little odd and simplistic. I think environments more aligned with the eventual application areas for a method such as Deep RLSP would make the paper much more compelling. Given the motivation of the paper, I think perhaps manipulation environments where a robot arm interacts with multiple objects could be an interesting choice.
+- From the empirical results, it is not clear that Deep RLSP works substantially better than the simple average features baseline.
+
+Overall I think the paper has the potential to be a good paper but could still be substantially improved and I'm leaning towards rejection. 
+
+Minor comments and questions for the authors:
+- I'm curious how you choose the gradient magnitude threshold? Does Deep RLSP fail without the curriculum? Could you provide an ablation that shows the effect of using a curriculum? 
+- I would also be interested in an ablation of the cosine-similarity weighting heuristic.
+- I think the phrase recent work in the abstract could use a reference.
+- I'm a bit confused by the description of the environment suite by Shah et al. in appendix A, in particular the different rewards. Could you clarify and expand the description a bit?
+",5,4.0,ICLR2021
+HJlhOgWZp7,3,H1g6osRcFQ,H1g6osRcFQ,Simple technique with few assumptions for policy transfer. Questions regarding performance and novelty. ,"This paper introduces a simple technique to transfer policies between domains by learning a policy that's parametrized by domain randomization parameters. During transfer CMA-ES is used to find the best parameters for the target domain.
+
+Questions/remarks:
+- If I understand correctly, a rollout of a policy during transfer (i.e. an episode) contains 2000 samples. Hence, 50000 samples in the target environment corresponds to 25 episodes. Is this correct? Does fine-tuning essentially consists of performing 25 rollouts in the target domain?
+- It seems that for some tasks, there is almost no finetuning happening whereas SO-CMA still outperforms domain randomization (Robust) significantly? How can this be explained? For example, the quadruped task (Fig 6a)  has no improvement for the SO-CMA method, yet it is significantly better than the domain randomization result. It seems that during the first episodes of finetuning, domain randomization and SO-CMA should be nearly equivalent (since CMA-ES will be randomly picking parameters mu). A very similar situation can be seen in Fig 5a
+- Following up on my previous question: fig 4a does show the expected behavior (domain randomization and SO-CMA starting around the same value). However, in this case your method does not outperform domain randomization. Any idea as to why this is the case?
+- It's difficult to understand how good/bad the performance of the various methods are without an oracle for comparison (i.e. just run PPO in the target environment). 
+- It seems that the algorithm in this work is almost identical to Hardware Conditioned Policies for Multi-Robot (Tao Chen et al. NIPS 2018), specifically section 5.2 in that paper seems very similar. Please comment.
+
+Minor remarks:
+- fig 5.a y-axis starts at 500 instead of 0.
+- The reward for halfcheetah seems low, but this might be due to the custom setup.",7,4.0,ICLR2019
+TVaSAdtEzHH,4,xF5r3dVeaEl,xF5r3dVeaEl,Well-executed study of learning opponent models with a VAE,"This paper addresses interactions in a multi-agent setting without access to the policies of other agents (termed opponents). Each agent has access to local observations only but is still required to cooperate with opponents (which use either heuristic or stochastic RL-trained policies). In each instance of the environment, the agent considered interacts with single opponent, drawn from a fixed set. The idea is then to learn an embedding of each opponent, which is derived from local observations of the agent, and trained by predicting opponent observations, rewards and actions which are made available at training time.
+
+Overall, I find this to be a well-executed study of learning opponent models with a VAE. The write-up is clear and easy to follow, and the experiments show good performance of the method, along with a base- and topline. Ablations on VAE training targets are provided, as well as analysis regarding the learned opponent embeddings and VAE decoder performance. I'm however a bit surprised by the poor performance of the NOM baseline on the speaker-listener environment. The reward is dense, so it seems the main task for the listener (mapping the the available 5-bit message to the respective goal location) should be solvable as-is? 
+
+Some suggestions for further improvement:
+- I found Figure 1 confusing as I understand that the actual opponent policies do not have access to the latent variable Z, and that three items are predicted from the latent space: the opponent's actions, observations and rewards. 
+- For LBF in Figure 3, I would suggest that the authors expand upon the fact that there is no discernible structure in the opponent embeddings. It would have been interesting to see how well the method works if agent and food locations were randomized in the LBF environment so that agents also have to cooperate in exploration. 
+- I would also suggest to clearly state the dimensionality of the learned embeddings in Appendix D -- from the current text I would assume it is 128?
+- In the conclusion, you state: ""LIOM is agnostic to the type of interactions in the environment (cooperative, competitive, mixed) and can model an arbitrary number of opponents simultaneously."" The method presented appears to only support a single opponent per environment instance, so it would be good to clarify this statement.",6,3.0,ICLR2021
+BK_EjD3dSY,4,JI2TGOehNT0,JI2TGOehNT0,Review,"This paper  introduces two different interpretations of free energy minimization as a form of behavior cloning and reinforcement learning.  
+
+Strength:
+This approach seems to have significant gains on the environments evaluated.
+The approach appears novel to my knowledge.
+
+Weaknesses:
+I found that I was confused by the presentation of section 3.1. In particular, I think the authors should clarify the difference between prior and posterior policies in both the RL and imitation learning setting as they appear to be different.
+
+Why does equation 17 to 18 follow? Isn't the posterior policy different than the policy prior, leading to the likelihood to the next state this is distinct from the prior probability of the next state?
+
+For equation 22 to 23, is the assumption that the likelihood of the next state is proportional to inverse exponentiated reward? I think the statement should be said in the text.
+
+How does the approach compare with other approaches that encourage entropy in the policy? Such as something like soft Q learning? Or some type of curiosity?
+
+Can this approach be evaluated on more realistic environments other than deepmind control? 
+
+Post Rebuttal Update:
+
+Due to the remaining confusion among reviewers about the equations in the manuscript, I maintain my score.
+",5,2.0,ICLR2021
+wFozcC8PTnN,3,_PzOsP37P4T,_PzOsP37P4T,"Simple idea of using Gumbel-Softmax to sample from discrete distributions, experimental part can be expanded and improved","**summary** 
+the paper proposes a method of learning discrete approximate posterior distributions with potentially unbounded support. The idea is to truncate it to a finite set of states and use Gumbel-Softmax relaxation for samples from the truncated distribution. 
+The approach is illustrated on VAE example as well as topic modelling.
+
+**pros**
+The idea of using Gumbel-Softmax to relax samples from another discrete distribution isn't new, however the application of this idea to distribution discussed in this paper does look novel. Gumbel-Softmax estimator is known to be computationally efficient and performing well in many situations, so it's perhaps not surprising that it performs well in the current context. Experimental results in topic modelling look very impressive.
+
+**cons**
+* I found it hard to follow the experimental section. In Section 7.2 the authors reference Figurnov et al (2018) for the setup however it seems that the reference only discusses continuous distributions. It is not clear what is the form of the prior and approximating posterior used in VAE experiments. This makes it hard to evaluate the paper.
+* I found the proofs that proposed approximation converges to true samples to be of little value: they do not give quantitative bounds on what the truncation region should be for given sample quality.
+* I think that comparison with StoRB and UnOrd estimators using only m=1 is not enough, it is possible that increasing m, although adding more computation, will significantly improve overall performance of the final LL.
+
+comments:
+* My suggestion for improving the paper would be to provide much more experimental details, regarding the setup, datasets and baselines in the main text of the paper. At the same time the theoretical part can be reduced given that the idea is very straightforward.
+",5,3.0,ICLR2021
+rkeiA3I19H,3,r1eiu2VtwH,r1eiu2VtwH,Official Blind Review #3,"This paper introduces a new method to make ensembles of decision trees differentiable, and trainable with (stochastic) gradient descent. The proposed technique relies on the concept of ""oblivious decision trees"", which are a kind of decision trees that use the same classifier (i.e. a feature and threshold) for all the nodes that have the same depth. This means that for an oblivious decision tree of depth d, only d classifiers are learned. Said otherwise, an oblivious decision tree is a classifier that split the data using d splitting features, giving a decision table of size 2^d. To make oblivious decision trees differentiable, the authors propose to learn linear classifiers using all the features, but add a sparsity inducing operator on the weights of the classifiers (the entmax transformation). Similarly, the step function used to split the data is replaced by a continuous version (here a binary entmax transformation). Finally, the decision function is obtained by taking the outer product of all the scores of the classifiers: [c_1(x), 1-c_1(x)] o [c_2(x), 1-c_2(x)] ... This ""choice"" operator transforms the d dimensional vectors of the classifier scores to a 2^d dimensional vector. Another interpretation of the proposed ""differentiable oblivious decision trees"" is a two layer neural network, with sparsity on the weights of the first layer,
+and an activation function combining the entmax transformation and the outer product operator. The authors then propose to combine multiple differentiable decision trees in one layer, giving the neural decision oblivious ensemble (NODE). Finally, several NODE layers can be combined in a dense net fashion, to obtain a deep decision tree model. The proposed method is evaluated on 6 datasets (half classification, half regression), and compared to existing decision tree methods such as XGBoost or CatBoost, as well as feed forward neural networks.
+
+The paper is clearly written, ideas are well presented, and it is easy to follow the derivation of the method. As a minor comment, I would suggest to the authors to give more details on the EntMax method, as it is quite important for the method, but not really introduced in the paper. The proposed algorithm is sound, and a nice way to make decision trees differentiable. One concern that I have though, is that it seems that NODE are close to fully connected neural networks, with sparsity on the weights. Indeed, I think that there are two ingredients in the paper to derive the method: adding sparsity to the weights and the outer product operator (as described in the previous paragraph). In particular, the improvement over vanilla feed forward neural networks seem small in the experimental section. I thus believe that it would be interesting to study if both two differences with feed forward networks are important, or if only is enough to get better results.
+
+To conclude, I believe that this is a well written paper, proposing a differentiable version of decision trees which is interesting. However, the proposed method relies on existing techniques, such as EntMax, and I wonder if the (relatively small) improvement compared to feed forward network comes from these. I believe that it would thus be interesting to compare the method with feed forward network with sparsity on the weights. For now, I am putting a weak reject decision, but I am willing to reconsider my rating based on the author response.
+
+Questions to the authors:
+(1) do you use the same data preprocessing for all methods (quantile transform)?
+(2) would it make sense to evaluate the effects of each the entmax and the outer product operator separately in the context of fully connected networks?
+",3,,ICLR2020
+oQTzbffAw0,4,xtKFuhfK1tK,xtKFuhfK1tK,Importance weighted sampling for GNN distributed training,"This work considers the challenge of distributed training for GNNs. The approach is a locality-aware importance weighted sampling procedure. I was not given much time to read the paper but it seems like a decent contribution, albeit too minor of a contribution to the existing literature to be considered a bonafide research paper.
+
+### Quality
+
+- There is nothing clearly wrong with the paper. I did not have time to go through all the equations but I can believe the approach.
+
+### Clarity
+
+- The writing is clear and the approach is well described.
+
+### Originality
+
+There is not a whole lot of originality. Prior work (appropriately cited) has consider a similar approach. The main difference is the locality of sampling to avoid communication.
+
+### Significance of this work
+
+- Important topic, not exciting as a research paper. 
+
+### Pros
+
+- Scalability of GNNs is a very important topic that deserves more attention.
+
+- The writing is good; the reader can quickly understand the approach and the main points of the paper. It also helps that the approach is well-known and relatively simple.
+
+### Cons
+
+- Experiments: For a work studying distributed training over graphs with ""billions of nodes"", it is certainly disappointing to see that the datasets contain up to 1.1M nodes. 
+
+- The work seems like a direct application of importance weighting and stratified sampling to sampling the neighborhood of node in a GNN. Locality-aware importance sampling is a common approach used in industry, and rather trivial as a method.
+
+
+### Other comments
+
+- Regarding the bound V, it would be nice to get a sense of its magnitude. It does not look very efficient. What if we performed a push sampling operation (where the node that has x_j will sample it with probability ||x_j||^2  and push it to the servers that need it) rather than the proposed pull sampling (where each node requests the samples)? That way we don't need to guess or bound the value of ||x_j||^2. Just a quick thought. 
+
+-------
+
+The rebuttal did not meaningfully addressed my concerns.
+
+Apologies to the authors for not providing a reference for my comment on approaches for reducing communication in graph-optimization methods being widely known in industry. GraphLab is an example (https://en.wikipedia.org/wiki/GraphLab)
+
+- Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin and J. Hellerstein. GraphLab: A New Framework for Parallel Machine Learning. In the 26th Conference on Uncertainty in Artificial Intelligence (UAI), Catalina Island, USA, 2010
+- Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin and Joseph M. Hellerstein (2012). ""Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud."" Proceedings of Very Large Data Bases (PVLDB).
+- Joseph Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, Carlos Guestrin (2012). ""PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs."" Proceedings of Operating Systems Design and Implementation (OSDI).
+
+",5,3.0,ICLR2021
+ByL6nRZVg,2,BkV4VS9ll,BkV4VS9ll,,"I did enjoy reading some of the introductions and background, in particular that of reminding readers of popular papers from the late 1980s and early 1990s. The idea of the proposal is straight forward: remove neurons based on the estimated change in the loss function from the packpropagation estimate with either first or second order backpropagation. The results are as expected that the first order method is worse then the second order method which in turn is worse than the brute force method.
+
+However, there are many reasons why I think that this work is not appropriate for ICLR. For one, there is now a much stronger comprehension of weight decay algorithms and their relation to Bayesian priors which has not been mentioned at all. I would think that any work in this regime would require at least some comments about this. Furthermore, there are many statements in the text that are not necessarily true, in particular in light of deep networks with modern regularization methods. For example, the authors state that the most accurate method is what they call brute-force. However, this assumes that the effects of each neurons are independent which might not be the case. So the serial order of removal is not necessarily the best. 
+
+I also still think that this paper is unnecessarily long and the idea and the results could have been delivered in a much compressed way. I also don’t think just writing a Q&A section is not enough, and the points should be included in the paper.
+",3,4.0,ICLR2017
+H1lSw16ch7,3,S1ecm2C9K7,S1ecm2C9K7,"Interesting result, need more comparison in the experiment section, need more explaining of related work","In this paper, the authors studied bias amplification. They showed in some situations bias is unavoidable; however, there exist some situations in which bias is a consequence of weak features (features with low influence to the classifier and high variance). Therefore, they used some feature selection methods to remove weak features; by removing weak features, they reduced the bias substantially while maintaining accuracy (In many cases they even improved accuracy).  Showing that weak features cause bias is very interesting, especially in their real-world dataset in which they improved bias and accuracy simultaneously.  
+
+
+My main concerns about this paper are its related work and its writing.
+Authors did a great job in reviewing related work for bias amplification in NLP or vision. 
+However, they studied bias amplification in binary classification, in particular, they looked at GNB; and they did not review the related work about bias in GNB.  I think it is clear that using MAP causes bias amplification. Therefore, I think changing theorem 1 to a proposition and shifting the focus of the paper to section 2.2 would be better. Right now, I found feature orientation and feature asymmetry section confusing and hard to understand. In the paper, the authors claimed bias is a consequence of gradient descent’s inductive bias, but they did not expound on the reasoning behind this claim. Although the authors ran their model on many datasets, there is no comparison with previous work. So it is hard to understand the significance of their work. It is also not clear why they don’t compare their model with \ell_1 regularization in CIFAR.
+
+
+Minor:
+
+Paper has some typos that can be resolved.
+Citations have some errors, for example, Some of the references in the text does not have the year, One paper has been cited twice in two different ways, For more than two authors you should use et al., sometimes \citet and \citep are used instead of each other.
+Authors sometimes refer to the real-world experiment without first explaining the data which I found confusing.",6,4.0,ICLR2019
+S1O4Hinlf,3,HkmaTz-0W,HkmaTz-0W,"Throughout visualisation, this paper investigates the ""flat vs sharp dilemma"", the non convexity of the loss surface and the so-called optimisation paths. Nice plots but I would have appreciated a deeper treatment of observations.","
+* In the ""flat vs sharp"" dilemma, the experiments display that the dilemma, if any, is subtle. Table 1 does not necessarily contradict this view. It would be a good idea to put the test results directly on Fig. 4 as it does not ease reading currently (and postpone ResNet-56 in the appendix).
+
+How was Figure 5 computed ? It is said that *a* random direction was used from each minimiser to plot the loss, so how the 2D directions obtained ?
+
+* On the convexity vs non-convexity (Sec. 6), it is interesting to see how pushing the Id through the net changes the look of the loss for deep nets. The difference VGG - ResNets is also interesting, but it would have been interesting to see how this affects the current state of the art in understanding deep learning, something that was done for the ""flat vs sharp"" dilemma, but is lacking here. For example, does this observation that the local curvature of the loss around minima is different for ResNets and VGG allows to interpret the difference in their performances ?
+
+* On optimisation paths, the choice of PCA directions is wise compared to random projections, and results are nice as plotted. There is however a phenomenon I would have liked to be discussed, the fact that the leading eigenvector captures so much variability, which perhaps signals that optimisation happens in a very low dimensional subspace for the experiments carried, and could be useful for optimisation algorithms (you trade dimension d for a much smaller ""effective"" d', you only have to figure out a generating system for this subspace and carry out optimisation inside). Can this be related to the ""flat vs sharp"" dilemma ? I would suppose that flatness tends to increase the variability captured by leading eigenvectors ?
+
+
+Typoes:
+
+Legend of Figure 2: red lines are error -> red lines are accuracy
+Table 1: test accuracy -> test error
+Before 6.2: architecture effects -> architecture affects",5,3.0,ICLR2018
+iTGQtmUBFBA,3,389rLpWoOlG,389rLpWoOlG,"Good idea, but flawed execution","The paper presents an empirical comparison of different approaches for data
+labeling. The authors describe their experimental setup and findings, making
+recommendations for when to use what approach in practice.
+
+The authors reference their own anonymous work throughout the paper as
+justification for the presented investigation and its parameters. This is
+problematic as the reviewers are now unable to confirm that the presented
+investigation is well-grounded.
+
+The authors evaluate their approaches on only six datasets. It is unclear to
+what extent the results generalize, in particular as no detailed results per
+dataset are given. There could be significant differences between the different
+types of datasets, but not enough data is presented to judge. This matters in
+particular with respect to the recommendations the authors make at the end of
+the paper.
+
+Some details of the experimental setup are unclear. The authors say that they
+measure F1 score, but then refer to accuracy (e.g. in Figure 1). Which measure
+was used? The experimental setup describes six datasets, but the results text
+refers to seven. The results presented in Table 1 and Figure 1 seem to disagree
+with Table 2 -- LabelSpreadingKNN is the highest-ranked algorithm, but
+UncertaintySampling performs better in terms of all the statistics presented in
+Table 1. The same is true for the second set of experiments (Tables 3 and 4).
+For the first set of experiments it is unclear what fraction of labels were
+missing.
+
+It is unclear why the Bradley-Terry model was used here to compare outcomes.
+There are multiple other methods to judge how and whether paired distributions
+differ. It appears that only ranks were used for this comparison and not the
+actual performance numbers.
+
+Finally, all methods evaluated by the authors have hyperparameters that need to
+be set. It is unclear how the authors chose the particular values they used in
+the experiments, and tuning them for best performance may have a major impact on
+their performance and the rankings. Conclusions from untuned methods are
+unlikely to generalize.
+
+There are numerous typos and grammatical mistakes throughout the paper.",4,4.0,ICLR2021
+B1gcmtH15B,2,rJg851rYwH,rJg851rYwH,Official Blind Review #2,"Overall, this work empirically evaluates different techniques used in privacy learning and suggest useful methods to stabilize or improve performance.
+
+Detail comments:
+
+Strength:
+Despite the progress of privacy-preserving learning in theory, there are few works providing learning details for better training. Especially, considering the instability in perturbation-based private algorithms, e.g., most DP ones, the work could be valuable in the sense of practice.
+
+Weakness:
+As far as empirical research, the compared techniques are too few. What if we use those less popular techniques, for example, RMSprop optimization method?
+
+The model capacity of neural networks, especially deep networks, has some non-trivial relation to the number of filters or the number parameters. It is important to quantify such relation. A good reference might be [A]. Briefly, the generalization performance may not be monotonic against the number of parameters.
+
+The baselines are not enough. Of course, Abadi et al.’s work is outstanding in handling the privacy learning of deep networks. It has been further developed by the following researchers. For example, [B] and [C]. Does the conclusion still hold for these algorithms?
+
+[A] Neyshabur, B., Bhojanapalli, S., Mcallester, D., & Srebro, N. (2017). Exploring Generalization in Deep Learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 30 (pp. 5947–5956). 
+[B] Yu, L., Liu, L., Pu, C., Gursoy, M. E., & Truex, S. (2019). Differentially Private Model Publishing for Deep Learning. Proceedings of 40th IEEE Symposium on Security and Privacy. 
+[C] Phan, N., Vu, M. N., Liu, Y., Jin, R., Dou, D., Wu, X., & Thai, M. T. (2019). Heterogeneous Gaussian Mechanism: Preserving Differential Privacy in Deep Learning with Provable Robustness. Proceedings of the Twenty-Eighth International Joint Conference on Artificial ",3,,ICLR2020
+HJgJ3w4LhQ,1,r1lpx3A9K7,r1lpx3A9K7,Performs worse than adversarial training,"This paper presents a new adversarial defense based on ""cleaning"" images using a round trip through a bidirectional gan.  Specifically, an image is cleaned by mapping it to latent space and back to image space using a bidirectional gan.  To encourage the bidirectional gan to focus on the semantic properties, and ignore the noise, the gan is trained to maximize the mutual information between z and x, similar to the info gan.
+
+Pros:
+	1. The paper presents a novel (as far as I am aware) way to defend against adversarial attacks by cleaning images using a round trip in a bidirectional gan
+
+Cons:
+	1. The method performs significantly worse than existing techniques, specifically adversarial training.
+		a. The authors argue ""Although better than FBGAN, adversarial training has its limitation: if the attack method is harder than the one used in training(PGD is harder than FGSM), or the perturbation is larger, then the defense may totally fail. FBGAN is effective and consistent for any given classifier, regardless of the attack method or perturbation.""
+		b. I do not buy their argument, however, because one can simply apply the strongest defense (PGD 0.3 in their results) and this outperforms their method in *all* attack scenarios.  And if someone comes out with a new stronger attack there's no guarantee their method will be strong defense against that method
+	2. The paper is not written that well.  Even though the technique itself is very simple, I was unable to understand it from the introduction, and didn't really understand what they were doing until I reached the 4th page of the paper. 
+	
+
+Missing citation:
+PixelDefend: Leveraging Generative Models to Understand and Defend against Adversarial Examples  (ICLR 2018)
+",3,4.0,ICLR2019
+B1x_HUU8cH,4,Hke-WTVtwr,Hke-WTVtwr,Official Blind Review #3,"This paper makes present an original way to encode the position of the token when encoding them in a sequence. The classical additive encoding of positions creates several issues, such as the lack of flexibility when dealing with pooling layers, and the authors refer it as the position-independence problem. 
+
+Instead, the proposed approach is based on the encoding of a term-specific frequency (through the complex argument) and modulus in the complex-space, applied once per embedding dimension. This enables the embedding of a word to be dependent on the position in a non-linear manner. The intuition is similar to the use of complex numbers in signal analysis.
+
+Sorry this is not scientifically, but I have to mention that I find the axiomatic derivation of the approach simply beautiful. It is amazing to find such a simple formula from two obvious properties that someone would want from a positional encoding: Position-free offset transformation and boundedness to handle arbitrary length. 
+
+The fact that the offset does not have positive effect is interesting, and the discussion about it is limited. I would assume it is due to some redundancy in the other two parameters, but more experiences would be needed.
+
+The rest of the paper shows impressive results, both for text-classification and for machine translation, with a clear comparison with the state-of-the-art. The gains are really significant, providing a clear validation of the approach.
+
+In short, it is quite rare to find such a clear and simple idea with so much empirical gains. I would love to meet the authors once the review period is over.
+",8,,ICLR2020
+Skxh_3ORFS,1,r1ln504YvH,r1ln504YvH,Official Blind Review #2,"Summary:
+This work proposes a method to perform clustering on a time-series data for prediction purposes, unlike the classical clustering where it is done in an unsupervised manner. The authors use an encoder (RNN) to process the time-series medical records, a selector to sample the cluster label for each encoding, and a predictor to predict the labels based on the selected cluster centroids. Since the sampling process prevents the authors from using back-prop, they employ an actor-critic method.
+
+Strengths:
+- Although some would argue otherwise, patient similarity has some promise to be useful in clinical practice.
+- The proposed method clearly outperformed various baselines in terms of clustering (Table 1).
+- Table 3 and Figure 3 show that the proposed method can capture heterogeneous subgroups of the dataset.
+
+Concerns:
+- I'm not a clustering expert, but I'm skeptical this is the first work to combine clustering and supervised prediction using an RL technique.
+- It is unclear what it means to train the embedding dictionary. Are there trainable parameters in the embedding dictionary? It seems that all it does is calculate the mean of the z_t's (i.e. centroid) in each cluster. Or do you take the centroid embeddings and put that through a feed-forward network of some sort?
+- The effect of Embedding Separation Loss (Eq.5) seems quite limited. According to Table 2, it doesn't seem to help much. And contrary to the authors' claim, using \beta increase the number of activated clusters from 8 to 8.4.
+- Most importantly, the central theme of this work is combining clustering with prediction labels for the downstream prediction task. But the authors do not compare the prediction performance of the proposed method with other clustering method or ""patient similarity"" methods, or even simple supervised models. The only prediction performance metric to be found is Table 2 from the ablation study.",3,,ICLR2020
+Hyl_lXQF3X,1,HkgqFiAcFm,HkgqFiAcFm,"Fun, albeit incremental paper","Summary
+
+This paper derives a new policy gradient method for when continuous actions are transformed by a
+normalization step, a process called angular policy gradients (APG). A generalization based on
+a certain class of transformations is presented. The method is an instance of a 
+Rao-Blackwellization process and hence reduces variance.
+
+
+Detailed comments
+
+I enjoyed the concept and, while relatively niche, appreciated the work done here and do believe it has clear applications. I am not convinced that the measure theoretic perspective is always
+necessary to convey the insights, although I appreciate the desire for technical correctness. Still,
+appealing to measure theory does reduces readership, and I encourage the authors to keep this in
+mind as they revise the text.
+
+Generally speaking it seems like a lot of technicalities for a relatively simple result:
+marginalizing a distribution onto a lower-dimensional surface.
+
+The paper positions itself generally as dealing with arbitrary transformations T, but really is 
+about angular transformations (e.g. Definition 3.1). The generalization is relatively 
+straightforward and was not too surprising given the APG theory. The paper would gain in clarity
+if its scope was narrowed. 
+
+It's hard for me to judge of the experimental results of section 5.3, given that there are no other 
+benchmarks or provided reference paper. As a whole, I see APG as providing a minor benefit over PG.
+
+Def 4.4: ""a notion of Fisher information"" -- maybe ""variant"" is better than ""notion"", which implies there are different kinds of Fisher information 
+Def 3.1 mu is overloaded: parameter or measure?
+4.4, law of total variation -- define 
+
+
+Overall
+
+This was a fun, albeit incremental paper. The method is unlikely to set new SOTA, but I appreciated
+the appeal to measure theory to formalize some of the concepts.
+
+
+Questions
+
+What does E_{pi|s} refer to in Eqn 4.1?
+Can you clarify what it means for the map T to be a sufficient statistic for theta? (Theorem 4.6)
+Experiment 5.1: Why would we expect APG with a 2d Gaussian to perform better than a 1d Gaussian
+on the angle?
+
+
+Suggestions
+
+Paragraph 2 of section 3 seems like the key to the whole paper -- I would make it more prominent.
+I would include a short 'measure theory' appendix or equivalent reference for the lay reader.
+
+I wonder if the paper's main aim is not actually to bring measure theory to the study of policy
+gradients, which would be a laudable goal in and of itself. ICLR may not in this case be the right
+venue (nor are the current results substantial enough to justify this) but I do encourage authors to
+consider this avenue, e.g. in a journal paper.
+
+= Revised after rebuttal =
+
+I thank the authors for their response. I think this work deserves to be published, in particular because it presents a reasonably straightforward result that others will benefit from. However, I do encourage further work to
+1) Provide stronger empirical results (these are not too convincing).
+2) Beware of overstating: the argument that the framework is broadly applicable is not that useful, given that it's a lot of work to derive closed-form marginalized estimators.
+",7,4.0,ICLR2019
+BkxEvnt337,2,rye7knCqK7,rye7knCqK7,"In this work, the authors propose an interesting gating scheme allowing agents to communicate in an multi-agent RL setting. ","From a methodological perspective, this paper describes a simple bu clever learning architecture with individual agents able to decide when to communicate through a learned gating mechanism. Each agent is an LSTM able to decide at each time point which aspects of its internal state should be exposed to other agents through this gating mechanism. The presentation of this method is clear to a level that should allows the reader to implement this him/herself. It would be great if the code associated to this could be released but the presentation allows for reproducibility. 
+
+The experiments are interesting as well. Experimental results are presented on 3 problems and compared with known baselines from the academic community. The obtained results do show the merit of the approach. That being said, while the experimental results are extensive, there are places that could benefit from more clarity. For instance, I have found section 4.2 a bit dry. For instance, I had to read the plots caption and the text several times to map get at the deductions made in 4.2. Given the importance of gating in this work, I recommend expanding on this a bit (if space allows it). Small note: in the caption for Figure 3, on the fourth line, did you mean (f) instead of (d) when arguing that agents stop communicating once they reach the prey ( or am I missing something here)? Also, would it be possible to provide more insights on why IC3Net is doing better than CommNet except for the Combat-10Mv3Ze task (last table before the conclusion, what makes this task harder for IC3Net)? Another observation is on the variance terms that are reported for IC3Net. They are often (not always but definitely in the last table before the conclusion) quite higher when compared to the values associated with the baselines. Can this be explained? Another small thing: please add captions to your tables (at least a table number; I think that Table 2 does not have a caption). 
+
+
+Overall, the paper is well written, interesting. Addressing the questions raised above would definitely help me and probably the eventual readers better appreciate its quality. ",6,3.0,ICLR2019
+tmL3qrHYaB8,4,pBDwTjmdDo,pBDwTjmdDo,Official Blind Review #3,"This paper presents a new method called Fourier temporal state embedding. The motivation for this approach is unclear and should be appropriately justified. In the abstract and introduction, the claim appears to be that previous methods are not time nor memory-efficient, and therefore FTSE is proposed. But this is obviously not true. So it is unclear what this new approach offers compared to the state-of-the-art. In Table 1, why not report performance for more standard baselines like CTDNE? The clarity and writing of this work require significant improvement. There are many incomplete and incorrect sentences throughout the paper that make it difficult if not impossible to understand. In the problem formulation, CTDG and DTDG were originally introduced in the CTDNE paper, but instead a more recent 2020 paper is referenced. Many of the ideas are never fully explained properly. The labels in nearly all the figures need to be appropriately sized, as they are impossible to read. 
+
+The motivation of this work needs to be stated clearly. The contribution and differences between existing work need to be clarified and discussed appropriately as well. In addition, some related work on temporal network representation learning is missing and should be appropriately discussed.
+
+Overall, the approach is interesting, the results seem promising, but more work is needed to better position and motivate it. Additional experiments and details would further strengthen it as well.
+",5,5.0,ICLR2021
+SygoXqOR5r,3,Bke89JBtvB,Bke89JBtvB,Official Blind Review #1,"This paper's focus is on conditional channel-gated networks. Conventional ConvNets process images by computing all the filters, which can be redundant since not all the filters are necessary for a given image. To eliminate this redundancy, this work aims at computing a channel gating on-the-fly, to determine what filters can be turned off. The core contribution of the paper is to propose a ""batch-shaping"" technique that regularizes the channel gating to follow a beta distribution. Such regularization forces channel gates to either switch on or off. Combined with l_0 regularization, the proposed training technique improves the performance of channel gating: ResNet trained with this technique can achieve higher accuracy with lower theoretical MACs. 
+
+Overall, the paper proposes a simple yet effective trick for training gated networks. The paper is well written, and experiments are sufficient in demonstrating the effectiveness of the method. 
+
+The main concern for the paper is whether such granular control on the convolution filters can be practically useful. For Conventional ConvNets whose computation is fixed regardless of the input, scheduling the computation on the hardware static and therefore can be easily optimized. When it comes to dynamic networks, especially at such a granular level, it is not clear whether the theoretical complexity reduction can directly translate to actual efficiency (such as latency) improvement. In section 5.2, the author mentions "" We simulated this for the GPU in the same table."". Can you elaborate on how you ""simulated"" the GPU time? How is the simulation done? How well does it predict the actual implementation? Can you implement an efficient kernel for this and show the actual speedup? For the CPU runtime, can you explain in more detail the experimental setting? Can you report the actual latency improvement against theoretical FLOP reduction? For the result in Table 1, why the result of the original ResNet50 is not reported? ",6,,ICLR2020
+SJeod5c_6X,4,BygNqoR9tm,BygNqoR9tm,Good motivation but empirical evidence shows limited improvements.,"The paper introduces a new cost function for training Wasserstein Autoencoders that combines reconstruction error with Sinkhorn distance on the latent space. Authors provide nice theoretical motivation, yet empirical results seem incremental and do not fully support the effectiveness of this approach.
+
+Pros:
+- Theorem 3.1 (although trivial) provides motivation for optimizing Wasserstein distance in the latent space in WAEs.
+- Theorem 3.2 shows sufficiency of optimization over deterministic encoders in WAEs.
+- The proposed SAE virtually does not favor any prior and can preserve some aspects of geometry of the original space. 
+
+Cons:
+- It is unclear why Sinkhorn algorithm would provide better estimate of Wasserstein distance than e.g. adversarial WGANGP (which would be a variant of GAN-WAE). Sinkhorn convergence is discussed only in terms of sample size and  smoothing regularizer, not in the context of batch training. 
+- Quantitative results are on par or marginally better than other methods, they also lack some comparisons (see details below).
+- There is no comparison to relevant models outside VAE scope, e.g. ALI [4]. 
+
+The novelty of this paper is combining WAEs with Sinkhorn algorithm. Overall, it has potential, but the proposed method would probably require clearer evaluation. 
+
+Detailed issues:
+- Notation for posterior seems somewhat inconsistent and misleading, namely push-forward G#P_Z = P_G, while Q#P_X = Q_Z.
+- It is unclear why MMD or GAN losses on WAS's latent space are referred to as heuristics, each of these constitutes a divergence in the same way as the proposed Sinkhorn distance.
+- FID scores for MNIST are incomparable due to the use of own network; using LeNet has been proposed [3].
+- It is unclear what ‘Empirical lower bounds’ for MMD mentioned in Table 1. caption mean, as unbiased MMD estimator (e.g. [2]) is available. On the other hand, FID is known to be biased [3], so test-set FID should be provided for comparison.
+- Table 2. lacks comparison of SAE with normal prior even though a) authors note that MMDs are incomparable with different priors, b) SAEs is claimed to be prior-agnostic, c) in such setting MMD-WAE might be advantageous [1]. Again, no test-set FID scores.
+- Samples in Figure 2 too small.
+- MMD lacks citation (e.g. [2]).
+
+Typos:
+p.6 line 3 construcetion -> construction
+p.6 line 30 Hypersherical -> Hyperspherical 
+P.8 line 1 this a sign -> this is a sign
+
+[1] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schölkopf. Wasserstein Auto-Encoders. ICLR 2018.
+[2] Arthur Gretton, Karsten M. Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alex J. Smola. A kernel two-sample test. The Journal of Machine Learning Research, 13, 2012a.
+[3] Mikołaj Bińkowski, Dougal J. Sutherland, Michael Arbel, Arthur Gretton. Demystifying MMD GANs. ICLR 2018.
+[4] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin Arjovsky and Aaron Courville. Adversarially Learned Inference. ICLR 2017
+",5,4.0,ICLR2019
+r1eSaciL27,1,B1ePui0ctQ,B1ePui0ctQ,"learning binary weight neural networks using a structured variational approximation, gradients estimated using modified reinforce","Summary: The paper considers a variational inference strategy for learning neural networks with binary weights. In particular, the paper proposes using a structured recognition model to parameterise the variational distribution, which couples the weights in different layers/filters in a non-trivial way. The gradient of the expected likelihood term in the variational lower bound is estimated using the REINFORCE estimator. This paper adjusts this estimator to use the gradient of the log-likelihood wrt the samples. Experiments on several image classification tasks are provided.
+
+evaluation:
+
+pros:
+- the idea of the proposed approach is interesting: using variational inference for binary weight neural networks. While recent work on VI for discrete variables only focused on discrete latent variable models, this work shows how VI can be used for binary neural networks.
+ 
+cons:
+- the writing, in my opinion, needs to be improved [see my comments below]. The VI presentation is cluttered and the justification of using the pseudo-reward for reinforce is not clear.
+- the experimental results are mixed and it's not clear to me how to interpret them/compare to the baselines -- what is the goal here: computational efficiency, compression or accuracy?
+
+Some specific questions/comments:
+
++ What is the input of the policy/recognition network? It's not clear from the paper whether this includes the inputs of the current batch or outputs or both? If so, how are variable batch sizes handled? What is the input to this network at test time? In contrast to generative models/VAEs, the weights here are global parameters and it's not clear to me these should be varied for different data batches.
+
++ related to the question above: how is prediction handled at test time? Say the parameters of the variational distribution over weights are generated using the recognition network, then 100 weights are sampled given these parameters which then give 100 predictions -- should these be then averaged out to get the final prediction? I'm not quite sure I understand why the paper chose to *pick the best one* out of 100 predictions and the justification/criterion for this procedure.  
+
++ The writing is not very clear at places, and it does not help that the references being merged with the text. I'm also not sure about some of the technical jargons/terms used in the papers:
+- reinforcement learning: is this really a reinforcement learning problem? If you tackle this problem from a pure variational perspective, reinforce is used to obtain the gradient of the expected log-likelihood wrt the variational parameters. But instead of using the log likelihood, a learning signal that depends on the gradient of the log-likelihood is used.
+- concrete weights -- what are these? I assume they are just binary weights sampled from the variational approximation.
+- middle of page 3: p(w|X, Y) = p_\theta(w): this is not precise as p_\theta(w) is only an approximation to the exact posterior, which then allows us to lower bound the log marginal likelihood. ""common practice in modern variational approximation"": This is the standard way of deriving the lower bound and has been used for many years.
+
++ the reinforce estimator tends to have high variances since it does not make use of the gradient of the function in the expectation. This paper adjusts the vanilla estimator with a learning signal that involves the gradient. Could you comment on the bias/variance trade-off of the resulting estimator? Much of recent literature on learning discrete variables, as far as I understand, propose ways to not to have to use the vanilla reinforce, for example Concrete, Relax or rebar, albeit the focus on latent variable models.
+
++ model selection and uncertainty measure: the paper mentions these potential advantages of the proposed approach over deterministic binarisation schemes, but does not fully explore and test these.
+
+",5,4.0,ICLR2019
+HJx3fjF_cr,3,Hkx6p6EFDr,Hkx6p6EFDr,Official Blind Review #5,"This paper proposes Equivariant Entity-Relationship Networks, the class of parameter-sharing neural networks derived from the entity-relationship model.
+
+Strengths of the paper:
+1. The paper is well-written and well-structured.
+2. Representative examples, e.g., the Entity-Relationship diagram in Figure 1, are used to demonstrate the proposed algorithms.
+3. Detailed proofs for some equations are provided for better understanding the proposed equivariant entity-relationship networks.
+
+Weaknesses of the papers:
+1. No effective baselines are used for comparisons in the experiments. Are there state-of-the-art algorithms that have been proposed by other researchers to be used as baselines in the experiments?
+2. No effective real-world datasets are used in the experiments. The authors only take synthesized toy dataset in their experiments. Are there other real-world datasets to be used in the experiments?
+3. In terms of missing record prediction, why do the authors embed, e.g., the COURSE, in this way but not the other ways? What are the motivations of embedding like this? Are there other embedding techniques, e.g., Matrix Factorization and Skip-gram frameworks like that in Word2VEC, can be used for your purposes?
+",3,,ICLR2020
+HyxiMQrDh7,1,S1grRoR9tQ,S1grRoR9tQ,The proposed SGLD-SA algorithm with its convergence properties is interesting,"* The proposed SGLD-SA algorithm, together with its convergence properties, is very interesting. The introduction of step size $w^{k}$ is very similar to the ""convex combination rule"" in (Zhang & Brand 2017) to guarantee convergence.
+  
+* It seems that this paper only introduced Bayesian inference in the output layers. It would be more interesting to have a complete Bayesian model for the full network including the inner and activation layers.
+
+* This paper imposed spike-and-slab prior on the weight vector which can yield sparse connectivity. Similar ideas have been explored to compress the model size of deep networks (Lobacheva, Chirkova and Vetrov 2017; Louizos, Ullrich and Welling 2017 ). It would make this paper stronger to compare the sparsification and compression properties with the above work.
+
+* In equation (11) there is a summation from $\beta_{p+1}$ to $\beta_{p+u}$. I wonder where this term comes from, as I thought $\beta$ is a vector of dimension $p$.
+
+Reference:
+Zhang, Ziming, and Matthew Brand. ""Convergent block coordinate descent for training tikhonov regularized deep neural networks."" Advances in Neural Information Processing Systems. 2017.
+
+Lobacheva, Ekaterina, Nadezhda Chirkova, and Dmitry Vetrov. ""Bayesian Sparsification of Recurrent Neural Networks."" arXiv preprint arXiv:1708.00077 (2017).
+
+Louizos, Christos, Karen Ullrich, and Max Welling. ""Bayesian compression for deep learning."" Advances in Neural Information Processing Systems. 2017.
+
+",6,5.0,ICLR2019
+Hyemr5VVtB,1,Syx79eBKwr,Syx79eBKwr,Official Blind Review #1,"The paper gives a big picture view on training objectives used to obtain static and contextualized word embeddings. This is very handy since classical static word embeddings, such as SGNS and GloVe, have been studied theoretically in a number of works (e.g., Levy and Goldberg, 2014; Arora et al., 2016; Hashimoto et al., 2016; Gittens et al., 2017; Allen and Hospedales, 2019; Assylbekov and Takhanov, 2019), but not much has been done for the modern contextualized embedding models such ELMo and BERT - I personally know only the work of Wang and Cho (2019), and please correct me if I am wrong.
+
+""There is nothing as practical as a good theory"", and the authors confirm this statement: their theory suggests them to modify the training objective of the masked language modeling in a certain way and this modification proves to benefit the embeddings in general when evaluated on standard tasks.
+
+I don't have any major issues to raise. A minor comment is that the mutual information I(., .) being a function of two variables suddenly became a function of a single variable in Eq. (1) and in the text which precedes it.",8,,ICLR2020
+S1PF1UKxG,1,HktK4BeCZ,HktK4BeCZ," The paper relates the theories of Mean Field Games and Reinforcement Learning  within the classic context of Markov Decision Processes. The method suggested uses inverse RL to learn both the reward function and the forward dynamics of the MFG from data, and its effectiveness is demonstrated on social media data. ","The paper considers the problem of representing and learning the behavior of a large population of agents, in an attempt to construct an effective predictive model of the behavior. The main concern is with large populations where it is not possible to represent each agent individually, hence the need to use a population level description.  The main contribution of the paper is in relating the theories of Mean Field Games (MFG) and Reinforcement Learning (RL) within the classic context of Markov Decision Processes (MDPs). The method suggested uses inverse RL to learn both the reward function and the forward dynamics of the MFG from data, and its effectiveness is demonstrated on social media data. 
+The paper contributes along three lines, covering theory, algorithm and experiment.  The theoretical contribution begins by transforming a continuous time MFG formulation to a discrete time formulation (proposition 1), and then relates the MFG to an associated MDP problem. The first contribution seems rather straightforward and appears to have been done previously, while the second is interesting, yet simple to prove. However, Theorem 2 sets the stage for an algorithm developed in section 4 of the paper that suggests an RL solution to the MFG problem. The key insight here is that solving an optimization problem on an MDP of a single agent is equivalent to solving the inference problem of the (population-level) MFG. Practically, this leads to learning a reward function from demonstrations using a maximum likelihood approach, where the reward is represented using a deep neural network, and the policy is learned through an actor-critic algorithm, based on gradient descent with respect to the policy parameters. The algorithm provides an improvement over previous approaches limited to toy problems with artificially created reward functions. Finally, the approach is demonstrated on real-world social data with the aim of recovering the reward function and predicting the future trajectory. The results compare favorably with two baselines, vector auto-regression and recurrent neural networks. 
+I have found the paper to be interesting, and, although I am not an expert in MFGs, novel and well-articulated. Moreover, it appears to hold promise for modeling social media in general. I would appreciate clarification on several issues which would improve the presentability of the results.  
+1)	The authors discuss on p. 6 variance reduction techniques. I would appreciate a more complete description or, at least, a more precise reference than to a complete paper. 
+2)	The experimental results use state that “Although the set of topics differ semantically each day, indexing topics in order of decreasing initial popularity suffices for identifying the topic sets across all days.” This statement is unclear to me and I would appreciate a more detailed explanation.  
+3)	The authors make the following statement: “ … learning the MFG model required only the initial population distribution of each day in the training set, while VAR and RNN used the distributions over all hours of each day.” Please clarify the distinction and between the algorithms here. In general, details are missing about how the VAR and RNN were run. 
+4)	The approach uses expert demonstration (line 7 in Algorithm 1). It was not clear to me how this is done in the experiment.
+",8,3.0,ICLR2018
+UgfxVCRYdv,1,tkAtoZkcUnm,tkAtoZkcUnm,Good paper overall; would have liked more discussion on the assumptions,"The paper proposes neural thompson sampling (TS) - a method to run TS without assuming that the reward is a linear function of the context, as is generally assumed in literature. This is not the first paper to use neural networks for TS, however existing papers either a) used TS only in the last layer, or b) maintained uncertainty over the weights and sampled the entire neural network. This paper instead maintains a single network that computes the mean of the reward distribution of an arm. 
+
+The paper also is the first paper to provide regret guarantees for a neural TS algorithm. Experiments show that their algorithm performs better than other baselines.
+
+My concern with the theoretical results is a missing discussion on their utility with respect to the assumptions. The necessary assumption for all results in the paper is Condition 4.1, which assumes that m, the width of the network is larger than T^6 L^6 K^6. With T as the horizon, this assumes a neural network width of 10^18 even for a modest horizon of 1000 (as used in the experiments). I don't think the experiments used this width in their implementation. I would like if the authors point out this disconnect for the benefit of the readers, and have a discussion section. I believe this assumption may be necessitated by the use of NTK.
+
+Second, the algorithm uses the function g to model the variance of the distribution, but I did not find any discussion. Is g assumed to be known? If not, how is it learned?
+
+I like that the authors study the delayed reward experiments as it is often the case in practical situations. What will also be useful is to discuss the implementation complexity (computation required to decide the next arm to be sampled) of various algorithms (ideally through a plot). Some algorithms may be faster than others, and readers can use this plot to make an informed choice.
+
+",7,3.0,ICLR2021
+SkNrPRFgM,3,B1suU-bAW,B1suU-bAW,"This paper proposes a covariate aware tensor embedding for text corpora that learns a shared embedding and how different contexts can modify the embedding. The authors show the method recovers interpretable latent embeddings from two text corpora, however, some of the experimental results seem less convincing.","This paper presents an embedding algorithm for text corpora that allows known
+covariates, e.g. author information, to modify a shared embedding to take context
+into account. The method is an extension of the GloVe method and in the case of
+a single covariate value the proposed method reduces to GloVe. The covariate-dependent
+embeddings are diagonal scalings of the shared embedding. The authors demonstrate
+the method on a corpus of books by various authors and on a corpus of subreddits.
+Though not technically difficult, the extension of GloVe to covariate-dependent
+embeddings is very interesting and well motivated. Some of the experimental results
+do a good job of demonstrating the advantages of the models. However, some of the
+experiments are not obvious that the model is really doing a good job.
+
+I have some small qualms with the presentation of the method. First, using the term
+""size m"" for the number of values that the covariate can take is a bit misleading.
+Usually the size of a covariate would be the dimensionality. These would be the same
+if the covariate is one hot coded, however, this isn't obvious in the paper right now.
+Additionally, v_i and c_k live in R^d, however, it's not really explained what
+'d' is, is it the number of 'topics', or something else? Additionally, the functional
+form chosen for f() in the objective was chosen to match previous work but with no
+explanation as to why that's a reasonable form to choose. Finally, the authors
+say toward the end of Section 2 that ""A careful comparision shows that this
+approximation is precisely that which is implied by equation 4, as desired"". This is
+cryptic, just show us that this is the case.
+
+Regarding the experiments there needs to be more discussion about how the
+different model parameters were determined. The authors say ""... and after tuning
+our algorithm to emged this dataset, ..."", but this isn't enough. What type of
+tuning did you do to choose in particular the latent dimensionality and the
+learning rate? I will detail concerns for the specific experiments below.
+
+Section 4.1:
+- How does held-out data fit into the plot?
+
+Section 4.2:
+- For the second embedding, what exactly was the algorithm trained on? Just the
+  book, or the whole corpus?
+- What is the reader supposed to take away from Table 1? Are higher or lower
+  values better? Maybe highlight the best scores for each column.
+
+
+Section 4.3:
+- Many of these distributions don't look sparse.
+- There is a terminology problem in this section. Coordinates in a vector are
+  not sparse, the vector itself is sparse if there are many zeros, but
+  coordinates are either zero or not zero. The authors' use of 'sparse' when
+  they mean 'zero' is really confusing.
+- Due to the weird sparsity terminology Table 1 is very confusing. Based on how
+  the authors use 'sparse' I think that Table 1 shows the fraction of zeros in
+  the learned embedding vectors. But if so, then these vectors aren't sparse at all
+  as most values are non-zero.
+
+Section 5.1:
+- I don't agree with the authors that the topics in Table 3 are interpretable.
+  As such, I think it's a reach to claim the model is learning interpretable topics.
+  This isn't necessarily a problem, it's fine for models to not do everything well,
+  but it's a stretch for the authors to claim that these results are a positive
+  aspect of the model. The results in Section 5.2 seem to make a lot of sense and
+  show the big contribution of the model.
+
+Section 5.3:
+- What is the ""a : b :: c : d"" notation?
+",5,4.0,ICLR2018
+isTA35yheEL,3,6FqKiVAdI3Y,6FqKiVAdI3Y,Superlative work,"This works motivates the use of a factorized critic for multi-agent policy gradient.   The technique is well-motivated, and the exposition anticipates and answers readers' likely concerns.   The experiment section is well-organized, supports the paper's major claims, and is empirically compelling.
+
+The policy improvement claims in section 4.1.2 are initially unintuitive, but ultimately are intelligible as an agent-block-coordinate local optimality statement.  However this reviewer is not clear on the quality of these local optima (i.e., when do we get ""trapped""?).  For example, is it possible to design a task where the local optima are all very poor?   Of course, the experiment section indicates many benchmark tasks are amenable to this decomposition; but perhaps reasoning about this would help in (re)defining multi-agent problems to encourage success, e.g., it would be interesting if adding actions that communicate information directly between agents mitigates the local optima problem.
+
+
+
+ ",9,4.0,ICLR2021
+rylwfsfRFH,2,SkxcZCNKDS,SkxcZCNKDS,Official Blind Review #1,"Summary :
+The paper discusses the use of maximum entropy in Reinforcement Learning. Specifically, it relates the solution of the maximum entropy RL problem to the solutions of two different settings, 1) a ‘meta-POMDP’ regret minimization problem and 2) a ‘robust reward control’ problem. Both cases follow with simple experiments. 
+
+I feel the paper could have been written more clearly. There seem to be too many definitions and descriptive examples that diverge the attention of the reader from the main problem setting. There are quite a bit of grammatical errors in the paper, making it even harder to follow. With these many definitions in the text, it is hard to make out the actual contributions of the work. Moreover, the experiments are restricted to the bandit setting and do not provide any empirical evidence on the MDP centered theory. Overall, although the paper does well in motivating the problem, the lack of rigorous experiments and poorly structured writing advocate for a weak rejection.
+
+
+Comments/questions:
+- Can the authors comment on why it makes intuitive sense to study the meta-POMDP and robust reward control problem settings together? I see the commonality being the reward variability, but is there something else?
+- If one wants to solve the meta-POMDP through max entropy RL, how general/strong is the assumption that we are given access to the target trajectory belief?
+- In the goal reaching meta-POMDP, it makes sense to only have the final state distribution in the definition. What does the action taken in the final state signify?
+- It would be more intuitive to note the optimal solution as pi* and not pi (Lemma 4.1).
+- In the meta-POMDP, does the task change after every meta-episode?
+- I think it would be better to have separate, consistently named subsections devoted to defining the two problem settings and then move on to proving equivalence with the max entropy case.
+",3,,ICLR2020
+XJlSSILJ9Mh,2,LtgEkhLScK3,LtgEkhLScK3,Interesting empirical findings for the DRL community,"I would like to thank the authors of ""Probabilistic Mixture-of-Experts for Efficient Deep Reinforcement Learning "" for their valuable submission.
+
+Summary of the paper
+-
+The paper proposes an end-to-end method to train probabilistic mixture-of-experts policies in RL agents. They show that the approach can be applied in the context of popular on-policy and off-policy algorithms, and that it compares favourably (performance and sampling efficiency) to using the same algorithms to train the corresponding unimodal policies. Furthermore they perform an empirically analysis of the individual resulting components and of the impact of these on exploration,
+ 
+Assessment
+-
+
+-- The positives --
+
+The proposed approach is sound and seems to work well in practice.
+The empirical evaluation is extensive - the fact that the paper evaluates the proposed approach in combination with different baseline algorithms makes the findings more robust. The analysis is overall quite insightful, especially in terms of understanding the diversity of individual components and the impact of backprop-max vs backprop-all.
+ 
+-- The concerns --
+
+The paper notes that the mixture of experts seems especially beneficial in high dimensional problems (such as continuous control). This seems an important claim, but it is not clearly backed up. It would be useful to make this statement more quantitative by plotting the improvement in performance over the baseline as a function of the number of dimensions.
+
+The paper notes that the mixture of experts might help by improving exploration. It would be therefore interesting to include a parameter study showing the performance of the baseline algorithm for different amounts of entropy regularisation. This would help to compare the proposed approach to a simpler way of tuning exploration, assess whether a mixture of experts delivers further benefits on top of this (e.g. by providing “deeper” exploration), and allow the reader to compare the sensitivity to the parameter K of the proposed approach to the sensitivity of the baseline to the weight of the entropy regularisation.
+ 
+Suggestions
+-- 
+
+* Figure 6 could be made more readable by plotting a parameter study instead of a bunch of learning curves, e.g. plot AUC as a function of K.
+* It would be helpful for the authors to better discuss the relation, similarity and difference between the proposed approach and popular HRL approaches, in order to better assess the novelty of the method, and to ensure that it is placed in the appropriate context.
+
+Finally, the paper could use one more pass general pass to ensure the writing is fully correct and make it as readable as possible. Please also fix the following typos:
+- or without explicit probabilistic representation → the sentence doesn’t connect to the previous one
+- Is our method outperform? → does our method outperform?",6,4.0,ICLR2021
+lFIJQGogQj,4,UV9kN3S4uTZ,UV9kN3S4uTZ,Improvement on NRI but confusing and limiting design choices,"This paper builds on Kipf et al. (2018)’s Neural Relational Inference. In particular, this work introduces a latent variable model which treats the interactions (i.e. relations) between different agents as dynamic and time-varying. As in NRI, the interaction variable between any two agents is conditioned on the history of those agents’ states. An agent’s future state is conditioned on its history of states as well as its interaction variables with other agents.
+
+The results from DYARI are interesting but I am concerned about the intricacies of setting the “inference period” to be aligned with the “dynamic period.” I would have expected that choosing a smaller inference period (e.g. half the dynamic period) should lead to little to no loss in performance. The authors ascribe the observed loss in performance to “the extra uncertainty introduced by estimating more latent variables.” I’m not totally convinced. 
+
+Physical systems like springs have an inherent temporal invariance which you don’t seem to be exploiting. Did you consider (a) processing timesteps in a moving window of size T (i.e. z^{ij}_t is conditioned only on the last T observations) or (b) using a recurrent encoder to process observations sequentially, rather than process all timesteps at once with the PSPNet encoder? That might help with the inference challenge you're facing.
+
+More questions and requests for clarification:
+- What if the inference period is out of sync with the dynamic period (e.g. 3 steps versus 5 steps)? Does the model still perform reasonably?
+- Why not model relations as continuous latents (setting aside the fact that NRI used discrete variables)? That way you can model negative interaction (e.g. competition), zero interaction, as well as the magnitude of interaction.
+- I don’t understand Figure 8 at all. Consider the two blue players who stay on the leftmost side of the basketball court. They have a dotted blue line between them in all the “coordination” plots and a dotted red line between them in all the “competition” plots. Could you please explain what's going on? I would expect only one relation to be on at any time step.
+- In Table 6, you show average pooling leads to higher accuracy but also a higher MSE. Shouldn't you have a lower MSE with a higher accuracy?
+- Could you please add error bars for your reported scores over independent runs?
+- What value of Beta did you use for your loss? What was the effect of varying this hyperparameter?
+
+Minor:
+- In equation 2, I assume the density p(x^(i)_{t+1} | x^(i)_{< t}, z^{ij}_t) was meant to be conditioned on N-1 latents not just a single j? In fact, j is undefined in the equation at that point. 
+- Figure 2 shows the hidden interaction nodes z^{ij}_t are not conditioned on any other nodes. That is not true from equation 2: each z^{ij}_t seems conditioned on agent i and j’s past trajectories.
+- Could you please use x_{\leq t} instead of x_{< t} to denote states up to and including time t?",4,3.0,ICLR2021
+6h4lSPNYlri,3,jYVY_piet7m,jYVY_piet7m,Interesting findings,"This paper's main topic is the actual usability of the current status of non-auto-regressive translation (NAT) models.
+
+Although previous papers have reported that the NAT models can achieve the same performance level with auto-regressive translation models while the decoding speed is much faster, like two to five times, this paper points out that it deeply relies on the batch size and computation environment.
+This is a proper investigation for the community since some researchers might believe that NAT is always faster than standard auto-regressive models and became an excellent alternative to them.
+
+The ideas of inducing skip-AT and skip-MT are really unique and somewhat innovative (since, I guess, no other researchers hardly think to employ such skip-decoding architecture).
+Basically, this paper has several new findings that should be shared in the community for developing better technologies.
+
+
+
+
+The following are the questions/concerns of this paper.
+
+1,
+
+""IR-NAT heavily relies on small batch size and GPU, which is rarely mentioned in prior studies.""
+I think this is an excellent investigation. However, this paper does not tell readers why this observation happens.
+Please explain why the current NAT models are not suitable to work on CPUs and large batches.
+
+
+2,
+
+The intention of the statement, ""which indicates that the balanced distribution of deterministic tokens is necessary"" is unclear. Please elaborate on what the authors try to tell by this statement.
+
+
+3,
+
+The proposed method consists of many new components.
+The authors provided the results of an ablation study in Table 5.
+This is a really nice analysis.
+However, the performance differences in −FT, −RPR, and −MixDistill are somewhat marginal. 
+Actually, we can easily observe such 0.3-0.4 BLEU score difference by just changing random seeds for Transformer models.
+Are there any statistically significant differences among them? Or any reasonable evidence that supports the difference?
+
+
+4,
+
+This is just a comment, and appreciate having the author's words.
+
+There is an opinion that fully tuned implementation for standard auto-regressive models outperforms both decoding speed and accuracy. See the following presentation slide on WNGT-2020 (such as P33):
+https://kheafield.com/papers/edinburgh/wngt20overview_slides.pdf
+
+What would happen for the proposed method if we compared them on such a highly-tuned implementation?",6,5.0,ICLR2021
+HyePr6MAKB,3,SkeuipVKDH,SkeuipVKDH,Official Blind Review #1,"
+The novelty of this paper is adding an extra regularization term to the objective of beta-TCVAE (a VAE that regularizes total correlation), based on the discovery that low TC(z) does not necessarily mean low TC(mu). The added term enforces sample and mean representations stay close. 
+
+The authors' idea is understandable at a coarse resolution. However, the authors explain the mathematics poorly. Explanations of lots of variables and notions are missing. For example in Theorem 1, what is ""j""? what is \sigma_j? In Section 4, the simplification of notations lead to more difficulties to understand the formulas. In ""x_n"", is n the index of a sample or a dimension? The notations of variables are also confusing. Boldface lowercase letters should be used for vectors, and plain letter should be used for scalars. In Equation 4, what are D and k?    
+
+It is nice to see, in the given experimental results,  that latent representations of RTCVAE are less correlated in comparison with FactorVAE in Figures 6 and 7. However, the authors should show some generated examples through latent variable traversal to qualitatively demonstrate the potential advantages of the proposed improvement.
+
+Minors: Section. X -> Section X
+
+",3,,ICLR2020
+jMeAGbTm_D3,1,EVV259WQuFG,EVV259WQuFG,Official Blind Review #1,"This work addresses two main challenges of span-extraction style machine reading comprehension (MRC) tasks: how to evaluate the syntactic completeness of predicted answers and how to utilize the rich context of long documents. To handle such challenges, Question Rewritten Verifier (QRV) and Hierarchical Attention Network (HAN) are proposed respectively. The former uses a question-rewritten method to replace the interrogatives in the question with the answer span for further verification. The latter adopts a hierarchical multi-granularity (sentence, segment, and paragraph) cross-attention to extract and utilize the context information of long paragraphs. Compared with the strong baselines, both the verifiers and their combination achieved relatively significant accuracy improvement on three mainstream span-extraction MRC tasks: SQuAD2.0, NewsQA, and TriviaQA.
+
+-------------------------------------------
+Strengths:
+
+1. The idea of bringing the answer back to the question for further validation is sound and it is reasonable for humans to do this process to verify the candidate answer in real-world practice. 
+
+2. The question rewritten strategy is simple and effective, which brings improvements. HAN also handles the problem of long sequence well.
+
+3. The overall method achieves state-of-the-art results. The significance test shows significant improvements over baselines.
+
+-------------------------------------------
+Weaknesses:
+
+1. The design of the training target (loss) in QRV is complex and not interpretable enough. There are many loss functions. How about their contributions to the final performance?
+
+2. There is no test result reported for SQuAD2.0, though it is possible to obtain the results without making it public. Therefore, the clarity, “Due to anonymous issues, we have not submitted our results in an anonymous way to obtain results on the hidden test set.”, is not quite convincing.
+
+3. The improvement of accuracy is mainly reflected in the questions of HasAns, which has no obvious contribution to the recognition accuracy of NoAns, which is one of the main challenges of the current MRC tasks.
+
+-------------------------------------------
+Questions:
+
+1. (Section 1 page 2 line 26) The paragraphs are divided into segments, with fixed length (e.g.,512 tokens with strides such as 128 or 256) and then divides the segment into sentences. So when dividing the paragraph, what if the dividing point is in the middle of a sentence? Would the incomplete sentence be discard？If not, how to further divide the segment to sentence level? Further clarification of the process would be beneficial.
+
+2. (Section 2.1 page 3 line 8) When failing to find the alignment, the answer text is attached at the left-hand side of the question. It obviously damages the sentence structure. So will this affect the judgment of the model in the following process? In another word, would it have an impact on the performance of the final model (Increase or decrease) if question-written including subsequent loss calculations were not done on such questions?
+
+3. (Section 2.2 page 4 line 17) Multiple losses are employed, but the paper did not distinguish the practical effectiveness of each loss. My concern is whether each of the objectives is necessary since the experiment results in Table 3 has verified that  $l’_{3}$ does not significantly improve the accuracy of the model. Would the authors further verify the contribution of other losses to model performance (except for apparently indispensable l1 and l2)?
+
+4. (Figure 1 (b)) Do the tunable CLMs of sentence-level and segment level share parameters? Besides, may the authors list the number of parameters of each model (QRV, HAN, and Combination)?
+
+----------------------------------
+Minor Issues:
+
+The citation format is not consistent, please check the usage of \citep{} and \citet{}.
+",6,5.0,ICLR2021
+JGicoZ1Rsmx,1,rmd-D7h_2zP,rmd-D7h_2zP,Review,"[Summary]
+In this paper, the authors proposed a multidomain state-tracking model that leverages the relationship among different domain-slot pairs. This is done by leveraging the full-attention step over the [CLS] special token and by providing all the domain-slot pairs as a special token to a pre-trained language model (Figure 2 is very clear). To predict the value of the slot $D_{i,j}$, the author concatenates the representation of the [CLS] token, share among all the domain-slots, and the $D_{i,j}$, provided as input, and use a gating mechanism, by only using $D_{i,j}$ representation, to decide whether require as value (i.e., prediction) or not (e.g. None). \
+
+The authors experimented using ALBERT (Lan et al., 2019) as a pre-trained language model, on the well-known benchmark MultiWoZ 2.0  (Budzianowski et al., 2018)  2.1 (Eric et al., 2019). The authors studied different format to represent $D_{i,j}$ DS-merge (i.e., one token per domain-slot) and DS-split (i.e., one token per slot and one per domain, thus more scalable). The reported performance is state-of-the-art at the time of submission. 
+
+[Pros]
+- The paper reads well and it is easy to follow for people working on Task-Oriented dialogue.
+- The proposed method is simple and effective, and it would be easy to reproduce. 
+
+[Cons]
+- The idea of using domain-pairs as input to a large pre-trained model is not novel (Wu et al., 2019; Zhang et al., 2019; Lee et al., 2019), as also pointed out by the authors, but the authors do not explicitly clarify this in the methodology section, leading the reader to believe that the domain-pairs is their own contribution. Same for the slot-gate (Wu et al., 2019)
+- The authors claim to learn relations between slots, but the analysis section is very thin and it just shows an ablation by masking the attention between the slot. Two points: why not just removing the [CLS] token instead of removing the attention, and why just using on ALBERTA large. For instance, the authors said ""For this experiment, we used the ALBERT configuration of large-v2,
+for faster experimentation"" which is contradictory since large-v2 is the slowest to run I guess. Can the authors show this ablation for all the model size? 
+- Although, MWoZ is the current benchmark for DST in ToDs, there are also other datasets for this task that can be considered (e.g., Schema Guided Dialogue (SGD) (Rastogi et.al. 2019))
+
+
+[Reason to Reject]
+The main contribution of this paper is very thin, adding the [CLS] token as input, and the main technical contribution is not well explored (missing an in-depth ablation). 
+
+[Reason to Accept]
+State-of-the-art performance at the submission time. To be noted, (Mehri et.al. 2020) reported better performance in MWoZ and other datasets, but this paper was released after the ICLR submission deadline.
+
+[Question]
+-  Can the authors show the ablation for all the model size? 
+
+[Suggestion]
+- Figure 4 is very hard to read. I suggest to better format the dialogue. 
+
+",4,5.0,ICLR2021
+SJewMf7q2m,2,BJgolhR9Km,BJgolhR9Km,Experiments are not convincing,"This paper proposes an infinity norm variant of the RBF as the activation function of neural networks. The authors demonstrate that the proposed unit is less sensitive to the out-liar generated by adversarial attacks, and the experimental results on MNIST confirmed the robustness of the proposed method against several gradient-based attacks.
+
+Intuitively, the idea should work well against the features of adversarial examples which are far from the center of the cluster of ""normal"" features. However, the experiments are not convincing enough to show this point, and the entire method looks like a simple gradient mask technique. In my opinion, two types of experiments should be further considered:
+
+1. Pseudo-gradient-based attacks. Since the networks are trained using Pseudo gradients, all the attacks utilized in this paper should be pseudo-gradient-based as well.
+
+2. Black-Box attacks which do not rely on the information provided by gradients, such as transferable adversarial examples.
+
+Furthermore, the robustness revealed on the ""noise"" attack is interesting, I wish the authors could provide an analysis of the effects on feature distributions using different types of attacks.",5,3.0,ICLR2019
+5I-AHiXrR9,1,mWnfMrd9JLr,mWnfMrd9JLr,review,"Pros:
+1. this work propose to learn a manifold prior, by doing so, it can be used to improve the generation and representation quality.
+Intrinsic dimension is applied in the method.
+2. to fix the ill-defined KL, the authors proposed to use a ""bridge"" distribution, so that Q and P can have overlap on their support.
+
+
+Cons:
+even though I love the idea of this work, in the experiment section, the authors fail to compare their method with other flow-based method with quantitative results.
+Since the authors claim that their method can improve the generation and representation quality, without any comparison, it lacks of evidence.",5,5.0,ICLR2021
+UaanPP5rnWY,4,a-xFK8Ymz5J,a-xFK8Ymz5J,An interesting paper showing good results on applying denoising diffusion processes to spectrogram inversion. ,"The paper develops a speech synthesis model using denoising diffusion processes, a generative model framework recently demonstrated in image generation (Ho et al. 2020).    The application is straightforward and there is little if any theoretical difference from the Ho et al. paper.    I didn't check the proofs included in the appendix, but they along with the learning and sampling procedure seem to be already developed in (Ho et al. 2020).  The authors should take care to be very clear about the mathematical developments that are directly taken from the prior literature, and what developments are introduced in this paper.   
+Nevertheless the experiments on this application is valuable, and makes significant progress on a problem that has proven surprisingly difficult to solve in an efficient way.  
+The experiments and demos are convincing, and the results could be considered highly competitive in conditional generation and  state of the art for class-conditional and non-conditional generation.  
+
+The writing at times could use improvement.   In the abstract, line 1, ""we propose DiffWave, a versatile Diffusion probabilistic model for conditional and unconditional Waveform generation"".   I don't like this style of capitalizing things in a sentence that are not proper nouns.  If you want to introduce an abbreviation derived from a term or phrase, a widely accepted conventional method is to italicize the phrase and define the acronym the first time it is used, as in ""\emph{diffusion waveform} (DiffWave) model"".   Later you have ""DiffWave produces high-fidelity audios in Different Waveform generation tasks"":  Why is ""Different Waveform"" capitalized?   DiffWave has already been defined relative to ""diffusion waveform"".   If this is supposed to be cute, it's not.   
+Defining the acronym in the abstract is OK, but not necessary.  In any case, you still have to define it again in the body of the paper, since the abstract is considered a standalone summary of your document. 
+Also ""audios"" is not a word.  Please use ""audio signals"". This is repeated throughout the paper. 
+Other examples:  
+""This avoids the ... issues *stemmed from the joint training""   stemmed -->  stemming
+"" for generating very long waveform"" :  waveform --> waveforms
+
+
+",7,5.0,ICLR2021
+HyxhrAGRtB,3,HJluEeHKwH,HJluEeHKwH,Official Blind Review #1,"This paper proposes a differentiable variant of the Cross-Entropy method and shows its use for a continuous control task.   
+- It introduces 4 hyper-parameters and it is not clear how robust the method is to these. 
+- Although the idea is interesting, I think the paper needs a more rigorous experimental comparison with previous work and other methods.
+Detailed review below:
+- The abstract should mention clearly that the proposed method allows you to differentiate through argmin operation and can be used for end to end learning. Similarly, please reframe parts of the introduction to make it more accessible to a general reader. For example, in the introduction,  ""approximation adds significant definition and structure to an otherwise..."". This statement requires more context to make it useful. Similarly, ""smooth top-k operation"" is not clear. 
+- Is there a way to guarantee that the solution found by (D)CEM is a reasonable approximation to the argmin. For unrolled gradient descent, this can be done by looking at the gradient wrt x. 
+- It might be more useful to explain CEM before the related work section or just moving the related work to the end. 
+- Section 3: If the paper is about CEM, please give some motivation and details rather than just citing De Boer, 2005. 
+- There is a notation clash between \pi for the sort and policy later in the paper. Similarly, ""t"" is for both for the iterations of CEM and the time-stamp in the control problem. 
+- I don't understand how Proposition 1 adds to the paper. This is a standard thing. Similarly for Proposition 3. 
+- Isn't there an easier way to make the top-k operation soft - by sampling without replacement proportional to the probabilities? Please justify this design decision. Similarly, how is the temperature \tau chosen in practice?
+- Please explain the paragraph: ""Equation 4 is a convex optimization layer and... GPU-amenable.."" Isn't this critical to the overall scalability of this method?
+- - How are the hyper-parameters for CEM  chosen - the function g(.), the value of k, \tau, T chosen in practice. If the criticism of GD is that it overfits to the hyper-parameters - learning rate and the number of steps, why isn't this a problem with (D)CEM.  
+- Section 4: Since you're comparing against unrolled GD, please formally state what the method is. 
+- Section 4.2: How is the structure of Z decided, that is how do you fix the space for searching for the policy in the Z space? 
+- There are other methods that auto-encode the policy u_1:H to search the space. How does the proposed method compare to these methods? This is important to disentangle the effect of GD vs CEM and that of just searching in a more tractable space of policies. 
+- Section 5.1: How is the number of optimizer steps (=10) decided? Also, how is the learning rate for GD picked. Is the performance of unrolled GD worse for all values of \eta, even after a grid-search over the learning rates?
+- For Section 5.2, please compare to baselines mentioned in the paper. Also, there needs to be an ablation/robustness study for the DCEM method. 
+
+
+
+
+",3,,ICLR2020
+rJxXwY25nX,2,B1x0enCcK7,B1x0enCcK7,An ad-hoc method for shape generation,"
+This paper proposed a 3D shape generation model. The model is essentially an auto-encoder. The authors explored a new way of interpolation among encoded latent vectors, and drew connections to object functionality.
+
+The paper is, unfortunately, clearly below the bar of ICLR in many ways. It’s technically incremental: the paper doesn’t propose a new model; it instead suggests new way of interpolating the latent vectors for shape generation. The incremental technical innovation is not well-motivated or justified, either: the definitions of new concepts such as ‘functional essence’ and ‘importance vector’ are ad-hoc. The results are poor, much worse compared with the state-of-the-art shape synthesis methods. The writing and organization can also be improved. For example, the main idea should be emphasized first in the method section, and the detailed network architecture can be saved for a separate subsection or supplementary material. 
+
+It’s good that the authors are looking into the direction of modeling shape functionality. This is an importance area that is currently less explored. I suggest the authors look into the rich literature of geometry modeling in the computer graphics and vision community, and improve the paper by drawing inspiration from the latest progress there. 
+",3,4.0,ICLR2019
+A49FqJHZEG,4,rmd-D7h_2zP,rmd-D7h_2zP,This paper studies the problem of multi-domain dialogue state tracking which is challenging.  It proposes a bert based model for multi-domain dialogue that effectively models the relationship among domain-slot pairs.  The experimental results show the model achieve sota performance on the MultiWOZ-2.1 dataset.,"Pros
+•	This paper incorporates the multi-domain domain-slot pairs into the bert input so that the relations between sentences, domain-slot are modeled.
+Cons
+•	It’s better to experiment on more dataset to prove the method’s effectiveness.
+Comments
+•	This paper   https://arxiv.org/abs/2006.01554 seems get better performance on the same dataset.
+•	In the domain split/merge method, does the domain/slot used as vocabulary tag or word embedding?
+•	The slot-gate classifier’s output is fed into the slot-value classifier, how does this affect the performance?
+",5,3.0,ICLR2021
+IXFJcbejoak,5,7qmQNB6Wn_B,7qmQNB6Wn_B,Novel idea with some clarity and technical issues,"This paper considers the exploration efficiency issues in off-policy deep reinforcement learning (DRL). The authors identify a sample efficiency limitation in the classical entropy regularization, which does not take into account the existing samples in the replay buffer. To avoid repeated sampling of previously seen scenarios/actions, the authors propose to replace the current policy in the entropy term with a mixture of the empirical policy estimation from the replay buffer and the current policy, and term this approach as sample-aware entropy regularization. The authors then propose a theoretical algorithm called sample-aware entropy regularized policy iteration, which is a generalization of the soft policy iteration (SPI) algorithm, and show that it converges assuming that the empirical policy estimation is fixed. A practical algorithm based on the sample-aware entropy regularized policy iteration, called Diversity Actor-Critic (DAC), is then proposed. This algorithm is a generalization of the well-known soft actor-critic (SAC) algorithm. Finally, numerical experiments show that DAC outperforms SAC and other SOTA RL algorithms, and some ablation studies are also provided to demonstrate the effect of hyper-parameter choices in DAC.
+
+In general, the approach is novel to my knowledge and the high level idea of using mixed policies in the entropy regularization to avoid repeated sampling and encourage unseen scenarios/actions is also interesting and reasonable. However, there are some clarity and technical issues that should be addressed and improved, as listed below:
+1. The authors study finite horizon MDPs, for which the optimal policy should be non-stationary in general. However, the authors only consider stationary policies. Instead, the authors should either change the underlying setting to infinite horizon MDPs or consider non-stationary policies.  
+2. In (2), $s_t$ should be replaced by an arbitrary $s$ in the state space. Otherwise there may be contradicting definitions of the policy $q$ if $s_t$ and $s_{t’}$ are equal for some two different timestamps $t$ and $t’$. And in (3), it is better to write the $q_{\rm target}^{\pi,\alpha}$ in the entropy term as $q_{\rm target}^{\pi,\alpha}(\cdot|s_t)$, to be consistent with (1). 
+3. It’s not very clear why the authors propose to estimate $R^{\pi,\alpha}$ with some (neural network) parametrized $R^{\alpha}$. The authors mention that one can only estimate $R^{\pi_{\rm old},\alpha}$ for the previous policy $\pi_{\rm old}$ in practice. However, since in $R^{\pi,\alpha}$, all the quantities including $\pi$, $q$ and $\alpha$ are known, I’m confused why one cannot evaluate it directly. On a related point, it’s not very clear why the estimation procedure for $\eta$ (the parameter of $R^{\alpha}$) using hat $J_{R^{\alpha}}(\eta)$ makes sense. The form of hat $J_{R^{\alpha}}(\eta)$ looks like an entropy term extracted from the $J_{\pi_{\rm old}}$ function, but it’s unclear why maximizing it gives a good estimation of $R^{\pi,\alpha}$. Some more explanations are needed. 
+4. There seem to be several errors (at least inaccuracies) in the proof of Theorem 1 (in the Appendix). Firstly, in the proof of Lemma 1, the term “correctly estimates” is not very accurate, and should be simply stated as something like “equals”. Also, it’s not very clear when the assumption $R^{\alpha}\in(0,1)$ can be guaranteed (e.g., using Gaussian/soft-max policies?). Secondly, in the main proof of Theorem 1, convergence of $Q^{\pi_i}$ to some $Q^{\star}$ is correct, but this does not immediately imply convergence of $J_{\pi_i}$, let alone the convergence of $\pi_i$ to some policy $\pi^\star$. On a related point, the proof for the optimality of $\pi^\star$ in terms of $J$ is not clear. In particular, it is not clear why (7) and Lemma 2 implies the chained inequality $J_{\pi_{\rm new}}(\pi_{\rm new})\geq J_{\pi_{\rm old}}(\pi_{\rm new})\geq J_{\pi_{\rm old}}(\pi_{\rm old})$. I understand that the authors may feel that the proofs are similar to that of SPI, but indeed there are several significant differences (e.g., the definitions of $\pi_{\rm new}$ and $J_{\pi}$). More rigorous proofs are needed for these claims. 
+5. In Section 5, it is unclear why the authors need to include the parameter $c$, how to choose it and what it serves for. Some additional explanations are needed. 
+6. On a high level, the eventual goal of the paper is not clearly stated. From the experiments, it seems that the average episode reward is the actual goal of concern. However, the problem setting and the theoretical results (Theorem 1) seem to indicate that the problem of concern is the discounted entropy regularized reward. Some discussion about this is needed. 
+
+Finally, here are some more minor comments and suggestions:
+1. In the analysis of the sample-aware entropy regularized policy iteration, the authors assume that $q$ is fixed. However, in practice, especially in the long run (as concerned in the analysis), such an assumption will not hold (even in just an approximate sense). Can you still obtain some sort of convergence when taking into account the $q$ changes?
+2. Why do you need to divide the reward and entropy regularization term in $Q^{\pi}$ by $\beta$? 
+3. It’s better to write out the “binary entropy function $H$"" explicitly for clarity. 
+4. At the beginning of Section 4.3, “propoed” should be “proposed”, and In Section 5, “a function $s_t$” should be “a function of $s_t$”. 
+5. Some high level explanations on why the $(1-\alpha)$ term can also be dropped in (8) will be helpful.
+6. The theoretical results only show that the algorithm converges, which is already guaranteed by SPI. Is there any possibility to show that there is also some theoretical improvement?
+
+So in short, the paper proposes an interesting modification of the max-entropy regularization framework, but contains several technical and clarity issues. Hence I think it is not yet ready for publication in its current form. ",5,4.0,ICLR2021
+TUuLieKGGGu,3,Wga_hrCa3P3,Wga_hrCa3P3,"This paper presents a method for conditional text generation tasks that aims to over the ""exposure bias"" problem through contrastive learning where negative examples are generated by adding small perturbations to the input sequence to minimize its conditional likelihood, and positive examples are generated by adding  large perturbations while enforcing it to have a high conditional likelihood. ","This paper presents a method for conditional text generation tasks that aims to over the ""exposure bias"" problem through contrastive learning where negative examples are generated by adding small perturbations to the input sequence to minimize its conditional likelihood, and positive examples are generated by adding  large perturbations while enforcing it to have a high conditional likelihood. Experimental results on machine translation, text summarization and question generation show the effectiveness of the proposed approach.
+
+My only concern is that compare to MLE, the improvements either on Table 1 or on Table 2 are relative small. The study in the paper by Massimo Caccia, Lucas Caccia, William Fedus, Hugo Larochelle, Joelle Pineau, Laurent Charlin, Language GANs Falling Short, ICLR 2020 shows that the ""exposure bias"" problem for text generation by MLE appears to be less of an issue, and simple ""temperature sweep"" in the softmax significantly boosts the performance and gives pretty good results that beat all language GANs. So I think in the experiments, all results should be compared using the trick of ""temperature sweep"". Moreover, if diversity is an issue, the results should be compared in the quality-diversity space as did in Language GANs Falling Short paper. Hopefully the authors can address my concern in the rebuttal period. ",6,3.0,ICLR2021
+#NAME?,4,ZHkbzSR56jA,ZHkbzSR56jA,"A decent paper, but a number of important points are not carefully discussed in it.","This paper studies distributed learning in the presence of Byzantine workers in the asynchronous setting. Its main contributions include generalization of the existing literature on Byzantine fault tolerance in distributed learning to incorporate the case of asynchronous learning. This generalization involves an algorithm, convergence analysis for the algorithm, and experimental results. While the results presented in the paper appear to be correct, I would like the authors to focus on the following points during their revision.
+
+1. In Section 2.2, the definition of Byzantine worker, it is not clear why the worker is being indexed with $k_t$? What is the meaning of $t$ in this usage of the worker index?
+2. The writing of the paper could use some proofreading. Some of the sentences are hard to parse on first read, while some other sentences suffer from grammatical errors. As a specific example, I could not understand the meaning of ""Only when all buffers have got changed since the last SGD step, ..."" in Section 3.1 until I reread the main parts of the paper.
+3. Related to the previous point, the discussion in Section 3.1 in general is hard to parse because of the notation and could benefit from revision. Also, $m$ in this section is undefined up to this point in time and it is not clear what it means.
+4. A number of aggregation functions have been proposed in prior works (see e.g. Adversary-resilient distributed and decentralized statistical inference and machine learning: An overview of recent advances under the Byzantine threat model). Do all of these previous aggregation functions satisfy the characterization of Section 3.2? It would be helpful to have some discussion of this.
+5. Theorem 1 and Theorem 2 leave something to be desired. Since the task is to engage in distributed learning, one expects to see some sort of speedup from the fact that $n$ workers are being used to divide up the work. However this speed-up does not seem to be coming up in the analysis or the discussion. In the absence of such a speed-up, it is not clear if the authors are really providing guarantees that are useful for distributed learning.
+6. It would be useful to discuss the impact of heavily delayed workers on the algorithm. What if the sum of the number of heavily delayed workers and Byzantine workers exceeds $r$?
+7. The plots corresponding to the experimental results are too small and should be modified to have bigger font and size.
+
+***Post-discussion period comments***
+The authors have done an adequate job of responding to my queries and have also revised the paper in light of the comments of all the reviewers. While the paper could always be improved, I believe it is now above the threshold of acceptance and it should be accepted into the program, if possible. I am raising my score for this paper in light of the discussion and the revised paper.",7,3.0,ICLR2021
+IVYcb3BAOWR,4,9z_dNsC4B5t,9z_dNsC4B5t,"Interesting approach, some improvements to the paper (experiments) needed","The authors propose a method for cross domain few-shot classification that learns to generate domain specific data statistics from very few training examples for domain independent batch normalisation.  
+They propose to train small auxiliary networks that generate data statistics for normalisation. Networks are trained within a meta-learning framework using a KL divergence loss, which enforces estimated statistics on small support/training examples to match statistics from query sets where more data is available. 
+
+
+STRENGTHS
+
+The paper is well written and motivated. The proposed approach is, for the most part, easy to follow and understand. The approach benefits from its simplicity and versatility (e.g. it is not tied to a specific FSL method) and promising performance is obtained.
+
+The authors provide a very large set of experiments in multiple scenarios, and definitely demonstrates a strong effort in evaluating their approach. 
+
+
+WEAKNESSES
+
+Unfortunately, despite the fact that a significant amount of time was dedicated to evaluate the method, the experimental section needs substantial modifications. This is mainly due to the presentation of the experiments, as well as some key missing comparisons.   
+Regarding presentation, too many implementation details and descriptions of the experiments are missing from the main paper (and in certain cases, missing altogether). Regarding datasets and implementation, it is ok to provide non essential details in the appendix (especially considering the large number of datasets considered). However, no information is provided at all in the main paper, which makes it very difficult to understand the setting of the experiments. For example, in Table 1 and Figure 2, it is not known on which datasets experiments are run, and it is never mentioned in the main text (for table 1) what FSL method is employed.   
+In addition, Table 2 experiments comparing MetaNorm to different approaches sorely lacks description. The approach is compared to 9 different algorithms, none of which are given a description or reference to learn more about the method. It is therefore impossible, besides guessing, to know what the model is being compared to. This issue is also noticeable in Table 3-5, in particular with a baseline in Table 4 that is never described.   
+Finally, experiments in Figure 2 would be a lot more interesting if compared to standard methods. It would be interesting to see how the proposed strategy allows to be more sample efficient and reach stronger performance than transductive batch norm in situations where sample size is the smallest. As this is one of the cited main limitations of TBN, this experiment is highly important2- With regards to related work, I would suggest to move the section after the introduction, where it provides much better context to facilitate method comprehension, in particular regarding the description of the TBN strategy. 
+
+Authors should also comment on how their work, and in particular FSL domain generalisation setting, relates to cross domain few-shot learning works, 
+e.g. Tseng et al ICLR 2020, Cross-Domain Few-Shot Classification via Learned Feature-Wise Transformation
+It is currently presented as a completely new approach to FSL, and appears to ignore past cross domain FSL works. Please provide additional context regarding such works.
+
+
+RECOMMENDATION
+
+In summary, the authors propose an interesting, simple strategy for more robust BN that can be of interest to the community. I would strongly recommend that the authors make the following modifications to strengthen their paper:
+
+1-	Reorganise and expand on the experimental evaluation to provide necessary details and make the main paper self-contained, even if it requires moving an experiment to the supplementary materials.  
+2-	Move the related works section after the introduction, to provide additional context before delving into the method  
+3-	Relate the proposed work to past work on cross-domain FSL and potentially tone down claims that few-shot cross domain learning is a completely new problem investigated here.  
+4-	Please provide a comparison to standard TBN in Figure 2
+
+
+Additional suggestions
+
+5-	If possible, please provide an overview figure of the proposed method.  
+6-	In result tables, please sort methods according to their overall performance, and correct bolding in table 2 (protonet 5-way-5 shot MetaNorm is not best performing method) and highlight setting where methods have very similar performance (as in Table 14) for clarity.   
+7-	Provide a clear list of contributions at the end of the introduction section  
+8-	While KL divergence and hypernetworks are well known terms, it would make the paper more accessible to add a sentence (and equation in the case of the KL divergence) or two describing the terms. In particular, hypernetworks are generally used to characterise weight generators for entire architectures and might lead to confusion.  
+9-	It could be nice to provide more attention to how this work relates to conditional batch norm works, and whether they can be complementary.  
+",6,4.0,ICLR2021
+B1e25dCt2m,2,H1xQSjCqFQ,H1xQSjCqFQ,Dropout,"This is an interesting idea that seems to do better than regular dropout.
+However, the experiment seem a bit artificial, starting with less modern network designs (VGG) that can benefit from adding dropout. State of the art computer vision networks don't seem to need dropout so much, so the impact of the paper is unclear.
+
+Section 4.4: How does this compare to state-of-the-art network compression techniques? (Deep compression, etc)
+",5,3.0,ICLR2019
+SkevyWznYr,2,BJgqQ6NYvB,BJgqQ6NYvB,Official Blind Review #3,"Summary:
+- key problem: neural architecture search (NAS) to improve both accuracy and runtime efficiency of deep nets for semantic segmentation;
+- contributions: 1) a novel NAS search space leveraging multi-resolution branches, efficient operators (""zoomed convolutions""), and parametrized expansion ratios, 2) a decomposition and normalization of the latency objective to avoid a bias towards very fast but weak architectures, 3) a natural extension of the optimization problem to simultaneously search for teacher and student architectures in one pass, 4) a novel state-of-the-art efficient architecture (FasterSeg) found by the aforementioned NAS algorithm, 5) a detailed experimental evaluation on 3 datasets and an ablative analysis quantifying the benefits of the aforementioned contributions.
+
+Recommendation: Accept.
+
+Key reason 1: solid experimental results backing the claims.
+- When compared to related efficient architectures, the proposed method results in competitive accuracy at significantly higher frame rates.
+- This is validated on Cityscapes, CamVid, and BDD with the architecture found on Cityscapes. 
+- The resulting architecture (FasterSeg) is actually interpretable and makes sense, extending the handcrafted architectures used as inspiration.
+- The ablative analysis shows that the numerous individual contributions are significant, esp. the multi-branch formulation and student co-searching.
+
+Key reason 2: well-motivated method with a collection of multiple novel contributions that are interesting and practical.
+- The multi-resolution branches formulation is simple and extends typical NAS focusing on single paths through the supernet.
+- Teacher/student co-searching via learning two sets of architectures in one supernet seems novel, simple, and effective. Always picking the largest expansion ratios for the teacher and applying a distillation loss in addition to the latency loss for the student is sensible and seems to beat the standard pruning approach at no significant extra cost during NAS.
+- The zoomed convolution operator seems like a novel efficient alternative to (expensive) dilated convolutions. Although it is very simple (bilinear downsampling -> 3x3 conv -> bilinear upsampling), it is not commonly used as an operator (as far as I know), and yet is found to be a key part of the final architecture (Table 7 appendix I) due to its low latency. The closest related operator / block I could think of might be blocks found in stacked hourglass networks (Newell et al).
+- The optimization of the expansion ratios using the Gumbel-Softmax trick is interesting, although this is also explored in the very recent paper by Shaw et al. 2019 (possibly the closest related work that should be discussed in a bit more depth in Section 2);
+- Decomposing and normalizing the latency objective to avoid ""architecture collapse"" (convergence to anemic architectures stemming from certain architectural factors dominating latency) is principled and effective.
+- Caveat regarding novelty: I could not find the ideas proposed here in the literature, but its hard to be sure due to 1) the recent explosion of NAS papers, 2) the simplicity of certain ideas (e.g., ""zoomed convolutions"").
+
+
+Additional Feedback:
+- how is the student trained after NAS? Is the teacher first retrained from scratch? Is the student retrained from scratch on the teacher (after NAS or retraining)? in general, more details on what happens after co-searching would be helpful;
+- ""human designed CNN architectures achieve superior accuracy performance nowadays"": this is a surprising statement considering the cited NAS papers report performance improvements (e.g., Zoph and Le 2016);
+- missing reference also using multi-scale NAS for efficient and accurate semantic segmentation: ""Searching for Efficient Multi-Scale Architectures for Dense Image Prediction"", Chen et al, NeurIPS 2018;
+- missing reference on NAS for efficient semantic segmentation that also uses distillation: ""Fast neural architecture search of compact semantic segmentation models via auxiliary cells"", Nekrasov et al, CVPR 2019;
+- missing reference on joint NAS and quantization: ""Joint Neural Architecture Search and Quantization"", Chen et al, arxiv 2018;
+- ""we choose a sequential search space (rather than a directed acyclic graph of nodes (Liu et al., 2018b)), i.e., convolutional layers are sequentially stacked in our network"": ""stacked"" is confusing here;
+- ""we allow each cell to be individually searchable across the whole search space"": what do you mean? Anything beyond each cell containing different operators after learning?
+- if \alpha = \beta in eq. 6 of appendix C, then w and hence Target(m) does not depend on latency, isn't this a typo?
+- ""Gumbel-Max"" is typically called ""Gumbel-Softmax"" (cf. ""Categorical Reparameterization with Gumbel-Softmax"", Jang et al, ICLR'17);
+- typos: ""find them contribute"", ""the closet competitor"", ""is popular dense predictions"".",8,,ICLR2020
+KRuWUiIK1Qa,4,V5j-jdoDDP,V5j-jdoDDP,Interesting and noval (AFAICT) combination of integrated gradients and SMT that leverages strengths of each.  Some choices need more justification and explanation.,"## Summary
+
+This paper presents a method to encode the minimal input feature discovery problem -- finding the minimal set of features in a input that is necessary for a prediction -- into a form that can is amenable to satisfiability modulo theory (SMT) solvers.  In particular they first use the integrated gradients methods to score first-layer neurons on the degree to which they influence the prediction.  Then, they produce and solve an SMT problem that finds the minimal mask that changes these influential neurons.  They demonstrate their approach on several problems.
+
+## Review
+
+Overall I thought this was an interesting paper with practical utility.
+
+- The formulation is interested and is a novel balance of quite different methodologies with useful results
+- The paper is clear and fairly well written, but some higher level intuition about the approach would help
+- I'd like to see some more justification for focus on the first layer, and experiments (described below)
+
+In section 3.2 you mention that you use IG to ""score the neurons in order of relevance by treating the first layer activations as an input to the subsequent network"".    It's not clear to me whether the $d$ in the function $F: R^{a\times b \times c} \to [0, 1]^d$ is the set of all nodes in the neural network or just the first layer or just the final layer?   It seems the latter is the case, based on Equation 2.
+
+More generally, it's not clear to me what privileges the first layer in this work (Eq 2).  My understanding is that
+1. simply that restricting attention to the first layer allows SMT to be applicable
+2. You use IG to integrates information from all layers, and by restricting Eq 2 to $D^k$ you are effectively combining both methods
+
+This leads to an experimental question: do your explanations improve if you include more than one layer?  This seems like something that is easily testable, at least on small examples.
+
+Writing wise, some of the terms could me more clearly defined.  For instance in the definition of $F$ above, I am left guessing as to what what $a$, $b$, $c$ and $d$ are, and assume they are simply place holders.  Similarly, sometimes we have $N_\theta(x)$ and sometimes $N_\theta(X)$.",7,3.0,ICLR2021
+HJl0oTyFhX,1,rke8ZhCcFQ,rke8ZhCcFQ,Nice but straightforward idea to attack graph CNNs; paper not always well-written,"The main idea of this paper is that a 'realistic' way to attack GCNs is by adding fake nodes. The authors go on to show that this is not just a realistic way of doing it but it can done in a straightforward way (both attacks to minimize classification accuracy and GAN-like attacks to make fake nodes look just like real ones). 
+
+The idea is neat and the experiments suggests that it works, but what comes later in the paper is mostly rather straightforward so I doubt whether it is sufficient for ICLR. I write ""mostly"" because one crucial part is not straightforward but is on the contrary, incomprehensible to me.  In Eq (3) (and all later equations) , shouldn't X' rather than X be inside the formula on the right? Otherwise it seems that the right hand side doesn't even depend on X' (or X_{fake} ). 
+But if I plug in X', then the dimensions for weight matrices  W^0 and W^1 (which actually are never properly introduced in the paper!) don't match any more. So what happens? To calculate J you really need some extra components in W0 and W1. Admittedly I am not an expert here, but I figure that with a bit more explanation I should have been able to understand this. Now it remains quite unclear...and I can't accept the paper like this.
+
+Relatedly, it is then also unclear what exactly happens in the experiments: do you *retrain* the network/weights or do you re-use the weights you already had learned for the 'clean' graph? 
+
+All in all: 
+PRO:
+- basic idea is neat 
+CON:
+- development is partially straightforward, partially incomprehensible.
+
+(I might increase my score if you can explain how eq (3) and later really work, but the point that things remain rather straightforward remains). ",3,2.0,ICLR2019
+S1eTrxCosm,1,SJzR2iRcK7,SJzR2iRcK7,"Lacks citation to similar works in Statistics, topic modelling ...","The work is a special case of density estimation problems in Statistics, with a use of conditional independence assumptions to learn the joint distribution of nodes. While the work appears to be impressive, such ideas have typically been used in Statistics and machine learning very widely over the years(Belief Propagation,  Topic modeling with anchor words assumptions etc...). This work could be easily extended to multi-class classifications where each node belongs to multiple classes. It would be interesting to know the authors' thoughts on that. The hard classification rule in the paper seems to be too restrictive to be of use in practical scenarios, and soft classification would be a useful pragmatic alternative. 
+",5,4.0,ICLR2019
+HygS97ittS,1,rylrI1HtPr,rylrI1HtPr,Official Blind Review #2,"This paper addresses the super-resolution problem. The key is to use pixel co-occurrence-based loss metric. The idea is very straightforward. But the description could be clearer. For example, what is the spatial size of P (\bar P)? How does it influence the optimization?
+
+Equation (2): There are four loss functions on the right hand. How are the loss defined?
+
+How is the GAN used?
+
+In experiments, there is no evidence showing the benefit from the pixel Co-occurrence
+
+There is a lack of much details. Given the current presentation, I cannot judge if the quality reaches the ICLR bar. ",1,,ICLR2020
+HJelZJgJ9B,3,S1lXnhVKPr,S1lXnhVKPr,Official Blind Review #3,"This paper tackles the problem of data-parallel (synchronous) distributed SGD, to optimize the (finite) sum of N non-convex, possibly different (in the so-called non-identical case), loss functions. This paper focuses on improving the communication efficiency compared to several existing methods tackling this problem.
+
+To that end, the authors contribute:
+· A novel algorithm and its asymptotic communication complexity.
+· The proof that the common metric of the sum over the training steps of the expected squared norm of the gradient at the average of the N parameters is bounded above.
+· Experimental results (training loss function of epoch number) comparing this algorithm with 2 existing ones, solving 3 problems under reasonable settings.
+
+  The training time per epoch of VRL-SGD is claimed to be identical to the one of Local SGD, as the algorithm only have minor differences.
+
+- strengths of the paper: 
+
+· The main paper is very easy to follow.
+· Good effort to give intuitions on why VLR-SGD can improve the convergence rate of existing algorithms.
+  Such an effort is to be highlighted.
+· No obvious mistake in the main.   I have not thoroughly checked the full proof though.
+
+-  weaknesses of the paper: 
+
+· The algorithm, while having differences, is quite reminiscent of Elastic Averaging SGD (EASGD) [1].
+  Indeed in both algorithms the model update at the workers consists in both descending the local gradient plus descending toward some ""moving-average""obtained through averaging all the local models.
+  In EASGD, this ""moving-average"" is common to every worker and the master, which updates it every k steps.
+  In this paper, each worker has its own ""moving-average"", which update computations are different than in EASGD as the use the instant average of the workers' models instead of the previous ""moving-average"".
+
+[1]Sixin Zhang, Anna Choromanska, Yann LeCun, Deep learning with Elastic Averaging SGD,  NeurIPS, 2015
+
+- Questions I would like the authors to respond to during the rebuttal:
+
+· Could Elastic Averaging SGD (in particular their fastest variant EAMSGD) be applied as-is to solve the non-identical, non-convex optimization problem at hand?
+  Despite the authors of EASGD not studying their algorithm in the non-identical case, following what is done in the intuition part of VRL-SGD (in particular Equation (8)), it seems that the update rule of the ""moving-average"" in EASGD is then equivalent to having a momentumSGD with dampening (instead of the ""generalized SGD form"" obtained with the approach of VRL-SGD).  Hence my question.
+
+I suggest acceptance. However I'm willing to change my opinion after reading other more qualified reviewers in the sub-area of variance-reduction techniques.
+
+note: If EASGD was to be sound in the non-identical case as well, my decision would not change much.",6,,ICLR2020
+H1l4PVq4KH,1,SJxjPxSYDH,SJxjPxSYDH,Official Blind Review #2,"The paper devises a pipeline that aims to address catastrophic forgetting in continual learning (CL) by the well-known generative replay (GR) technique. The key ingredient of the pipeline is a modern variational auto-encoder (VAE) that is trained with class labels with respect to a mutual information maximization criterion.
+
+The paper does not follow a smooth story line, where an open research question is presented and a solution to this problem is developed in steps. The flowchart in Fig 1 is rather a system design consisting of many components, the functionality of which is not clearly described and existence of which is not justified. This complex flowchart does not even describe the complete task. It is in the end plugged into a continual learning algorithm which also performs domain transformation. All of these pieces are very well-known methods (e.g. VAEs, conditional VAEs, CL, catastrophic forgetting, domain transformation) in the literature and this paper puts them together in a straightforward way. Hence, I kindly do not think the outcome is truly a research result. It is more system engineering than science. 
+
+The next submission of the paper could choose one or few of these pieces as target research problems and develop a thoroughly analyzed novel technical solution for them. If this solution can be proven to improve a valuable metric (e.g. accuracy, interpretability, theoretical understanding, or computational efficiency) of a setup, it is then worthwhile being published.
+
+Minor: The abstract could be improved by providing more clear pointers to the presented novelty.",1,,ICLR2020
+9a3sNQhJchC,1,2NU7a9AHo-6,2NU7a9AHo-6,An interesting proposition but work is too premature,"This paper examines the positive-unlabeled (PU) learning setting, and recommends the usage of the area under the lift curve, or AUL, as an unbiased estimate of the AUL under the fully labeled setting. This justifies proposing to use an AUL-optimization algorithm to train binary classifiers.
+
+Although I am not entirely sure this is fitting for a conference focused on representation learning, this is a question of interest in machine learning at large.
+
+I think this is an interesting proposition, supported by the experimental results. However, I do have a number of concerns that make me think this work is too premature and I lean towards rejection.
+
+1) I do not understand where the bias of $AUC^{PU}$ comes from in the equation below equation (3). Indeed, it is not obvious how it is derived from (3), and Jain et al. (2017) obtain 
+$$
+   (\beta-\alpha) (AUC - \frac12) + \frac12. 
+$$
+
+2) I disagree with the bound on $P(|AUL - AUL^{PU}| \geq \epsilon)$. Indeed, applying Chebychev's inequality, I think this should be upper bounded by $\frac{\text{Var}}{\epsilon^2}$ and not $\frac{\text{Var}}{\epsilon}$. Hence the final bound should be
+$$
+    P(|AUL - AUL^{PU}| \geq \epsilon) \leq \frac{1 - \beta}{4 n^L \epsilon^2}.
+$$
+This is unfortunate, because taking $\epsilon = 1\%$, and for example $\beta = 0.5$, 
+$$
+   \frac{1-\beta}{4 \epsilon^2} = 1250,
+$$
+so the sentence ""Hundreds of labeled samples can reduce the error to an acceptable level"" does not hold any longer -- it's now ""tens of thousands of samples"".
+On the numerical experiments of Table 1, you can indeed check that for the Airfol data set, with $\beta = 0.1$, 
+$$
+    \frac{1-\beta}{4 n^L \epsilon^2} = \frac{1-\beta}{4 \alpha \beta n \epsilon^2} \approx \frac{1.8 \times 10^{-3}}{\epsilon^2},
+$$
+so $P(|AUL - AUL^{PU}| \geq 10^{-2})$ can only be bounded by 18, and indeed $MAE_{AUL}=0.019$.
+
+3) Although some of the papers that are cited also use UCI data sets, I am wondering whether the values of $\alpha$ that are being tested are realistic. Indeed, it is my impression that PU settings occur mostly when the relative proportion of positive examples in the data is really small (fraud or outlier detection, malicious URL detection, drug discovery), which is why it makes sense to treat the unlabeled as negative, but here $\alpha$ is between 0.3 and 0.9.
+
+5) The AUL-optimization algorithm is an interesting proposal, and I think the paper would greatly benefit from having more details about this algorithm, even if it is ""only"" an adaptation of the algorithm of Sakai et al. (2018).
+
+6) On Figure 3, while for some of the data sets (Pageblock, Concrete) it is obvious that PU_AUL is performing much better than PU_AUC, it is not clear for all of them (in particular Landset, Anuran, Abalone, Airfoil) whether the difference in performance is significative. I would recommand plotting error bars on multiple cross-validation runs and/or computing statistical tests.
+
+In addition, I have several minor comments.
+
+1) The work is conducted under the assumption that labeled examples are selected completely at random among the positives and I think this should be mentioned explicitly in the abstract and Introduction. Indeed, this assumption may not hold in practice, for example in biological applications (the molecules or interactions labeled as positives are those that have been biologically investigated, and these investigations did not occur uniformly at random). 
+
+2) I don't think the proof that $\sigma^2$ is less than 1/4 is correct. The result is known as Popoviciu's inequality on variances, and if you want to prove it you should use 
+$$
+   \mathbb{E}[(t - \bar{t})^2] \leq \mathbb{E}[(t - \bar{t})^2 - t(t-1)],
+$$
+which is correct because $t(t-1) < 0$. The equation that is given in the paper,
+$$
+   \mathbb{E}[(t - \bar{t})^2] \leq \mathbb{E}[(t - \bar{t})^2 - (0-\bar{t})(\bar{t}-1)],
+$$
+is not true because $(0-\bar{t})(\bar{t}-1) > 0$.
+
+3) There are a number of typos in the text. ""online advertise"", ""Select Completely at Random"", etc.
+The definition of fpr in Section 2, paragraph ""ROC"", should be 
+$$
+   \frac{n^{FP}}{n^N}.
+$$
+
+4) In Section 4.2, I find it odd to compare $|AUC^{est}-AUC|$ to $|AUL^{PU}-AUL|$, because $AUC$ and $AUL$ don't take the same values. I think these two quantities should be normalized by $AUC$ and $AUL$ respectively, reporting $|AUC^{est}-AUC|/AUC$ and $|AUL^{PU}-AUL|/AUL.$ The same holds, of course, for $AUC^{KM1}$ and $AUC^{KM2}$. 
+I note that it does not seem to change the conclusions that can be drawn from Table 1.
+
+5) In biological and drug discovery applications, it has become rather mainstream to evaluate the performance of PU-learning algorithms using the cumulative distribution function of the ranking of positive samples among all test samples in a leave-one-out setting as in Mordelet and Vert (2011). I think it would be warranted to discuss this method as well.
+Another relevant reference seems to be Jiang et al. (2020).
+
+References
+Mordelet and Vert, BMC Bioinformatics 2011, 12:389 http://www.biomedcentral.com/1471-2105/12/389
+Liwei Jiang, Dan Li, Qisheng Wang, Shuai Wang, Songtao Wang. Improving Positive Unlabeled Learning: Practical AUL Estimation and New Training Method for Extremely Imbalanced Data Sets.  https://arxiv.org/abs/2004.09820
+
+
+",3,3.0,ICLR2021
+XCN2gwRGFVt,2,H6ZWlQrPGS2,H6ZWlQrPGS2,Providing a heuristic method to speed up low-precision training ,"This paper proposed one simple method called partial pre-training to speed up the training of binary neural networks (BNN). The pros and cons are as follows:
+
+Pros:
+1. The partial pre-training method is simple and easy to implement;
+2. For standard binary optimizer like the straight-through-estimator (STE), the method improves the training speed to some extent;
+
+Cons:
+1. The main concern of the proposed method is kind of heuristic and there is a lack of theoretical explanation, whether rigorous or not, why this method works. 
+
+2. As described in Section 7 by the author themselves, there are several apparent limitations of current evaluation, e.g., several dimensions of the hyper-parameters are not explored, other Binary optimizers are not considered, different learning rate schedules, etc. As a result, it is unconvinced that the partial pre-training could universally improve the speed as a general method. In addition, the improvement of speed-up are not very apparent especially for ResNet-20 and ResNet-34 as shown in Fig. 1 and Fig. 2, i.e., it took approximately the same time to reach the final precision even though the proposed method achieves higher accuracy before saturation. Given the inadequate evaluations and lack of theoretical explanations, this might be due to unfair comparison. 
+
+3. There is a lack of explanation of the contradiction result with previous result proposed in Alizadeh et al (2019), which dismissed the approach of pre-training. Is there good explanation of such an opposite result? It would be better to show results of different split between full-precision training and low-precision training. 
+
+4. Regarding pre-training for BNN, there are some related works from the Bayesian perspective. In Shayer et al. (2018), they used the result of full-precision training as the prior for the binary training, which improves the final result, as opposed to Alizadeh et al (2019). The Bayesian perspective provides an explanation of the effectiveness of a good prior. In Meng et al. (2020), they showed that STE could be viewed as Bayesian and obtained good result even with a uniform prior. Also, the posterior obtained after full-training (binary) could be used as prior to enable continual learning, which shows effectiveness of the prior. Given the above results, since the authors demonstrate that partial pre-training can increase the speed for STE, does this imply that partial pre-training provides a better prior than full pre-training? If so, why? 
+
+Shayer, O., Levi, D., and Fetaya, E. Learning discrete weights using the local reparameterization trick. ICLR, 2018.
+Meng, X, Bachmann. R., Khan. E. Training Binary Neural Networks using the Bayesian Learning Rule. ICML, 2020.
+
+
+
+
+
+",4,5.0,ICLR2021
+r1ggr0FY3Q,1,BkMXkhA5Fm,BkMXkhA5Fm,"Potentially impactful work, but lack clarity","The work releases a large-scale multimodal dataset recorded from the X-Plane simulation, as a benchmark dataset to compare various representation learning algorithms for reinforcement learning. The authors also proposed an evaluation framework based on some simple supervised learning tasks and disentanglement scores. The authors then implemented and compared several representation learning algorithms using this dataset and evaluation framework. 
+
+pros:
+1.  Releasing this dataset as a benchmark for comparing representation learning algorithms can potentially impact the community greatly;
+2. The authors combined several existing work on measuring representation learning algorithms and proposed an evaluation framework to evaluate the quality of learned representation using supervised learning tasks and disentanglement scores;
+3. The authors implemented an extended list of representation learning algorithms and compared them on the dataset;
+
+cons:
+1. the paper lacks clarification and guideline to convince the readers of the usefulness of the dataset and the evaluation framework. The authors spent almost half of the space explaining different existing representation learning algorithms. A more convincing story would be to find a few well-established representation learning algorithms to corroborate on the reliability of the dataset and the evaluation metrics;
+2. More details should be put into describing the dataset. It is not clear why this dataset is particularly suited for evaluating representation learning in the context of reinforcement learning. Do the authors have insight on the difficulty of the task? While having multi-modality is appreciated, it might worth thinking a separate dataset focusing on a single modality, e.g., image;
+3.  Given that the authors designed the dataset for evaluating representation learning for reinforcement learning, it is worth evaluating these algorithms on solving the main task using some standard RL techniques on top of the learned representations.
+4. Table 4 is difficult to parse. ",5,4.0,ICLR2019
+ry2CUfcxz,1,Bym0cU1CZ,Bym0cU1CZ,Strong results from a simple idea,"The authors use a distant supervision technique to add dialogue act tags as a conditioning factor for generating responses in open-domain dialogues. In their evaluations, this approach, and one that additionally uses policy gradient RL with discourse-level objectives to fine-tune the dialogue act predictions, outperform past models for human-scored response quality and conversation engagement.
+While this is a fairly straightforward idea with a long history, the authors claim to be the first to use dialogue act prediction for open-domain (rather than task-driven) dialogue. If that claim to originality is not contested, and the authors provide additional assurances to confirm the correctness of the implementations used for baseline models, this article fills an important gap in open-domain dialogue research and suggests a fruitful future for structured prediction in deep learning-based dialogue systems.
+
+Some points:
+1. The introduction uses ""scalability"" throughout to mean something closer to ""ability to generalize."" Consider revising the wording here.
+2. The dialogue act tag set used in the paper is not original to Ivanovic (2005) but derives, with modifications, from the tag set constructed for the DAMSL project (Jurafsky et al., 1997; Stolcke et al., 2000). It's probably worth citing some of this early work that pioneered the use of dialogue acts in NLP, since they discuss motivations for building DA corpora.
+3. In Section 2.1, the authors don't explicitly mention existing DA-annotated corpora or discuss specifically why they are not sufficient (is there e.g. a dataset that would be ideal for the purposes of this paper except that it isn't large enough?)
+3. The authors appear to consider only one option (selecting the top predicted dialogue act, then conditioning the response generator on this DA) among many for inference-time search over the joint DA-response space. A more comprehensive search strategy (e.g. selecting the top K dialogue acts, then evaluating several responses for each DA) might lead to higher response diversity.
+4. The description of the RL approach in Section 3.2 was fairly terse and included a number of ad-hoc choices. If these choices (like the dialogue termination conditions) are motivated by previous work, they should be cited. Examples (perhaps in the appendix) might also be helpful for the reader to understand that the chosen termination conditions or relevance metrics are reasonable.
+5. The comparison against previous work is missing some assurances I'd like to see. While directly citing the codebases you used or built off of is fantastic, it's also important to give the reader confidence that the implementations you're comparing to are the same as those used in the original papers, such as by mentioning that you can replicate or confirm quantitative results from the papers you're comparing to. Without that there could always be the chance that something is missing from the implementation of e.g. RL-S2S that you're using for comparison.
+6. Table 5 is not described in the main text, so it isn't clear what the different potential outputs of e.g. the RL-DAGM system result from (my guess: conditioning the response generation on the top 3 predicted dialogue acts?)
+7. A simple way to improve the paper's clarity for readers would be to break up some of the very long paragraphs, especially in later sections. It's fine if that pushes the paper somewhat over the 8th page.
+8. A consistent focus on human evaluation, as found in this paper, is probably the right approach for contemporary dialogue research.
+9. The examples provided in the appendix are great. It would be helpful to have confirmation that they were selected randomly (rather than cherry-picked).",7,3.0,ICLR2018
+s1YkYEeSYYG,5,#NAME?,#NAME?,Review 5,"Summary
+----------
+
+This paper presents an approach to meta-RL based on combining gradient-based updating with recurrence-based meta-learning. The approach is based on combining gradient-based policy updating with recurrence-based updating of the value function. The authors evaluate the method of simple standard meta-RL benchmarks. 
+
+
+Comments
+----------
+
+Overall, the direction of the paper is interesting but the paper has numerous shortcomings. 
+
+First, the proposed method is a fairly straightforward combination of existing techniques. In particular, the approach consists of a fairly simple combination of gradient-based and recurrence-based methods. While this is not necessarily a limitation, it necessitates a thorough set of experiments to justify the combination of approaches. This is the second limitation of the paper: the experimental evaluation is very limited. The authors test of very simple benchmark problems compared to recent work in meta-RL. Moreover, there are few comparisons to baseline methods (especially PEARL) and the authors should include ablation experiments in which they examine the performance of their method relative to strictly gradient-based and strictly recurrence-based methods, using the SAC algorithm as an underlying algorithm. 
+
+Finally, the writing of the paper is extremely sloppy. Trials is misspelled as trails throughout the paper. There are also numerous typos, such as ""gardient"", ""meta-reinforce"", ""Substitue"", ""apdated"", etc. The presentation of the algorithm is quite unclear and the discussion of related work is quite limited. 
+
+Overall, the paper should be tested on a wider range of environments and against more competitive baselines to warrant acceptance. The authors should also improve the presentation of the paper to improve clarity. 
+",3,3.0,ICLR2021
+r1xTId76KH,2,HJlxIJBFDr,HJlxIJBFDr,Official Blind Review #3,"This paper studies the theory of sample efficiency in reinforcement learning, which is of great importance and has a potentially large audience. 
+
+The strong points of the paper:
+1. This paper proposed a new algorithm stochastic variance reduced policy gradient algorithms. This paper establishes better sample complexity compared with existing work. The key part of the proposed algorithm for variance reduction is to have step-wise importance weights to deal with the inconsistency caused by varying trajectory distribution.  
+2. This paper provides experimental results verifies the efficiency and effectiveness of the proposed algorithm. 
+3. In addition, parameter-based exploration extension is discussed in the appendix, which enjoys the same order of sample complexity under mild assumptions and gives better empirical performance. 
+4. This paper is easy to follow. In particular, there are a lot of discussions comparing this work with existing work. 
+
+The weak points of the paper:
+1. In section 3, it is not quite clear how the reference policy is defined, and the \theta^s is not clearly defined when s >= 1.
+2. In the main part of the paper, the discussion in Remark 4.6 and the following Corollary 4.7 is not quite clear.
+
+
+Some minor comments of the paper:
+1. In introduce page 3, We note that a recent work by .... by a fator of H. --> Here H should be defined as the Horizon. 
+2. There is one additional parenthesis in Theorem 4.5.  
+3. In  Corollary 4.7, T is not defined. 
+",8,,ICLR2020
+BylNfGini7,1,B14ejsA5YQ,B14ejsA5YQ,Interesting approach,"In the manuscript entitled ""Neural Causal Discovery with Learnable Input Noise"" the authors describe a method for automated causal inference under the scenario of a stream of temporally structured random variables (with no missingness and a look-back window of given size).  The proposed approach combines a novel measure of the importance of fidelty in each variable to predictive accuracy of the future system state (""learnable noise risk"") with a flexible functional approximation (neural network).  Although the setting (informative temporal data) is relatively restricted with respect to the general problem of causal inference, this is not unreasonable given the proposed direction of application to automated reasoning in machine learning.  The simulation and real data experiments are interesting and seem well applied.
+
+A concern I have is that the manuscript as it stands is positioned somewhere between two distinct fields (sparse learning/feature selection, and causal inference for counterfactual estimation/decision making), but doesn't entirely illustrate its relationship to either.  In particular, the derived criterion is comparable to other sparsity-inducing penalities on variable inclusion in machine learning models; although it has motivation in causality it is not exclusively derived from this position, so one might wonder how alternative sparsity penalities might perform on the same challenge.  Likewise, it is not well explained what is the value of the learnt relationships, and how uncertainty and errors in the causal learning are relevant to the downstream use of the learnt model.  In the ordinary feature selection regime one is concerned simply with improving the predictive capacity of models: e.g. a non-linear model might be fit using just the causal variables that might out-perform both a linear model and a non-linear model fit using all variables.  Here the end goal is less clear; this is understandable in the sense that the work is positioned as a piece in a grand objective, but it would seem valuable to nevertheless describe some concrete example(s) to elucidate this aspect of the algorithm (use case / error effects downstream).  ",8,4.0,ICLR2019
+rkxmuiLnYH,1,HJxTgeBtDr,HJxTgeBtDr,Official Blind Review #3,"
+The manuscript proposes an evaluation methodology to obtain deeper insights regarding the strength and weaknesses of different methods on different datasets. The method considers a set of methods addressing the task of Named Entity Recognition (NER) as case study. In addition, it proposes a set of attribute-based criteria, i.e. bucketization strategies, under which the dataset can be divided and analyzed in order to highlight different properties of the evaluated methods.
+
+As said earlier, the manuscript proposes an evaluation methodology to obtain deeper insights regarding the strength and weaknesses of different methods on different datasets. The characteristic of being able to provided deeper insights on strength/weaknesses and relevant factors on the inner-workings of a given method is 
+something very desirable for every evaluation. As such, in my opinion, the ""interpretable"" tag associate to the proposed method is somewhat out of place. Having said that, I would recommend removing the ""interpretable"" tag and stress the contribution of this manuscript as an evaluation protocol. 
+
+In Section 4.2, for the R-Bucket strategy it is stated as having the requirement of discrete and finite attributes. Based on the equations of the other two strategies (R-bucket and F-bucket), it seems that they also have the requirement of having discrete attributes. Is this indeed the case? if so, it should be explicitly indicated. 
+Having said that, this raises another question: Is this protocol exclusive to tasks/problems with explicit discrete attributes?
+
+The goal of this manuscript is to propose a general evaluation protocol for NLP tasks.
+However, it seems to be somewhat tailored to the NER task. My question is: How well the proposed method generalizes to other NLP tasks without attributes? Similarly, how well the proposed bucketization strategies generalize beyond the NER task? Perhaps the generalization characteristics and limitations of the proposed evaluation methodology should be explicitly discussed in the manuscript.
+
+Last paragraph of Section 4.2 summarizes ideas that were just presented. It feels somewhat redundant. I suggest removing in in favor of extending the existing discussions and analysis.
+
+I may consider upgrading my initial rating based on on the feedback given to my questions/doubts.
+",3,,ICLR2020
+BygndE8qhX,1,BJgy-n0cK7,BJgy-n0cK7,Encouraging results but main idea is not novel and some baselines are missing,"# Paper summary
+This paper advances a method for accelerating semantic segmentation on video content at higher resolutions. Semantic segmentation is typically performed over single images, while there is un-used redundancy between neighbouring frames. The authors propose exploiting this redundancy and leverage block motion vectors from MPEG H.264 video codec which encodes residual content between keyframes. The block motion vectors from H264 are here used to propagate feature maps from keyframes to neighbouring non-keyframe frames (in both temporal directions) avoiding thus an additional full forward pass through the network and integrate this in the training pipeline. Experimental results on CamVid and Cityscapes show that the proposed method gets competitive results while saving computational time.
+
+
+# Paper strengths
+- This paper addresses a problem of interest for both academic and industrial purposes.
+- The paper is clearly written and the authors argument well their contributions, adding relevant plots and qualitative results where necessary.
+- The two-way interpolation with block motion vectors and the fusion of interpolated features are novel and seem effective.
+- The experimental results, in particular for the two-way BMV interpolation, are encouraging.
+
+
+# Paper weaknesses
+
+- The idea of using Block Motion Vectors from compressed videos (x264, xvid) to capture motion with low-cost has been previously proposed and studied by Kantorov and Laptev [i] in the context of human action recognition. Flow vectors are obtained with bilinear interpolation from motion blocks between neighbouring frames. Vectors are then encoded in Fisher vectors and not used with CNNs as done in this paper. In both works, block motion vectors are used as low-cost alternatives to dense optical flow. I would suggest to cite this work and discuss similarities and differences.
+
+
+- Regarding the evaluation of the method, some recent methods dealing with video semantic segmentation, also using ResNet101 as backbone, are missing, e.g. low latency video semantic segmentation[ii]. Pioneer Clockwork convnets are also a worthy baseline in particular in terms of computational time (results and running times on CityScapes are shown in [ii]). It would be useful to include and compare against them.
+
+- In Section 4.1.2 page 7 the authors mention a few recent single-frame models ((Yu et al. (2017); Chen et al. (2017); Lin et al. (2017); Bilinski & Prisacariu (2018)) as SOTA methods and the current method is competitive with them. However I do not see the results from the mentioned papers in the referenced Figures. Is this intended?
+
+- On a more general note related to this family of approaches, I feel that their evaluation is usually not fully eloquent. Authors compare against similar pipelines for static processing and show gains in terms of computation time. The backbone architecture, ResNet-101 is already costly for high-resolution inputs to begin with and avoiding a full-forward pass brings quite some gains (though a part of this gain is subsequently attenuated by the latency caused by the batch processing of the videos). There are recent works in semantic segmentation that focus on architectures with less FLOPs or memory requirements than ResNet101, e.g. Dilated ResNets [iii], LinkNet[iv]. So it could be expected that image-based pipelines to be getting similar or better performance in less time. I expect the computational gain on such architectures when using the proposed video processing method to be lower than for ResNet101, and it would make the decision of switching to video processing or staying with frame-based predictions more complex. 
+The advantage of static image processing is simpler processing pipelines at test time without extra parameters to tune. It would be interesting and useful to compare with such approaches on more even grounds.
+
+
+# Conclusion 
+This paper takes on an interesting problem and achieves interesting results. The use of Block Motion Vectors has been proposed before in [i] and the main novelty of the paper remains only the interpolation of feature maps using BMVC. The experimental section is missing some recent related methods to benchmark against.
+This work has several strong and weak points. I'm currently on the fence regarding my decision. For now I'm rating this work between Weak Reject and Borderline  
+
+# References
+
+[i] V. Kantorov and I. Laptev, Efficient feature extraction, aggregation and classification for action recognition, CVPR 2014
+[ii] Y. Li et al., Low-Latency Video Semantic Segmentation, CVPR 2018
+[iii] F. Yu et al., Dilated Residual Networks, CVPR 2017
+[iv] A. Chaurasia and E. Culurciello, LinkNet: Exploiting Encoder Representations for Efficient Semantic Segmentation, arXiv 2017
+",5,4.0,ICLR2019
+rJgdkwO6nQ,2,BJfvknCqFQ,BJfvknCqFQ,some interesting observations,"The paper states that basic transformation (translation and rotation) can easily fool a neural network in image classification tasks. Thus, image classification models are actually more vulnerable than people thought. The message conveyed by the paper is clear and easy to get. The experiments are natural and interesting. Some interesting points:
+  --The model trained with data augmentation that covers the attack space does not alleviate the problem sufficiently.
+  --Gradient descent does not provide strong attack, but grid search does. This may be due to the high non-concavity, compared to the small perturbation case.
+
+One possible question is the novelty, as this idea is so simple that probably many people have observed similar phenomenon--but have not experimented that extensively. 
+Also, there are some related works that also show the vulnerability under spatial transformations. But some are concurrent works to 1st version of the paper (though published), so I tend to not to judge it by those works.  
+
+Other comments: 
+1. page 3 in the paragraph starting with ‘We implement …’, the author chooses a differentiable bilinear interpolation routine. However, the interpolation method is not shown or explained. 
+2. In term of transformation, scaling and reflecting are also transformations. It should be straightforward to check the robustness with respect to them. Comments? 
+3. Header in tables is vague. Like ‘Natural’ or ‘Original’, etc. More description of the Header under tables is helpful.
+4. For CIFAR10 and especially for ImageNet dataset, Aug30 and Aug40 models showed lower accuracy than No Crop model on Nat test set. This is little strange because data augmentation (such as random rotation) is commonly used strategy to improve test accuracy. I think this might mean that the model is not trained enough and underfitted, maybe because excessive data augmentation lowered the training speed.
+",6,2.0,ICLR2019
+H1Dw9y51z,1,SknC0bW0-,SknC0bW0-,Neat work of low novelty.,"This paper studies hyperparameter-optimization by Bayesian optimization, using the Knowledge Gradient framework and allowing the Bayesian optimizer to tune fideltiy against cost.
+
+There’s nothing majorly wrong with this paper, but there’s also not much that is exciting about it. As the authors point out very clearly in Table 1, this setting has been addressed by several previous groups of authors. This paper does tick a previously unoccupied box in the problem-type-vs-algorithm matrix, but all the necessary steps are relatively straightforward.
+
+The empirical results look good in comparison to the competing methods, but I suspsect an author of those competitors could find a way to make their own method look better in those plots, too.
+
+In short: This is a neat paper, but it’s novelty is low. I don't think it would be a problem if this paper were accepted, but there are probably other, more groundbreaking papers in the batch.
+
+Minor question: Why are there no results for 8-cfKG and Hyperband in Figure 2 for SVHN?",5,4.0,ICLR2018
+BkgacWb0cS,3,Syg-ET4FPS,Syg-ET4FPS,Official Blind Review #4,"Review for ""Posterior Sampling for Multi-Agent Reinforcement Learning"".
+
+The paper proposes a sample-efficient way to compute a Nash equilibrium of an extensive form game. The algorithm works by maintaining a probability distribution over the chance player / reward pair (i.e. an environment model).
+
+I give a weak recommendation to accept the paper. Although I haven't checked the proofs in detail, the premise seems to be sound - the authors extend model-based exploration results from MDPs to games. The essence of the argument seems to be that the model of the chance player becomes close to d^\star quickly enough to get a sub-linear bound.
+
+The main complaints I have about the paper concern clarity.
+
+1. The paper is very densely written. This isn't necessarily bad, but it makes the paper a bit hard to understand. It would benefit the manuscript greatly to provide a figure which shows how the algorithm works for a small toy game. There is space left in the paper, so even a one-page figure would fit in. The figure should show all the major quantities: d, \sigma, u.
+
+2. The meaning of the quantity \mathcal{G}_T^i should be more thoroughly described, given it is important in the proof. 
+
+3. You define a game with N players, but the algorithm works with 2.
+
+4. Do you really need all the notations in section 2.1? Why not just define the ones used in the algorithm?
+
+5. Can you discuss how large the constants \xi can become in practice? The definition of \xi^i seems to be different on page 10 and in Theorem 1 - please disambiguate.
+
+I ask the authors to add a figure and address the issues above. 
+
+I am not an expert in this sub-field so I may have missed aspects of the paper.
+
+Minor points:
+- In Figure 1, please say that ""default"" is your algorithm.
+- ""optimal in the face of uncertainty"" => ""optimism in the face of uncertainty"" ",6,,ICLR2020
+6syN9kB7Xyi,3,OLrVttqVt2,OLrVttqVt2,"An online algorithm for targeted poisoning, with theoretical guarantees","**Paper summary**
+The paper proposes an algorithm, that works in an online fashion, for targeted poisoning attacks. If the loss function is convex, then the algorithm is guaranteed to converge to the target as the number of poisoned samples increases. The paper claims that this is the first model-targeted attack which has theoretical guarantees. The lower bound provided is interesting in the sense that it can give a lower bound on the number of samples needed to reach the target model from the current model.
+
+**Strengths**
+1. The paper proposes a new algorithm that works by progressively adding poisoned points to the dataset. Theoretical guarantee (Theorem 4.1) is also provided for the algorithm. The proof idea is well explained and I liked the connection with online learning and how the algorithm is reduced to follow-the-leader algorithm. 
+2. The algorithm works by progressively adding more points and hence it can stop as soon as it is close to the target. This means that it does not need a predecided budget and if it is indeed optimal, then it would use the minimum number of poisoned points to reach the target.
+3. On SVMs, the experiments show that the attack is almost optimal in the sense that it matches the theoretical lower bound.
+
+**Concerns**
+1. Theorem 4.2 is a lower bound for the minimum number of samples needed to reach the target model exactly. Can we say something about reaching the target model approximately (say up to distance $\epsilon$)? I think this can be achieved by the following optimization in Theorem 4.2: $$\inf_{\theta':\|\theta'-\theta_p\|\leq\epsilon} \sup_\theta c(\theta,\theta')=\frac{L(\theta';\mathcal{D}_c)-L(\theta;\mathcal{D}_c)+NC_R(R(\theta_p)-R(\theta))}{\sup(l(\theta;x,y)-l(\theta';x,y))+C_R(R(\theta)-R(\theta')))}.$$
+
+Further, how much would the lower bound decrease (as a function of $\epsilon$) if we are indeed interested in only reaching the target model approximately?
+
+2. In the experiments, the lower bound is computed for the number of samples needed to reach the model induced after poisoning. This can be different from the target model. Hence, shouldn't the lower bound be computed for the number of samples needed to reach the target model? I think that would be the true lower bound for poisoning. However, I believe that the two numbers should be close.
+
+3. The experiments are performed only for linear SVMs. It would be more interesting to see if the attack also works well on deep neural networks. This is also important because the paper claims that the existing algorithms get stuck in bad local minima. Thus, it should be checked if this algorithm also gets stuck in bad local minima.
+
+**Score justification**
+
+Model poisoning is a very real threat in modern machine learning. Strong attacks provide good insights for developing strong defenses. I believe the attack proposed in this paper is strong and theoretically backed. Further, the paper also provides lower bound on the minimum amount of poisoning needed to reach the target model. However, I have some concerns regarding the attack's success on deep neural networks. ",6,3.0,ICLR2021
+r1xMfg2LKS,2,rkedXgrKDH,rkedXgrKDH,Official Blind Review #1,"This submission proposes an alternative way to lower bound the trajectory growth through random networks. It generalizes to a variety of weights distributions. For example, the authors showcase their approach on sparse-Gaussian, sparse-uniform, and sparse-discrete-valued random nets and prove that trajectory growth can be exponential in depth with these distributions, with the sparsity appearing in the base of the exponential. 
+
+I give an initial rating of weak accept because (1) the paper is well written and well-organized. (2) the numerical simulation results support the claims and proofs. (3) the investigation on sparsely connected networks seems timely. However, I'm not an expert in this area. It also seems that most derivation and insights are from previous literature Raghu 2017, which makes the contribution of this submission limited. 
+
+I have a question which may be invalid. For Figure 3, the observed expectation matches perfectly with the the lower bound for all three distributions. This seems amazing, have the authors try with other dataset or other settings to do this experiment? Did it always match perfectly? ",6,,ICLR2020
+B1etdm5itr,1,HylhuTEtwr,HylhuTEtwr,Official Blind Review #1,"Summary:  This paper studies alternative priors for VAEs, comparing Normal distributions (diagonal and full covariance) against Student-t’s (diagonal and full covariance).   In particular, the paper is concerned with posterior collapse---i.e. posterior remains at the prior, limiting the model’s ability to reconstruct the data.  Experiments are performed on a synthetic 2D dataset, ‘Gaussian ovals,’ and on OMNIGLOT.  Results primarily take the form of visualizations of the reconstructed data and MSE / SSIM numbers. 
+
+Pros:  Systematic study and investigation of alternative priors for deep generative models is an under-studied area.  Moreover, heavy-tailed priors such as the student-t---while widely successful for robust regression---have not been explored as extensively for latent variable models, to the best of my knowledge.  This paper makes some steps towards solving these open problems.  
+
+Cons:  I have two primary critiques of the paper: (i) the experimental hypothesis is unclear, (ii) no engagement with the results of Mathieu et al. [ICML 2019], who also study Student-t priors for reconstruction (and disentanglement).  
+
+Regarding (i), the paper seems to be testing two hypotheses simultaneously: the effect of diagonal vs full covariance matrices and exponential-tailed vs heavy-tailed priors.  The latter seems more crucial for purposes of reconstruction according to Figure 3 (since the diagonal St-t has good reconstruction).  Yet a proper study of the effect of tails vs reconstruction would report results as the degree of freedom parameter is gradually increased (as this controls the tails directly).  No careful ablation study of this sort is performed.  For comparison, see Figure 2 of Mathieu et al. [ICML 2019].  Moreover, the text simply calls the student-t a “weakly informative prior,” but it needs to be much more specific about what characteristics of the student-t are crucial.  If all we require is something “weakly informative,” why aren’t alternatives like a diffuse Gaussian or uniform also considered?
+
+Regarding (ii), Mathieu et al. [ICML 2019] also study priors formed by products of student-t marginals, but their work is not cited.  Mathieu et al. [ICML 2019] also show that, due to student-t’s not being rotationally invariant (unlike the diagonal Gaussian), they improve disentanglement with only a minor degradation of reconstruction.  As this work also studies reconstruction in student-t VAEs, it should include some discussion of Mathieu et al. [ICML 2019]’s results---if not direct engagement with their hypotheses.       
+
+Final Evaluation:  While I like the general motivation for this work, there are no clear experimental hypotheses being tested in the experiments.  
+
+Mathieu, E., Rainforth, T., Narayanaswamy, S. and Teh, Y.W., 2018. Disentangling Disentanglement in Variational Autoencoders. ICML 2019.",1,,ICLR2020
+7RZ46SAoRDl,1,vnlqCDH1b6n,vnlqCDH1b6n,Official Blind Review ,"Summary:
+The paper is motivated by the need for a better trade-off between the reconstruction and disentanglement performance of an autoencoder. The proposed solution is to use KL as a latent regularizer in the framework of Wassestain autoencoders, which allows for a natural interpretation of total correlation.
+
+
+The paper reads well, all related work and relevant background concepts are nicely integrated throughout the text. The experiments are exhaustive and the results show competitive performance wrt disentanglement while improving reconstruction/modeling of the AEs.
+
+If a dataset is of dynamical nature, how difficult would it be to extend the current version of TCWAE to dynamical systems? Do the authors have any intuition/hint on what should change to make their method applicable to dynamical setups? Significantly changing the probabilistic model or modifying only the and encoder/decoder architecture could suffice? 
+
+Minor:
+- Consider changing the naming of the baselines either in tables or figures to make them consistent  Chen et al (2018) -> TCVAE Kim & Mnih (2018) -> factorVAE.",8,4.0,ICLR2021
+Byg4aCzE5S,2,HJxV5yHYwB,HJxV5yHYwB,Official Blind Review #3,"Thank the authors for the response. I agree with R2 that the paper lacks comparisons with previous works. I will stick to my previous decision.
+----------------------------------------
+Summary
+This paper presents a new approach for single-objective reinforcement learning by preferencing multi-objective reinforcement learning. The general idea is to first figure out a few important objectives, add some helper-objectives to the original problem, and learn the weights for each individual objective by trying to keep the same order as Pareto dominance. This paper has potential, but I lean to vote for rejecting this paper now, since it is still not ready. I might change my score based on the reviews from other reviewers.
+Strengths
+- The idea is novel. Learning weights for each objective by keeping the order as Pareto dominance is an interesting idea to me.
+Weaknesses
+- The lack of experiments. The authors tested their method in only one scenario, which makes me feel unsafe. Only testing on one simple scenario does not demonstrate the effectiveness. The authors are supposed to test their method on more (complex) scenarios to show the effectiveness of their method.
+Possible Improvements
+As mentioned before, the proposed method can be tested on more scenarios (e.g., Deep Sea Treasure, SuperMario, etc.).",3,,ICLR2020
+BygYHZ3F3m,2,rke41hC5Km,rke41hC5Km,This is an interesting paper on the application of GAN in generating order data.  The evaluation and assumptions used in the paper need further justifications.,"The objective of this paper is to use GAN for generating the order stream of stock market data.   The novelty of the paper is the formulation of the order stream and the use of GAN for generating the stock data.   This is a paper for the application of GAN and there are limited contribution to the technical aspect of machine learning.    The paper is clearly written.   There are two main assumptions used in the paper; one is the Markov chain and the second one is the stationary distribution.   In real case, both assumptions are unlikely  to be satisfied.  The orders are mostly affected by many external factors and financial data are known to be non-stationary.  The authors may have to justify these assumptions. 
+
+Another issue is the evaluation of the results.  The paper uses five statistics to evaluate the generated data.  What we can conclude is that the generated data follow similar statistics with the real data.   But we do not know if the generated data offer extra information about the market.  The paper has used synthetic data in the experiments.  So it means that we could have models that generate data that look like real data.  If so, what are the benefits of using GAN to generate the data ?  
+
+",5,4.0,ICLR2019
+y1onO5W1NGG,2,E3Ys6a1NTGT,E3Ys6a1NTGT,needs better organization and some missing related work,"Summary:
+
+The paper proposes a theoretical framework for analyzing the error of reinforcement learning algorithms in a fixed dataset policy optimization (FDPO) setting.  In such settings, data has been collected by a single policy that may not be optimal and the learner puts together a model or value function that will have explicit or implicit uncertainty in areas where the data is not dense enough.  The authors provide bounds connecting the uncertainty to the loss.  They then show that explicitly pessimistic algorithms that fill in the uncertainty with the worst case can minimize the worst case error.  Similarly, proximal algorithms that attempt to adhere to the collection policy (as often the case in model-free batch RL) have improved error compared to a naive approach but not as good as an explicitly pessimistic approach.
+
+
+Review:
+
+The paper provides a general description of the pessimism performance bounds.  The theorems appear to be correct and the reasoning sound.  I also like the connection to the proximal approach, which is how most model-free batch RL algorithms approach the problem (by sampling close to the collection policy).
+
+However, the paper does need some improvement.  Specifically, a connection should be made to more existing literature on pessimism in safe, batch, or apprenticeship RL.  In addition, the paper spends a lot of time on definitions and notation that are not explicitly used while the most interesting empirical results are relegated to the appendix, which seems backwards.
+
+On the connections to the literature, the idea of using pessimism in situations where you are learning from a dataset collected by a non-optimal teacher has been investigated in previous works in apprenticeship RL:
+http://proceedings.mlr.press/v125/cohen20a/cohen20a.pdf
+or
+https://papers.nips.cc/paper/4240-blending-autonomous-exploration-and-apprenticeship-learning.pdf 
+
+Specifically, the first (Bayesian) paper explicitly reasons about the worst of all possible worlds mentioned in the current submission and seems to have a lot of overlap in the theory.  Can the authors distinguish their results from Cohen et al.?  The second paper is an example where model-learning agents keep track of the uncertainty in their learned transition and reward functions and use pessimism to fill in uncertainty.  So the idea here is not quite new and better connections to this literature need to be made.
+
+The other issue with the paper is its organization and writing. The theoretical results, while general, are not particularly complicated and don’t seem to warrant the amount of notation and definitions on pages 1-3.  Specifically, the bandit example isn’t really mentioned in the paper but the figure takes up a lot of valuable space.  Over a full page is used to define basic MDP and dataset terms that are widely known and commonly used.  The footnotes are whole paragraphs that seem to be just asides.  Finally, the grid word results are presented in a figure without any real associated text except for some generalities about what algorithms worked well,  Meanwhile, the most interesting and novel contributions of the paper, including the concrete algorithms for applying pessimistic learning, and the empirical analysis on Atari games, are stashed in the (very long) appendix.  I strongly suggest the authors reorganize the paper to highlight these strengths instead of notation and footnotes that are tangential to the paper.
+",6,4.0,ICLR2021
+ByeojHyTYB,1,rkl8dlHYvB,rkl8dlHYvB,Official Blind Review #3,"This paper describes a method for segmenting 3D point clouds of objects into component parts, with a focus on generalizing part groupings to novel object categories unseen during training.  In order to improve generalization, the paper argues for limiting the influence of global context, and therefore seeks to build compact parts in a bottom-up fashion by iterative merging of superpixel-like point subsets.  This is achieved by defining a RL merge policy, using merge and termination scores formed by a combination of explicitly trained part purity (each part should comprise one true part), and policy-trained pair comparison network.  The system is evaluated using PartNet, using three categories for training and the rest for testing, showing strong performance relative to baselines.
+
+The system is described well, and shows good performance on a nicely motivated task.  A few more ablations would have been nice to see (in questions below), as might more qualitative results.  Overall, the method is presented and evaluated convincingly.
+
+
+Questions:
+
+*  What is the effect of the purity score regression?  Since the policy network is trained using a pair-comparison module anyway, what happens if the explicit purity score supervision is removed?
+
+* What if the ""rectifier"" module is made larger (with or without purity module), e.g. the same size as the termination network?  Does this improve or overfit to the training categories?
+
+* Sec 5.3 mentions ""segmentation levels for different categories may not share consistent part granularity ....  Thus, ... we train three networks corresponding to three levels of segmentation for training categories"".  While it makes sense to have three networks for the three levels (each have different termination points, and perhaps even merge paths), I don't see how this follows from the levels being inconsistent between categories.  In fact, it seems just the opposite, that if the levels are inconsistent, this could pose a problem when a part at one level for one category is ""missing"" from the other category, due to level numbers not coinciding.  Or, is this actually not a problem because on the three training categories selected, the levels are in fact consistent?
+
+* Can termination be integrated into the policy network or policy itself?
+ 
+
+A couple typos I noticed:
+
+p.5 ""In consequences,"" --> ""As a consequence,""
+p.11 ""in-balanced"" --> ""unbalanced""
+",8,,ICLR2020
+H1xFSl8rFr,1,H1l_gA4KvH,H1l_gA4KvH,Official Blind Review #2,"This paper proposes a solution to overcome the challenges due to the black-box nature of physical constraints that are involved in the design of nano-porous templates with optimal thermoelectric efficiency. 
+
+Unfortunately, I cannot comment on the overall scientific contribution of the paper, as I do not possess the expertise to judge it accurately. My expertise is so outside of this field that I will rely on the judgement of the other reviewers, whom I hope will have more experience and will better know the literature. 
+
+I can only report that the proposed method does not seem to be a particularly good approximation to the zeroth-order method since the mean values for kappa and sigma in table 2 are quite a bit worse than those obtained with the baseline. Of course, the proposed approach is quite a bit faster. However, the paper does not provide a sense of whether these values are actually useful. In practice, would one want to wait longer to get a better quality result, or are the numbers obtained with the proposed approach usable? 
+
+Also, could the proposed approach be applied to other problems? It would be great to see at least one or two other areas where this could be applied to, since I doubt the general ICLR audience is well-versed in nano-porous templates. ",3,,ICLR2020
+SJCjWdiJG,1,rJhR_pxCZ,rJhR_pxCZ,"Review: Interesting hybrid model, but weak experiments (MNIST only)","
+Summary
+
+This paper proposes a hybrid model (C+VAE)---a variational autoencoder (VAE) composed with a differentiable decision tree (DDT)---and an accompanying training scheme.  Firstly, the prior is specified as a mixture distribution with one component per class (SVAE).  During training, the ELBO’s KL term uses the component that corresponds to the known label.  Secondly, the DDT’s leaves are parametrized with the encoder distribution q(z|x), and thus gradient information flows back through the DDT into the posterior approximations in order to make them more discriminative.  Lastly, the VAE and DDT are trained together by alternating optimization of each component (plus a ridge penalty on the decoder means).  Experiments are performed on MNIST, demonstrating tree classification performance, (supervised) neg. log likelihood performance, and latent space interpretability via the DDT.  
+
+
+Evaluation
+
+Pros:  Giving the VAE discriminative capabilities is an interesting line of research, and this paper provides another take on tree-based VAEs, which are challenging to define given the discrete nature of the former and continuous nature of the latter.  Thus, I applaud the authors for combining the two in a way that admits efficient training.  Moreover, I like the qualitative experiment (Figure 2) in which the tree is used to vary a latent dimension to change the digit’s class.  I can see this being used for dataset augmentation or adversarial example generation, for instance.
+
+Cons:  An indefensible flaw in the work is that the model is evaluated on only MNIST.  As there is no strong theory in the paper, this limited experimental evaluation is reason enough for rejection.  Yet, moreover, the negative log likelihood comparison (Table 2) is not an informative comparison, as it speaks only to the power of adding supervision.  Lastly, I do not think the interpretability provided by the decision tree is as great as the authors seem to claim.  Decision trees provide rich and interpretable structure only when each input feature has clear semantics.  However, in this case, the latent space is being used as input to the tree.  As the decision tree, then, is merely learning hard, class-based partitioning rules for the latent space, I do not see how the tree is representing anything especially revealing.  Taking Figure 2 as an example (which I do like the end result of), I could generate similar results with a black-box classifier by using gradients to perturb the latent ‘4’ mean into a latent ‘7’ mean (a la DeepDream).  I could then identify the influential dimension(s) by taking the largest absolute values in the gradient vector.  Maybe there is another use case in which a decision tree is superior; I’m just saying Section 4.3 doesn’t convince me to the extent that was promised earlier in the paper (and by the title).
+
+Comment:  It's easier to make a latent variable model interpretable when the latent variables are given clear semantics in the model definition, in my opinion.  Otherwise, the semantics of the latent space become too entangled.  Could you, somehow, force the tree to encode an identifiable attribute at each node, which would then force that attribute to be encoded in a certain dimension of latent space?      
+",3,5.0,ICLR2018
+B1KZkIqxG,3,ryHM_fbA-,ryHM_fbA-,No comparison against recent SOTA in text representation,"This paper proposes using CNNs with a skip-gram like objective as a fast way to output document embeddings and much faster compared to skip-thought and RNN type models.
+
+While the problem is an important one, the paper only compares speed with the RNN-type model and doesn't make any inference speed comparison with paragraph vectors (the main competing baseline in the paper). Paragraph vectors are also parallelizable so it's not obvious that this method would be superior to it. The paper in the introduction also states that doc2vec is trained using localized contexts (5 to 10 words) and never sees the whole document. If this was the case then paragraph vectors wouldn't work when representing a whole document, which it already does as can be seen in table 2.
+
+The paper also fails to compare with the significant amount of existing literature on state of the art document embeddings. Many of these are likely to be faster than the method described in the paper. For example:
+
+
+Arora, S., Liang, Y., & Ma, T. A simple but tough-to-beat baseline for sentence embeddings. ICLR 2017.
+Chen, M. Efficient vector representation for documents through corruption. ICLR 2017.
+",2,5.0,ICLR2018
+mPDGGP_pb8_,4,Zbc-ue9p_rE,Zbc-ue9p_rE,The work presents an elegant framework to refine samples generated from a generative model.,"Pros:
+* The proposed framework seems principled and practical. Propagating generated samples to follow the data distribution by simulating the gradient flow of the f-divergence to the data distribution is reasonable, and the required quantity for simulation, i.e. the density ratio between the sample and data distributions, is readily given by the discriminator in GAN training.
+* The presentation follows a clear logic flow, and related works are clearly connected. Experiment shows promising results.
+
+Cons:
+* Some statements can be made more precise and rigorous, up to my knowledge.
+  - ""The discriminator is trained to maximize this distance"": although the discriminator is involved in a minimax optimization problem, it is the distributions-dependent optimal discriminator that defines a distance between two distributions. The minimax objective may not be a distance between two distributions given an arbitrary discriminator.
+  - Eq. (1). Up to my knowledge, on a metric space, there is no formal definition of __gradient__. Even the concept of tangent vector is not defined on a metric space. Formally, a tangent vector involves a differential structure, so the space is often required to be a manifold. To define gradient, the space is further required to be a Riemannian manifold. What can be defined on a metric space is __gradient flow__, as __curves__ holding the intuition to minimize a given function steepest, and there are several formal descriptions on this intuition, e.g. the minimizing movement scheme. But the curves cannot be described using tangent vectors $x'(t)$ and gradients, as presented in Eq. (1).
+  - Eq. (3) is specific to the 2-Wasserstein space.
+* On Lemma 3.2.
+The result is based on the rule of change of variables, Eq. (25). But if $g$ is not required to be injective, the right hand side of Eq. (25) needs to be multiplicated by the number of $z$'s that make $x = g(z)$ [Federer, 1969, ""Geometric Measure Theory"", Thm. 3.2.5]. So the lemma may need to be adjusted accordingly.
+* It is favorable to cite the specific theorem/statement from works by Villani (2008) and Ambrosio et al. (2008) since they are huge books.
+* On the method.
+Although asymptotically the method guarantees that the sample distribution will converge to the data distribution, for each sample it may not converge and may traverse non-stop in the support of the data distribution. In other words, there exists nontrivial (non-zero) dynamics that keeps the data distribution stationary/invariant. Some examples in Fig. (2) already show this behavior to some extent, where all the samples along evolution seem realistic and differ in e.g. color or orientation. So is there a method to determine when to stop the evolution? Also, is it a problem that the evolution may change some attributes of the original sample?
+
+=== EDIT: post rebuttal ===
+
+Thanks for the response and addressing the issues for a more serious research paper.",7,4.0,ICLR2021
+o67CDKifq89,3,dgd4EJqsbW5,dgd4EJqsbW5,The algorithm is promising but the theoretical foundation is inaccurate,"This paper aims to address an important question in reinforcement learning: policy learning from high-dimensional sensory observations. The authors propose an algorithm for Learning Controllable Embedding (LCE) based on policy iteration in the latent space. The authors provide a theorem to show how the policy performance in latent-space policy improvement depends on the learned representation and develop three algorithmic variations that attempt to maximize the theoretical lower bounds. In the experiments, the proposed algorithm CARL shows improved performance when compared with other LCE baseline algorithms. 
+
+While I'm not particularly familiar with the field of LCE, I think the idea of learning a representation that is suitable for policy improvement is an interesting idea. The readability of this paper is also pretty good, which can be difficult to get right because the of the correspondence between the original space and the latent space. Overall the paper is easy to follow. 
+
+While I do think Algorithm 1 is reasonable, I found its theoretical foundation, namely Theorem 1, is incorrect. In the proof of Theorem 7 on p15 in the appendix, I do not think the implication T^2 VE(x) < T VE(x) + \gamma Delta(x) for all x, would hold. Because Bellman operator contracts in the L-inf norm, a basic inequality would rather take a form of  T^2 VE(x) < T VE(x) + \gamma sup_y Delta(y). In addition to this, another minor error happens in the first equation on pg 16, where I believe the correct right hand side would be 1/(1-gamma) sup_y Delta(y), without the gamma dependency.
+
+However, a bound that depends on L-inf norm would be quite bad for Theorem 1, and current data collection process in Alg 1 is not sufficient for minimizing it. I think it might be possible not using an L-inf bound but using an expected error based on the policy's rollout distribution. However, this change would largely change the theoretical results, and perhaps the motivation or details of the algorithm design. Therefore, I do not think the paper is ready for acceptance at the current stage without a large revision. If the authors can address this question properly, I would raise my score.
+
+Beyond the flaw in the theory, there are some parts which can benefit from some clarification: 
+1. In the offline CARL, how does the algorithm address the issue of out of distribution error due to using a batch dataset? 
+2. The authors argue that the loss here is different from PCC many times in the paper, but they never explain whether the choice here is better (or in which way). 
+3. In line 4 of Alg 1, how do we ensure such pi would exist?
+4. What is the definition of ""compatible reward function"" in the last paragraph on p4?
+5. For completeness of presentation, please include the definition of curvature loss.
+
+
+
+
+
+",6,4.0,ICLR2021
+rkgNCYHK2Q,1,B1fysiAqK7,B1fysiAqK7,A vaguely described idea without enough improvements,"## Summary
+
+This work presents a probabilistic training method for binary Neural Network with stochastic versions of Batch Normalization and max pooling. By sampling from the weight distribution an ensemble of Binary Neural Networks could further improve the performance. In the experimental section, the authors compare proposed PBNet with Binarized NN (Hubara et al., 2016) in two image datasets (MNIST and CIFAR10).
+
+In general, the paper was written in poor quality and without enough details. The idea behind the paper is not novel. Stochastic binarization and the (local) reparametrization trick were used to training binary (quantized) neural networks in previous works. The empirical results are not significant. 
+
+## Detail comments
+
+Issues with the training algorithm of stochastic neural network
+The authors did not give details of the training method and vaguely mentioned that the variational optimization framework (Staines & Barber, 2012). I do not understand equation 1. Since B is binary, the left part of equation 2 is a combination optimization problem. If B is sampled during the training, the gradient would suffer from high variance.
+
+Issues with propagating distributions throughout the network
+Equation 3 is based on the assumption of that the activations are random variables from Bernoulli distribution. In equation 4, the activations of the current layer become random variables from Gaussian distribution. How the activations to further propagate?  
+Issues with ternary Neural Networks in section 2.4
+For a ternary NN, the weight will be from a multinomial distribution, I think it will break the assumption used by equation 3.
+
+Issues with empirical evidences
+Since the activations are sampled in PBNET-S, a more appropriate baseline should be BNN with stochastic binarization (Hubara et al., 2016) which achieved 89.85% accuracy on CIFAR-10. It means that the proposed methods did not show any significant improvements. By the way BNN with stochastic binarization (Hubara et al., 2016) can also allow for ensemble predictions to improve performance.
+",3,3.0,ICLR2019
+64eXg3UvNXc,5,m1CD7tPubNy,m1CD7tPubNy,Review,"The paper studies the effect of padding on artefacts in CNN feature maps and performance on image classification and object detection. It convincingly makes the case that these artefacts have a significant detrimental effect on task performance, e.g. leading to blind spots / missed detections of small objects near the image border. It also studies the effect of uneven padding in downsampling layers, where the padding may only affect some sides of the image and not others, depending on the image size. A condition is presented for when this does / does not occur. The effect of different padding methods is also studied from the perspective of foveation by computing the number of paths from an input pixel to the output. A number of practical recommendations are given.
+
+The paper is well written. It contains lots of details that are relevant to CNN architecture design, especially when the appendix is taken into account. Proposed fixes are simple and produce a very significant improvement in performance on imagenet classification and object detection, so are likely to be adopted by practitioners. 
+
+The paper states:
+""It is evident that the 1-pixel border variations in the second map are caused by the padding mecha-nism in use. This mechanism pads the output of the previous layer with a 1-pixel 0-valued border inorder to maintain the size of the feature map after applying a 3x3 convolutional kernel. The maps inthe first layer are not impacted because the input we feed is zero valued. Subsequent layers, however,are increasingly impacted by the padding, as preceding bias terms do not warrant 0-valued input.""
+It would be interesting to know if batchnorm or some other kind of normalization might mitigate this issue, because if the feature map is constant but non-zero, normalization will make it all zero. Of course this will not hold for non-zero (natural) inputs, but it would still be interesting to see a discussion on the effect of (batch) normalization on padding artefacts.
+
+
+Typos
+Section 4: ""To serves""
+Section 5: RseNet
+
+------ 
+UPDATE
+I have now read the other reviews, author response and updated paper, and have decided to maintain my rating.",8,4.0,ICLR2021
+rJ6r_eqlf,3,HyKZyYlRZ,HyKZyYlRZ,Review,"The paper presents a multi-task, multi-domain model based on deep neural networks. The proposed model is able to take inputs from various domains (image, text, speech) and solves multiple tasks, such as image captioning, machine translation or speech recognition. The proposed model is composed of several features learning blocks (one for each input type) and of an encoder and an auto-regressive decoder, which are domain-agnostic. The model is evaluated on 8 different tasks and is compared with a model trained separately on each task, showing improvements on each task.
+
+The paper is well written and easy to follow.
+
+The contributions of the paper are novel and significant. The approach of having one model able to perform well on completely different tasks and type of input is very interesting and inspiring. The experiments clearly show the viability of the approach and give interesting insights. This is surely an important step towards more general deep learning models. 
+
+Comments:
+
+* In the introduction where the 8 databases are presented, the tasks should also be explained clearly, as several domains are involved and the reader might not be familiar with the task linked to each database. Moreover, some databases could be used for different tasks, such as WSJ or ImageNet.
+
+* The training procedure of the model is not explained in the paper. What is the cost function and what is the strategy to train on multiple tasks ? The paper should at least outline the strategy.
+
+* The experiments are sufficient to demonstrate the viability of the approach, but the experimental setup is not clear. Specifically, there is an issue about the speech recognition part of the experiment. It is not clear what the task exactly is: continuous speech recognition, isolated word recognition ? The metrics used in Table 1 are also not clear, they should be explained in the text. Also, if the task is continuous speech recognition, the WER (word error rate) metric should be used. Information about the detailed setup is also lacking, specifically which test and development sets are used (the WSJ corpus has several sets).
+
+* Using raw waveforms as audio modality is very interesting, but this approach is not standard for speech recognition, some references should be provided, such as:
+P. Golik, Z. Tuske, R. Schluter, H. Ney, Convolutional Neural Networks for Acoustic Modeling of Raw Time Signal in LVCSR, in: Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), 2015, pp. 26–30.
+D. Palaz, M. Magimai Doss and R. Collobert, (2015, April). Convolutional neural networks-based continuous speech recognition using raw speech signal. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on (pp. 4295-4299). IEEE.
+T. N. Sainath, R. J. Weiss, A. Senior, K. W. Wilson, and O. Vinyals. Learning the Speech Front-end With Raw Waveform CLDNNs. Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), 2015.
+
+Revised Review:
+The main idea of the paper is very interesting and the work presented is impressive. However, I tend to agree with Reviewer2, as a more comprehensive analysis should be presented to show that the network is not simply multiplexing tasks. The experiments are interesting, except for the WSJ speech task, which is almost meaningless. Indeed, it is not clear what the network has learned given the metrics presented, as the WER on WSJ should be around 5% for speech recognition.
+I thus suggest to either drop the speech experiment, or the modify the network to do continuous speech recognition. A simpler speech task such as Keyword Spotting could also be investigated.
+",6,3.0,ICLR2018
+rkgznMOCnX,3,r1e13s05YX,r1e13s05YX,Interesting approach with good results on synthetic tasks,"This paper presents an approach, called EstiNet, to train a hybrid models which uses both neural networks and black-box functions. The key idea is that, during training, a neural network can be used to approximate the functionality of the black-box functions, which makes the whole system end-to-end differentiable. At test time, the true black-box functions are still used. The training objective composes two parts: L_bbf, the loss for approximating the black-box function and L_target, the loss for the end-to-end goal. They tried different variations of when to train the black-box function approximator or not. It is shown to outperform the baselines like end-to-end differentiable model or NALU over 4 synthetic tasks in sample efficiency. There are some analysis about how the entropy loss and label smoothing helps with the gradient flow. 
+
+The proposed model is interesting, and is shown to be effective in the synthetic tasks. The paper is well-written and easy to follow. However, some of the experiment details are missing or scattered in the text, which might make it hard for the readers to reproduce the result. I think it helps to have the experimental details (number of examples, number of offline pretraining steps, size of the neural network, etc) organized in some tables (could be put in the appendix). 
+
+Two main concerns about how generally applicable is the proposed approach: 
+
+1. It helps to show how L_target depends on L_bbf, or how good the approximation of the black-box function has to be to make the approach applicable. For example, some functions, such as sorting, are hard to approximate by neural network in a generalizable way, so in those cases, is it still possible to apply the proposed approach? 
+
+2.The proposed approach can be better justified by discussing some potential real world applications. Two closely related applications I can think of are visual question answering and semantic parsing. However, it is hard to find good black-box functions for VQA and people often learn them as neural networks, and the functions in semantic parsing often need to interact with a database or knowledge graph, which is hard to approximate with a neural network. 
+
+Some minor issues:
+
+Table 3 isn’t very informative since k=2 and k=3 provides very similar results. It would help to show how large k needs to be for the performance to severely degrade. 
+
+Missing references: 
+
+The Related Works section only reviewed some reinforcement learning work on synthetic tasks. However, with some bootstrapping, RL is also shown to achieve state-of-the-art performance on visual question answering and semantic parsing tasks (Johnson et al, 2017; Liang et al, 2018), which might be good to include here. 
+
+Johnson, J., Hariharan, B., van der Maaten, L., Hoffman, J., Fei-Fei, L., Zitnick, C. L., & Girshick, R. B. (2017, May). Inferring and Executing Programs for Visual Reasoning. In ICCV (pp. 3008-3017).
+Liang, C., Norouzi, M., Berant, J., Le, Q., & Lao, N. (2018). Memory augmented policy optimization for program synthesis with generalization. arXiv preprint arXiv:1807.02322.
+",7,3.0,ICLR2019
+S1eyoo0F9B,2,ryl3ygHYDB,ryl3ygHYDB,Official Blind Review #4,"This paper proposes a new magnitude-based pruning method (and a few variants) by extending the single-layer distortion minimization problem to multi-layer cases so that the correlation between layers is taken into account. Particularly, the authors take into account the weight tensors of neighboring layers in addition to the original layer. The proposed algorithm looks promising and interesting. Empirically, the authors show that the proposed method consistently outperforms the standard magnitude-based pruning method.
+
+Overall, the paper is well-written and I think the algorithm is novel. Therefore, I've given the score of 6.
+
+Comments:
+(1) It seems obvious that the proposed method would increase the computation cost, but the authors didn't give any discussion or results on that. 
+(2) Although the main focus of the paper is magnitude-based pruning, I think the authors should include one baseline of Hessian-based pruning methods for comparison. As I know, the computation overhead of Hessian-based methods (e.g., OBD) is relatively small for the networks used in this paper. In particular, Hessian-based pruning methods can also be interpreted as a distortion minimization problem but in a different space/metric. So I wonder if the authors can extend LAP to Hessian-based pruning methods.
+(3) The authors introduced LAP with deep linear networks. However, the details of LAP in non-linear network are missing. I encourage the authors to fill in the details in section 2.2 in the next revision.
+(4) Currently, all experiments are done on CIFAR-10 dataset. I wonder if the author can include one more dataset. For example, comparison between MP and LAP on Tiny-ImageNet or even ImageNet. As I know, experiments of ImageNet can fit into a 4-GPU server for magnitude-based methods.",6,,ICLR2020
+B1_Afzy-f,3,SyF7Erp6W,SyF7Erp6W,"Unusual, perhaps creative paper, very cryptically presented-- hard to evaluate fairly for this reviewer","For me, this paper is such a combination of unusual (in the combinations of ideas that it presents) and cryptic (in its presentation) that I have found it exceedingly hard to evaluate fairly. For example, Section 4 is very unclear to me. Its relationship to Section 3.3 is also very unclear to me. 
+
+Before that point in the paper, there are many concepts and ideas alluded to, some described with less clarity than others, but the overall focus is unclear to me and the relationship to the actual algorithms and implementation is also unclear to me. That relationship (conceptual motivation --> implementation) is exactly what would be needed in order to fully justify the inclusion (in an ICLR paper) of so much wide-ranging philosophical/conceptual discussion in the paper.
+
+My educated guess is that other reviewers at ICLR may have related concerns. In its current state, it therefore does not feel like an appropriate paper for this particular conference. If the authors do feel that the content itself is indeed a good fit, then my strong recommendation would be to begin by re-writing it so that it makes complete sense to someone with a ""standard"" machine learning background. The current presentation just makes it very hard to assess, at least for this reviewer.
+
+If other reviewers are more easily able to read and understand this paper, then I will be glad to defer to their assessments of it and will happily retract my own.",3,1.0,ICLR2018
+z9QrRdvNCvq,2,r7L91opmsr,r7L91opmsr,Official Review 2,"Summary:
+The submission suggests a new variational bound for sequential latent variable models. Unlike previous work that optimize this bound using ‘standard’ particle filters with unbiased resampling, the new bound is constructed based on a partial rejection control step and uses a dice enterprise for sampling the ancestor variables. 
+
+Positives:
+The combination of partial rejection control and dice enterprise for variational inference is new and interesting. Particle filters with partial rejection control have been used before for constructing (biased) bounds based on the marginal likelihood. However, using a dice-enterprise step allows for a new unbiased bound which makes it possible to consider a lower bound on the log-likelihood via variational ideas that can be optimized with standard techniques. Empirical experiments suggest that the method outperforms previous work.
+
+Negatives:
+Does the complexity of the new bound not scale linearly with K (while K=1 for FIVO)? This seems to be not accounted for in the experiments. Choosing larger N=16 also has a better performance in the FIVO paper. 
+
+Recommendation:
+I vote for acceptance of the paper. However, I think that the experimental section should be improved.
+
+Comments:
+Variational bounds can also be constructed by targeting a smoothing distribution (Lawson et al, 2019) and particle filters with complexity N^2 based on a marginal Fisher identity have been suggested (POYIADJIS et al, 2011) for parameter estimation that avoid estimator variances scaling quadratically in time. I was wondering if there is a connection between such filters and the method suggested here, particularly for K=N?
+Can you explain the connection between the variance of the estimator for the normalizing constant obtained from particle filters and the tightness of the variational bound in more details?
+Are the signal-to-noise gradient issues for large N or K?
+How do the methods in the experiments compare for a larger number of particles?
+Is there some useful practical advice on choosing the ratio N/K and gamma?
+
+",7,4.0,ICLR2021
+CAtwMihcqEW,2,PXDdWQDBsCG,PXDdWQDBsCG,Official Blind Review #2 ,"Summary:
+- In this paper, the authors try to learn robust models for visual recognition and propose two defense methods, Edge-guided Adversarial Training and GAN-based Shape Defense (GSD), to use shape bias and background subtraction to strengthen the model robustness.
+
+However, I have still some concerns below:
+- In summary, the paper is hard to follow and the writting is not clear, such as the detailed motivation of the proposed methods and the structure of this paper.
+- For the experiments, a big dataset is needed, such as CIFAR-100. In addition, the results are not convincing, i.e., the evaluation on FGSM and PGD attack is not enough, some gradient-free attacks are needed. ",4,4.0,ICLR2021
+oYzQEkzGzpb,1,a3wKPZpGtCF,a3wKPZpGtCF,Some neat tricks but very tough to follow,"Summary:
+This paper studies Lyapunov chaos in learning algorithms for matrix games. It appears to extend earlier work by Cheung and Piliouras to more general-sum settings with the conclusion that in these more common settings the learning algorithms considered exhibit chaos. The paper also presents an interesting notion of matrix domination which is a necessary and sufficient condition for chaos, and also a linear programming approach for the purpose of identifying chaotic games.
+
+Strengths:
+* very nice example before section 3.2
+* thm 7 is very interesting
+* (8) is also very clever
+
+Weaknesses (in rough order of appearance):
+* constantly referring to ""AI/ML"" - just pick one
+* ""measurement errors in real economies"" - why ""economies"" all of a sudden?
+* ""Nash equilibrium is not achievable in general."" - this is patently false. There is a wealth of recent literature on learning algorithms which stably approach Nash equilibria, see, e.g., https://arxiv.org/pdf/1901.00838
+* the whole motivation with roundoff errors reads very naive... these issues have been studied for decades and are taught in standard courses on numerical methods. A good reference which covers these issues with rigor would be the classic book by James Demmel (who is also, coincidentally, known for his work on LAPACK). It is also worth pointing out the extensive work behind (and Turing award for) the development of the IEEE floating point standard. My point being that problems of algorithmic stability arising from finite-precision computation have been (and continue to be) studied extensively, and much theory absolutely considers the presence of computation errors.
+* I do not follow footnote 1
+* repeated references to ""dual space"" - what is this? It does not appear to be defined
+* (1) is confusing. Some prose description would be good.
+* The definitions following ""convex regularizer function"" are unclear.
+* I found (4) onward (through the end of pg. 5) very confusing and didn't really follow
+* Notation of ""u_i"" in Def. 3 is not introduced before
+* In 3.2.2 I don't understand why the ""iff"" property above doesn't apply
+* The end at pg. 8 is extremely abrupt. Please offer further interpretation of the results (e.g., what should the reader take away from Thm 11?) and a proper conclusion for the paper.
+* [throughout] there are numerous syntax/semantic errors that should be corrected before publication. I suggest employing a copy editor.
+* [high level comment] matrix games are a nice theoretical starting point for game theory, but I have yet to see them used in realistic settings. I'm sure that theoretical results may be very interesting, but practically speaking I have difficulty seeing the motivation. I would suggest providing stronger motivation for the reader early on.
+* [high level comment] there are *no* experiments at all. Please show at least some numerical examples, if for no other reason than to convince skeptics of the theoretical results.
+* [high level comment] numerous times, the paper referred to something like a definition that did not appear until much later in the paper. this is a little awkward for the reader
+
+Overall:
+While this paper does offer some interesting ideas, I cannot recommend publication at this time. I have pointed out a number of directions in which the manuscript could be improved; I hope that the authors are able to clarify these points in later revision.",7,4.0,ICLR2021
+w79Veac2Xyw,1,65MxtdJwEnl,65MxtdJwEnl,"Seems novel and motivated, but should be more approachable.","### **Summary and Contributions of Paper**
+This paper proposes a new method for computing Neural CDEs via the signature transform, which transforms a path integral into log signatures, i.e. a collection of iterated integrals. Then standard ODE tools are applied to each piecewise log signature.
+
+### **Strengths**
+- The writing quality is rigorous.
+- The approach seems motivated and based on a clever mathematical trick via the signature transform.
+- Experiments are convincing and sound.
+- Appendix provides proof of the approximation properties of the clipped-term signature transform (which originally requires infinite basis for exact approximation)
+
+### **Weaknesses**
+- The signature transform seems somewhat esoteric and nonstandard for readers without specific knowledge in this field. It would be very good if the authors could give more intuitive/pictorial views of this transform (I needed to read online surveys multiple times to understand the intuition behind this). For instance, I read this in detail: https://arxiv.org/pdf/1905.08494.pdf (NeurIPS 2019), which provides a much cleaner explanation of the signature transform, but also demonstrates that an entire paper is needed to simply explain the method. 
+- While the authors claim that there is an ease of implementation via pre-existing tools, the larger bottleneck seems to be actually understanding the method itself (which seems to also be a function of how the paper treats this material). While I have no doubt that this work would be great for a very mathematically minded community, I am unsure of its merits for the ICLR conference community. I think the authors should provide more high level overview of the signature transform, and keep the strict math in the appendix.
+
+I am not an expert on these types of methods, so my confidence will not be as high, but I believe that this paper contributes via its insight with the signature transform, and thus my rating is marginally above the acceptance threshold.
+
+If the authors could perhaps make the work more approachable, I would be happy to raise my score.
+",6,4.0,ICLR2021
+ByeQ23xaKB,1,H1gB4RVKvB,H1gB4RVKvB,Official Blind Review #1,"The paper introduces a complex hierarchical recurrent model for contour detection loosely inspired by the organization of cortical circuits. Their model performs state-of-the-art on sample-limited versions of popular contour detection (BSDS500) and cell segmentation (SNEMI3D) datasets, and it reproduces the well-known tilt illusion when transfer-learning orientation estimation. Interestingly, ""untraining"" the tilt illusion degrades performance on contour detection.
+
+Strengths:
++ State-of-the-art performance on data-limited contour detection tasks
++ Nice illustration of how the network refines its predictions over time
++ Demonstrates a contextual visual illusion in a task-trained neural network
++ Shows that tilt illusion is actually necessary for optimal performance in their model
+
+Weaknesses:
+- Architecture seems very complicated (unnecessarily so?)
+- No ablation studies showing the usefulness of various model components
+- Title seems a bit overly general given the quite specific result
+- Not clear whether their results support their interpretation of the function of illusions
+
+It's a relatively straightforward paper that is easy to follow and has a clear result that is both interesting and novel. Thus, I'm generally very supportive of the paper.
+
+There are a couple of weaknesses summarised above and detailed below that I would love to see addressed, but none of them is overly critical:
+
+1. Unfortunately the paper suffers from the same issue as the original work on fGRU, which it's based on: The fGRU architecture seems overly complicated and its numerous details and design choices not well motivated. Ablation studies showing which components are really necessary are missing. While this was understandable for the original paper, which introduced a novel approach, one would hope that follow-up work would subsequently get rid of some of the slack and simplify the architecture to the minimum that's really required.
+
+2. The title suggests that the paper explains the function of contextual illusions in general, but the paper actually ""just"" shows that one contextual illusion emerges when one trains a biologically inspired model on one particular task. I suggest aligning the title better with the actual contribution.
+
+3. (somewhat philosophical) The paper does not really answer the question posed in the abstract, does it? Do visual illusions reflect basic limitations of the visual system or do they correspond to corner cases of neural computations that are efficient in everyday settings? The authors seem argue for the second possibility. But if that was the case, wouldn't one expect other systems trained on the same tasks to also exhibit these illusions? It seems to me as if their results might suggest quite the opposite: Because only brain-like architectures exhibit this illusion, and because only they are hurt by ""unlearning"" the illusion, this visual illusion may reflect a basic limitation of how the visual system solves the task. I think it would be great if the author could comment on this point and clarify their reasoning in the paper.",8,,ICLR2020
+r1lI1xpkam,4,SJg6nj09F7,SJg6nj09F7,Possibly useful malware detector but unclear paper and uncharacterized black box labels in dataset,"This paper attempts to train a predictor of whether software is malware. Previous studies have emulated potential malware for a fixed number of executed instructions, which risks both false negatives (haven’t yet reached the dangerous payload) and false positives (malware signal may be lost amidst too many other operations). This paper proposes using deep reinforcement learning over a limited action space: continue executing a program or halt, combined with an “event classifier” which predicts whether individual parts of the program consist of malware. The inputs at each time step are one of 114 high level “events” which correspond to related API invocations (e.g. multiple functions for creating a file). One limitation seems to be that their dataset is limited only to events considered by a ""production malware engine"", so their evaluation is limited only to the benefit of early stopping (rather than continuing longer than the baseline malware engine). They evaluate a variety of recurrent neural networks for classifying malware and show that all significantly underperform the “production antimalware engine”. Integrating the event classifier within an adaptive execution control, trained by DQN, improves significantly over the RNN methods. 
+
+It might be my lack of familiarity with the domain but I found this paper very confusing. The labeling procedure (the ""production malware engine”) was left entirely unspecified, making it hard to understand whether it’s an appropriate ground-truth and also whether the DRL model’s performance is usable for real-world malware detection. 
+
+Also, the baseline models used an already fairly complicated architecture (Figure 3) and it would have been useful to see the performance of simple heuristics and simpler models. ",5,2.0,ICLR2019
+h-_YbptgpZK,2,okT7QRhSYBw,okT7QRhSYBw,The motivation of improving reproducibility is not convinced.,"This paper proposes the Anti-Distillation method to encourage prediction diversity in an ensemble model, in order to improve reproducibility. As I understand, the reproducibility defined in this paper refers to the prediction variance w.r.t. the random factors during training, e.g., SGD.
+
+However, a trivial but complete reproducibility can be achieved by simply fixing the random seeds during training (without affecting model performance). Then why we would prefer the (incomplete) reproducibility induced by Anti-Distillation? If the reproducibility is the metric, then a single model or an ensemble model with fixed random seeds would trivially be the best model.
+
+Besides, the experiments are done on MNIST and a private dataset. I suggest the authors evaluating their method on public datasets like CIFAR or ImageNet, where there are many existing baselines to compare the model performance.",3,3.0,ICLR2021
+Byg5OXlgcr,3,rkeeoeHYvr,rkeeoeHYvr,Official Blind Review #1,"Motivated by recent development of attack/defense methods addressing the vulnerability of deep CNN classifiers for images, this paper proposes an attack framework for adversarial text generation, in which an autoencoder is employed to map discrete text to a high-dimensional continuous latent space, standard iterative optimization based attack method is performed in the continuous latent space to generate adversarial latent embeddings, and a decoder generates adversarial text from the adversarial embeddings.  Different generation strategies of perturbing latent embeddings at sentence level or masked word level are both explored. Adversarial text generation can take either a form of appending an adversarial sentence or a form of scattering adversarial words into different specified positions. Experiments on both sentiment classification and question answering show that the proposed attack framework outperforms some baselines. Human evaluations are also conducted.
+
+Pros: 
+
+This paper is well-written overall. Extensive experiments are performed.
+
+Many human studies comparing different adversarial text generation strategies and evaluating adversarial text for sentiment classification/question answering are conducted.
+
+Cons:
+
+1) Although the studied problem in this paper is interesting, the technical innovation is very limited. All the techniques are standard or known. 
+
+2) There are two major issues: lacking a rigorous metric of human unnoticeability and lacking justification of the advantage of the tree-based autoencoder. I think the first issue is a major problem that renders all the claims in this paper questionable. The metrics used to define adversarial images for deep CNN classifiers are indeed valid and produce unnoticeable images for human observers. But in this paper, the adversarial attack is performed in the latent embedding space, and there is no explicit constraint enforced on the output text. It’s unconvincing that this approach will generate adversarial text that seems negligible to humans. Therefore, the studied problem in this paper has a completely different nature from the one for CNN image classifiers and it is hard to convince readers that the proposed  framework generates adversarial text legitimate to human readers. 
+
+3) It is unclear why tree-structured LSTM instead of a standard LSTM/GRU should be chosen in this framework for adversarial text generation. If this architecture is preferred, sufficient ablation studies should be conducted.
+
+4) In section 3.3, the description about adversarial attacks at word level is unclear. More detailed loss function and algorithms along with equations should be provided.
+
+5) In section 5.2, it is unclear that the majority answers on the adversarial text will, respectively, match the majority answers on the original text. Moreover, it seems that there is a large performance drop from original text to adversarial text. Therefore, it is valid to argue that whether the proposed framework can generate legitimate adversarial text to human readers or not.
+
+6) It’s better to include many examples of generated adversarial text in the appendix.
+
+7) Missing training details: It is unclear how the model architectures are chosen, and learning rate, optimizer, training epochs etc. are also missing. All these training details should be included in the appendix.
+
+8) Minor: Figure 1: ""Append an initial sentence..."",  section 3: ""map discrete text into a high dimensional..."",  section 3.2.2: ""Different from attacking sentiment analysis..."" ....
+
+In summary, the research direction of adversarial text generation studied in this paper is interesting and promising. However, some technical details are questionable, and the produced results without rigorous metrics seem to be unconvincing. 
+",3,,ICLR2020
+ByeWt-9S2Q,2,H1lo3sC9KX,H1lo3sC9KX,"missing references, theory is not novel, experiments are not sufficient","The paper proposes an algorithm to restrict the staleness in ASGD (asynchronous SGD), and also provides theoretical analysis. This is an interesting and important topic. However, I do not feel that this paper solves the fundamental issue - the staleness will be still very larger or some workers need to stay idle for a long time in the proposed algorithm if there exists some extremely slow worker. To me, the proposed algorithm is more or less just one implementation of ASGD, rather than a new algorithm. The key trick in the algorithm is collecting all workers' gradients in the master machine and update them at once, while hard limiting the number of updates in each worker. The theoretical analysis is not brand new. The
+line 6 in Algorithm 1 makes the delay a random variable related to the speed of a worker. The faster a worker is, the larger the tau is, which invalidates the assumption implicitly used in the theoretical analysis.
+
+The experiment is done with up to 4 workers, which is not sufficient to validate the advantages of the proposed algorithm compared to state of the art ASGD algorithms. The comparison to other ASGD implementations is also missing, such as Hogwild! and Allreduce.
+
+In addition, I am so surprised that this paper only have 10 references (the last one is duplicated). The literature review is quite shallow and many important work about ASGD are missing, e.g.,
+
+- Parallel and distributed computation: numerical methods, 1989.
+- Distributed delayed stochastic optimization, NIPS 2011.
+- Hogwild!, NIPS 2011
+- Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization, NIPS 2015
+- An asynchronous mini-batch algorithm for regularized stochastic optimization, 2016.",4,4.0,ICLR2019
+HyehHMVFnm,2,ByeMB3Act7,ByeMB3Act7,"Fast and accurate approximation to softmax, but more in-depth analysis results would be required","This paper presents an approximation to the softmax function to reduce the computational cost at inference time and the proposed approach is evaluated on language modeling and machine translation tasks. The main idea of the proposed approach is to pick a subset of the most probable outputs on which exact softmax is performed to sample top-k targets. The proposed method, namely Learning to Screen (L2S), learns jointly context vector clustering and candidate subsets in an end-to-end fashion, so that it enables to achieve competitive performance.
+
+The authors carried out NMT experiments over the vocabulary size of 25K. It would be interesting if the authors provide a result on speed-up of L2S over full softmax with respect to the vocabulary size. Also, the performance of L2S on larger vocabularies such as 80K or 100K needs to be discussed.
+
+Any quantitative examples regarding the clustering parameters and label sets would be helpful.
+L2S is designed to learn to screen a few words, but no example of the screening part is provided in the paper.",6,3.0,ICLR2019
+BkxudvgP27,1,Byx83s09Km,Byx83s09Km,The main input of this paper is to combine Information Direct Sampling and Distributional Reinforcement Learning for handling heteroscedasticity of noise in Reinforcement Learning.,"This paper investigates sophistical exploration approaches for reinforcement learning. Motivated by the fact that most of bandit algorithms do not handle heteroscedasticity of noise, the authors built on Information Direct Sampling and on Distributional Reinforcement Learning to propose a new exploration algorithm family. Two versions of the exploration strategy are evaluated against the state-of-the-art on Atari games: DQN-IDS for homoscedatic noise and C51-IDS for heteroscedastic noise. 
+
+The paper is well-written. The background section provides the clues to understand the approach. In IDS, the selected action is the one that minimizes the ratio between a squared conservative estimate of the regret and the information gain. Following (Ktischner and Krause 2018), the authors propose to use \log(1+\sigma^2_t(a)/\rho^2(a)) as the information gain function, which corresponds to a Gaussian prior, where \sigma^2_t is the variance of the parametric estimate of E[R(a)] and \rho^2(a) is the variance of R(a). \sigma^2_t is evaluated by bootstrap (Boostrapped DQN). Where the paper becomes very interesting is that recent works on distributional RL allow to evaluate \rho^2(a). This is the main input of this paper: combining two recent approaches for handling heteroscedasticity of noise in Reinforcement Learning.
+
+Major concern:
+While the approach is appealing for handling heteroscedastic noise, the use of a normalized variance (eq 9) and a lower bound of variance (page 7) reveal that the approach needs some tuning which is not theoretically founded. 
+This is problematic since in reinforcement learning, the environment is usually assumed to be unknown. What are the results when the lower bound of the variance is not used? When the variance of Z(a) is low, the variance of the parametric estimate should be low also. It is not the case?
+
+
+Minor concerns:
+
+The color codes of Figure 1 are unclear. The color of curves in subfigures (b) (c) (d) corresponds to the color code of IDS.
+
+The way in which \rho^2(s,a) is computed in algorithm 1 is not precisely described. In particular page 6, the equation \rho^2(s,a)=Var(Z_k(s,a)) raises some questions: Is \rho evaluated for a particular bootstrap k or is \rho is averaged over the K bootstraps ?
+_____________________________________________________________________________________________________________________________________________
+
+I read the answers of authors. I increased my rating.
+",7,4.0,ICLR2019
+Hkw5qHb4x,1,HJStZKqel,HJStZKqel,"Final Review: Fine idea but very basic tasks, weak baselines, and misleading presentation","The authors explore the idea of life-long learning in the context of program generation.
+
+The main weakness of this paper is that it mixes a few issues without showing strong results on any of them. The test tasks are about program generation, but these are toy tasks even by the low standards of deep-learning for program generation (except for the MATH task, they are limited to 2x2 grid). Even on MATH, the authors train and discuss generalization from 2-digit expressions -- these are very short, so the conclusiveness of the experiment is unclear. The main point of the paper is supposed to be transfer learning though. Unluckily, the authors do not compare to other transfer learning models (e.g., ""Progressive Neural Networks"") nor do they test on tasks that were previously used by others. We find that only testing on a newly-created task with a weak baseline is not sufficient for ICLR acceptance.
+
+After clarifying comments from the authors and more experiments (see the discussion above), I'm now convinced that the authors mostly measure overfitting, which in their model is prevented because the model is hand-fitted to the task. While the idea might still be valid and interesting, many harder and much more diverse experiments are needed to verify it. I consider this paper a clear rejection at present.",2,5.0,ICLR2017
+5S6TPwJeDVd,1,vsU0efpivw,vsU0efpivw,"Nice idea, but limiting the cardinality of active sets of the Shapley modules seems to be restrictive ","##########################################################################
+
+Summary:
+
+The paper proposes to incorporate Shapley values as latent representations in deep models. Specifically, the paper constructs Shallow SHAPNETs that computes the exact Shapley values. The paper also constructs Deep SHAPNETs that maintain the missingness and accuracy properties of Shapley values. The effectiveness of the proposed SHAPNETs is demonstrated through experiments on synthetic and real-world data. 
+
+Overall, it seems to be a good idea to incorporate Shapley values into deep models and the proposed method seems to be reasonable. The empirical results have demonstrated the usefulness of the proposed method. The paper is also well-written and technically sound. I have some comments as detailed below.
+
+##########################################################################
+
+Comments: 
+
+- The main challenge for Shapley values is its computational complexity. The paper overcomes this challenge by forcing the active set of all Shapley modules to be size=2. While mitigating the computational challenge, this would limit the representation power of the model. The authors showed that this is not a big issue by providing a comparison in Table 1 on three datasets (Synthetic, Yeast, Breast Cancer). However, all three datasets are low dimensional and do not require a high representation power of the model. Therefore, I am not quite convinced that the proposed SHAPNETs have satisfactory representation power. A comparison of Deep SHAPNETs and the DNN models on the three high-dimensional image datasets (MNIST, FashionMNIST, Cifar-10) would better answer this question. 
+
+- In principle, we can tradeoff between the representation power and the computational efficiency by varying the size of the active sets of the Shapley modules. Have the authors considered a comparison of SHAPNETs with different active set sizes? 
+
+- In Table 1, Shallow SHAPNET has a better performance than the Deep SHAPNET. Why is that? ",6,3.0,ICLR2021
+31hZ1qoeM-H,1,vT0NSQlTA,vT0NSQlTA,"Review 1: Flawed evaluation, needs significant updates","---- Summary ----
+
+The paper proposes LOVE, an adaptation of DOVE (Seyde’20) to latent variable predictive models (Seyde’20 only condsidered predictive models without latent variables). Seyde’20 proposes to use a generalization of Upper Confidence Bound to deep model-based RL, by training an ensemble of models and value functions, and training a policy to maximize mean + variance of this ensemble (similarly to Lowrey’19). The submission empirically demonstrates that tuning the learning rate and number of training steps per environment steps of Dreamer (Hafner’20) improves sample efficiency, using an ensemble of predictive models further improves data efficiency slightly (on cartpole and walker tasks), while on top of that the proposed exploration method slightly improves sample efficiency on the Hopper and sparse Cartpole tasks. 
+
+---- Decision ----
+
+The submission contains little technical novelty over prior work of Seyde (2020). The experimental results are weak, but somewhat justify the claims as there is a slight but consistent improvement on some tasks. However, the paper suffers from a major flaw in the empirical evaluation. Figure 3 and the relevant discussion describe LOVE as significantly outperforming the Dreamer baseline. This difference is largely due to the fact LOVE uses a different learning rate and number of epochs, which improves sample efficiency. The paper graciously provides the comparison to the fairly tuned baseline in the appendix as Figure 9, confirming this. The fairly tuned baseline needs to be moved from the appendix to the main paper, and the contribution section and the discussion of the experiments need to be rewritten accordingly. If this is provided, I will reevaluate the paper. In the current state of the paper, I am unable to consider its merits on the basis of this flaw.
+
+---- Strengths ----
+
+The paper is technically correct (except for the flaw explained above), and proposes a promising approach to a relevant problem of exploration in RL from images. The experimental results indicate that the proposed method could be effective.
+
+---- Weaknesses ----
+
+The major flaw of the paper is described earlier. In addition, there are two other major issues with experimental evaluation.
+
+The experimental evaluation of the paper is rather weak. The proposed exploration method only improves performance in 2 out of 8 environments. This might be because the other environments do not require sophisticated exploration, in which case the method needs to be tested on more than 2 relevant environments. Sparse-reward versions of the evaluated environments can be easily designed and would be suitable for evaluating the method.
+
+The second major issue is that the method is not evaluated against any competing exploration baselines, even though the paper cites multiple prior works on this. For instance, the paper claims that methods based on information gain or improving knowledge of the dynamics will not explore as efficiently as the proposed method. Both of these claims need to be empirically evaluated or toned down.
+
+---- Additional comments ----
+
+The related work section is missing the following papers: 
+- Ball’20 is a model-based RL method that uses UCB for exploration
+- Sekar’20 is a model-based RL method that uses latent ensemble variance for task-agnostic exploration
+
+Ball’20, Ready Policy One: World Building Through Active Learning
+
+Sekar’20, Planning to Explore via Self-Supervised World Models
+
+## Update 
+
+The new sparse tasks and comparison to Dreamer + Curious improve the paper and address some of my concerns. Specifically, a sizable improvement due to exploration is now seen on 3 tasks, Hopper, Cartpole Sparse, and Cheetah Sparse. The new maze task is also more challenging than the bug trap task.  
+
+--- Final Decision ---
+
+After the significant improvements in the experimental evaluation, I believe the paper provides a reasonable case for the proposed latent UCB method. It also provides an interesting discussion on the advantages of UCB-style methods, and an interesting observation that optimistic reward-based exploration can be effectively used even in absence of (positive) rewards. Even though the experimental evaluation of prior work on exploration is still rather lacking, I believe that these contributions are enough for the paper to be interesting to the ICLR audience. I raise my score to 6.
+
+--- Remaining weaknesses ---
+
+The experimental evaluation in the paper is still quite lacking in terms of baselines, making it impossible to judge whether the paper actually works better than prior work.
+
+First, the proposed method contains two improvements, model ensembling, and optimistic exploration, but doesn't go much in-depth analyzing either of these improvements, instead trying to focus on both at the same time. This makes comparison to prior exploration methods hard because the proposed method receives an additional boost due to an ensemble of dynamics (the paper conveniently quantifies this boost in the LVE method, and it is shown to be rather large). For a more fair comparison, the ensemble of dynamics might be ablated (leaving only the ensemble of value functions), or the competing baselines could also be built on top of LVE.
+
+Second, the paper only compares against one competing exploration method, Dreamer + Curious. There has been a large amount of proposed exploration methods, and it would be appropriate to evaluate the proposed UCB method against at least a few of them. For instance, the paper could compare against similar value function ensemble techniques (Osband'16, Lowrey'18, Seyde'20), or other cited work (Ostrovski'17, Pathak'17). Burda'18 is not cited, but perhaps should be compared against. All these methods can be relatively easily implemented on top of LVE for a fair comparison.
+
+Burda'18, Exploration by random network distillation.
+
+--- Additional comments ---
+
+Would be great to clarify what is the observation space for the bug trap and maze tasks. For instance, you could add observations and predictions for these tasks to the appendix.",6,5.0,ICLR2021
+Bkl5ifgIT7,2,Syx9rnRcYm,Syx9rnRcYm,"Interesting comparison between different SOTA CNN for UAV trail guidance, but seems weak in clarifying novelty.","The paper initiates a comparison between different SOTA convolutional neural networks for UAV trail guidance with the goal of finding a better motion control for drones. They use a simulator (but not a physical UAV)  to perform their experiments, which consisted on evaluating tuned versions of Inception-Resnet and MobileNet models using the IDSIA dataset, achieving good results in the path generated.  
+
+I think that the authors have perform an interesting evaluation framework, although not novel enough according to the literature. It is also great that the authors have included an explicit enumeration of all the dimensions relevant for their analysis (which are sometimes neglected), namely, computational cost, power consumption, inference time and robustness, apart from accuracy. 
+
+However, I think the paper is not very well polished: there are quite a lot of grammatical, typing and aesthetic errors. Furthermore, the analysis performed is an A+B approach from previous works (Giusti et al.2016, and Smolyanskiy et al, 2017) and, thus, it is hard to find the novelty here, since similar comparisons have been already performed. Therefore, the paper needs major improvements in terms of clarity regarding the motivations in the introduction.
+
+Also, one third of the paper is devoted to the software and hardware architecture used in the study, which I think it would be better fitted in an appendix section as it is of no added scientific value. Another weakpoint is that the authors were unable to run their DNN models on a physical drone in real time due to a hardware bug... I think the paper would benefit from a more robust (real) experimentation since, as they are, the presented results and experiments are far from conclusive.",3,2.0,ICLR2019
+B1zK_4_Ve,3,B16Jem9xe,B16Jem9xe,Review,"I just noticed I submitted my review as a pre-review question - sorry about this. Here it is again, with a few more thoughts added...
+
+The authors present a great and - as far as I can tell - accurate and honest overview of the emerging theory about GANs from a likelihood ratio estimation/divergence minimisation perspective. It is well written and a good read, and one I would recommend to people who would like to get involved in GANs.
+
+My main problem with this submission is that it is hard as a reviewer to pin down what precisely the novelty is - beyond perhaps articulating these views better than other papers have done in the past. A sentence from the paper ""But it has left us unsatisfied since we have not gained the insight needed to choose between them.” summarises my feeling about this paper: this is a nice 'unifying review’ type paper that - for me - lacks a novel insight.
+
+In summary, my assessment is mixed: I think this is a great paper, I enjoyed reading it. I was left a bit disappointed by the lack of novel insight, or a singular key new idea which you often expect in conference presentations, and this is why I’m not highly confident about this as a conference submission (and hence my low score) I am open to be convinced either way.
+
+Detailed comments:
+
+I think the authors should probably discuss the connection of Eq. (13) to KLIEP: Kullback-Leibler Importance Estimation by Shugiyama and colleagues.
+
+I don’t quite see how the part with equation (13) and (14) fit into the flow of the paper. By this point the authors have established the view that GANs are about estimating likelihood ratios - and then using these likelihood ratios to improve the generator. These paragraphs read like: we also tried to derive another particular formulation for doing this but we failed to do it in a practical way.
+
+There is a typo in spelling Csiszar divergence
+
+Equation (15) is known (to me) as Least Squares Importance Estimation by Kanamori et al (2009). A variant of least-squares likelihood estimation uses the kernel trick, and finds a function from an RKHS that best represents the likelihood ratio between the two distributions in a least squares sense. I think it would be interesting to think about how this function is related to the witness function commonly used in MMD and what the properties of this function are compared to the witness function - perhaps showing the two things for simple distributions.
+
+I have stumbled upon the work of Sugiyama and collaborators on direct density ratio estimation before, and I found that work very insightful. Generally, while some of this work is cited in this paper, I felt that the authors could do more to highlight the great work of this group, who have made highly significant contributions to density ratio estimation, albeit with a different goal in mind.
+
+On likelihood ratio estimation: some methods approximate the likelihood ratio directly (such as least-squares importance estimation), some can be thought of more as approximating the log of this quantity (logistic regression, denoising autoencoders). An unbiased estimate of the ratio will provide a biased estimate of the logarithm and vice versa. To me it feels like estimating the log of the ratio directly is more useful, and in more generality estimating the convex function of the ratio which is used to define the f-divergence seems like a good approach. Could the authors comment on this?
+
+I think the hypothesis testing angle is oversold in the paper.  I’m not sure what additional insight is gained by mixing in some hypothesis testing terminology. Other than using quantities that appear in hypothesis testing as tests statistics, his work does not really talk about hypothesis testing, nor does it use any tools from the hypothesis testing literature. In this sense, this paper is in contrast with Sutherland et al (in review for ICLR) who do borrow concepts from two-sample testing to optimise hyperparameters of the divergence used.",6,4.0,ICLR2017
+B1gLTsN2FB,1,H1gcw1HYPr,H1gcw1HYPr,Official Blind Review #2,"
+This paper proposes AlignNet, a bipartite graph network that learns to match to sets of objects. AlignNet has a slot-wise object-based memory that associates an index with each unique object and can discover new and re-appearing objects. Experiments are conducted on a symbolic dataset.
+
+I do not think the paper meets the acceptance threshold, and recommend for weak rejection. While the paper proposes an interesting architecture to address the alignment problem, it has noticeable flaws in its experimental designs.
+
+First, all the experiments are conducted on toy symbolic datasets, where the alignment problem is rather easy to solve. On the other hand, real-world scenarios can be far more complicated. For example, the appearance of the same object can change due to lighting and distance, and it is unreasonable to assume that their features would remain static (apart from simple uniform noises). In addition, the paper only compares against hand-crafted similarity measures (MSE and cosine). It is unfair to compare learned methods only to hand-crafted methods. As a reasonable and fair comparison, the paper should also compare AlignNet against learned similarity measures (such as a neural network supervised with ground-truth labels for alignments).
+
+The toy dataset and simple baselines in this paper raise doubts on whether the proposed method is applicable to more complex scenarios (such as aligning two sets of objects in natural images through their appearance features).",3,,ICLR2020
+rylIcVsaYH,1,SylUiREKvB,SylUiREKvB,Official Blind Review #3,"This paper proposes the variational hyper RNN (VHRNN), which extends the previous variational RNN (VRNN) by learning the parameters of RNN using a hyper RNN. VRHNN is tested and compared with VRNN on synthetic and real datasets. The authors report superior performance parameter efficiency over VRNN.
+
+The performance of VHRNN is promising and certainly better than the previous VRNN for some applications. However, the VHRNN is constructed by a straight-forward combination of existing techniques and hence the technical contribution of this paper is marginal.
+
+Although Section 4 is entitled as systematic generalization analysis of VHRNN, the reported results are only for the specific structures of VHRNN and VRNN. Isn’t it useless to present results for the VRNN with a latent dimension of 4, at least as a sanity check? 
+
+Fig. 2 and the texts referring to it discuss the KL divergence between the prior and the variational posterior. While the FIBO is mainly used as the objective in this paper, is the ELBO enough if the authors care the simultaneous low reconstruction error and low KL divergence?
+
+It is unclear and explained little if the comparison using parameter count is fair for VHRNN and VRNN since they have different structures.
+
+It would be nicer to discuss for which kind of time-series VRNN is enough.
+
+Minor comments:
+The caption of Figure 1 is too close to the main texts.
+Eq. (4) is overlapping with texts.
+Can the equations at the bottom of p.3 be explained with an illustration?
+",6,,ICLR2020
+rkximJxaKH,1,HJepXaVYDr,HJepXaVYDr,Official Blind Review #1,"This paper proposes two algorithms for the non-convex concave AUC maximization problem, along with theoretical analysis. Experiments show the proposed methods are effective, especially in data imbalanced scenarios.
+
+Strengths:
+
+This paper might be useful and interesting to related research, which overcomes some limitations in previous works such as: 1. the convex assumptions; 2. only considering simple models like linear models; 3. the need of extra memory to store/maintain samples. The proposed method extends existing works to a non-convex setting, which can be applied to deep neural networks, and is applicable for batch-learning and online learning.
+
+The proposed methods achieve better experimental results, especially in the data imbalanced scenarios, which is a real problem that may arise in many scenarios. The paper provides theoretical analysis on the proposed methods, based on Assumption 1, and inspired by the PL condition.
+
+Weaknesses:
+
+I think some comparisons with AdaGrad and related methods should be performed in experiments. Since PPD-AdaGrad is “AdaGrad style”.
+
+The assumptions seem a bit unclear. What does the first assumption in Assumption 1 imply?
+
+Minor Comments:
+1. Since the experiments label the first 5/50 classes as negative, and the last 5/50 classes as positive for CIFAR10/CIFAR100, is it possible to provide results on experiments that label in the opposite way (or randomly label 5/50 classes) and add these results in the paper/appendix? Just to make results more convincing and reduce some potential dataset influences.
+
+2. Is it possible to provide some results on more imbalanced positive-negative ratio like 20:1?
+
+3. Is it possible to provide some comparison in terms of actual time, like learning curves with time as x-axis?
+
+4. In the multi-class problems, why are the lower layers shared while last layer separated? 
+
+5. Since the extension to multi-classes problems are mentioned in the paper. I like to see some experimental results on this setting.
+
+6. How do the proposed methods perform on models other than NN?
+
+7. I think there is a typo on Page 4, the definition of AUC definition: the latter y should be -1.
+",6,,ICLR2020
+hcGNUdl-0Jj,1,ZzwDy_wiWv,ZzwDy_wiWv,Official Blind Review #2,"#####################################################################################
+Summary:
+
+This paper proposes a new formulation of knowledge distillation (KD) for model compression. Different from the classic formulation that matches the logits between student and teacher models, this paper suggests to match the output features of the penultimate layers between student and teacher models, based on L2 distance. Two complementary variants are introduced, with one directly minimizing the distance of the original feature vectors and with the other minimizing the distance of the feature vectors projected on the teacher classifiers. The approach is evaluated in a variety of scenarios, such as different network architectures, teacher-student capacities, datasets, and domains, and compared with state-of-the-art results.
+
+####################################################################################
+Pros:
+
+The proposed knowledge distillation formulation is simple. The paper is well written. Experimental evaluations clearly demonstrate the effect by directly matching the output features of the penultimate layers.
+
+####################################################################################
+Cons: 
+
+- While the proposed approach is interesting, its novelty seems limited, with formulations being special cases of existing methods. In particular, as the authors also mentioned, the proposed L_FM loss is a simplified FitNet loss in Romero et al., which focuses only on matching the final representation without considering the intermediate representations. The proposed L_SR loss is similar to the standard KD formulation in Hinton et al., with the only difference that the pre-trained teacher’s classifier is used for both teacher and student models.
+
+- From the result tables, the performance improvement of the proposed approach is marginal, which is on par with existing works.
+
+- As the authors mentioned, the L_SR loss is inspired by that only using the L_FM loss ignores the inter-channel dependencies of the feature representations h^S and h^T. So L_SR-CE is introduced. However, empirically, L_SR-CE is worse than L_SR-L2 which directly minimizes the distance between projected features. It seems that the empirical formulation is inconsistent with the motivation.
+
+- In Eq 8, I assume that a separate classifier is also trained for the student model using the L_CE loss. How is the performance if we completely remove the student classifier? That is, the student model only consists of a feature extractor, and during inference time, we directly insert the pre-trained teacher classifier on top of the student model.
+
+- It would be interesting to show the results when h^T and h^S have different dimensionality.
+
+- In the ablation studies, the authors investigated “where should be losses be applied”. How is the classifier generated for the intermediate layers?
+
+- How does the performance change with respect to different settings of the hyper-parameters alpha and beta?
+
+- The notation is inconsistent throughout the paper. For example, h^T and h^S are used in the first half of the method section, while h_T and h_S are used later.
+
+#################################################################################### Updated:
+
+The authors' rebuttal addressed my concerns and I lean toward acceptance.",6,4.0,ICLR2021
+B1ePQyfk9S,3,BylWglrYPH,BylWglrYPH,Official Blind Review #1,"This paper focuses on modelling invariances or symmetry between various components for solving tasks via convolutions and weight sharing.
+The proposed tasks are toyish in nature although they do give insights into importance of modeling symmetry for better generalization. The first task is a symbol substitution which considers a permutation in source symbols and maps them to either ""ABA"" or  ""ABB"" categories i.e binary classification. While this task does require generalizability, it is surprising that the mlp and recurrent net baselines are so much inferior (basically random) to the convolution baseline. While this shows efficacy of modeling symmetry, I'd be curious about performance graphs as the training data increases in size.
+The second task is an artificially created task inspired from the SCAN dataset. The task is to translate a verb-number pair into number repetitions of the verb. The encoder decoder network uses convolution in the recurrences to capture the notion of generalizability. The input and output space is very small (10 verbs and 10 numbers) but shows superiority of convolution and weight sharing over other baselines. Curiously, the recurrent baseline seems to perform better than 0% accuracy (if still poorly) on the original SCAN task which is much harder than the proposed task in this paper. Maybe, the number of examples (1000) is too small recurrent networks but this makes me a little surprised. More details about the architecture and training procedure for baselines would be helpful to ensure that the comparison is fair across baselines.
+The final task is CFG modeling where convolutions are used to model the forget gate of an LSTM which seems to endow the network with PDA like properties and the convolutions are more effective than baselines at modeling this.
+
+Apart from the concerns related to the results mentioned above, my major concern is that the tasks considered are too simple and at least one complicated large-scale task would have strengthened the paper.
+Also, for tasks 2 and 3, the motivation behind using convolutions is not as clean as in task 1. So more analysis and insights into model performance, the weights learned, ablation studies etc. would have helped in understanding how the convolutions are modeling the symmetry. This should be informative and tractable because of simplicity of the tasks involved.
+
+Finally, as mentioned above, I still cannot intuitively understand why convolutions in the forget architecture would learn about symmetry related to structured repetition produced by a CFG. Hence, more analysis or a better motivation would have helped. ",3,,ICLR2020
+Rb0isLk9cVD,4,gLWj29369lW,gLWj29369lW,Clear exposition and hypothesis but underwhelming empirical validation,"This paper aims to establish a theoretical basis for geometric properties of knowledge graph relations and embedded entities by comparing knowledge graph embeddings with word embeddings. Using the insight that the semantic properties of PMI-based word embeddings manifest as linear geometric relationships, they view and compare the relationship embeddings derived from different knowledge graph embedding schemes in this way. 
+The analysis claims to show that when the KG architecture conforms to the presented relation types and conditions (divided into similarity, relatedness, and context shift types), it has better performance of link prediction for that embedding scheme. 
+
+The empirical evaluation focuses on comparison of 4 embedding schemes that have linear transformation score functions (additive, multiplicative, both) on WN18RR and NELL-995 relations for link prediction on several examples of the relation types.
+
+Overall, the paper is well-motivated, cites relevant literature in the theory behind word embeddings, and is generally clearly written. It has a useful proposal for the types and conditions of the three relation types and clear hypothesis for the performance of knowledge graph relation transformations that have certain properties.
+
+However, the empirical evaluation does not seem to completely support the claims. TransE is obviously lower performing across relations, but MureI seems quite close in most cases to models that involve multiplacative relatedness, so it’s not obvious to me that MureI performs worse. Further, the summery suggests that DistMult is preferable for type R, but MuRe appears to do equally well or better on most cases, thus it’s not clear under what circumstances (what dataset dependent factor) would point to not choosing MuRe. 
+I would expect to see a starker contrast between the performance of the different models per claim type to support the dataset dependent statement. Perhaps another experimental setting, like comparison on non-linear transformation, or other examples, would help support that claim. ",6,3.0,ICLR2021
+BkeTXrG6nm,3,HkzSQhCcK7,HkzSQhCcK7,Ok paper with a reasonable -- though somewhat obvious -- approach to generative modeling of sequence data,"This paper presents a generative sequence model based on the dilated CNN
+popularized in models such as WaveNet. Inference is done via a hierarchical
+variational approach based on the Variational Autoencoder (VAE). While VAE
+approach has previously been applied to sequence modeling (I believe the
+earliest being the VRNN of Chung et al (2015)), the innovation where is the
+integration of a causal, dilated CNN in place of the more typical recurrent
+neural network. 
+
+The potential advantages of the use of the CNN in place of
+RNN is (1) faster training (through exploitation of parallel computing across
+time-steps), and (2) potentially (arguably) better model performance. This
+second point is argued from the empirical results shown in the
+literature. The disadvantage of the CNN approach presented here is that
+these models still need to generate one sample at a time and since they are
+typically much deeper than the RNNs, sample generation can be quite a bit
+slower.
+
+Novelty / Impact: This paper takes an existing model architecture (the
+causal, dilated CNN) and applies it in the context of a variational
+approach to sequence modeling. It's not clear to me that there are any
+significant challenges that the authors overcame in reaching the proposed
+method. That said, it certainly useful for the community to know how the
+model performs.
+
+Writing: Overall the writing is fairly good though I felt that the model
+description could be made more clear by some streamlining -- with a single
+pass through the generative model, inference model and learning. 
+
+Experiments: The experiments demonstrate some evidence of the superiority
+of this model structure over existing causal, RNN-based models. One point
+that can be drawn from the results is that a dense architecture that uses multiple levels of the
+latent variable hierarchy directly to compute the data likelihood is
+quite effective. This observation doesn't really bear on the central message
+of the paper regarding the use of causal, dilated CNNs. 
+
+The evidence lower-bound of the STCN-dense model on MNIST is so good (low)
+that it is rather suspicious. There are many ways to get a deceptively good
+result in this task, and I wonder if all due care what taken. In
+particular, was the binarization of the MNIST training samples fixed in
+advance (as is standard) or were they re-binarized throughout training? 
+
+Detailed comments:
+- The authors state ""In contrast to related architectures (e.g. (Gulrajani et
+al, 2016; Sonderby et al. 2016)), the latent variables at the upper layers
+capture information at long-range time scales"" I believe that this is
+incorrect in that the model proposed in at least Gulrajani et al also 
+
+- It also seems that there is an error in Figure 1 (left). I don't think
+there should be an arrow between z^{2}_{t,q} and z^{1}_{t,p}. The presence
+of this link implies that the prior at time t would depend -- through
+higher layers -- on the observation at t. This would no longer be a prior
+at that point. By extension you would also have a chain of dependencies
+from future observations to past observations. It seems like this issue is
+isolated to this figure as the equations and the model descriptions are
+consistent with an interpretation of the model without this arrow (and
+including an arrow between z^{2}_{t,p} and z^{1}_{t,p}.
+
+- The term ""kla"" appears in table 1, but it seems that it is otherwise not
+defined. I think this is the same term and meaning that appears in Goyal et
+al. (2017), but it should obviously be defined here.
+",6,5.0,ICLR2019
+SJgzxEO5hQ,3,rkxd2oR9Y7,rkxd2oR9Y7,How to make sgd with full matrix pre-conditioning scalable?,"adaptive versions of sgd are commonly used in machine learning. adagrad, adadelta are both popular adaptive variations of sgd. These algorithms can be seen as preconditioned versions of gradient descent where the preconditioner applied is a matrix of second-order moments of the gradients. However, because this matrix turns out to be a pxp matrix where p is the number of parameters in the model, maintaining and performing linear algebra with this pxp matrix is computationally intensive. In this paper, the authors show how to maintain and update this pxp matrix by storing only smaller matrices of size pxr and rxr, and performing 1. an SVD of a small matrix of size rxr 2. matrix-vector multiplication between a pxr matrix and rx1 vector. Given that rxr is a small constant sized matrix and that matrix-vector multiplication can be efficiently computed on GPUs, this matrix adapted SGD can be made scalable. The authors also discuss how to adapt the proposed algorithm with Adam style updates that incorporate momentum. Experiments are shown on various architectures (CNN, RNN) and comparisons are made against SGD, ADAM. 
+
+General comments: THe appendix has some good discussion and it would be great if some of that discussion was moved to the main paper.
+
+Pros:  Shows how to make full matrix preconditioning efficient, via the use of clever linear algebra, and GPU computations.
+Shows improvements on LSTM tasks, and is comparable with SGD, matching accuracy with time.
+
+Cons: While doing this leads to better convergence, each update is still very expensive compared to standard SGD, and for instance on vision tasks the algorithm needs to run for almost double the time to get similar accuracies as an SGD, adam solver.  This means that it is not apriori clear if using this solver instead of standard SGD, ADAM is any good. It might be possible that if one performs few steps of GGT optimizer in the initial stages and then switches to SGD/ADAM in the later stages, then some of the computational concerns that arise are eliminated. Have the authors tried out such techniques?",6,3.0,ICLR2019
+H1GtVkvEg,2,SJk01vogl,SJk01vogl,Review,"This paper considers different methods of producing adversarial examples for generative models such as VAE and VAEGAN. Specifically, three methods are considered: classification-based adversaries which uses a classifier on top of the hidden code, VAE loss which directly uses the VAE loss and the ""latent attack"" which finds adversarial perturbation in the input so as to match the latent representation of a target input.
+
+I think the problem that this paper considers is potentially useful and interesting to the community. To the best of my knowledge this is the first paper that considers adversarial examples for generative models. As I pointed out in my pre-review comments, there is also a concurrent work of ""Adversarial Images for Variational Autoencoders"" that essentially proposes the same ""latent attack"" idea of this paper with both L2 distance and KL divergence.
+
+Novelty/originality: I didn't find the ideas of this paper very original. All the proposed three attacks are well-known and standard methods that here are applied to a new problem and this paper does not develop *novel* algorithms for attacking specifically *generative models*. However I still find it interesting to see how these standard methods compare in this new problem domain.
+
+The clarity and presentation of the paper is very unsatisfying. The first version of the paper proposes the ""classification-based adversaries"" and reports only negative results. In the second set of revisions, the core idea of the paper changes and almost an entirely new paper with a new co-author is submitted and the idea of ""latent attack"" is proposed which works much better than the ""classification-based adversaries"". However, the authors try to keep around the materials of the first version, which results in a 13 page long paper, with different claims and unrelated set of experiments. ""in our attempts to be thorough, we have had a hard time keeping the length down"" is not a valid excuse.
+
+In short, the paper is investigating an interesting problem and apply and compare standard adversarial methods to this domain, but the novelty and the presentation of the paper is limited.",5,4.0,ICLR2017
+rJxOvPC63m,2,H1e8wsCqYX,H1e8wsCqYX,interesting idea but the significance of the experimental results is unclear and the motivation should better match the evaluation ,"The paper proposes to use a regularization which preserves nearest-neighbor smoothness from layer to layer. The approach is based on controlling the extent to which examples from different classes are separated from one layer to the next, in deep neural networks. The criterion computes the smoothness of the label vectors (one-hot encodings of class labels) along the nearest-neighbor graph constructed from the euclidian distances on a given layer's activations. From an algorithmic perspective, the regularization is applied by considering distances graphs on minibatches. Experiments on CIFAR-10 show that the method improves the robustness of the neural networks to different types of perturbations (perturbations of the input, aka adversarial examples, and quantization of the network weights/dropout0.
+
+The main contribution of the article is to apply concepts of graph regularization to the robustness of neural networks. The experimental evaluation is solid but the significance is unclear (error bars have rather large intersections), and there is a single dataset.
+
+While the overall concept of graph regularization is appealing, the exact relationship between the proposed regularization and robustness to adversarial examples is unclear. There does not seem to be any proof that adersarial examples are supposed to be classified better by keeping the smoothness of class indicators similar from layer to layer. Section 3.4 seem to motivate the use of the smoothness from the perspective of preventing overfitting. However, I'm not sure how adversarial examples and the other forms of perturbations considered in the experiments (e.g., weight quantization) are related to overfitting.
+
+strengths:
+- practical proposal to use graph regularization for neural network regularization
+- the proposal to construct graphs based on the current batch makes sense from an algorithmic point of view
+
+
+cons: experimental results are a bit weak -- the most significant results seem to be obtained for ""implementation robustness"", but it is unclear why the proposed approach should be particularly good for this setting since the theoretical motivation is to prevent overfitting. The results vs Parseval regularization and the indications that the metohd works well with Parseval regularization is a plus, but the differences on adversarial examples are tiny.
+
+other questions/comments:
+- how much is lost by constructing subgraphs on minibatches only?
+- are there experiments (e.g., on smaller datasets) that would show that the proposed method indeed regularizes and prevents overfitting as motivated in Section 3.4?
+
+
+",5,3.0,ICLR2019
+sfdV4h2xpzl,4,#NAME?,#NAME?,"The paper proposes a new problem setting for learning object detectors using weak supervision as well as a deep learning solution. However, the presentation and its relation to previous works should be improved.","
+This paper defines cross-supervised object detection which learns a detector from both image-level and instance-level annotations. It proposes a unified framework along with a spatial correlation module for the task. The spatial correlation module is used for transfer mapping information from base categories to novel categories. It conducts experiments on the PASCAL VOC dataset and COCO dataset, demonstrating the effectiveness.
+
+Pros:
+(1) The proposed spatial correlation module is a novel and effective transfer module.
+(2) The ablation studies are relatively complete. 
+
+Cons:
+(1) The structure of the proposed spatial correlation module should be described in more detail. What is the meaning of “replacing the backbone and feature pyramid network with five max-pooling layers.” in the heatmap detection part?
+(2) In Table 1, are the experimental settings of those competitors such as MSD-VGG16, MSD-Ens, and Weight Transfer et al. exactly the same as those used in this paper?
+(3) In Table 2, it seems like the method taking the non-VOC as the base classes while Hu et al. (2018) use the non-VOC as those classes without mask annotation. Can you tell me the reasons for this choice?
+(4) Some similar problem settings are defined in [a] and [b]. The paper fails to compare the problem settings and justify the usefulness of the proposed problem setting in real application scenarios.
+
+Overall evaluation: The major contribution of this paper comes from the spatial correlation module. However, I still have some doubts about the structure of this module. Since this task is first presented, I want to make sure that the comparison is as fair as possible. 
+
+[a] Weakly- and Semi-Supervised Fast Region-Based CNN for Object Detection. Journal of Computer Science and Technology (JCST) 34(6): 1269–1278 Nov. 2019.
+[b] LSTD: A Low-Shot Transfer Detector for Object Detection, AAAI 2018",6,5.0,ICLR2021
+wDxSVmaHS_d,4,RwQZd8znR10,RwQZd8znR10,Promising direction but unpolished work,"*SUMMARY*
+
+The paper presents a method for efficient task identification to improve adaptation in a meta RL setting. The approach is based on learning an exploration policy to quickly discriminate the task at hand, so that to leverage a task-specific policy for exploitation. To do so, it employs an intrinsic reward proportional to the information gain (or prediction error) over both the transition and reward models. Finally, the algorithm is evaluated over a set of continuous control domains with sparse rewards.
+
+*EVALUATION*
+
+The idea of focusing the exploration on identifying the current task, instead of collecting good samples for policy improvements, is quite interesting, albeit not completely novel. Unfortunately, the lacking presentation does not help assessing the real quality of the contributions and the robustness of the experimental analysis, and it make me leaning towards a negative evaluation. However, I believe that a partial overhaul of the paper structure, along with a sharper experimental analysis, would make for a nice contribution to the meta RL community.
+
+*DETAILED COMMENTS*
+
+C1) As noted by the authors, the idea of maximizing information gain over the transition model in a single-environment setting has been presented in VIME (Houthooft et al. 2016), while Zhou et al. (2018) employs information gain to fast identification of the transition model in multi-environment setting. Thus, the main contribution of this work seems to account also for the reward model in the information gain computation. However, it is not clear to me how this can lead to a significant improvement, especially in sparse-reward settings, where rewards are zero almost everywhere.
+
+C2) An information theoretic approach to leverage the reward structure in a multi-goal scenario has been previously proposed in (Goyal et al. InfoBot: Transfer and exploration via the information bottleneck. ICLR 2019). Though the setting is not equivalent, I am  wondering how their concept of *decision state* relates to the idea of fast identification of the reward model.
+
+C3) The paper propose two alternatives reward function, one that essentially normalizes the prediction error with the predictability of the transition (IG), and the other that solely consider the prediction error (PE). While the experimental analysis shows a slight edge for the latter, I would say that this might be negligible with respect to the ability of the former to deal with noisy environments (as explained in Section 4.4).
+
+*QUESTIONS*
+
+- Could the authors address the comments above in their response?
+- The last line of Section 3 says that the proposed intrinsic reward does not vanish over time: Can the authors explain why?
+- The experimental setup describes a very short adaptation phase (hundreds of samples) but the subsequent plots are functions of millions of samples. Could the authors clarify the experimental setup, especially the number of samples employed in the meta-training and meta-testing phases?
+
+
+*ADDITIONAL FEEDBACK*
+
+- I would suggest to revise the structure of the paper by focusing on the core ideas first, and to the implementation details later.
+- I did not find the notation crystal clear, especially regarding the role of the context c, and the indiscriminate use of the words trajectory, episode, experience. To me it would be better to refer as sample every (s,a,r,s) tuple and trajectory every sequence of samples from the initial state to the end of an episode.
+- In my opinion the experimental section could be sharper. I would suggest to focus on a smaller set of domains (such as one with changing dynamics, one with changing rewards and one with both for the Mujoco set, in addition to the two Meta-World) and a smaller set of comparisons, and to seek for a clearer illustrations of the benefits of MetaCURE with respect to other methods.
+- I believe the ablation study to be central in the evaluation of MetaCURE. Especially, I would have preferred to see a convincing assessment of the benefit of computing the information gain over both the reward and transition models, with respect to accounting just for the former or the latter.
+- I would not claim the performance of Meta-World Push (Table 1) to be significantly higher than PEARL, since the confidence intervals are overlapping. Averaging over more trials might help.
+- I think the main text relies too much on the Appendix: It is fine to have proofs and derivations out of the paper, but I would not directly comment in the experimental section figures that only appear in the Appendix.
+
+#############
+
+AFTER RESPONSE
+
+I would like to thank the authors for the detailed response. I encourage them to keep working on MetaCURE: With some improvements I believe it will provide a valuable contribution to the meta-RL field.",4,3.0,ICLR2021
+BJYbdoXVl,3,S1X7nhsxl,S1X7nhsxl,,"This paper is well written, and well presented. This method is using denoise autoencoder to learn an implicit probability distribution helps reduce training difficulty, which is neat. In my view, joint training with an auto-encoder is providing extra auxiliary gradient information to improve generator. Providing auxiliary information may be a methodology to improve GAN.  
+ 
+Extra comment:
+Please add more discussion with EBGAN in next version. 
+",7,4.0,ICLR2017
+HJeTtyRPtr,1,rJx0Q6EFPB,rJx0Q6EFPB,Official Blind Review #3,"The authors propose TinyBERT, a smaller version of BERT that is trained with knowledge distillation. The authors evaluate on the GLUE benchmark.
+
+Overall, I find the direction of this work exciting and making these large models smaller for practical use is an important research area. The authors provide various ablation experiments that provide insight into their method. The main contribution is experiments comparing various existing distillation methods to different parts of the model (embeddings, layers, prediction layer), so is not particularly novel in contributing new techniques for distillation. That being said, there is importance in contributing these results as they are very useful for others working in the area and on making smaller models. But I would expect the authors to be much more detailed in their experimental description and make it clear in the paper that the comparative baselines are fair and well tuned. 
+
+Comments:
+
+1. Can the authors please add details for how the model has been trained, such as the datasets used, the number of update steps, the batch size, etc. as well as the finetuning parameters that were cross validated for GLUE? It is difficult to tell in the current setting if the models are comparable to the baselines. The current paper doesn't seem like it could be reproduced. It is particularly important to detail how the finetuning was done, as this is very important for the smaller datasets in GLUE.
+
+2. Is the learning of the distilled model only done on the training dataset, or there is data augmentation beyond the training set? What is the effect without data augmentation?
+
+3. Unfortunately, the performance drop on the GLUE benchmark as shown in Table 2 is fairly large. The authors compare to BERT Small and DistilBERT and I like the baselines, but the claim that the model achieves comparable performance to BERT Base is not true. 
+
+4. Was the BERT Small model tuned, or the same learning parameters from BERT Base were used? 
+
+5. Can the authors clarify the inference time of BERT Small? The speed improvement of TinyBERT should be the same as BERT Small based on parameter size.
+
+6. The authors experiment with distilling the embedding layer to reduce the number of parameters, why not reduce the parameter size by reducing the vocabulary size? Existing approaches to BERT training use BPE with ~30k vocabulary size or RoBERTa with ~50k vocabulary size, but large gains could be applied here by reducing the size or using softmax reduction techniques that were popular on full vocabulary language modeling datasets like wikitext-103 or billion word. 
+
+7. Can the authors please clarify the construction of Table 2? Are those results on the test set (e.g. evaluated on the official GLUE benchmark), or on the dev set? Where are the DistilBERT numbers on the test set coming from, as it is not reported in their paper? ",3,,ICLR2020
+0DZtJZeDaDY,2,Dmpi13JiqcX,Dmpi13JiqcX,Light-weight approach to untangle language model representations,"This paper proposes a masking strategy to identify subnetworks within language models responsible for predicting different text features. This approach requires no fine-tuning of model parameters and still achieves better results compared to previous approaches. Their experimental results on the movie domain show some level of disentanglement is achieved between sentiment and genre. Disentanglement capabilities of their model between sentence semantics and structure, are also tested on four tasks. 
+
+Pros:
+- Paper is well-written and the idea is explained well.
+- Experiment results are convincing and support the claims.
+- Achieving comparable results to SOTA without the need to train or finetune models is interesting especially from a computational point of view.
+
+Cons:
+- I wish the authors performed their first experiment on more domains: books, music, etc. and consider more than two labels.
+From current results, it's hard to confidently conclude that this approach is generalizable.
+- Judging based on Figure 4 results I'm not convinced that the proposed approach does better than the *finetuned* (which I believe has a trained classifier on top of BERT) approach especially for Semantic tasks. Perhaps a discussion/ error analysis would be appropriate given better results on Syntax tasks.
+- Also a discussion on the results for masking weights vs. masking hidden units is missing. If I'm not mistaken, mathematically, hidden unit masking is a subset of weight masking, where masking an item in hidden activation is equivalent to masking an entire column in the weight matrix?
+
+
+Comments:
+- Although the idea of masking model parameters to achieve untanglment is new, there has been [previous work](https://www.aclweb.org/anthology/P18-1069.pdf) on using dropout to identify sub-parts of the network that contribute more/ less to model predictions framed as a confidence modeling task. Authors may consider adding it to related work.
+- Another missed citation under related work is [HUBERT](https://arxiv.org/pdf/1910.12647.pdf) which examines untanglement of semantics and structure across a wide range of NLP tasks.
+
+
+Minor typos:
+- ""we *measure* evaluate them on four tasks ..."" on page 7 
+- ""Technically, in the Pruned + Masked Weights method, *the* refining the masks ..."" ",6,4.0,ICLR2021
+XwDQqDVhDLk,1,g21u6nlbPzn,g21u6nlbPzn,"This paper proposes a novel framework called $\text{VA-RED}^2$ to reduce spatial and temporal features to be computed for video understanding, which can reduce FLOPs when inferencing the video but remains the performance.  This paper is well-written and conducts extensive experiments to validate the performance of proposed approach.","This paper proposes a novel framework called $\text{VA-RED}^2$ to reduce spatial and temporal features to be computed for video understanding, which can reduce FLOPs when inferencing the video but remains the performance. 
+
+The authors have done extensive experiments on video action recognition tasks and spatio-temporal action localization task in the area of video understanding. For the video action recognition task, experiments are carried out using Mini-Kinetics-200, Kinetics-400, and Moments-In-time datasets. For the action localization task, J-HMDB-21 dataset is used. Results show that this framework is promising, which reduces the computation but main the performance. 
+
+
+Question: 
+1. X3D-M in the original paper achieved at top-1 74.6 for Kinetics-400 dataset (but in table 4 reports clip-1: 61.8, video-1: 67.9) and FLOPs is 4.73 (6.20 reported in the paper). Can authors explain why there is a difference here? 
+
+
+Minor:
+1. Reduce ratio fact for channel-wise dynamic convolution $r=\frac{1}{2}^{p_c}$ such as in Fig.2 on page 4 and equation 7, equation 8 in supplementary materials on page 13. I think it would make more sense representing it as $(\frac{1}{2})^{p_c}$. 
+
+",6,1.0,ICLR2021
+B6o2fKHitDD,1,rcQdycl0zyk,rcQdycl0zyk,Solid contribution toward making hypercomplex operations more flexible,"The authors focus on the area of using hypercomplex multiplications (multiplications involving numbers with multiple imaginary components) in deep learning models. Past work in this area has been promising but has been limited to certain dimensions for which there are predefined multiplication operations. The novel contribution of this work is to parameterize the hypercomplex multiplication operations, enabling the model to discover new operations rather than relying on the small number of existing operations and the small number of dimensions for which such operations exist. The authors find that their approach can substantially reduce the number of parameters without reducing performance (and in some cases even improving performance). 
+
+Strengths:
+
+1. The proposed method makes a promising approach from the literature more flexible, helping to pave the way for making this approach more broadly useful.
+
+2. The authors illustrate this flexibility by showing how their approach can be effective for two different architectures (LSTMs and Transformers), making the general point that it can be applied to any architecture that uses feedforward components. They also apply it to multiple tasks, again illustrating the flexibility.
+
+3. As mentioned above, the approach can substantially increase a model’s parameter count without affecting performance. Relatedly, it can also improve inference speed.
+
+4. The paper is generally thorough and clear.
+
+Weaknesses:
+
+1. The specific contribution of this paper is the parameterization of the multiplication operation, but the evidence that this parameterization is helpful is mild, as there are only a few cases where the proposed model noticeably outperforms the Quaternion model. Thus, the evidence presented does not make a strong case for the necessity of this parameterization.
+
+2. Much of the argument hinges on the reduced parameter count, but there was not any mention of exactly how many parameters each model had (at least, not that I saw - I did not check the appendix). I think the paper could be substantially strengthened by adding a “Parameter count” column to each table.
+
+3. There is no clear intuition offered for why this approach might be expected to be effective. Offering such an intuition is certainly not necessary (since results alone are enough), but the paper would be more satisfying if there were such an intuition present.
+
+Overall, I am rating this as a 7, because I find it to be a solid paper but worry that its contribution on top of the existing work that has studied hypercomplex operations may be too small and may not have enough evidence for its usefulness.
+",8,3.0,ICLR2021
+BJeSgyAaKB,2,B1xGxgSYvH,B1xGxgSYvH,Official Blind Review #3,"Summary
+-------
+This paper presents a revisit of existing theoretical frameworks in unsupervised domain adaptation in the context of learning invariant representation. They propose a novel bound that involves trainable terms taking into account some compression information and a novel interpretation of adaptability. The authors mention also contribution showing that weighting representations can be a way to improve the analysis. 
+
+Evaluation
+-----
+The ideas are novel and the result brings novel and interesting light on the difficult problem of unsupervised domain adaptation. 
+However, the practical interest in terms of applicability of the proposed framework is not fully demonstrated, the properties of the proposed analysis have to be studied more in details and some parts better justified. The experimental evaluation brings some interesting behavior but is somewhat limited. The weighting aspect of the contribution is not supported by any experiment.
+
+Other comments
+------------
+
+-I am a but puzzled by the use of the term ""compression"". This is maybe subjective, but in the context of learning representation, I would have interpreted it as a way to sparsify the representation, and thus compression could then be measured with respect to a given norm (L2?) or another criterion (Kolmogoroff, ...). 
+
+In the paper, the notion of compression is related to a reduction of the hypothesis space after application of a transformation \phi, so I am wondering if using ""hypothesis space reduction"" would not be more appropriate.
+In this case, however, there are maybe links with structural risk minimization that could be investigated here.
+A side remark: there is no particular restriction on the space of transformations, we wonder if it would be useful to indicate if all the possible transformations are included as subspaces of a given latent space. Since, to be very general, one can imagine the existence of an unbounded number of transformations that correspond to an increase of the input dimension. For transformations leading to different representations of different dimensions, the way the deduced hypothesis can be compared should also be indicated (for defining properly the inclusion H(\phi_1)\subset H(\phi_2).
+
+On the other hand, the authors seem to need the use of norms over transformations as illustrated in the definition of H_0^\eta in the experimental section. So I suggest that the analysis could be revisited by directly incorporating (representation) norms in the theoretical framework and in particular for defining more properly H_0.
+
+-One weakness of the theoretical framework is for me the lack of definition of H_0 in Section 3. We just know that it is included between two classes of hypothesis of interest, but there is no clear characterisation of H_0 which makes the analysis fuzzy: we have a bound that involves an object without any clear definition and it is for me difficult to really interpret the bound. Trying to define H_0 with some restrictions related to the norm of the transformations, as evoked before, could be a way to address this point (and actually the way the experiments are done tend to confirm this point).
+
+-Another weak point is the lack of qualitative analyse of the bound in Inequality 3 (the same applies for Inequality 5). I would have appreciated if the authors could provide an analysis similar to the one of (Mansour et al., COLT 2009) - it is cited in the paper - when they compared their result to the one of (Ben-David et al., 2007). For example, what happens when source is equal to the target, when is the bound significantly loose, significantly tight, different from other existing results, ...
+
+In particular, if we compare the bound with the one of Ben-David et al. (we can also consider the one of Mansour et al.), there is two additional term, one is weighted by a factor 2, another one involved a supremum and one can think that this bound is rather loose and does not provide any insightful information and said differently it could not give a strong framework for practical considerations.
+I may understand that when the bound is tight we could deduce that the compression term is low, but finding cases leading to a tight interesting bound does not seem obvious.
+
+-The experimental evaluation presents some expected behavior in the context of the bound, but I miss a real study trying to make use of the proposed framework to do adaptation in practice with comparisons to other strategies.
+Additionally, having additional studies with other models and tasks will probably reinforce the analysis.
+
+-At the beginning of Section 3.2, the authors mention that they restrict their analysis to the square loss, however I think the analysis is true for larger class of losses with more general properties. In the experimental evaluation, the cross entropy is used, so I think that the experimental evaluation should also be consistent with the theoretical analysis by considering the square loss.
+
+
+-Paragraph below Definition 5 is unclear: the notion of L2 norm has not been introduced in this context, so the message of the authors is a bit unclear.
+
+-I do not find the notation \gamma(\phi,H) appropriate, I woud rather suggest to use \gamma(H\cdot \phi)
+
+-The biblioggrgaphy can be improved by adding the right conferences/journals where the papers have been published in addition to the ArXiv reference.
+",3,,ICLR2020
+SkeHR3J0FS,2,S1gnxaVFDB,S1gnxaVFDB,Official Blind Review #2,"This paper presents a classification model focused on interpretability. The model, Explaining model Decision through Unsupervised Concepts Extraction (EDUCE), is applied to a text classification task, while the authors argue in the appendix that this is also applicable to a wider problem, such as image classification. 
+
+The model is composed of three parts: the first part is detecting salient spans of text relevant to the text classification problem, the second part assigning each salient span a concept label, and the third part which does the classification task based on the binary concept feature label. The models’ loss is composed of two parts: (A) minimizing the cross-entropy of text classification loss and (B) minimizing the cross-entropy of concept classification loss. For (A), as the first and second part of the model introduce discrete choices, they use a RL with Monte-Carlo approximation of gradients. 
+
+The system is evaluated under two measure: 1) classification accuracy and 2) concept accuracy. They define the concept accuracy as follows: after training, they train a classifier that takes output (in the form of <salient span, their concept label>) of the model from the test portion of the data. They split this output into train and test, and report the test accuracy. This aims to show how consistent is the labeling of the salient spans for different methods: if the concept label set correctly merged together semantically similar spans, this “concept accuracy” would be higher. This is a new metric they are proposing. While it is interesting, I would like to see *some* studies on how this correlates with human’s judgements on how interpretable the model is. The paper is introducing a new measure *and* new model, and it’s hard to be persuaded the model is doing well based on this new measure, when there is little ground to know what this measure really measures. 
+
+Overall, I’m not impressed with the models’ performances. The aspect rationale annotated beer sentiment dataset, presented by Lei et al (2016), has provided one of few opportunities to evaluate interpretability / rationale model quantitatively. The paper evaluates on this measure, which is included in the appendix, and the results are pretty disappointing compared to the existing models such as Lei et al’s initial baseline or Bastings et al. While the paper argues this method isn’t necessarily designed for this task unlike the other methods, I’m not sure this is necessarily the case. Bastings et al could be applied to other tasks that model is evaluated on, such as DBPedia and AGNews classification. The difference comes on how easy it is to interpret the methods, as these other rationale-based text processing methods would make use of captured words, while EDUCE would make use of detected “concept” clusters. Currently, the only real baselines are the ablations of its own model. 
+
+Table 3 is quite interesting, different “concepts” capture different aspects fairly well. 
+
+Not having a concept loss actually helps the classification accuracy. Would the concepts learned without concept loss qualitatively very different? This goes back to my original point that their new measure of ""concept accuracy"" is vague. 
+
+Other comments and Q: 
+- Figure (3), the visualization is a bit confusing cause it is unclear whether it is each span is a set of spans or a single span. Also, I would recommend making figures colorblind friendly, if possible. 
+Q: what kind of classifier was used for the evaluation metric “concept accuracy” classifier? I don’t think it’s mentioned. 
+Q: why are you sampling a test set for DBPedia experiments? Is it for efficiency reason?
+Q: how sensitive is model’s performance to the hyper parameters, especially the number of concepts?
+Q: the current baseline classifier is a simple BiLSTM one, which definitely perform a lot worse than recent pre-trained LMs such as BERT. Would it be easy to use this method on top of richer representation such as pertained LM outputs?
+Q: how would this connects to saliency map literature in computer vision? I guess these would be mostly “a posteriori” explanations? Discussion would be helpful. 
+",3,,ICLR2020
+SkxjvM-upm,3,Hyffti0ctQ,Hyffti0ctQ,Propose a model to combine some existing techniques for model acceleration.,"This paper proposes a new framework which combines pruning and model distillation techniques for model acceleration. Though the ``pruning” (Molchanov et al. (2017)) and hint components already exists, the authors claim to be the first to combine them, and experimentally show the benefit of jointly and iteratively applying the two techniques. The authors show better performance of their new framework over baseline, pruning only method and hint only method on a few standard Vision data set.
+
+The motivation is clearly conveyed in the paper. As a framework of combining two existing techniques, I expect the framework can stably improve its two components without too much additional time cost. I have some small questions.
+
+--What is the ``additional cost” of your proposed framework. For example, how many iterations do you typically use. For each data set, what time delta you spent to get the performance improvement comparing to pruning only or hint only models.
+--In your iterative algorithm (pseudo code in appendix), the teacher model is only used in the very beginning and final step, though richest information is hidden in the original teacher model. In the intermediate steps, you are fine tuning iteratively without accessing the original teacher model.
+--In your reconstruction step, you said due to the randomness, you do not always use the learned new W. How much your algorithm benefit from this selection strategy?
+",4,3.0,ICLR2019
+B10RWItgz,3,rybAWfx0b,rybAWfx0b,Review,"This paper present a simple but effective approach to utilize language model information in a seq2seq framework. The experimental results show improvement for both baseline and adaptation scenarios.
+
+Pros:
+The approach is adapted from deep fusion but the results are promising, especially for the off-domain setup. The analysis also well-motivated about why cold-fusion outperform deep-fusion.
+
+Cons:
+(1) I have some question about the baseline. Why the decoder is single layer but for LM it is 2 layer? I suspect the LM may add something to it.  For my own Seq2seq model, 2 layer decoder always better than one. Also, what is HMM/DNN/CTC baseline ? Since they use a internal dataset, it's hard to know how was the seq2seq numbers. The author also didn't compare with re-scoring method.
+
+(2) It would be more interesting to test it on more standard speech corpus, for example, SWB (conversational based) and librispeech (reading task). Then it's easier to reproduce and measure the quality of the model.
+
+(3) This paper only report results on speech recognition. It would be more interesting to test it on more area, e.g. Machine Translation. 
+
+Missing citation: In (https://arxiv.org/pdf/1706.02737.pdf) section 3.3, they also pre-trained RNN-LM on more standard speech corpus. Also, need to compare with this type of shallow fusion.
+
+Updates: 
+
+https://arxiv.org/pdf/1712.01769.pdf (Google's End2End system) use 2-layer LSTM decoder. 
+https://arxiv.org/abs/1612.02695,  https://arxiv.org/abs/1707.07413 and https://arxiv.org/abs/1506.07503) are small task. 
+Battenberg et al. paper (https://arxiv.org/abs/1707.07413) use Seq2Seq as a baseline and didn't show any combined results of different #decoder layer vs. different LM integration method. My point is how a stronger decoder affect the results with different LM integration methods. In the paper, it still only compared with deep fusion with one decoder layer. 
+
+Also, why it only compared shallow fusion w/ CTC model? I suspect deep decoder + shallow fusion already could provide good results. Or the gain is additive?
+
+Thanks a lot adding Librispeech results. But why use Wav2Letter paper (instead of refer to a peer reviewed paper)? The Wav2letter paper didn't compare with any baseline on librispeech (probably because librispeech isn't a common dataset, but at least the Kaldi baseline is there). 
+
+In short, I'm still think this is a good paper but still slightly below the acceptance threshold.",5,5.0,ICLR2018
+fsj7SSqYEio,3,45NZvF1UHam,45NZvF1UHam,Interesting contribution in structure learning of Hamiltonians using Meta-Learning,"The paper introduces a meta-learning approach in Hamiltonian Neural Networks to find the structure of the Hamiltonian that can be adapted quickly to a new instance of a physical system.
+
+The contribution is novel and the paper is well written. The presentation is mostly clear, however, some improvements are needed.
+
+Strength:
+- Clean methodology
+- good performance on the task of few-shot learning on a new system
+- nice visualization of the vector fields
+
+Weaknesses:
+- No real-world problem
+- The meta-learner, in particular HANIL, is clearly not adjusting the parameters of the Hamiltonian as we would expect it, like the physical parameters, which would be more at the beginning of the network.  It only adjusts the read-out layer. In this way, it is hard to understand what is really happening.   
+- You need a dense sampling of the meta-world. Really you have to look at 10000 tasks? A latent variable estimation model would have been a good baseline, or alternative to the Meta-learning framework, for instance, a VAE.
+  
+Details:
+- below Eq 3: consider calling beta not a learning rate, because it is more a step-size since nothing is really learned in the inner loop
+- related work: machine learning methods to perform symbolic regression directly, such as ""Learning Equations for Extrapolation and Control""
+Sahoo et al, ICML 2018, might be good to add
+- Fig 3: font size is far too small, lines are too thin
+- Fig 3: consider using a log-scale for the MSE plots
+
+----
+Post rebuttal update:
+I  read the response and commented on it. The authors clarified my questions and updated the paper accordingly. So I think my score of 7 is supported.",7,4.0,ICLR2021
+H1g5Um_92m,2,SklXvs0qt7,SklXvs0qt7,Handling unbalanced target distributions when conditioning on goal in RL,"This paper addresses a problem that arises in ""universal"" value-function approximation (that is, reinforcement-learning when a current goal is included as part of the input);  when doing experience replay, the experience buffer might have much more representation of some goals than others, and it's important to keep the training appropriately balanced over goals.
+
+So, the idea is to a kind of importance weighting of the trajectory memory, by doing a density estimation on the goal distribution represented in the memory and then sample them for training in a way that is inversely related to their densities.  This method results in a moderate improvement in the effectiveness of DDPG, compared to the previous method for hindsight experience replay.
+
+The idea is intuitively sensible, but I believe this paper falls short of being ready for publication for three major reasons.
+
+First, the mechanism provided has no mathematical justification--it seems fairly arbitrary.   Even if it's not possible to prove something about this strategy, it would be useful to just state a desirable property that the sampling mechanism should have and then argue informally that this mechanism has that property.  As it is, it's just one point in a large space of possible mechanisms.
+
+I have a substantial concern that this method might end up assigning a high likelihood of resampling trajectories where something unusual happened, not because of the agent's actions, but because of the world having made a very unusual stochastic transition.   If that's true, then this distribution would be very bad for training a value function, which is supposed to involve an expectation over ""nature""'s choices in the MDP.
+
+Second, the experiments are (as I understand it, but I may be wrong) in deterministic domains, which definitely does not constitute a general test of a proposed RL  method.  
+- I'm not sure we can conclude much from the results on fetchSlide (and it would make sense not to use the last set of parameters but the best one encountered during training)
+- What implementation of the other algorithms did you use?
+
+Third, the writing in the paper has some significant lapses in clarity.  I was a substantial way through the paper before understanding exactly what the set-up was;  in particular, exactly what ""state"" meant was not clear.  I would suggest saying something like s = ((x^g, x^c), g) where s is a state from the perspective of value iteration, (x^g, x^c) is a state of the system, which is a vector of values divided into two sub-vectors, x^g is the part of the system state that involves the state variables that are specified in the goal, x^c (for 'context') is the rest of the system state, and g is the goal.  The dimensions of x^g and g should line up.
+- This sentence  was particularly troublesome:  ""Each  state s_t also includes the state of the achieved goal, meaning the goal state is a subset of the normal state.  Here, we overwrite the notation s_t  as the achieved goal state, i.e., the state of the object.""
+- Also, it's important to say what the goal actually is, since it doesn't make sense for it to be a point in a continuous space.  (You do say this later, but it would be helpful to the reader to say it earlier.)
+",4,4.0,ICLR2019
+BJlfwPbitH,1,H1lma24tPB,H1lma24tPB,Official Blind Review #2,"Review of “Principled Weight Initialization for Hypernetworks”
+
+There has been a lot of existing work on neural network initialization, and much of this work has made large impact in making deep learning models easier to train in practice. There has also been a line of work on indirect encoding of neural works (i.e. HyperNEAT work of Stanley, and more recent Hypernetworks proposed by Ha et al) which showed promising results of training very large networks (in the case of Stanley), or have network weights that can adapt to the training data (in the case of Hypernetworks), and these approaches have been shown to be useful in applications such as meta-learning or few-shot learning (i.e. [1]). However, as far as I know, there hasn't been any work that looks at a principled way of initializing the weights of a weight-generating network, which this work tries to explore.
+
+Making the observation (and claim) that traditional init methods don't init hypernetworks properly, they propose a few techniques to initialize hypernetworks (""Hyperfan""-family), which are justified in a similar way as original classical init techniques (i.e. preserving variance like in Xavier init), and they demonstrate that their method works well for feed forward networks on MNIST, CIFAR-10 tasks compared to traditional classical init methods, as well for a continual learning task.
+
+I liked the paper as they identified a problem that hasn't been studied, and proposed a reasonable method to solve it. Their method may be able to make Hypernetworks accessible to many more researchers and practitioners, the way classifical init techniques have made neural net training more accessible.
+
+There are a few things that could improve the paper (and get an improvement score from me). The authors don't have to do all of these, but just a few suggestions:
+
+1) The experiments, to my understanding, are all feed forward networks. How about RNNs or LSTMs?
+
+2) Are there any (interesting) tasks that use Hypernetworks that are not trainable with existing methods, but made trainable using this proposed scheme?
+
+3) Would this method also work with HyperNEAT [2] or Compressed Network Search [3]? (probably should cite that line of work too). In [3], a research group at IDSIA used DCT compression to compress millions of weights into a few dozen parameters, so would be interesting if the approach will work on similar ""learn-from-pixels"" RL experiments.
+
+I'm assigning a score of 6 (it's currently like a ""really good"" workshop paper, but a normal conf paper IMO), but I like this paper and would like to see the authors make an attempt to improve it, so I can improve the score to see it get accepted with a higher certainty.
+
+Good luck!
+
+[1] i.e. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6519722/ https://arxiv.org/pdf/1710.03641.pdf https://arxiv.org/pdf/1703.03400.pdf
+
+[2] http://eplex.cs.ucf.edu/hyperNEATpage/
+
+[3] http://people.idsia.ch/~juergen/compressednetworksearch.html
+
+*** Revised Score ***
+
+Nov20: Upon reading the other reviews, and looking at the changes to the paper with the extra citations, I'm improving the score to 8. (For the record, if this was a 1-10 scale, I would have liked my score to be a 7).",8,,ICLR2020
+FkhoPsHrZj,1,Mu2ZxFctAI,Mu2ZxFctAI,Difficult to read,"The authors of this paper introduced a new acquisition function of active learning for optimal Bayesian classifier. The new query strategy is based on mean objective cost of uncertainty, defined as the expected difference between losses of the optimal Bayesian classifier and the optimal classifier.
+
+I think this paper can benefit from revisions to improve clarity. Unfortunately, in the current state of writing, I found it very difficult to understand what this approach is doing exactly. The lack of clarity makes it hard to appreciate the interestingness of the proposed approach. In the following, I will list some possible improvements and my confusions.
+
+1, In the abstract, please rewrite the sentence ""To improve convergence ... classification error."" This sentence is probably the most important summary of this work, but it's so long and dense that it's very difficult to parse. I believe people read abstract to get a general idea of what this paper is, not a dense summary of what the technical details are. 
+
+2, The introduction should be called related work. Especially in 2nd-3rd paragraph, the authors tried to pack all competing algorithms in and explain why they don't work well. It is too detailed, I think. I expect more high-level descriptions of why this problem is important, where the field is now, or why the authors think this is an important problem to solve rather than, e.g., solving active learning for regression. 
+
+3, The authors have a tendency of defining a symbol or an abbreviation, and expect the readers to register them in their memory. It would help the clarity significantly if the authors could just repeat in English what \pi (or \phi or C_\theta , M, U etc) is again when they are mentioned. 
+
+4, It's not clear to me why the entire introduction and definition of MOCU is under section 3.1 Analysis of ELR Methods. Perhaps section 3.1 needs to be segmented into more subsections.
+
+5, I'd recommend that the authors leave the most important theorem in the main paper and move all the less important lemmas and proofs to appendix. Then, the authors can have more space explaining the intuitions behind the proofs and the newly designed weighted MOCU. It is nice to see the convergence analysis, but it also makes me wonder how useful it really is. The theorem is saying, as we get infinite samples, we get the optimal classifier, under a bunch of conditions. It's nice to have, but it almost feels like every algorithm that can loop through all possible inputs can do that. What about the convergence rate that we care more about? Or how many active learning iteration is needed to achieve a certain performance.
+
+6, While it is unclear how useful the theoretical guarantees are, it is also unclear if the empirical results should enough evidence. Only toy datasets were examined, and the performance of the proposed approach is quite similar to other competitors.",5,2.0,ICLR2021
+B1x_np7Ctr,1,SJeY-1BKDS,SJeY-1BKDS,Official Blind Review #1,"This paper explores the recently proposed $\ell^4$-norm maximization approach for solving the sparse dictionary learning (SDL) problem. Unlike other previously proposed methods that recover the dictionary one row/column at a time, for an orthonormal dictionary, the $\ell^4$-norm maximization approach is known to recover the entire dictionary once for all. 
+
+This paper shows that $\ell^4$-norm maximization has close connections with the PCA and ICA problem. Furthermore, focusing on the MSP algorithm for solving the  $\ell^4$-norm maximization formulation, the paper highlights the connections of this fixed-point style algorithm with such algorithms for PCA and ICA. Subsequently, the paper studies the behavior of the MSP algorithm in the presence of noise, outliers, and sparse corruption. Unlike PCA, surprisingly, the MSP algorithm is shown to be robust to outliers and sparse corruption.
+
+Overall, the paper makes a nice effort towards better understanding the relatively new $\ell^4$-norm maximization approach and its connection with other well-understood problems in the literature. Moreover, the paper takes the right step by studying the effect of non-ideal signal measurements on the underlying goal of dictionary learning. That said, the reviewer feels that, in the current form, the results in the paper are not novel enough to warrant an acceptance to ICLR. The connection of the $\ell^4$-norm maximization formulation with ICA have been previously noted in other paper, so this would hardly qualify as a novel contribution. The analysis of the MSP algorithm in the presence of noise, outlier, and sparse corruption is not comprehensive enough. It would have been nice if the authors had provided a non-asymptotic analysis of the MSP algorithm in the presence of non-ideal measurements. Also, it is not clear how interesting the outlier formulation presented in the paper is. Shouldn't one consider outliers that go beyond the Gaussian distribution, ideally arbitrary outliers?",6,,ICLR2020
+XqkxGFgVSeF,4,E_U8Zvx7zrf,E_U8Zvx7zrf,initial review,"The paper considers delay-tolerant and communication-efficient in distributed training and proposes a training framework OLCO3 with local update steps, staleness compensation, and compression compensation. The proposed OLCO3 can be generalized to OLCO3-TC & OLCO3-OC (master-slave case), and OLCO3-VQ (both master-salve and all-reduce case).
+
+### pros.
+* the paper is well-written; the arguments are supported by both empirical and theoretical results.
+* the convergence analyses are provided for OLCO3-VQ and OLCO3-TC, for both SGD and momentum SGD.
+* in numerical results, both iid and non-iid cases are considered.
+
+### cons.
+* missing compression compensation baseline. the paper considers local update, staleness compensation, and compression compensation, but in the empirical results, only local SGD and some delay tolerance methods are considered. it is suggested to also include the results from the powerSGD and the signSGD, thus the readers can identify the source of quality loss.
+* different compression operators are used for different OLCO3 variants. I noticed that in the evaluation part, the paper considers  OLCO3-OC with signSGD, OLCO3-VQ with powerSGD, and OLCO3-TC with signSGD. it is encouraged to justify such a design choice.
+* unclear practical impact. even though it is intuitive to design a training system that has local update steps, staleness compensation, and compression compensation, its practical impact is still unclear to me (due to the trade-off between test accuracy and system performance). it is encouraged to include (e.g. a simulated) results to illustrate the potential trade-off (e.g. time-to-accuracy) on different distributed training scenarios (e.g. differ in latency, bandwidth, local computation capability).
+* the ideas like local update steps, staleness compensation, and compression compensation, have been well developed in the distributed machine learning community. though I do acknowledge the efforts of formulating/combining these ideas into a unified framework, the significance (novelty) of the paper might still have some limitations (as the proof seems quite standard to me). I would like to encourage authors to provide comprehensive empirical results to justify the pros and cons of the proposed scheme, as well as some practical guidelines.",5,4.0,ICLR2021
+r1jeC_Kgf,1,SkFEGHx0Z,SkFEGHx0Z,Review,"(Summary)
+This paper proposes weighted RBF distance based loss function where embeddings for cluster centroids and data are learned and used for class probabilities (eqn 3). The authors experiment on CUB200-2011, Cars106, Oxford 102 Flowers datasets.
+
+(Pros)
+The citations and related works cover fairly comprehensive and up-to-date literatures on deep metric learning.
+
+(Cons)
+The proposed method is unlikely to scale with respect to the number of classes. ""..our approach is also free to create multiple clusters for each class.."" This makes it unfair to deep metric learning baselines in figures 2 and 3 because DMP baselines has memory footprint constant in the number of classes. In contrast, the proposed method have linear memory footprint in the number of classes. Furthermore, the authors ommit how many centroids are used in each experiments.
+
+(Assessment)
+Marginally below acceptance threshold. The method is unlikely to scale and the important details on how many centroids the authors used in each experiments is omitted.",5,4.0,ICLR2018
+Hy2OnpKeG,2,Hkp3uhxCW,Hkp3uhxCW,Interesting posterior sharpening idea,"This paper proposes an interesting variational posterior approximation for the weights of an RNN. The paper also proposes a scheme for assessing the uncertainty of the predictions of an RNN. 
+
+pros:
+--I liked the posterior sharpening idea. It was well motivated from a computational cost perspective hence the use of a hierarchical prior. 
+--I liked the uncertainty analysis. There are many works on Bayesian neural networks but they never present an analysis of the uncertainty introduced in the weights. These works can benefit from the uncertainty analysis scheme introduced in this paper.
+--The experiments were well carried through.
+
+cons:
+--Change the title! the title is too vague. ""Bayesian recurrent neural networks"" already exist and is rather vague for what is being described in this paper.
+--There were a lot of unanswered questions:
+ (1) how does sharpening lead to lower variance? This was a claim in the paper and there was no theoretical justification or an empirical comparison of the gradient variance in the experiment section
+(2) how is the level of uncertainty related to performance? It would have been insightful to see effect of \sigma_0 on the performance rather than report the best result. 
+(3) what was the actual computational cost for the BBB RNN and the baselines?
+--There were very minor typos and some unclear connotations. For example there is no such thing as a ""variational Bayes model"".
+
+I am willing to adjust my rating when the questions and remarks above get addressed.",6,4.0,ICLR2018
+H1exCVyg9H,1,H1gWyJBFDr,H1gWyJBFDr,Official Blind Review #1,"This paper proposes BiGraphNet, which proposes to replace the graph convolution and pooling with a single bipartite graph convolution. Its motivation comes from using stride(>1) convolution to replace pooling in CNN. The authors claim that the computation and memory can be reduced with the proposed bipartite graph convolution, because the pooling layers are removed. The authors also conduct experiments about graph skip connection and graph encoder-decoder to show that their method's flexibility.
+
+Cons:
+1. If I understand it correctly, the bipartite graph convolution still needs a cluster algorithm to determine the output graph, which is identical to cluster-based pooling methods like DiffPool. In addition, previous pooling methods like DiffPool, gPool are NOT non-parametric as suggested by Figure 1. Therefore, the advantage of the proposed method is vague.
+2. The idea of bipartite graph convolution seems different from that of stride convolution. The connection should be better explained.
+3. The experiments of this paper are not very convincing. Comparison with more baselines and ablation study are needed to demonstrate the effectiveness of this method. On graph classification tasks, many other methods (GCN with pooling) are worth comparing with, like DiffPool, SAGPool, gPool, etc. More datasets should be included. In addition, it will be more convincing to do ablation study, e.g. single layer replacement.",3,,ICLR2020
+B1lYZ6Ht37,2,r1My6sR9tX,r1My6sR9tX,Great paper tackling important problem with nice experiments,"In this paper, the task of performing meta-learning based on the unsupervised dataset is considered. The high-level idea is to generate 'pseudo-labels' via clustering of the given dataset using existing unsupervised learning techniques. Then the meta-learning algorithm is trained to easily discriminate between such labels. This paper seems to be tackling an important problem that has not been addressed yet to my knowledge. While the proposed method/contribution is quite simple, it possesses great potential for future applications and deeper exploration. The empirical results look strong and tried to address important aspects of the algorithm. The writing was clear and easy to follow. I especially liked how the authors tried to exploit possible pitfalls of their experimental design. 
+
+Minor comments and questions:
+- Although the problem of interest is non-trivial and important, the proposed algorithm can be seen as just a naive combination of clustering and meta-learning. It would have been great to see some clustering algorithm that was specifically designed for this type of problem. Especially, the proposed CACTUs algorithm relies on sampling without replacement from the clustered dataset in order to enforce ""balance"" of the labels among the generated task. This might be leading to suboptimal results since the popularity of each cluster (i.e., how much it represents the whole dataset) is not considered. 
+
+- CACTUs seems to be relying on having random scaling of the k-means algorithm in order to induce diversity on the set of partitions being generated. I am a bit skeptical about the effectiveness of such a method for diversity. If this holds, it would be interesting to see the visualization of such a concept.
+
+- Although only MAML was considered as the meta-learning algorithm, it would have been nice to consider one or more candidates to show that the proposed framework is generalizable. Still, I think the experiment is persuasive enough to expect that the algorithm would work well at practice.
+
+- Would there be a trivial generalization of the algorithm to semi-supervised learning?  
+
+-------
+
+I am satisfied with the author's response and changes they made to the text. I still think the paper brings significant contributions to the area, by showing that even generating the pseudo-tasks via unsupervised clustering method allows the meta-learning to happen.  ",8,4.0,ICLR2019
+Syy4M8qxf,2,rJa90ceAb,rJa90ceAb,Interesting neural network architecture; experiments can be stronger,"This paper proposes a two-pathway neural network architecture. One pathway is an autoencoder that extracts image features from different layers. The other pathway consists of convolutional layers to solve a supervised task. The kernels of these convolutional layers are generated dynamically based on the autoencoder features of the corresponding layers. Directly mapping the autoencoder features to the convolutional kernels requires a very large matrix multiplication. As a workaround, the proposed method learns a dictionary of base kernels and maps the features to the coefficients on the dictionary. 
+
+The proposed method is an interesting way of combining an unsupervised learning objective and a supervised one. 
+
+While the idea is interesting, the experiments are a bit weak. 
+For MNIST (Table 1), only epoch 1 and epoch 20 results are reported. However, the results of a converged model (train for more epochs) are more meaningful. 
+For Cifar-10 (Figure 4b), the final accuracy is less than 90%, which is several percentages lower than the state-of-the-art method.
+For MTFL, I am not sure how significant the final results are. It seems a more commonly used recent protocol is to train on MTFL and test on AFLW. 
+In general, the experiments are under controlled settings and are encouraging. However, promising results for comparing with the state-of-the-art methods are necessary for showing the practical importance of the proposed method. 
+
+A minor point: it is a bit unnatural to call the proposed method “baseline” ...  
+
+If the model is trained in an end-to-end manner. It will be helpful to perform ablative studies on how critical the reconstruction loss is (Note that the two pathway can be possibly trained using a single supervised objective function).  
+
+It will be interesting to see if the proposed model is useful for semi-supervised learning. 
+
+A paper that may be related regarding dynamic filters:
+Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction
+
+Some paper that may be related regarding combine supervised and unsupervised learning:
+Stacked What-Where Auto-encoders
+Semi-Supervised Learning with Ladder Networks
+Augmenting Supervised Neural Networks with Unsupervised Objectives for Large-Scale Image Classification
+
+",5,4.0,ICLR2018
+HJe2x9RTFr,2,rJe04p4YDB,rJe04p4YDB,Official Blind Review #2,"This paper studies the teacher-student models in semi-supervised learning. Unlike previous methods in which only the student will learn from the teacher, this paper proposes a method to let the teacher learn from the student by reinforcement learning. Experimental results demonstrate the proposal’s performance.
+
+The paper achieves some good empirical results compared to other baselines. However, the proposed method is implemented with many tricks listed on Page 4, and with data augmentation techniques, which may not be used in previous methods. Additionally, the paper is weak in technology. There is no clear explanation of why the proposed method works except a metaphor for sports coaches. I vote for a clear rejection of the paper.
+
+First, the paper is weak in experiments. It works hard to achieve a good experimental result, through many tricks listed in the “Additional Implementation Details” in Page 4, and through the data augmentation used in Page 5. However, these tricks to improve the performance may not be used in previous methods, as the paper does not run experiments on baselines under the same setting, but use the results reported in Oliver et al. (2018). Additionally, the paper only uses one number of labeled data for each data set, it makes readers doubt that the proposed method only works under this number of labeled data. 
+
+The paper fails to clearly state why we need to let the teacher learn from the student. Actually, I doubt if this is necessary. Given the strong learning capacity of neural networks, the proposed method will easily be overfitting. Assume we have a very weak student network at the beginning, then by training in the way proposed in the paper, the teacher network will have all labeled data classified correct, and all unlabeled data classified into the same labels as the student network. I cannot see from the simple proposal why such overfitting can be avoided.
+
+The paper is weak in both technology and experiments. It is also poorly written without clearly stating the motivation for this problem. I would vote for a reject for the paper. 
+
+-------------------------------------------
+The rebuttal has cleared some of my concerns. However, it is still not clear why the proposed method work and how does it prevents overfitting. The paper also needs more experimental results to confirm its effectiveness. I will increase my score a little bit, but would not vote for an accept this time. But I believe with further revision, the paper may be worth publishing in the future, if the questions in all reviews can be addressed.  ",3,,ICLR2020
+rkxViW62KH,2,BJlVeyHFwH,BJlVeyHFwH,Official Blind Review #2,"This paper analyses the numerical invertibility of analytically invertible neural networks (INN). The numerical invertibility depends on the Lipschitz constant of the respective transformation. The paper provides Lipschitz bounds on the components of building blocks for certain INN architectures, which would guarantee numerical stability. Furthermore, this paper shows empirically, that the numerical invertibility can indeed be a problem in practice.
+
+This work is interesting and could be important to many researchers working with INNs. The worst case analysis and the corresponding table with Lipschitz bounds is useful. 
+However, I have some concerns regarding the experimental evaluation. 
+- Experiments in 4.1. nicely show that there exist non-invertible inputs for a GLOW model trained on CIFAR. But I wish the authors also considered other popular INN models and non-image datasets for this set of experiments (showing if this is also an issue in scenarios other than CIFAR/CELEBA + GLOW). 
+- Although the authors spend significant space in the main text and the appendix to motivate the experiments in 4.2, I cannot follow this motivation. For example, “decorrelation is a simpler objective than optimizing outputs z = F(x) to follow a factorized Gaussian as in Normalizing Flows”. Why is this is simpler, and, more importantly, why would this be an argument? Another example is “… this decorrelation objective offers a controlled environment to study which INN components steer the mapping towards stable or unstable solutions, …”. Why is this more controlled? What exactly is controlled here that is not controlled in training a an INN for, e.g., density estimation?  I am not sure if this set of experiments is any useful for determining whether numerical precision is actually problematic for posterior approximation with normalizing flows, density estimation, etc.
+- the experimental sections is somewhat badly structured and makes it difficult to read. It is not clear if this paper is analysis-only or whether the authors propose a remedy. The authors write in the abstract and conclusion that they show how to guarantee invertibility for one of the most common INN architectures. After reading this, I would expect a designated experimental section which shows a fix. I suppose they refer to Additive blocks + Spectral Norm, discussed in 4.2.1. However, that reads more like a post-hoc insight (“it turns out that…” rather than “we show how“). In short, the experiments section could be much better structured. 
+- The paper would be greatly improved, if the authors would propose how to tackle these numerical problems. I doubt that additive coupling is “one of the most common INN architectures”. It would be nice if the authors would conduct more extensive experiments and propose solutions for other building blocks. 
+- I expect at least a few experiments that quantify numerical instability with multiple different random seeds (for initialization etc.).
+
+For these reasons I vote for rejection. 
+I think it would be advisable to rethink the goals of the experimental evaluation, come up with a better structure, and expand at several places. E.g. (i) expand 4.1 to other architectures and data, (ii) show how this is relevant in practice (e.g. posterior inference with NFs and density estimation) and how it questions published results (currently Sec. 5), and (iii) evaluate proposed solutions. 
+
+",3,,ICLR2020
+ryg2OYFAFS,1,HJxdTxHYvB,HJxdTxHYvB,Official Blind Review #1,"The paper proposes a new way to generate adversarial images that are perturbed based on natural images called Shadow Attach. The generated adversarial images are imperceptible and have a large norm to escape the certification regions. The proposed method incorporates the quantities of total variation of the perturbation, change in the mean of each color channel, and dissimilarity between channels, into the loss function, to make sure the generate adversarial images are smooth and natural. Quantitative studies on CIFAR-10 and ImageNet shows that the new attack method can generate adversarial images that have larger certified radii than natural images. To further improve the paper, it would be great if the authors can address the following questions:
+
+- In Table 1, for ImageNet, Shadow Attach does not always generate adversarial examples that have on average larger certified radii than the natural parallel, at least for sigma=0.5 and 1.0. Could the authors explain the reason?
+
+- In Table 2, it is not clear to me what is the point for comparing errors of the natural images (which measures the misclassification rate of a natural image) and that of the adversarial images (which measures successful attacks rate), and why this comparison helps to support the claim that the attack results in a stronger certificates. In my opinion, to support the above claim, shouldn’t the authors provide a similar table as Table 1, directly comparing the certified radii of the natural images and adversarial images?
+
+- From Figure 9, we see the certificate radii of the natural have at least two peaks. Though on average the certificate radii of the adversarial attacks is higher than that of the natural images, it is smaller than the right peak. Could the authors elaborate more of the results?
+
+- Sim(delta) should be Dissim(delta) which measures the dissimilarity between channels. A smaller dissimilarity suggests a greater similarity between channels. 
+
+- Lambda sim and lambda s are used interchangeably. Please make it consistent. 
+
+- The caption of Table 1 is a little vague. Please clearly state the meaning of the numbers in the table.
+",6,,ICLR2020
+rkHX_Bjlf,3,SyqAPeWAZ,SyqAPeWAZ,Official review,"The method proposes a new architecture for solving image super-resolution task. They provide an analysis that connects aims to establish a connection between how CNNs for solving super resolution and solving sparse regularized inverse problems.
+
+The writing of the paper needs improvement. I was not able to understand the proposed connection, as notation is inconsistent and it is difficult to figure out what the authors are stating. I am willing to reconsider my evaluation if the authors provide clarifications.
+
+The paper does not refer to recent advances in the problem, which are (as far as I know), the state of the art in the problem in terms of quality of the solutions. This references should be added and the authors should put their work into context.
+
+1) Arguably, the state of the art in super resolution are techniques that go beyond L2 fitting. Specifically, methods using perceptual losses such as:
+
+Johnson, J. et al ""Perceptual losses for real-time style transfer and super-resolution."" European Conference on Computer Vision. Springer International Publishing, 2016.
+
+Ledig, Christian, et al. ""Photo-realistic single image super-resolution using a generative adversarial network."" arXiv preprint arXiv:1609.04802 (2016).
+
+PSNR is known to not be directly related to image quality, as it favors blurred solutions. This should be discussed.
+
+2) The overall notation of the paper should be improved. For instance, in (1), g represents the observation (the LR image), whereas later in the text, g is the HR image. 
+
+3) The description of Section 2.1 is quite confusing in my view. In equation (1), y is the signal to be recovered and K is just the downsampling plus blurring. So assuming an L1 regularization in this equation assumes that the signal itself is sparse. Equation (2) changes notation referring y as f. 
+
+4) Equation (2) seems wrong. The term multiplying K^T is not the norm (should be parenthesis).
+
+5) The first statement of Section 2.2. seems wrong. DL methods do state the super resolution problem as an inverse problem. Instead of using a pre-defined basis function they learn an over-complete dictionary from the data, assuming that natural images can be sparsely represented. Also, this section does not explain how DL is used for super resolution. The cited work by Yang et al learns a two coupled dictionaries (one for LR and HL), such that for a given patch, the same sparse coefficients can reconstruct both HR and LR patches. The authors just state the sparse coding problem.
+
+6) Equation (10) should not contain the \leq \epsilon.
+
+7) In the second paragraph of Section 3, the authors mention that the LR image has to be larger than the HR image to prevent border effects. This makes sense. However, with the size of the network (20 layers), the change in size seems to be quite large. Could you please provide the sizes? When measuring PSNR, is this taken into account? 
+
+8) It would be very helpful to include an image explaining the procedure described in the second paragraph of Section 3.
+
+9) I find the description in Section 3 quite confusing. The authors relate the training of a single filter (or neuron) to equation (7), but they define D, that is not used in all of Section 2.1. And K does not show in any of the analysis given in the last paragraph of page 4. However, D and K seem two different things (it is not just one for the other), see bellow.
+
+10) I cannot understand the derivation that the authors do in the last paragraph of page 4 (and beginning of page 5). What is phi_l here? K in equation (7) seems to match to D here, but D here is a collection of patches and in (7) is a blurring and downsampling operator. I cannot review this section. I will wait for the author's response clarifications.
+
+11) The authors describe a change in roles between the representations and atoms in the training and testing phase respectively. I do not understand this. If I understand correctly, the final algorithm, the authors train a CNN mapping LR to HR images. The network is used in the same way at training and testing.
+
+12) It would be useful to provide more details about the training of the network. Please describe the training set used by Kim et al. Are the two networks trained independently? One could think of fine-tuning them jointly (including the aggregation).
+
+13) The authors show the advantage of separating networks on a single image, Barbara. It would be good to quantify this better (maybe in terms of PSNR?). This observation might be true only because the training loss, say than the works cited above. Please comment on this.
+
+14) In figures 3 and 4, the learned filters are those on the top (above the yellow arrow). It is not obvious to me that the reflect the predominant structure in the data. (maybe due to the low resolution).
+
+15) This work is related to (though clearly different)  that of LISTA (Learned ISTA) type of networks, proposed in:
+
+Gregor, K., & LeCun, Y. (2010). Learning fast approximations of sparse coding. In Proceedings of the 27th International Conference on Machine Learning (ICML) 
+
+Which connect the network architecture with the optimization algorithm used for solving the sparse coding problem. Follow up works have used these ideas for solving inverse problems as well.
+",4,4.0,ICLR2018
+tZ4mRgfd_Yc,1,w6Vm1Vob0-X,w6Vm1Vob0-X,High complexity and weak experiments ,"This work propose a new GNN architecture to help GNN break its limitation on only working over homophilic networks. The technical is to use introduce graph global attention. 
+
+I think the paper is written okay. The motivation is clear. The solution is reasonable. However, I have following criticisms:
+1. This work has limited novelty. Observing that GCN cannot work well over heterophilic networks is not a new idea and observation. Using attention to capture the features from far-away nodes is natural but not novel. I do not think that it is reasonable to argue against other works, e.g. [1] that adopts the above idea by saying they are not expressive enough. Expressiveness sometimes may lead to model overfitting. Actually, ChevNet [2] can also capture far-aways nodes and be expressive enough. Why does it not work well? I guess that it is due to some overfitting issue. Moreover, if I understand it correctly, the limited difference between this work and [3] is most likely the global attention, which has very limited contribution. 
+
+2. Although the work claims everywhere to tend to decrease the complexity, when computing the global attention, one still needs to do computation for every pair of nodes, which is of course not scalable for even medium-sized graphs. 
+
+3. The heterophilic networks used for evaluation are very small with only several hundred nodes. Why not try larger ones, say actor, Cham. in [4]? I guess the computational issue comes from the global attention. 
+
+[1] Non-Local Graph Neural Networks.
+[2] Convolutional neural networks on graphs with fast localized spectral filtering.
+[3] Graph wavelet neural network
+[4] Geom-gcn: Geometric graph convolutional networks.
+
+
+---post-discussion update----
+I would like to thank the authors for preparing the rebuttal and attending our discussion. However, I still think the complexity is a concern of this work. I do not think that Eq. (3) can be implemented within the complexity that the authors claimed. Moreover, if the authors use another way to compute the attention scores, that way should be very clearly stated instead of written in a different form. Given the high complexity, I cannot clearly see the advantage of this work in comparison to [1], as the non-local attention has been proposed in [1] already.
+
+[1] Non-Local Graph Neural Networks.",4,5.0,ICLR2021
+2IabjocZRYY,5,Rq31tXaqXq,Rq31tXaqXq,Official Blind Review #5,"This paper points out the fact that just developing models using deep learning frameworks is not the end of the story when it comes to building end-to-end visual pipelines. The paper introduces VideoFlow which is a framework that aims to improve the development process of streaming pipelines.
+
+Compared to the widely used TensorFlow, PyTorch, and MXNet, VisualFlow focuses on more coarse-grained blocks like a ""whole network"" itself instead of ""layers."" As such, the work is developed around a rather different units of data compared to Tensors in DNNs. This work also incorporates a GUI that lets the user edit the computation graphs. I believe this work in full fledged form may help the productivity while building visual analysis applications.
+
+Sadly, there are many shortcomings of the paper. First, the paper literally spends over 60% of the paper to describe the implementation details which does not lead to much intellectual insights. This looks more like a technical report than a research paper.
+
+Furthermore, the paper lacks on evaluation in many aspects. For example, there is no evaluation about the potential overhead of the framework. I have listed a number of questions below with my comments.
+
+The paper began with a luring abstract; however, after reading through the paper, the reader is left with only minor insights. Up to this point, the paper seems to be an amalgamation of various libraries and frameworks with a GUI wrapper. While I do believe the paper has some prospect as it does touch upon a real problem that does indeed take up a lot of resources in industry, I believe the paper is yet in a premature state to merit a publication at ICLR.
+
+Also, I believe the paper would receive a more pertinent evaluation from a systems community like OSDI, SOSP, ATC. This is because these frameworks should not be just about putting things together to make something working, but should also entail a through experimentation of the performance and overhead.
+
+Questions:
+- Could you please provide running examples of applications that have graphs with a complex topology?
+- How does the resource management behave for different scenarios such as underutilization? Could you show some visualization of that? For example, it would be very interesting to see how Dynamic Batching dynamically improves the utilization and the performance at runtime.
+- How can this be used for on-device scenarios, or cases where there are numerous devices? For example how does this compare to NNStreamer [1]?
+- If this framework were to be used in cases like Inference-as-a-Service scenario, how would this perform in terms of various QoS metrics?
+
+[1] NNStreamer, https://nnstreamer.ai",3,4.0,ICLR2021
+4PmdEIolPdO,1,Nc3TJqbcl3,Nc3TJqbcl3,review1,"This paper presents a method to combine zero order optimizer with imitation learning. Several components are essential for good performance, including policy guided initialization, DAgger and adaptive auxiliary loss.
+
+Overall the paper provides a clear description of the essential components of the algorithms and the result is also quite strong.
+
+Several comments below:
+
+(1)The right most figures in Figure 4 is really hard to unpack with so many information and almost similar colors. Honestly I don't understand these figures, it would be nice if the authors make it clearer, either in the caption or make cleaner figures.
+
+(2) For the first figure in Figure 5, the text said "" it performs
+initially better than with adding the samples, but the experts do not improve with the policy in this
+case such that the asymptotic performance is lower"". But the figure shows it has the same final performance as Apex, am I missing something here? Also more distinguished color coding can make the figures easier to digest.
+
+(3) Maybe related to the previous question, what does the learning curve shows? Is it performance of the policy or the performance of the ICEM warm starting with the policy?
+
+(4) How important is an accurate dynamics? All the experiments are done with perfect dynamics information, which I also believe is deterministic, which is not possible in real world scenario. Some experiments with dynamics noise added or with learned model in the CEM rollouts can make the paper's claim stronger.
+
+(5) half cheetah is not a difficult task as claimed in the introduction of the paper. The paper itself shoes SAC can already perform very well. Actually this task is a very strange choice compared to other tasks considered. I would recommend something that is more like the others.",6,3.0,ICLR2021
+rylIKYfn2m,1,SJekyhCctQ,SJekyhCctQ,On the adaptive CW attack,"This work introduces a novel defense method ""Neural Fingerprinting"" against adversarial examples.
+In the training process, this method embeds a set of characteristic labeled samples so that responses of the model around real data show a specific pattern.  The defender can detect if a given query is adversarial or not by checking the pattern at test time.
+
+Strong point:
+The strong point is that the proposed method seems to be appropriate and technically original. The performance is well investigated and compared with several competitors.  The organization is good and the idea is clearly stated. 
+  
+Weak point:
+One question is that why the proposed method can be protective against the adaptive CW attack. In the public discussion, the authors mention that the defense works successfully because the landscape of the fingerprinting loss is non-convex and no gradient method is guaranteed to find a suitable solution. If this is correct, did you repeatedly try the gradient-based attack with changing random seeds? By doing so, the attack might work successfully with a certain probability.
+
+Comments:
+The presented method seems to have a certain similarity with digital watermarking of deep neural networks, for example:
+https://gzs715.github.io/pubs/WATERMARK_ASIACCS18.pdf
+It would be interesting to mention to these methods in the related work section.
+
+",6,3.0,ICLR2019
+ByxiEPX4l,3,r1X3g2_xl,r1X3g2_xl,"Reads well, missing theoretical differences with past techniques.","The authors propose to apply virtual adversarial training to semi-supervised classification.
+
+It is quite hard to assess the novelty on the algorithmic side at this stage: there is a huge available literature on semi-supervised learning (especially SVM-related literature, but some work were applied to neural networks too); unfortunately the authors do not mention it, nor relate their approach to it, and stick to the adversarial world.
+
+In terms of novelty on the adversarial side, the authors propose to add perturbations at the level of words embeddings, rather than the input itself (having in mind applications to NLP).
+
+Concerning the experimental section, authors focus on text classification methods. Again, comparison with the existing SVM-related literature is important to assess the viability of the proposed approach; for example (Wang et al, 2012) report 8.8% on IMBD with a very simple linear SVM (without transductive setup).
+
+Overall, the paper reads well and propose a semi-supervised learning algorithm which is shown to work in practice. Theoretical and experimental comparison with past work is missing.",6,4.0,ICLR2017
+BJn-O8HVx,2,S1c2cvqee,S1c2cvqee,,"The paper looks solid and the idea is natural. Results seem promising as well.
+
+I am mostly concerned about the computational cost of the method. 8-10 days on 10 GPUs for relatively tiny datasets is quite prohibitive for most applications I would ever encounter.
+ I think the main question is how this approach scales to larger images and also when applied to more exotic and possibly tiny datasets. Can you run an experiment on Caltech-101 for instance? I would be very curious to see if your approach is suitable for the low-data regime and areas where we all do not know right away how a suitable architecture looks like. For Cifar-10/100, MNIST and SVHN, everyone knows very well what a reasonable model initialization looks like.
+
+If you show proof that you can discover a competitive architecture for something like Caltech-101, I would recommend the paper for publication.
+
+Minor: 
+- ResNets should be mentioned in Table ",6,3.0,ICLR2017
+pTAZCT9X2CI,1,XLfdzwNKzch,XLfdzwNKzch,Major concern is about the experimental results,"Summary Of Contributions: This paper proposes to automate the design of auxiliary network and its allocation under decoupled neural network scheme, a design that speeds up network training and potentially boost model accuracy. The approach is validated with leading performance on ResNet and VGGNet under various datasets.
+
+
+Strengths:  The idea to search auxiliary network for decoupled neural network is novel, and the proposed method is verified on multiple widely used datasets.
+
+Weaknesses: 
+
+1.	It’s unclear if the proposed differentiable search algorithm works better than DARTs with direction comparison, as the proposed method is basically DARTs with some changes (e.g. weight sampling, extra inner optimization steps per outer optimization), that are also applied to typical NAS tasks, a comparison with DARTs is needed to validate how much of the gain is from the proposed method rather than the fact of automating network design itself.
+
+2.	It seems that the decoupled neural network scheme has nearly no gains in terms of Top-1 without the auxiliary heads ensembled as shown in Table 3. The details should be provided on how the ensemble works, are ensembled auxiliary heads kept during the testing stage? If so, Table 3 cannot tell if the gain is from the ensembled aux heads or decoupled neural network scheme. Network trained with Backprop method and identical extra aux head for ensemble might reach the same level of performance.
+
+3.	Experiment with DGL in Table3 (b) is problematic, DGL in its original paper only provides ResNet-152 result with K=2, and it is better than Backprop according to their experiment. Yet the authors set K=4 in Table 3 and got a contrary result. This brought up the suspect of cherry picking of hyperparameters and can only be addressed with stricter comparison.
+
+4.	One of the advantages to adopt a decoupled neural network scheme is training speedup, while the network based on compared methods is instantly available, the proposed method requires extra time to search before training. With this extra cost, how much the training speedup benefit is still left? The search cost should be indicated and the overall time cost should be discussed.
+
+",7,4.0,ICLR2021
+SyxjXYoUnm,1,BJesDsA9t7,BJesDsA9t7,Need  justification of privacy,"The privacy definition employed in this work is problematic. The authors claim that ""Privacy can be quantified by the difficulty of reconstructing raw data via a generative model"". This is not justified sufficiently. Why larger reconstruction error achieves stronger privacy protection? I could not find any formal relationship between reconstruction error and privacy. 
+
+The proposed method is not appropriately compared with the other methods in experiments.  In Fig. 3 the author claim that the proposed method dominates the other methods in terms of privacy and utility but this is not correct. At the specific point that the proposed method is evaluated with MNIST and Sound, it achieves better utility and better ""privacy"". However, the Pareto front of the proposed method is concentrated on a specific point. For example, the proposed method does not achieve high ""privacy"" as ""noisy"" does. In this sense, the proposed method is not comparable with ""noisy"". In my understanding, this concentration occurs because the range of \lambda is inappropriately set. This kind of regularization parameter should be exponentially varied so that the privacy-utility Pareto front covers a wide range. 
+
+--
+Minor:
+In Eq. 1, the utility is evaluated as the probability Yi=Yi'. What randomness is considered in this probability?
+In Eq 2, privacy is defined as maxmin of |Ii - Ii'|. Do you mean privacy guaranteed by the proposed method is different for each data? This should be defined as expectation over T or max over T. 
+
+In page 4. ""The reason we choose this specific architecture is that an exactly reversed mode is intuitively the mode powerful adversarial against the Encoder."" I could not find any justification for this setting. Why ""exactly reversed mode"" can be the most powerful adversary? What is an exactly reversed mode?
+
+Minimization of Eq. 3 and Eq. 4 contradict each other and the objective function does not converge obviously. The resulting model would thus be highly affected by the setting of n and k.  How can you choose k and n?",3,4.0,ICLR2019
+HJgX78WIcr,3,r1lOgyrKDS,r1lOgyrKDS,Official Blind Review #1,"The paper presents experimental results on the application of the gradient ARSM estimator of Yin et al. (2019) to challenging structured prediction problems (neural program synthesis and image captioning). The authors also propose two variants, ASR-K which is the ARS estimator computed on a random sample of K (among V) labels, as well as a binary tree version in which the V values are encoded as a path in a binary tree of depth O(log(V)), effectively increasing the length of sequences to be predicted but reducing the action space at each tilmestep.
+
+The paper is self-contained and clear. The main value of the paper is to present good experimental results on challenging tasks; the ARS-K variant, although fairly straightforward, seems to be a reasonable implementation of the ARS(M) estimator.
+
+My main criticism on the paper is that the exact nature of the contribution is not properly stated. As far as I understand, the main value of the paper is to demonstrate the effectiveness of ASR-K/M on challenging tasks. In a first read however, it seems that the authors claim an algorithmic/theoretical contribution compared to the state-of-the-art. Comparing with the paper by Yin et al. (2019), it seems to me that the technical contribution is rather incremental (the binary tree version is a variant of the hierarchical softmax, and ASR-K seems very straightforward), up to the point that the first set of experiments is actually only about vanilla ARSM.
+
+other comments:
+- what is j in Eq 4?
+- RL_beam vs ASRM on neural program synthesis: the authors say that ""RL_beam overfits [...] because of biased gradients"", whereas ""ASRM converges to a local minimum that generalizes better"". I do not see why biased gradients would help fitting the data (compared to unbiased gradients). And as far as I understood, ASRM is about getting a better gradient (hence better optimization, and hence better fitting of the data), so I really do not understand this argument.
+
+- RL_beam vs ASRM on NPS: I do not see why ASRM cannot fit the data as well as RL_beam. Is there some regularization involved? 
+
+minor:
+- ""expected award"" (first line section 3.1)
+
+------ after author rebuttal
+
+The authors answered my main concerns, I raises my score to weak accept.",6,,ICLR2020
+djSub53-m6g,1,l3YcqzaPlx0,l3YcqzaPlx0,Lacks discussions with prior works,"Overall, the paper propose an interesting approach for computing node embeddings in a scalable way.
+However, the contribution is incremental as the idea of embedding nodes using sequences is not new; moreover, the discussion with prior works is very weak.
+
+Concretely, important related works are missing. 
+There is an ICML 2018 paper, ""Anonymous Walk Embeddings"" (https://arxiv.org/pdf/1805.11921.pdf). There are many following up papers as well. The idea of this line of works is to embedding node/graphs by random walk sequences. While the approaches are not exactly the same as  Neighbor2Seq, the idea of embedding nodes using sequences is not new. However, none of these works are mentioned in this paper. The authors should compare these papers in the experiments as the baselines. Without such comparison, it is hard to evaluate the performance gain of the proposed Neighbor2Seq.
+
+Additionally, the discussion with prior works is very weak.
+The paper misleads the readers by arguing ""However, they have inherent difficulties when applied on large graphs due to their excessive computation and memory requirements"" in Section 2.1. The paper refers 
+However, existing approaches can scale to huge graphs. For example in the PinSage paper (https://arxiv.org/pdf/1806.01973.pdf), they can scale their GNN model to 3 billion nodes. Failing to mentioning this point is misleading for the readers, and is exaggerating the motivation of this paper.
+
+More comments:
+1 The exact configuration of the Neighbor2Seq model is not mentioned. Providing an algorithm will greatly help.
+
+2 I can imagine multiple ways for sampling sequences from a node's neighborhood. However, they are not discussed in the paper.
+
+3 How does the expressive power of Neighbor2Seq compares with WL test? I assume Neighbor2Seq is theoretically less expressive than WL test. 
+
+4 I guess there will be failure cases of this kind of Neighbor2Seq, when the topology of node neighborhood structure matters for the final performance. For example, Neighbor2Seq probably won't perform well for graph isomorphism tests. I suggest mentioning potential limitations in the paper.",5,4.0,ICLR2021
+Hkg9Vrqlz,3,SJJinbWRZ,SJJinbWRZ,A nice baseline in the epic model-based vs. model-free battle,"This paper presents a simple model-based RL approach, and shows that with a few small tweaks to more ""typical"" model-based procedures, the methods can substantially outperform model-free methods on continuous control tasks.  In particular, the authors show that by 1) using an ensemble of models instead of a single models, 2) using TRPO to optimize the policy based upon these models (rather that analytical gradients), and 3) using the model ensemble to validate when to stop policy optimization, then a simple model-based approach actually can outperform model-free methods.
+
+Overall, I think this is a nice paper, and worth accepting.  There is very little actually new here, of course: the actual model-based method is entirely standard except with the additions above (which are also all fairly standard approaches in isolation).  But at a higher level, the fact that such simple model-based approaches work better than somewhat complex model free approaches actually is the point of the paper to me.  While the general theme of model-based RL outperforming model-free RL is not new (Atkeson and Santamaria (1997) comes to a similar conclusion) its good to see this same pattern demonstrated ""officially"" on modern RL benchmarks, especially since the _completely_ naive strategy of using a single model and more standard policy optimization doesn't perform as well.
+
+Naturally, there is some question as to whether the work here is novel enough to warrant publication, but I think the overall message of the paper is strong enough to overcome fairly minimal contribution from an algorithmic perspective.  I did also have a few general concerns that I think could be discussed with a bit more detail in the paper:
+1) The choice of this particular model ensemble to represent uncertainty seems rather ad-how.  Why is it sufficient to simply learn N models with different initial weights?  It seems that the likely cause for this is that the random initial weights may lead to very different behavior in the unobserved parts of the space (i.e., portions of the state space where we have no samples), and thus.  But it seems like there are much more principled ways of overcoming this same problem, e.g. by using an actual Bayesian neural net, directly modeling uncertainty in the forward model, or using generative model approaches.  There's some discussion of this point in the introduction, but I think a bit more explanation about why the model ensemble is expected to work well for this purpose.
+2) Likewise, the fact the TRPO outperforms more standard gradient methods is somewhat surprising.  How is the model ensemble being treated during BPTT?  In the described TRPO method, the authors use a different model at each time step, sampling uniformly.  But it seems like a single model is used for each rollout in the proposed BPTT method?  If so, it's not surprising that this approach performs worse.  But it seems like one could backprop through the different per-timestep models just as easily, and it would remove one additional source of difference between the two settings.",7,5.0,ICLR2018
+k_9ijVXV3M8,1,YCXrx6rRCXO,YCXrx6rRCXO,"Seems to be a straightforward variant of the prior work, perhaps I'm missing something","The paper studies binary embeddings that preserve Euclidean distances for the case when the vector mass is spread fairly evenly across the coordinates, which is a very common case in practice.
+
+What the paper essentially does is a standard observation that uniform subsampling of coordinates (in spirit of Ailon -- Chazelle) of such dense vectors gives a Johnson--Lindenstrauss guarantee on the pairwise distances, and then it uses the quantization procedure developed earlier by Huynh and Saab.
+
+To me the result sounds like a fairly straighforward ramification of the result of Huynh and Saab, but I can see it potentially being accepted, since the studied problem is extremely important.
+
+I'm happy to revise the score if I got something wrong and the main resul is _not_ a straightforward variant of Huynh--Saab.",6,3.0,ICLR2021
+h7U0Sk-mlM1,4,4jXnFYaDOuD,4jXnFYaDOuD,Elegant unsupervised importance based auto-encoding model - good ideas and some early results - needs more experiments and analysis,"This paper presents a multimodal Autoencoder framework that learns the multimodal latent representations alongwith the importance of regions in each modality’s representation space in an unsupervised fashion. Multimodal fusion algorithms either use complex architecture representations or use disentangling joint representations for improving generative auto-encoding architectures using VAEs, GANs, WAE and some variants of these. This paper presents an elegant importance based model and architecture that takes into account various local and joint loss functions along with alignment factors to represent the Autoencoder model.
+Some questions/comments that would make the paper more readable:
+- Architecture diagram: Please provide an architecture diagram for the model to help the readers
+- How good are the embeddings for downstream applications: Can we use these representations and compare performance with other learnt embedding models? How would the performance be with missing modalities? 
+- Table 1 - It is surprising to see that weighted Precision is less than unweighted Precision - any thoughts?
+- Comparison to other Autoencoders: How does this compare to denoising Autoencoder? How does this compare to Wasserstein autoencoders and the variants such as the multimodal factorization model neural architecture?
+Figure 2: Any relationship or companion to self-attention weights and word level importances 
+- Number of network params compared to MVAE model that the authors have compared to? How fast is the model training compared to other similar models?
+
+",6,3.0,ICLR2021
+Qd3y0k6NgTv,3,7FNqrcPtieT,7FNqrcPtieT,On Data-Augmentation and Consistency-Based Semi-Supervised Learning ,"This paper analyses the consistency-based SSL methods in settings  where  data lie a manifold of much lower dimension than the input space and obtains tractable results. The paper relates the analysis with Manifold Tangent Classifiers and shows that the quality of the perturbations plays a key role  to achieve a promising results in this set of SSL methods. Finally, the paper extends the Hidden Manifold Model by incorporating data-augmentation techniques and proposes a framework  to provide a direction for analyzing consistency-based SSL methods.
+
++This work might be useful for those who want to work in the theoretical part of SSL. 
+
++The paper analytically discusses that the type of data augmentation plays a significant role in the performance  of the SSL models based on consistency regularization.  
+
+-While many points discussed in the paper are natural to me and well-discussed in the SSL literature (e.g.,  considering geometry of the data to develop an SSL algorithm or effect of the perturbation quality on the performance), relating and analyzing them with Manifold Tangent Classifiers (MTC) is interesting and new to me.  However, I still think that the theoretical part of the work is not strong enough and can be improved. The quality of the paper will be  improved if it uses MTC and provides some new results which do not exist in the SSL literature. This is because  MTC may not be the only approach to analyze these points.  For example, there are many other approaches based on optimal transport (e.g., Wasserstein Distances) that consider the geometry of the data and can be used to analyze the effect of perturbation on the SSL performance.  Then, someone may ask what is special in your approach to analysis consistency-based SSL methods in contrast to other tools/techniques?
+
+-While this paper is mostly an analytical paper, providing more experiments on some claims discussed in the paper  (e.g., Mean Teacher method and Π-model approach share the same solutions in the cases where the data-augmentations are small)  is necessary and can improve the quality of the paper. Furthermore, authors may  use the exact same underlying model, training set-up,  and then try different types of data-augmentation methods on several recent and effective consistency-based SSL methods to show the differences on the performance experimentally. This can better contextualize the paper as we will know how much is the difference in terms of accuracy between different consistency-based SSL methods  with respect to perturbation, or other data-augmentation methods. 
+
+Generally,  I think this  paper provides a good direction for understanding the consistency based-SSL methods.
+",6,4.0,ICLR2021
+BJgdDYYun7,1,BygGNnCqKQ,BygGNnCqKQ,"Interesting underlying idea, but evaluation insufficient","This paper deals with Architecture Compression, where the authors seem to learn a mapping from a discrete architecture space which includes various 1D convnets. The aim is to learn a continuous latent space, and an encoder and decoder to map both directions between the two architecture spaces. Two further regressors are trained to map from the continuous latent space to accuracy, and parameter count. By jointly training all these networks, the authors are now able to compress a given network by mapping it's discrete architecture into the latent space, then performing gradient descent towards higher accuracy and lower parameter count (according to the learned regressors).
+
+The authors perform experiments on 4 standard datasets, and show that they can in some cases get a 20x reduction in parameters with negligible performance decrease. They show better Cifar10 results than a few baselines - I am not aware whether this is SOTA for that parameter budget, and the authors do not specify.
+
+Overall I really like the idea in this paper, the latent space is well justified, but I cannot recommend acceptance of the current manuscript. There are many notational issues which I go into below, but the key issue is experiments and reproducability.
+
+The search space is not clearly defined. Current literature shows that the performance of these methods depends a lot on the search space. The manuscript does make clear that a T-layer CNN is represented as a 5XT tensor, with each column representing layer type, kernel size etc. However the connectivity is not defined at all, which implies that layers are simply sequentially stacked. This seems to preclude even basic architectural advancement like skip connections / ResNet - the authors even mention this in section 3.1, and point to experiments on resnets in section 4.4, but the words ""skip"" and ""resnet"" do not appear anywhere else in the paper. I presume from the emphasis on topological sort that this is possible, but I don't see how.
+
+If this paper is simply dealing with linear chains of modules, then the mapping to a continuous representation, and accuracy regression etc would still be interesting in principle. However it does mean that essentially all the big architecture advancements post-VGG (ie inception, resnet, densenet...) are impossible to represent in this space. Most of the Architecture Search works cited do have a search space which allows the more recent advances.
+
+I don't see a big reason why the method could not be extended - taking the 5D per-layer representation and adding a few more dimensions to denote connectivity would seem reasonable. If not, the authors should clearly mention the limitations of their search space.
+
+
+In terms of experiments, Figure 3 is very hard to interpret. The axes labellings are nearly too small to read, but it's also unclear what loss this even is - I presume this is the 'train' loss of L_d + \lambda_1L_a + \lambda_2L_p, but it could also be the 'compress' loss. It also behaves very unusually - the lines all end up lower than where they started, but oscillate around a lot, making me wonder if the curves from a second set of runs would look anything alike. It's not obvious why there's not just a 'normal' monotonic decrease.
+
+A key point that is not really addressed is how well the continuous latent space actually captures what it should. I am extremely interested to know whether the result of 'compress', ie a new concrete architecture found by gradient descent in the latent space, actually has the number of parameters and the accuracy that the regressors predict. This could be added as columns in Table 1 - eg the concrete architecture for Cifar10 gets 20.33x compression and no change in accuracy, but does the regressor for the latents space predict this compression ratio / accuracy as well? If this is the case, then I feel that the latent space is clearly very informative, but it's not obvious here.
+
+It would also be really useful to see some concrete input / output values in discrete architecture space. Presumably along the way to 20x compression of parameter count, the optimisation passes through a number of progressively smaller discrete architectures - what do these looks like? Is it progressively fewer layers / smaller filters / ??? Given that the discrete architecture encoding appears to have a fixed length of T, it's not even clear how layers would be removed. Figure 1 implies you would fill columns with zeros to delete layers, but I don't see this mentioned elsewhere in the text.
+
+More minor points:
+
+Equation numbers would be extremely useful throughout the paper.
+
+Notation in section 3 is unclear. If theta represents trained parameters, then surely the accuracy on a given dataset would be a deterministic value. Assuming that the distribution P_{\theta}(a | A, D) is used to represent the non-determinism of SGD training, is \theta supposed to represent the initialised values of the weights?
+
+There are 3 functions denoted by 'g' defined on page 3 and they all refer to completely different things - this is unnecessarily confusing.
+
+The formula for expected accuracy - surely this should be averaging over N different training / evaluation runs, something like:
+
+E_{\theta}[a | A, D] \simto \frac{1}{N} \sigma_{i}^N g_{\theta}(A, D, \theta_i)
+
+The decoder computes a 6xT output instead of a 5xT output - what is this extra row for?
+
+In the definition of ""ground truth parameter count"" p^* - presumably the standard deviation here is the standard deviation of the l vector? This formulation is a bit surprising, as convolutional layers will generally have few parameters, and final dense layers could have many. Did you consider alternative formulations like simply taking the log of the number of parameters? Having a huber loss with scale 1 for this part of the loss function was also surprising, it would be good to have some justification for this (ie, what range are the p^* values in for typical networks?)
+
+In algorithm 1 line 4 - here you are subtracting \bar{p} from num_params before dividing by standard deviation, which does not appear in the formulation above.
+
+In the experiments:
+How were the 1500 random architectures generated? I presume by sampling uniformly a lot of 5xT tensors, but this encoding is not clearly defined. x_i is defined as being in the set of integers, does this include negative numbers? What are the upper / lower limits, and is there anything to push towards standard kernel sizes like 3x3, 5x5, etc? These random architectures were then trained five times for 5 epochs - what optimizer / hyperparameters / regularization was used? Similarly, the optimization algorithm used in the outer loop to learn the {en,de}coders/regressors is not specified.
+
+I would move the lemma and theorem into the appendix - they seem quite unrelated to the overall thrust of the paper. To me, saying that an embedding is not uniquely defined, but can be learnt is not that controversial, and I don't need proofs that some architecture search space has a finite number of entries. Surely the fact that the architecture is represented as a 5xT tensor, and practically there are upper limits to kernel size, stride etc beyond which an increase has no effect, already implies a finite space? Either way, this section of the paper did not add much value from my perspective.
+
+
+I want to close by encouraging the authors to resubmit after addressing the above issues, I do believe the underlying idea here is potentially very interesting.
+",4,4.0,ICLR2019
+r1gbbP6TKB,1,SJlNnhVYDr,SJlNnhVYDr,Official Blind Review #3,"After responses:
+
+I read the authors response and decided to stick to my original score mostly because:
+
+1 - I understand that interpretability is hard to define. I also agree with the authors response. However, this is still not reflected in the paper in any way. I expect a discussion on what is the relevant definition used in the paper and how does it fit to that definition. Currently, it is very confusing to the reader.
+
+2 - I understand the authors' response that few-shot learning is a different empirical setting. However, authors also agree that settings are some-what relevant. I really do not see any gain by NOT discussing the few-shot learning literature. At the end, a reader is interested in this work if they have limited data. Moreover, other ways to address limited data issue should be discussed.
+
+-----
+The manuscript is proposing a few-shot classification setting in which training set includes only few examples. The main contribution is using prototype embeddings and representing each word as cosine distances to these prototype embeddings. Moreover, the final classification is weighted summation of the per-token decisions followed by a soft-max. Per-token classifiers are obtained with an MLP using the cosine distances as features. When the relevance labels are available, they are used in training to boost gradients.
+
+PRO(s)
+The proposed method is interesting and addressing an important problem. There are many few-shot scenarios and finding good models for them is impactful.
+
+The results are promising and the proposed method is more interpretable than the existing NLP classifiers. I disagree with the claim that the model is interpretable. However, I appreciate the effort to interpret the model.
+
+CON(s)
+The model is not interpretable because 1) it starts with embeddings and they are not interpretable, 2) model is full of non-linearities and decision boundaries are not possible to find. In other words, it is not possible to answer ""what would make this model predict some other classifier"".
+
+The authors should discuss the existing few-shot learning mechanisms. Especially, ""Prototypical Networks for Few-shot Learning"" is very relevant. I also think it can be included as a baseline with very minimal modifications.
+
+The writing is not complete. The authors do not even discuss how the prototypes are learned. I am assuming it is done using full gradient-descent over all parameters. However, this is not clearly discussed. Implementation details should be discussed more clearly.
+
+SUMMARY
+I believe the manuscript is definitely interesting and has a potential. In the mean time, It is not ready for publication. It needs a through review of few-shot learning. Authors should also discuss can any of the few-shot learning methods be included in the experimental study. If the answer is yes, it should be. If the answer is no, it should be explained clearly. 
+
+Although my recommendation is weak-reject, I am happy to bump it up if these issues are addressed.",3,,ICLR2020
+B1lGkaXqnm,2,SkfTIj0cKX,SkfTIj0cKX,"The main idea of the paper was very interesting, but the clarity of the paper needs to be improved significantly","The paper proposed a new framework for session-based recommendation system that can optimize for sparse and delayed signal like purchase. The proposed algorithm with an innovative IRN architecture was intriguing. 
+
+The writing of the paper was not very clear and pretty hard to follow. With this level of clarity, I don’t think it’s easy for other people to reproduce the results in this paper, especially in section 4, where I expect more details about the description of the proposed new architecture. Even though the author has promised to release their implementation upon acceptance, I still think the paper needs a major change to make the proposed algorithm more accessible and easier for reproduce.
+
+Some examples:
+What is L_A3C in “L = L_A3C + L_IRN” in the first paragraph of session 4? It looks like a loss from a previous paper, but it’s kind hard to track what it is exactly.
+
+“where Tj,τ is the τ-th imagined item, φ(·) is the input encoder shared by π (for joint feature learning), AE is the autoencoder that reconstructs the input feature, and the discounting factor γ is used to mimic Bellman type operations. … Therefore, we use the one-hot transformation as φ(·) and replace AE with the policy π (excluding the final softmax function), and only back-propagate errors of non-zero entries.”
+This seems one of the most important components of the proposed algorithm, but I found it’s very hard to understanding what is done here exactly.
+
+Regardless the sketchy description of the algorithm, the empirical results look good, with comprehensive baseline methods for comparison. It’s interesting to see the comparison between different reward function. Maybe the author can also discuss on the impact of the new imagination module on the training time.",6,2.0,ICLR2019
+Sygf9aE927,1,rkeMHjR9Ym,rkeMHjR9Ym,Good convergence result for non-convex dynamic problem under stable system condition,"The paper studies discrete time dynamical systems with a non-linear state equation.  They assume the non-linear function is assumed to be \beta-increasing like leaky ReLU. Under this setting, the authors prove that for the given state equation for stable systems with random gaussian input at each time step, running SGD on a fixed length trajectory gives logarithmic convergence.
+
+The paper is well-written and proves strong convergence properties. The deterministic result does not seem very novel and uses the idea of one-point strong convexity which has been studied in various prior works. However the bounding of the condition number of the data matrix is interesting and guarantees are near-optimal. The faster convergence for odd activations is a good observation. Overall, I think the paper is good. I do list some concerns:
+Questions/concerns:
+- The deterministic theorem (Theorem 4.1) seems similar to Theorem 3 in [1] with SGD instead of GD. Also under the distribution being symmetric, it can be derived from [2] with $k=1$. 
+- Can the ideas be extended to other commonly used activations such as ReLUs/Sigmoids? Sigmoids have exponentially small slope near origin.
+- The proof seems to rely on the fact that due to the gaussian input added each time step and stable system assumption after a sufficient number of time steps, the input-output pairs will not be highly correlated. So the data is sufficiently uncorrelated taking enough data. What happens if this data at each step is not gaussian?
+- In the unstable setting, the solution proposed just samples from different trajectories which by default are independent hence correlation is not an issue, this seems a bit like cheating. 
+- In RNNs, the motivation of the work, the hidden vectors are not observed, thus this setting seems a bit restrictive.
+- If SGD was performed on only one truncated series, do the results still hold?
+
+Other comments:
+- There has been previous work on generalized linear models which work in more general settings like GLMtron [3]. The authors should update prior work on generalized linear models as well as neural networks.
+- Typo on Page 2 y_t = h_{t+1} not y_t = h_t.
+
+[1] Dylan J. Foster, Ayush Sekhari, and Karthik Sridharan. Uniform Convergence of Gradients for Non-Convex Learning and Optimization. NIPS 2018.
+[2] Surbhi Goel, Adam Klivans, and Raghu Meka. Learning One Convolutional Layer with Overlapping Patches. ICML 2018.
+[3] Sham M. Kakade et al. Efficient learning of generalized linear and single index models with isotonic regression. NIPS 2011.
+
+
+--------------
+I would be maintaining the same score. I agree that the paper has nice convergence results that could possibly be building steps towards the harder problem of unobserved hidden states however, there is more work that could be done for unstable systems and possible extension to ReLU and other activations to take it a notch higher. ",7,3.0,ICLR2019
+r1ehpIDThX,2,r1Nb5i05tX,r1Nb5i05tX,"Confused discussion, lacking experiments, strong reject.","While overall the writing quality of the paper is high, the paper itself is a strong rejection.  I believe the analysis of the paper is at points flawed, and the experiments are minimal.  
+
+This work attempts to study the degree to which a layer by layer information bottleneck inspired objective can improve performance, as well as generally attempt to clarify some of the discussion surrounding Shwartz-Ziv & Tishby 2017.  Here, the authors study a deterministic neural network, for which the mutual information estimation is difficult (I(X,L)) and error prone.  To combat this they use the noise-regularized mutual information estimator (I(X; L+eps)).  To actually estimate the mutual information the authors use the MINE estimator of Belghazi (2018).  Here they suggest using the neural network itself as a structural element in the form of the discriminator to take advantage of the specific circumstances in this case.  Doing this ensured that their estimator diverged in the zero noise limit as expected.  From here they show some experimental results of the effect of their objective on an MNIST / CIFAR10 classification task.
+
+This paper fits into what is an increasingly large discussion in the literature, surrounding Information Bottleneck.  The paper itself does a very good job of citing recent relevant work.  Technically however I take issue with the framing of previous work in the last paragraph of the ""Deep neural nets"" subsection of Section 2.  Technically Achille & Soatto explicitly formed a variational approximation to the posterior over the weights of the neural network and so was not a ""single bottleneck layer"" as stated in the paper.  More generally at the end of that paragraph it is implied that the single bottleneck layer scheme ""deviates from the original theory"".  This is a misleading characterization of the original information bottleneck (Tishby et al 1999) in which there was a single random variable, a representation of the data (Z) satisfying the Markov conditions Z <- X -> Y.   I believe the authors instead meant to say that the cited works deviate from the information bottleneck theory of learning suggested in (Shwartz-Ziv & Tishby 2017).  In general the paper does a poor job of distinguishing between the Shwartz-Ziv & Tishby paper and the rest, but this is a distinction that should be maintained.  The original information bottleneck may and has demonstrated utility regardless of whether the information bottleneck generally can help explain why ordinary deterministic feed forward networks trained with cross entropy and sgd generalize well.  
+
+This also raises one of the main problems with the current work. The title, abstract and especially the conclusion (""This provides, for the first time, strong and direct emperical evidence for the validity of the IB theory of deep learning"") seem to present the paper as somehow offering some clarity and further support for the assertions of the Shwartz-Ziv & Tishby 2017 paper, but that paper hoped to establish that information bottleneck can explain the workings of ordinary networks.  Here the authors modify the ordinary cross entropy objective, and so their networks are necessarily not ordinary and so they cannot claim they have helped clarify our understanding of the vast majority of neural networks currently being trained.  Again, this is distinct and should be kept distinct from the utility of their proposed objective, itself inspired by the information bottleneck.  Here too the paper falls flat.  If instead of attempting to comment on networks as they are designed today they aim to proposed a new information bottleneck inspired objective they really ought to directly compare other attempts along those lines (such as the ones they themselves cite  Alemi et al. 2018, Kolchinsky et al. 2017, Chalk et al. 2016, Achile & Soatto 2018, Belghazi et al. 2018) but there are no comparative studies.
+
+The experiments are extremely lacking, not only are any of their cited alternatives compared, they don't compare to what would be an equivalent network to their but where they did utilize the noise at every layer and actually made the network stochastic.  Their reported numbers are not very impressive with their top MNIST number at 98.09 and their baseline at 97.73. These numbers are worse than many of the papers they themselves cite.  Only a single comparative results for both a limited training set run and the full one are shown, as well as only a single choice of beta.  The CIFAR10 numbers are not very good either.  There is some discussion of the text suggesting they believe their method acts like an approximate weight decay, but there are no results showing the effect of weight decay just on the baseline classification accuracies they compare against.
+
+Technically a deterministic function need not have infinite mutual information, if it is non-invertible, i.e. the sign function, or just floating point discretization. 
+
+Their own results in Figure 2 and the main body of the text highlight that the authors believe the true mutual information between the activations of the intermediate layers and the input is infinite.  If the true mutual information is infinite and the noise regularized estimator is only meant for comparative purposes, why then are the results of the training trajectories interpreted so literally as estimates of the true mutual information?
+
+Just plugging in the Discriminator for the objective (equation (7)) is flawed.  The discriminator, if optimal would learn to approximate the density ratio 1 + log p(x,y)/(p(x) p(y)) .   ( see f-GAN, Norowin et al. 2016).  How does this justify using the individual elements of the discriminator in the functional form of the IB objective?  
+
+At the bottom of page 6 they rightfully say that mutual information is invariant to reparameterizations, but their noise regularized mutual information estimator is not (by their own reference (Saxe et al 2017).
+
+The discussion at the center of page 8 is confusing.   They claim that Figure 5 (a) is more 'quantized' than (b) and ""has reduced entropy"".  I think it should be the other way.  More clusters should translate to a higher KL divergence, or higher entropy.  If you need only identify which cluster an activation is in, that should require log K nats where K is the number of clusters.  (a) shows more clusters and so seems like it should cost more and have a higher entropy not a lower one.
+
+Despite a recurring focus of the text that this paper applies and information theoretic objective at each layer of the network, and hence is novel, the final sentence of the paper suggests it might not actually be needed and single layer IB objectives can work as well.",2,4.0,ICLR2019
+H1ljKxccnX,3,B1MbDj0ctQ,B1MbDj0ctQ,"Interesting ideas, but more justifications and comparisons necessary","Thank you for the detailed reply and for updating the draft 
+
+The authors have added in a sentence about the SLDS-VAE from Johnson et al and I agree that reproducing their results from the open source code is difficult. I think my concerns about similarities have been sufficiently addressed.
+
+My main concerns about the paper still stem from the complexity of the inference procedure. Although the inference section is still a bit dense, I think the restructuring helped quite a bit. I am changing my score to a 6 to reflect the authors' efforts to improve the clarity of the paper. The discussion in the comments has been helpful in better understanding the paper but there is still room for improvement in the paper itself.
+=============
+
+Summary: The authors present an SLDS + neural network observation model for the purpose of fitting complex dynamical systems. They introduce an RNN-based inference procedure and evaluate how well this model fits various systems. (I’ll refer to the paper as SLDVBF for the rest of the review.)
+
+Writing: The paper is well-written and explains its ideas clearly
+
+Major Comments:
+There are many similarities between SLDVBF and the SLDS-VAE model in Johnson et al [1] and I think the authors need to address them, or at least properly compare the models and justify their choices:
+
+- The first is that the proposed latent SLDS generative models are very similar: both papers connect an SLDS with a neural network observation model. Johnson et al [1] present a slightly simpler SLDS (with no edges from z_t -> s_{t + 1} or s_t -> x_t) whereas LDVBF uses the “augmented SLDS” from Barber et al. It is unclear what exactly  z_t -> s_{t + 1} is in the LDVBF model, as there is no stated form for p(s_t | s_{t -1}, z_{t - 1}).
+
+- When performing inference, Johnson et al use a recognition network that outputs potentials used for Kalman filtering for z_t and then do conjugate message passing for s_t. I see this as a simpler alternative to the inference algorithm proposed in SLDVBF. SLDVBF proposes relaxing the discrete random variables using Concrete distributions and using LSTMs to output potentials used in computing variational posteriors. There are few additional tricks used, such as having these networks output parameters that gate potentials from other sources. The authors state that this strategy allows reconstruction signal to backpropagate through transitions, but Johnson et al accomplish this (in theory) by backpropagating through the message passing fixed-point iteration itself. I think the authors need to better motivate the use of RNNs over the message-passing ideas presented in Johnson et al.
+
+- Although SLDVBF provides more experiments evaluating the SLDS than Johnson, there is an overlap. Johnson et al successfully simulates dynamics in toy image systems in an image-based ball-bouncing task (in 1d, not 2d). I find that the results from SLDVBF, on their own, are not quite convincing enough to distinguish their methods from those from Johnson et al and a direct comparison is necessary.
+
+Despite these similarities, I think this paper is a step in the right direction, though it needs to far more to differentiate it from Johnson et al. The paper draws on many ideas from recent literature for inference, and incorporating these ideas is a good start. 
+
+Minor Comments:
+
+- Structurally, I found it odd that the authors present the inference algorithm before fully defining the generative model. I think it would be clearer if the authors provided a clear description of the model before describing variational approximations and inference strategies. 
+- The authors do not justify setting $\beta = 0.1$ when training the model. Is there a particular reason you need to downweight the KL term as opposed to annealing?
+
+[1] Johnson, Matthew, et al. ""Composing graphical models with neural networks for structured representations and fast inference."" Advances in neural information processing systems. 2016.",6,3.0,ICLR2019
+SygIGXDRtB,2,B1lDoJSYDH,B1lDoJSYDH,Official Blind Review #3,"[Summary]
+
+This paper proposes to learn fluid dynamics by combining the position-based fluids (PBF) framework and continuous convolution. They use dynamics particles to represent the fluids, and static particles to describe the scene boundaries, and employ continuous convolution to learn the interactions between the particles of different kinds. They have demonstrated the effectiveness of the proposed method by comparing it with several state-of-the-art learning-based and physics-based fluid simulators. Their method outperforms the baselines in terms of both accuracy and efficiency. They have also shown that the model can extrapolate to terrains that are more complex than those used in training, and are useful in estimating physical properties like the viscosity of the fluids.
+
+
+[Major comments]
+
+For now, I slightly lean towards acceptance, as I like the idea of combining PBF and continuous convolution for fluid simulation, and the method seems to have a much better performance than the baselines. The experiments have also convincingly demonstrated the method's generalization ability to terrains of various geometry and fluids of different viscosity. However, I would still like the authors to address my following questions.
+
+My primary concern about the proposed method is the scope of its applicability. One of the benefits of using learning-based physics engines is that they directly learn from observations while making very few assumptions towards the underlying dynamics, which gives them the potential to handle complex real-world scenarios. The model in this paper, however, heavily relies on the PBF framework that may limit its ability to simulate objects like rigid bodies and other deformable materials. I would be curious to know the authors' views on how to extend their model to environments with not just fluids, but also other objects of various material properties.
+
+
+[More detailed questions]
+
+Will the method run faster than DFSPH, given that the timestep is much larger than the timestep used by DFSPH, 0.02 ms vs. 0.001 ms? Will the learning-based physics engine have the potential to outperform the physics-based physics engine in terms of efficiency?
+
+For estimating the viscosity of the fluids, how well does the gradient descent on the learned model perform comparing with black-box optimization, e.g., Bayesian Optimization using the ground truth simulator?
+
+In the SPNet paper, they have also tried to solve the inverse problem of estimating the viscosity of the fluids. It would be great to include a comparison to see if the proposed method can outperform SPNet in terms of efficiency and accuracy.
+
+Equation 8 smooth out the effect between particles of different distances. How sensitive is the final performance of the model to the specific smoothing formulation? Is it possible to learn a reweighting function instead of hardcoding?
+
+In figure 3, the model's rollout is a bit slower than the ground truth. The authors explained the phenomenon using the ""differences in the integration of positions and the much larger timestep."" I do not quite get the point. Could you elaborate more on this? Also, it might be better to include labels for the two columns in figure 3 to make it more clear.
+
+In the experiment section, the authors claimed that SPNets take ""more than 29 days"" to train. Correct me if I am wrong, but from my understanding, SPNets directly write Position-Based Fluids (PBF) in a differentiable way, where they can extract gradients. Except for the tunable parameters like viscosity, cohesion, etc., I'm not sure if there are any learnable parameters in their model. Could the authors elaborate on what they mean by ""the training time"" of SPNets?
+
+From the videos, DPI-Nets does not seem to have a good enough performance in the selected environments. I can see why their model performs not as good since they did not use as much of a structure in the model. But from the videos of DPI-Nets, it seems that they perform reasonably well in scenes like dam break or shake a box of fluids. Would you please provide more details on why they are not as good in the scenes in this paper?
+
+The data was generated using viscosity varying between 0.01 and 0.3. How well can the model do extrapolate generalization? It would be great to show some error plots indicating its extrapolate performance.
+
+Why there are no average error numbers for SPNets?
+",8,,ICLR2020
+QLINRb5DBPX,2,jEYKjPE1xYN,jEYKjPE1xYN,Review,"### Summary of the paper
+The paper utilizes a actor-critic framework for 3D molecular design. The central part of the approach is a rotation equivariant network (Comorant). For each atom, Comorant learns a state representation so that it is equivariant under rotation. The method is evaluated on 9 different molecules and it outperforms Simm et al.'s method in terms of validity and diversity.
+
+### Strength and weakness
+1. The method adopts a better model architecture (Comorant) to learn the state representation. However, Comorant is developed by Anderson et al. Therefore it cannot be counted as the contribution of this paper. The actor-critic formulation is also standard in RL. The sequential decision process is also similar to Simm et al. The technical innovation is not original enough.
+
+2. The evaluation protocol is problematic. In table 1, authors only report validity and diversity of generated compounds. The validity is defined as ""successfully parsed by RDKit"". To my knowledge, RDKit validity checking is based on 2D constraints such as valence, aromaticity and kekulization. It does not capture 3D information at all. Are the generated compound stable? What is the RMSD of generated compounds? Simm et al. 2020 reported RMSD to measure the structural stability. Why RMSD metric is missing? This is important because 2D graph generation models already satisfy validity quite well (100% validity for even large molecules). 
+
+3. Motivation is not clear. In the paper, authors state that the choice of focal atom, element and distance have to be *invariant* to rotation. It seems like invariant representation is sufficient. Moreover, the advantage over Simm et al. is not clear in section 4. Why highly symmetric states are problematic for prior work? How does this method solves this issue? I am afraid that the experiment section does not address this, since most test cases are not ""highly symmetric"" to my knowledge.
+
+### Overall evaluation
+I vote for rejection. My major reason is the problematic evaluation protocol. RMSD must be added to evaluate the stability of compounds. Moreover, I would like to see evaluation on ""highly symmetric"" cases. I think it's important since authors state that this is the major limitation of prior work.
+
+### Question
+1. Is the method scalable to large molecules? How is the runtime of your model compared to Simm et al.?
+
+### Post rebuttal feedback
+It's good to see that RMSD experiments are added and the results are better than Simm et al. Therefore, I am raising my score to 6. I also realized that the validity calculation is different from standard graph generation methods. The validity results now look reasonable to me.",6,4.0,ICLR2021
+b-Vzhp9l15w,1,YHdeAO61l6T,YHdeAO61l6T,Review of Auction Learning as a Two-Player Game,"Objective of the paper:
+The objective of the paper is to show that auction design can be views as a two-player game;  improving on past work, they provide schemes that give better performance and better time for learning.
+
+Strong Points:
+1)  The paper seems to offer improvements on recent past work.
+2)  The paper provides a clear background on auction theory relevant to the problem.
+3)  The paper delivers on the abstract.  
+4)  Some code is provided.  
+
+Weak Points:
+1)  It is not clear that the improvement over past work is large.  
+2)  The improvements/framework seem to rest on unproven assumptions, e.g., the use of the loss function in section 3.2.2, and the discussion of ""closeness"" in section 3.3.  This makes the paper somewhat unclear in terms of what it is offering -- heuristic approaches that improve (empirically) on the previous work?  If so, what are the limitations?  Are there situations the heuristic might be troublesome?  (Can you emphasize more clearly that your results are heuristic in nature, albeit based on theoretical formulations.)  
+
+Overall Rating:  I think this is an interesting paper.  I think the heuristics proposed offer benefits empirically and are well grounded, but the authors could do a better job clarifying the possible weak points of this heuristic approach compared to previous work.  The result is perhaps of interest to a specialized audience (people interested in auctions);  it's not clear that more general applications of the techniques are available.  
+
+Questions for Authors:
+I was confused by Table 2 where you describe comparing to the ""optimal"" auction that has lower revenue than your auction results in 2 of 3 cases.  Is ""optimal"" here just signifying zero-regret?  Or is something else going on?
+
+It is not clear to this reviewer that the auctions chosen are representative in any way -- I assume they've been used in previous works or are otherwise standard?  
+
+Other Feedback:  The paper is a bit confusing at times, but I think that is because the authors were forced to keep descriptions short in order to fit within page limits.  I think going back and offering a bit more description for a longer version would be useful.  Overall though the writing is fine.  
+",6,3.0,ICLR2021
+ZCHW3Y6k0d,3,OMizHuea_HB,OMizHuea_HB,"Nice idea and decent results, lacks discussion of other existing hard negative mining strategies ","The goal of the paper is audio-visual self-supervised learning using an active sampling technique to mine hard negatives during training. 
+
+Strengths 
+- The paper is well written and clearly explained. 
+- The strategy for selecting negatives based on diversity seems well reasoned and experimentally outperforms random sampling and OHEM. I like the study showing empirically more categories covered in active sampling vs random sampling. 
+
+Weaknesses: 
+- It would be nice to see some discussion of the other self-supervised contrastive works that also focus on the optimal selection of “negatives”: eg. Korbar et al 2018 (where they use a curriculum based on time distance of negatives from the positives), Iscen et al. 2018: https://arxiv.org/abs/1803.11095, Cao et al.2020: https://arxiv.org/abs/2006.14618, Wu et al 2020: https://arxiv.org/abs/2005.13149 (variational extension to InfoNCE with modified strategies for negative sampling). In a similar vein, the comparison to OHEM in the supplementary is quite nice and I believe should be in the main paper. 
+- Why is the ablation showing the benefits of Active Sampling only on Kinetics-Sound? Would the same trends shown in Table 1 hold on a random sampled proportion of Kinetics of the same size, or even on the whole dataset?
+- The audio-visual contrastive method has been proposed in numerous papers before, so the novelty of this paper lies solely in the active sampling technique that increases diversity using K-means clustering 
+- (Minor) The footnote on page 1 is a bit confusing, it’s hard to see the probability drop from just Fig. 2 since the division needs to be done, probably better to plot the probabilities directly somewhere or remove this. 
+- (Minor) It would be interesting to also test the audio representations for classification on VGG-Sound (http://www.robots.ox.ac.uk/~vgg/data/vggsound/) ",6,4.0,ICLR2021
+77WkftOwD3,4,ijVgDcvLmZ,ijVgDcvLmZ,Q-functions and Policies,"The paper proposes a Q-factorization method by assuming an energy-based policies model. Q-functions are formulated as soft value functions with the energy parameters, and this adoption renders the function factorization more flexible compared to existing ones. The proposed solution applies to continuous-action tasks, a feat left unconquered by some of the existing methods. Authors exhibit that FSV outperforms others in various environments characterized by local optima.
+
+Strengths:
+
++ The formulation of Q-functions as soft functions, despite appearing simple, shows some effectiveness in a number of MARL tasks.
+
++ The network architecture is intuitive.
+
+
+Major Concerns:
+
+- Neither energy-based policies nor soft value functions is an original contribution of this work. True, the authors do not claim so. But the reviewer is left unsure as to what then the primary contribution of the paper would be.
+
+- The method generalizes IGM to IGO but in doing so, foregoes the simplicity of the IGM condition. The reviewer would then expect to be met with a somewhat strong guarantee, but is instead presented with approximations on \lambda_i. It is not clear from the paper how much insightful value the method has, when its criticism of a previous work (QTRAN) was based on intractability but the FSV method itself still relies on approximations. It would seem as though QTRAN and FSV each chose different paths to approximate different components of an MARL training scheme - the former takes may stronger assumption on the value functions while the latter takes assumptions on the nature of value functions being parametrized by approximated weights.
+
+- The effectiveness of the proposed method is not yet well-accounted for. Issues are raised, but little explanation (or any attempt thereof) is provided. For example, the reviewer would have very much liked to gain an understanding of the relevance between IGO and its ability to alleviate relative overgeneralization. How does taking on greedy policies (which makes IGO collapse into IGM) make MARL agents more prone to overgeneralize with respect to each other? What kinds of findings would the authors present? What evidence could support those findings? The evaluation, while illustrating great performance gaps, needs a careful redesign so as to construct solid grounds for the soft value function factorization under IGO to be ""explainably"" better than existing works.
+
+- The paper could be better positioned. The Related Works section could be put to better use to clearly distinguish two very different lines of research: value function factorizing MARL works and maximum entropy principle.
+
+- There needs to be some justification about multi-head attention being used to ""enable efficient learning"" in Section 3.3. The reviewer is left hanging as to why and how such a choice was made.
+
+
+Minor Concerns:
+
+* A few parts of the paper were difficult to follow. For example, there is an unfinished sentence in Related Works. In Section 2.1, there is an incomplete clause beginning with ""the reward function [...] shared by all agents"". Under Theorem 1, ""any distributions"" --> ""any distribution"". Also, what is meant by ""correct architecture"" in that same paragraph?",2,4.0,ICLR2021
+BknlT5Bez,2,Skx5txzb0W,Skx5txzb0W,a step in the right direction,"This manuscript raises an important issue regarding the current lack of standardization regarding methods for evaluating and reporting algorithm performance in deep learning research.  While I believe that raising this issue is important and that the method proposed is a step in the right direction, I have a number of concerns which I will list below.  One risk is that if the proposed solution is not adequate or widely agreeable then we may find a proliferation of solutions from which different groups might pick and choose as it suits their results!
+
+The method of choosing the best model under 'internal' cross-validation to take through to 'external' cross-validation against a second hold-out set should be regarded as one possible stochastic solution to the optimisation problem of hyper-parameter selection.  The authors are right to emphasize that this should be considered part of the cost of the technique, but I would not suggest that one specify a 'benchmark' number of trials (n=5) for comparison.  Rather I would suggest that this is a decision that needs to be explored and understood by the researchers presenting the method in order to understand the cost/benefit ratio for their algorithm provided by attempting to refine their guess of the optimal hyperparameters.  This would then allow for other methods not based on internal cross-validation to be compared on a level footing.
+
+I think that the fundamental issue of stochasticity of concern for repeatability and generalisability of these performance evaluation exercises is not in the stochastic optimisation search but in the use of a single hold-out sample.  Would it not be wise to insist on a mean performance (a mean Boo_n or other) over multiple random partitions of the entire dataset into training and hold-out?  I wonder if in theory both the effect of increasing n and the mean hold-out performance could be learnt efficiently with a clever experimental design. 
+
+Finally, I am concerned with the issue of how to compute the suggested Boo_n score.  Use of a parameteric Gaussian approximation is a strong assumption, while bootstrap methods for order statistics can be rather noisy.  It would be interesting to see a comparison of the results from the parametric and non-parameteric Boo_n versions applied to the test problems.  ",6,4.0,ICLR2018
+rygbXAUK9B,3,BkljIlHtvS,BkljIlHtvS,Official Blind Review #3,"This paper presents an experimental study of gradient based meta learning models and most notably MAML. The results suggest that modeling and adaptation are happening on different parts of the network leading to an inefficient use of the model capacity which explains the poor performance of MAML on linear (or small networks) models. To tackle this issue they proposed a kronecker factorization of the meta optimizer.
+
+The paper is well motivated and well written in terms of clarity in the message and being easy to follow.
+
+One major issue is that the experimental study is not that comprehensive to support the claim of the paper. Especially, in analyzing the failure case of linear models.For example, one may try small (but nonlinear networks) and compare its performance with larger (possibly overparameterized) ones on at least 2 standard network architectures. But, it doesn't mean that I don't like the paper at its current state. The paper yet has a message and it's delivered clearly.
+
+I wonder if the overparameterized is just related to depth or overparameterization in width would work too? If not then it might be the ""nonlinearity"" that is doing the work
+
+In section 3.2 (Figure 2, left) and (Figure2, mid) show that FC follows the pattern of C1-C3. t
+Then the authors proposed the experiment related to perturbing FC (Figure 2, right) to show that FC is actually not similar to C1-C3 and is important to adaptation. However, one can do similar experiments for C1-C3 and claim they are also important to adaptation. It seems that FC and C4 are really different.
+
+For a non-expert reader it's not readily clear that how the kronecker factorization of A leads to equation 5. An explanation can help. Also, a few sentences or schematic demonstration of kronecker product makes the paper self-contained. 
+
+There are a few typos in the paper that can be removed after a thorough proofreading. ",6,,ICLR2020
+HJgiqXjrYB,1,rylvAA4YDB,rylvAA4YDB,Official Blind Review #3,"This paper proposes a neural network architecture to classify graph structure. A graph is specified using its adjacency matrix, and the authors prose to extract features by identifying temples, implemented as small kernels on sub matrices of the adjacency matrix. The main problem is how to handle isomorphism: there is no node order in a graph. The authors propose to test against all permutations of the kernel, and choose the permutation with minimal activation. Thus, the network can learn isomorphic features of the graph. This idea is used for binary graph classification on a number of tasks.
+
+Graph classification is an important problem, and I found the proposed solution to be quite elegant. The paper is mostly well written (it could use some proofreading, but the main ideas are explained well). Overall, I liked the idea and tend towards acceptance.
+
+In the experiments, the authors report using different hyper parameters for each data set (e.g., k). I did not understand how these parameters were chosen, since only training and testing sets were reported. I would like the authors to clarify how model selection was performed.
+
+Also, Figure 1 and the details in Section 4 discuss a 1-layer isomorphic NN. The discussion in Section 4.3.2 discusses multi-layer feature extraction. If I understand correctly, this means to apply the graph isomorphic layer + min pooling + softmax several times, but this should be stated explicitly.",6,,ICLR2020
+r1geCyJTKB,1,HkxlcnVFwB,HkxlcnVFwB,Official Blind Review #3,"Main contributions:
+This paper generalizes the recent state-of-the-art behavior agnostic off-policy evaluation DualDice into a more general optimization framework: GenDice. Similar to DualDice, GenDice considers distribution correction over state, action pairs rather than state in Liu et al. (2018), which can handle behavior-agnostic settings. The optimization framework (in equation (9)) is novel and neat, and the practical algorithm seems more powerful than the previous DualDice. As a side product, it can also use to solve offline page rank problem.
+
+Clarity:
+This paper is well established and written. 
+
+Connection of theory and experiment:
+I have a major concern for the theory 1 about the choice of regularizer $\lambda$. For infinite samples case, the derivation of theory 1 is reasonable since both term is nonnegative. However, in practice we will have empirical gap for the divergence term, thus picking a suitable $\lambda$ seems crucial for the experiment. I think a discussion on $\lambda$  for average case in experiment part should be added. And compared to Liu et al. (2018) which normalized the weight of $\tau$ in average case, which one is better in practice?
+
+Overall I think this paper is good enough to be accepted by ICLR. The optimization framework can also inspire future algorithm using different divergence.",8,,ICLR2020
+dnmeD8RSciS,2,rYt0p0Um9r,rYt0p0Um9r,Comments,"*Summary:
+
+This paper mainly answers a fundamental question: what is the role of depth in convolutional networks? Specifically, the authors present an empirical analysis of the impact of the depth on the generalization in CNNs. Experiments on CIFAR10 and ImageNet32 demonstrate that the test performance beyond a critical depth. My detailed comments are as follows.
+
+*Positive points:
+
+1. This paper is significant to understand deep neural networks and helps to develop new deep learning algorithm.
+
+2. This paper provides many empirical studies to analyze the effect of increasing depth on test performance.
+
+*Negative points:
+
+1. The importance and novelty of the research should be emphasized. Recently, there are some works [1][2][3] study the role of depth in DNN. What is the difference from these works? 
+
+[1] Do Deep Convolutional Nets Really Need to Be Deep and Convolutional? ICLR 2017
+[2] Understanding intermediate layers using linear classifier probes. arXiv, 2016.
+[3] Towards Interpreting Deep Neural Networks via Understanding Layer Behaviors. 2020
+
+2. This paper analyzes the linear neural networks and demonstrates that increasing depth leads to poor generalization. However, existing works apply non-linear neural networks in real-world case. It would be better to provide analysis on non-linear neural networks.
+
+3. The authors suggest that practitioners should decrease depth in these settings to obtain better test performance. However, ResNet-101 has better test performance than ResNet-18 in practice. Could you please give more explanations?
+",5,3.0,ICLR2021
+BEwnKw1hCtz,3,#NAME?,#NAME?,An interesting exploration that could have benefited from a more precise definition of terms,"This paper sets out to determine something about the form of the bias acquired by a standard meta-learning algorithm, and compare the form of that bias to the inherent bias that humans have.    The authors point out, rightly, that meta-learning algorithms have meta-biases and it is important to understand these, from both the scientific and engineering perspectives.  It is well written and raises good questions.
+
+There is a cleverly constructed test domain and a set of well-executed computer and human experiments (I think---I don't really know about how to construct a human experiment.)
+
+Unfortunately, I can't end up agreeing or disagreeing with the claims made in the paper, or really understands how well they are supported by evidence, because I find that they use terms that don't seem to be sufficiently technically well defined.
+
+For example:
+- what exactly is compositional structure?  
+- what is statistical structure?
+- what is your measure of task complexity?
+
+How can we tell if what the agent learns is compositional?  Is that an externally measurable property of the agent's behavior and the way it generalizes to new environments?  Or is it a property of the internal representation?   (It is common to have an intuition that ""compositional"" also implies ""compact"" or ""low complexity"" in some sense.)
+
+It feels like generalization be a way to get more clearly at the presence of a compositional representation:  could you train on small grids and have the learned agent generalize to big ones?  It seems like if a fixed-size representation can generalize to very large instances, then that is more clear evidence of compositionality (but then I'm thinking of compositionality as a property of a representation, not of externally-measurable behavior.)
+
+I also feel that I don't quite understand the meta-learning training regime.  What exactly constituted a ""task"" from the meta-learning perspective?   Is it a single board?  If so, then the meta-learning problem is to learn the task distribution, in some sense.   I was expecting something more ""meta"":  that is, to test whether the system is actually meta-learning the *idea* of compositionality, it seems like set-up would be that a task corresponds to a particular grammar with multiple boards drawn from the distribution induced by the grammar;  then we'd know that it had meta-learned compositionality if it could learn *new grammars* quickly.
+
+Smaller points
+- I didn't completely understand the production rule (nor the examples) for the loop structure.
+- It would help me understand the task set better if there were a slightly more in-depth description of the chains, trees, and loops and described how the grammar generates the compositional tasks in figure 2. Is it not the case that every connected configuration of red tiles could be described as a tree?
+- Rather than showing just one number for the final performance, It would be helpful to show learning curves for the RL algorithms so the reader can assess the stability, convergence, etc. Similarly, learning curves for humans would be interesting, but less important since I assume they just look flat.
+- ""is develop""",6,3.0,ICLR2021
+a_RlVX1mFxq,4,0h9cYBqucS6,0h9cYBqucS6,A nice idea with insufficient analysis,"This paper proposes an efficiency improvement on the ""secure aggregation"" (called SA in this paper) protocol of Bonawitz et al. 
+
+For context: the SA protocol allows a group of clients holding secret values to compute the sum of their values with the help of a reliable server. The computation is ""private"" in the following sense: any adversary that controls the server and at most $t$ out of the $n$ clients learns nothing except the sum of the honest parties' inputs. 
+
+More formally: SA is a multiparty computation protocol for the ""summation"" ideal functionality, that is secure against a *semi-honest* honest adversary. It has the desirable feature of being resilient to failures on the part of many clients (as long as the server remains online). 
+
+The big downsides are that (a) the protocol does not handle malicious behavior on the part of the server and clients and (b) the communication scales poorly with the number $n$ of clients (each client's communication is $\Theta(n)$, and the server's total communication is $\Theta(n^2)$).
+
+This paper aims to alleviate problem (b) by giving a version that uses a reduced communication graph. The paper argues that if the graph has appropriate properties then the resulting protocol is secure (though that argument is problematic, see below). The paper shows that a random Erdos-Renyi graph with density about $1/\sqrt{n}$ satisfies the properties, leading to a protocol with communication $\tilde O(\sqrt{n})$ per client. 
+
+I like the approach and idea of the paper, but I don't think the execution is quite there yet—I don't feel the paper, as written, is acceptable for ICLR.  The paper's value hinges on three claims: correctness (the protocol should complete even when some players drop out), low communication, and security. I did not check the arguments for any of these in detail, but correctness and communication look fine. The main issue is the security argument, which is not properly developed (and not obviously correct). The security model is not clearly articulated, the claims don't clearly describe assumptions on the number of corrupted parties, and so forth. For example, the paper makes claims such as that the protocol prevents membership inference. That can't be true because many (most?) membership attacks use only the final trained model (or even just its outputs)—hiding the intermediate gradients doesn't help. 
+
+This paper is about multiparty computation (to be clear, secure summation is a special case of MPC where the ideal functionality is a sum). It should use the language developed over the past thirty+ years in the crypto and security communities to formulate and prove its claims of security. This would allow for clear and refutable claims, and a clear discussion of the new protocols limitations.  In particular, the restriction to semi-honest parties is a huge one, albeit shared with the SA protocol of Bonawitz et al. 
+
+
+
+* Modeling the exact security properties of protocols with parties that drop out is a bit subtle. See, for example, this paper for discussion: https://eprint.iacr.org/2018/997
+
+* The idea of replacing the complete graph with a low-degree expander to save communication has been used elsewhere. Some relevant citations are below (though there are others, too):
+
+** Fitzi, M., Franklin, M., Garay, J., Vardhan, H.: Towards optimal and efficientperfectly secure message transmission. In: Vadhan, S.P. (ed.) TCC 2007. LNCS,vol. 4392, pp. 311–322. Springer, Heidelberg (2007)
+
+** Harnik, D., Ishai, Y., Kushilevitz, E.: How many oblivious transfers are neededfor secure multiparty computation? In: Menezes, A. (ed.) CRYPTO 2007. LNCS,vol. 4622, pp. 284–302. Springer, Heidelberg (2007)
+
+* Finally: This type of paper might be a better fit for a security or crypto venue, where its contributions can be better evalauted and appreciated. It is up to the authors where to submit, of course, and I don't generally take conference scope too strictly, but the paper isn't really about learning representations, and isn't clearly a good fit for the ICLR audience.
+
+",4,4.0,ICLR2021
+vp78c1ImyzJ,1,_i3ASPp12WS,_i3ASPp12WS,Official Blind Review #2,"[Summary]
+Online defenses of adversarial examples is an old topic: Given an input x (potentially adversarially perturbed) at test time, we want to sanitize x to get x', on which the trained classifier $g \circ f$ gives the correct answer. This paper proposes a new architecture for online defenses via self supervision. There are two new things in the proposal:
+
+1. There is an explicit representation function f, namely the classifier is decomposed into $g \circ f$. And the auxiliary self-supervised component h works on the same representation. This thus creates a Y-shape architecture that is ""syntactically"" similar to the training structure in unsupervised domain adaptation (e.g., domain adversarial neural networks). This architecture for online defense seems new (as far as I know).
+
+2. The paper leverages an interesting hypothesis that for a common f, a large classification loss happens if and only if a large self-supervision loss happens. And this paper provides solid evidence to justify this -- namely in Section 4.1 (auxiliary-aware attacks), it evaluates the defense against an adversary that is aware of h, in order to create adversarial examples that explicitly breaks the hypothesis (i.e. large classification loss but small self-supervision loss).
+
+3. For the experiments -- the paper trained f, g, and h under Gaussian corruptions, and indeed found that this online purification strategy provides robustness under adversarial perturbations, even for auxiliary-aware attacks, which is interesting.
+
+[Assessment]
+1. My first worry is that the performance of the defense is still much worse than the performance from direct adversarial training (for example, check the MNIST numbers). For example, under PGD, on CNN architecture we can achieve 80%ish accuracy. Note that for MNST, a simple discretization can already achieve almost-perfect accuracy. This is especially the case if we consider auxiliary-aware attacks.
+
+2. Following (1), what worries me more is that online-purification still needs to be aware of the attack type. Namely if one looks into equation (4), the objective has encoded norm-based attacks within it. This makes the results less interesting.
+
+3. All in all, my major doubt is what is really the benefit of reduced training complexity if we cannot achieve better robustness, and also the defense still needs to be fully aware of the attack type? For these reasons, I vote for a weak reject.
+
+[Questions]
+1. Why do we need to know the results for FCN (fully connected networks)?
+
+2. I am not sure the numbers reported for adversarial training match the state of the art reported in the MNIST challenge leaderboard: https://github.com/MadryLab/mnist_challenge. There the SOTA MNIST model always has >88% accuracy (so I am a bit skeptical about DF can bring down the accuracy to 78% for PGD AT). Also, how about applying those attacks for the self-supervision defense? (that's an additional request). Similarly, for CIFAR10, as shown by https://github.com/MadryLab/cifar10_challenge, PGD AT is never under 43%, but in Table 2, the robust accuracy is only 2% under CW attack. This is suspicious.
+
+[Post rebuttal]
+
+After more discussion and reading through the revision, I think this is a good paper and will be useful to the community for an instance of test-time defenses.",7,5.0,ICLR2021
+Bkj-7CYef,3,ByQZjx-0-,ByQZjx-0-,Improving Neural Architecture Search by Parameter Sharing,"In this paper, the authors look to improve Neural Architecture Search (NAS), which has been successfully applied to discovering successful neural network architectures, albeit requiring many computational resources. The authors propose a new approach they call Efficient Neural Architecture Search (ENAS), whose key insight is parameter sharing. In NAS, the practitioners have to retrain for every new architecture in the search process, but in ENAS this problem is avoided by sharing parameters and using discrete masks. In both approaches, reinforcement learning is used to  learn a policy that maximizes the expected reward of some validation set metric. Since we can encode a neural network as a sequence, the policy can be parameterized as an RNN where every step of the sequence corresponds to an architectural choice. In their experiments, ENAS achieves test set metrics that are almost as good as NAS, yet require significantly less computational resources and time.
+
+The authors present two ENAS models: one for CNNs, and another for RNNs. Initially it seems like the controller can choose any of B operations in a fixed number of layers along with choosing to turn on or off ay pair of skip connections. However, in practice we see that the search space for modeling both skip connections and choosing convolutional sizes is too large, so the authors use only one restriction to reduce the size of the state space. This is a limitation, as the model space is not as flexible as one would desire in a discovery task. Moreover, their best results (and those they choose to report in the abstract) are due to fixing 4 parallel branches at every layer combined with a 1 x 1 convolution, and using ENAS to learn the skip connections. Thus, they are essentially learning the skip connections while using a human-selected model. 
+
+ENAS for RNNs is similar: while NAS searches for a new architecture, the authors use a recurrent highway network for each cell and use ENAS to find the skip connections. Thus, it seems like the term Efficient Neural Architecture Search promises too much since in both tasks they are essentially only using the controller to find skip connections. Although finding an appropriate architecture for skip connections is an important task, finding an efficient method to structure RNN cells seems like a significantly more important goal.
+
+Overall, the paper is well-written, and it brings up an important idea: that parameter sharing is important for discovery tasks so we can avoid re-training for every new architecture in the search process. Moreover, using binary masks to control network path (essentially corresponding to training different models) is a neat idea. It is also impressive how much faster their model performs on tasks without sacrificing much performance. The main limitation is that the best architectures as currently described are less about discovery and more about human input -- finding a more efficient search path would be an important next step.",5,2.0,ICLR2018
+HJxsqDY7cH,3,rklOg6EFwS,rklOg6EFwS,Official Blind Review #2,"The paper improves adversarial training by introducing two modifications to the loss function: (i) a ""boosted"" version of the cross-entropy loss that involves a term similar to a large-margin loss, and (ii) weighting the adversarial loss differently depending on how correctly classified an example is. When put together, these modifications achieve state-of-the-art robustness on CIFAR-10, improving over the previously best robust accuracy by about 3.5%. The authors perform multiple ablation studies and demonstrate that their modified loss function also improves when additional unlabeled data is added (again achieving state-of-the-art robustness).
+
+I recommend accepting the paper. The modifications for the loss function are well motivated and improve over the state of the art by a non-trivial amount. Moreover, the authors nicely put their loss function in the context of prior work.
+
+Additional comments:
+
+- In Table 4, are the ""best"" columns the best checkpoint for the respective column (potentially different checkpoints for different columns) or does ""best"" refer to a single model (for each row)?
+
+- Is 65.04% (Table 5 b) now the best published robust accuracy on CIFAR-10 (at least to best of the authors' knowledge)? If so, it may be helpful to indicate this to the reader.
+
+- In Figure 2d, it could be insightful to expand the plot further to see the regime where the performance of MART drops substantially.
+
+- In Figure 1, the three plots would be easier to compare if the y-axes were the same.
+
+- From Figure 2, it looks like the gain from the BCE loss is as large as the gain from treating misclassified examples differently. Is this correct?
+
+- I strongly encourage the authors to release their models in a format that is easy for other researchers to use (e.g., PyTorch model checkpoints). This will make it substantially easier for future work to build on the results in this paper.",8,,ICLR2020
+H1xdjdvhtH,2,B1eyA3VFwS,B1eyA3VFwS,Official Blind Review #2,"This paper proposes to use a differentiable FFT layer to enforce hard constraints for results generated by a CNN. This is demonstrated and evaluated for a 3D turbulence data set (an interesting and challenging problem), and evaluated for a single case.
+
+While this goal is good by itself, and the domain of applications is a very interesting one, the paper gives the impression of being preliminary, and the claims for the proposed constraints are a bit too generic, in my opinion.
+
+First, the FFT effectively only yields a somewhat specialized method for projection onto the set of admissible solutions, and is demonstrated only for a single constraint, i.e., to make the flow field solenoidal. The same goal can actually be reached in different ways, e.g., by inferring a vector potential as proposed by Kim et al. 2019 in the ""DeepFluids"" paper. The latter employs a curl formulation, and as such is less general, but probably faster than the FFT based method proposed here.
+
+In addition, the paper unfortunately contains only a single example. Here, several variants (no constraint, soft constraint, and the proposed method) are evaluated in addition to simpler interpolation methods. Visually, I could not really make out differences in figure 4. The metrics in table 1 look interesting, although it, e.g., didn't get clear to me what the ""KS stats"" mean. The graphs in figure 3 also paint a somewhat varied picture. While some regions seem to be well represented, others are clearly there in the references, but missing in one of the inferred versions.
+
+I was wondering in general - what is the intuition for divergence-freeness improving the TKE, for example? It's neat to see the metrics improve, but wouldn't one expect that a projection onto divergence free flows rather removes energy from the solutions, and hence maybe yield values that are too low? 
+
+I think the paper could be improved by first evaluating the method for a series of smaller two-dimensional examples, before tackling a full 3D flow. This would simplify comparisons to other methods, and help to illustrate the properties of the method. Ideally, other constraints than enforcing divergence-freeness could be demonstrated to show the generality of employing an FFT projection in the loss function. So currently, I think this paper is not quite ready for a conference such as ICLR. It would be important to demonstrate that the result shown here is not an ""outlier"", but that the improvements are a general trend obtained via the proposed method.",3,,ICLR2020
+qrkAfPFiAa0,4,8Sqhl-nF50,8Sqhl-nF50,"Mathematically elegant paper, but not clear if suitable for ICLR in current form","This paper studies approximation and optimization of linear RNNs for learning linear functions, from the perspective of the memory-properties of the temporal sequence. It shows that linear functionals can be approximated by a linear RNN, with the rate of approximation depending on the long-term memory of the process.  It also shows that the training dynamics slow down for certain linear functionals with long-term memory. 
+
+Strengths:
+
+1. The problem being studied in the paper is interesting and well-motivated. Capturing long-term memory is one of the major challenges for sequential models such as RNNs, and the paper makes progress towards understanding this.
+2. The functional view of the process is interesting, and seems to shed light on interesting phenomenon regarding memory. The paper also brings in rich tools from functional analysis for analyzing RNNs, which could perhaps be more broadly useful if they can be made accessible enough.
+
+Weaknesses:
+
+1. I think the paper needs to be significantly rewritten for the ML audience to extract much out of it. Most of the ML community will not be familiar with the tools and terminology here, including classical results from representation theory such as the Riesz-Markov-Kakutani theorem. Providing more intuition and context for these results will be very helpful. For instance, it would be good to provide some intuition for the \rho(t) function. Once this is introduced, the authors use it interchangeably in place of H_t for describing all their subsequent examples, but it would be better to provide some intuition for the examples directly. The underlying phenomenon are simple and elegant, and I think they can be explained effectively to the ML community.
+2. As far as the main message of the paper regarding memory goes, I think it is interesting, but I am not sure if all the machinery is necessary for showing this result? For instance, for a linear functional which does not decay fast enough on the constant input (such as in the conditions of Thm 4.2), would it not be possible to show that it cannot be approximated using a small number of neurons even in the discrete case? The reason being that the process “remembers” inputs over exponentially long windows, and hence you need an exponential number of units to approximate it (at least with linear activations)? Can the authors shed light on the power of the continuous time view and representation theory for showing this?
+3. The optimization result seems a bit tailored to a particular functional. I think if the authors could explain more generally why the optimization is getting stuck at a plateau, even at an intuitive level, then that would be useful. I’m also curious about the same question as before, is it not possible to construct a specific worst case function even in the discrete case?
+
+Overall, I think there are some nice ideas and tools here, but am not sure if the ML community will get a lot out of this currently.
+
+Some more comments:
+
+1. Please define/describe the airy function.
+2. Typo above Eq (21), We->we.
+3. Typo above conclusion, exponentially->exponential.
+4. Theorem 4.2, y_i^(k)(t) with the superscript (k) does not appear to be defined?
+5. Please clearly define what inputs and outputs are at the beginning of Section, for example x is input, y is output, h is not observed.
+
+------Updates after response------
+
+I thank the authors for the detailed response and the revision. I am still not completely convinced regarding the suitability for ICLR and have similar concerns to reviewer 2, but am not opposed to acceptance.  In light of this, I have increased my score to 6.",6,3.0,ICLR2021
+wWtYkekwup,2,eo6U4CAwVmg,eo6U4CAwVmg, ,"In this paper, the authors suggest using the contrastive loss to improve the training of the discriminator and further stabilize the GAN training process. More specifically, the proposed method incorporates the self-supervised simCLR contrastive loss on a pair of transformed real images and supervised contrastive loss on the fake ones. The proposed method is evaluated on the image synthesis task on CIFAR10/100 and CelebA-HQ-128 images and over several different GAN models.
+
+Strengths:
+* The idea of using self-supervised learning for improving the training dynamics of the discriminator makes sense and is an interesting exploration area.
+* Empirical evaluations show a consistent and significant advantage for the proposed methods and ablation studies verify the contributions of the different proposed components.
+
+weaknesses:
+* The proposed method is rather a careful ensembling of existing components, e.g. simCLR self-supervised or supervised contrastive loss in the right context, rather than a radically novel methodological contribution.
+* The proposed method is only tested on relatively low-resolution datasets, namely CIFAR10/100 and CelebA-HQ-128. It would have been interesting to also demonstrate the contributions on the more challenging higher resolution datasets.
+
+Detailed comments:
+* Equation 7: If I am not mistaken, it doesn't seem quite the same as in the referred supervised contrastive loss from Khosla et al.; more specifically, I think the order between log and \sum_{v_{i+}^{(2)} \in V_{i+}^{(2)}} need to be reversed. Please clarify.
+* An ablation study I was missing was having the proposed method without the extensive simCLR augmentations and see how it compares to the other methods.
+* It would have been interesting to also compare with some other recent and relevant work, e.g. ADA (Karras et al.).
+* ""Remark that we use an independent projection header h_f instead of h_r"" => Is this making a significant difference? Are there ablation studies showing this?
+* Typo: ""approaches that handles""
+",6,4.0,ICLR2021
+pYAr9vBRnwG,2,IMPA6MndSXU,IMPA6MndSXU,An interesting work but lacking promising results ,"Summary: This paper proposes to learn the categorical semantic features in an unsupervised manner to enhance the ability of image translation at preserving high-level features, and obtains some good results on Semantic-Preserving Unsupervised Domain Translation and Style-Heterogeneous Domain Translation problems. 
+
+
+Major issues: 
+- The proposed method seems to be a combination of current works. The main contribution of this work may be leveraging the unsupervised representation learning for semantic features extraction.  
+
+- The quality of generated images is still not satisfactory with such rapid development of GANs.
+
+- The experiment and qualitative evaluation are too limited. Only two image translation tasks are conducted for comparison, and little visual results are given. It will be preferred if some common I2I tasks results are given. Only FID is used, adding other metrics, such as LPIPS, NDB, and JSD, will be more convincing. 
+
+
+Minor issues
+- What are the essential differences between SPUDT and SHDT problem? How does the model solve the two problems according to their differences?
+
+",4,4.0,ICLR2021
+S1gpyr26tH,3,rJecbgHtDH,rJecbgHtDH,Official Blind Review #2,"This paper introduces a framework for composing tasks by treating tasks as a Boolean algebra. The paper assumes an undiscounted MDP with a 0-1 reward and a fixed absorbing set G, and considers a family of tasks defined by different reward functions. Each task defers only by the value of the reward function at the absorbing set G. These restrictions are quite severe but basically describes goal-state reaching sparse reward tasks, which are quite general and valuable to study. The paper then defines a mapping onto a Boolean algebra for these tasks and shows how the mapping also allows re-using optimal Q functions for each task to solve a Boolean composition of these tasks. This is demonstrated on the tabular four-rooms environment and using deep Q learning for a 2D navigation task.
+
+The writing is relatively clear and the experiments support the claim in the paper that the framework allows learning compositions of skills. Both experiments show that after learning a set of base tasks, the method can solve a task in a zero-shot manner by composing Q functions according to the specified task. This capability seems very useful wherever it can be applied. But I worry that since the setting is so constrained, it is not likely to be widely applicable. The method in the paper likely does not apply to non-sparse, non-goal reaching settings, and prior methods have explored compositionality in that space anyways.
+
+The coverage of prior work seems complete. One suggestion is to discuss recent goal relabeling work such as Hindsight Experience Replay (Andrychowicz 2017). Kaelbling 1993 is mentioned already, but this line of work has recently shown significant progress in learning to achieve multiple goals at the same time from a different perspective (and also considers sparse rewards).
+
+However, my main concern with this paper is that it is not clear the language of Boolean algebra leads to significant insights in solving these compositional problems. Take Figure 1, which shows the disjunction and conjunction of tasks. While it is true the average does not lead to the same optimal policy as the conjunction, people use it because learning from the completely sparse reward is often prohibitively difficult. This kind of reasoning is straightforward in the restricted case of MDPs considered in the paper and people can design their reward function directly without considering boolean algebra. The result and proofs about recovering optimal Q functions without extra further training are interesting, but again, seem straightforward in the restricted family of MDPs considered without looking at Boolean algebra. Therefore, I am currently considering the paper borderline.
+",3,,ICLR2020
+SkgFH5o3FS,1,BJlNs0VYPB,BJlNs0VYPB,Official Blind Review #1,"This paper attempts an in depth study of the lottery ticket hypothesis. The lottery ticket hypothesis holds that sparse sub-networks exist inside dense large models and that the sparse sub-networks achieve at least as good an accuracy as the underlying large model. These sub-networks are discovered by training and iteratively pruning the dense model. This paper investigates the epoch at which pruning should occur as well as the epoch at which weights should be rewound when retraining. Then, the authors conduct experiments with different pruning strategies (one-shot vs. gradual) in an attempt to find such sparse models (or ""winning tickets"") earlier than they otherwise would have been found.
+
+The experiments conducted by the authors seem to be very extensive, and I think the paper contains useful data to have for researchers interested in better understanding the lottery ticket hypothesis. However, my main issue is with both the originality and significance of this work. This paper gives evidence that winning tickets may be found ""early,"" although their notion of early still involves quite a lot of training. 
+
+Although the paper is interested in addressing the structure of the winning tickets, I really didn't find any of the discussion of structure to give much insight into the lottery ticket hypothesis. Most of the section focuses on analyzing weight magnitude, though I was hoping for something more about the actual structure of the sparse subnetwork -- especially given the title of the paper. Figure 3 is notable, showing that different winning tickets (parameterized by different prune and rewind epochs) can have a large Hamming distance between them. This is very interesting, and I wish the authors had more to say. How is this affected by different initializations? Are these solutions connected on a loss landscape? Is there something invariant about the sparse architecture after symmetries are taken into account? It's not clear to me that Hamming distance alone is enough.
+
+In conclusion, the paper presents a set of nice experiments, but doesn't really shed too much additional light on the scientific nature of the lottery ticket hypothesis.",3,,ICLR2020
+ryJ6sRYlf,2,Hy8hkYeRb,Hy8hkYeRb,Major and minor concerns,"The paper ""A Deep Predictive Coding Network for Learning Latent Representations"" considers learning of a generative neural network. The network learns unsupervised using a predictive coding setup. A subset of the CIFAR-10 image database (1000 images horses and ships) are used for training. Then images generated using the latent representations inferred on these images, on translated images, and on images of other objects are shown. It is then claimed that the generated images show that the network has learned good latent representations.
+
+I have some concerns about the paper, maybe most notably about the experimental result and the conclusions drawn from them. The numerical experiments are motivated as a way to ""understand the capacity of the network with regards to modeling the external environment"" (abstract). And it is concluded in the final three sentences of the paper that the presented network ""can infer effective latent representations for images of other objects"" (i.e., of objects that have not been used for training); and further, that ""in this regards, the network is better than most existing algorithms [...]"".
+
+I expected the numerical experiments to show results instructive about what representations or what abstractions are learned in the different layers of the network using the learning algorithm and objectives suggested. Also some at least quantifiable (if not benchmarked) outcomes should have been presented given the rather strong claims/conclusions in abstract and discussion/conclusion sections. As a matter of fact, all images shown (including those in the appendix) are blurred versions of the original images, except of one single image: Fig. 4 last row, 2nd image (and that is not commented on). If these are the generated images, then some reconstruction is done by the network, fine, but also not unsurprising as the network was told to do so by the used objective function. What precisely do we learn here? I would have expected the presentation of experimental results to facilitate the development of an understanding of the computations going on in the trained network. How can the reader conclude any functioning from these images? Using the right objective function, reconstructions can also be obtained using random (not learned) generative fields and relatively basic models. The fact that image reconstruction for shifted images or new images is evidence for a sophisticated latent representations is, to my mind, not at all shown here. What would be a good measure for an ""effective latent representation"" that substantiates the claims made? The reconstruction of unseen images is claimed central but as far as I could see, Figures 2, 3, and 4 are not even referred to in the text, nor is there any objective measure discussed. Studying the relation between predictive coding and deep learning makes sense, but I do not come to the same (strong) conclusions as the author(s) by considering the experimental results - and I do not see evidence for a sophisticated latent representation learned by the network. I am not saying that there is none, but I do not see how the presented experimental results show evidence for this.
+
+Furthermore, the authors stress that a main distinguishing feature of their approach (top of page 3) is that in their network information flows from latent space to observed space (e.g. in contrast to CNNs). That is a true statement but also one which is true for basically all generative models, e.g., of standard directed graphical models such as wake-sleep approaches (Hinton et al., 1995), deep SBNs and more recent generative models used in GANs (Goodfellow et al, 2014). Any of these references would have made a lot of sense.
+
+With my evaluation I do not want to be discouraging about the general approach. But I can not at all give a good evaluation given the current experimental results (unless substantial new evidence which make me evaluate these results differently is provided in a discussion).
+
+
+Minor:
+
+- no legend for Fig. 1
+
+-notes -> noted
+
+have focused
+
+
+
+
+",3,4.0,ICLR2018
+ae8p5fddUKI,5,6t_dLShIUyZ,6t_dLShIUyZ,Review,"This paper combines a widely used variance reduction technique SVRG with the greedy-GQ. It provides a finite-time analysis of the proposed algorithm  in the off-policy and Markovian sampling setting (convergence to the stationary point) and improves the sample complexity from the order $\epsilon^{-3}$ to $\epsilon^{-2}$ comparing with the vanilla greedy GQ. Interestingly, the analysis shows that the biase error caused by the Markovian sampling and the variance error of the stochastic gradient are reduced by the $M$, where M is the batch size of the batch gradient in SVRG. At last, it verifies the theoretical claim by two toy examples.
+
+pros:
+1. It combines the variance reduction trick in optimization community with the two time scale analysis in RL.
+2. The analysis is on the off-policy control setting, which in general is much harder than the off-policy evaluation setting.
+3. The objective function of MSPBE in control setting is non-convex, which increases the difficulty of the proof.
+
+
+cons:
+1. The main contribution of this paper is its theoretical analysis. However the techniques in the proof have already existed in many literatures. It seems that the author just combines them together. For instance, there are many literatures on the convergence analysis of the SVRG in the non-convex setting.  The author claims that the 'fine-tuned' Lyapunov function is novel. However such tools in the non-convex SVRG are widely used. It may be true that we need to chose the c_t carefully to cancel some error term but  the main framework of the proof is the same.
+
+2. I am not sure whether the variance reduction technique is useful in practice. There are some evidences that SVRG does not work well in the training of deep learning problem.  Would faster convergence to the stationary points lead to the better performance (e.g. higher reward )? I do not see anything related to that in the experiment. 
+
+3. The author just tests their algorithm on two toy examples. I hope to see more complicated experiments. Maybe the author can try the neural network in the function approximation beyond the linear function approximation.  I know the analysis is just on the linear case, but this experiment would demonstrate the potential applicability of the algorithm.
+
+
+################after rebuttal
+
+After reading the responds from the author, I keep my score at 5.
+",5,3.0,ICLR2021
+HJgX4T6CtH,2,BJxQxeBYwH,BJxQxeBYwH,Official Blind Review #3,"This paper presents a dissection analysis of graph neural networks by decomposing GNNs into two parts: a graph filtering function and a set function. Although this decomposition may not be unique in general, as pointed out in the paper, these two parts can help analyze the impact of each part in the GNN model. Two simplified versions of GNN is then proposed by linearizing the graph filtering function and the set function, denoted as GFN and GLN, respectively. Experimental results on benchmarks datasets for graph classification show that GFN can achieve comparable or even better performance compared to recently proposed GNNs with higher computational efficiency. This demonstrates that the current GNN models may be unnecessarily complicated and overkill on graph classification. These empirical results are pretty interesting to the research community, and can encourage other researchers to reflect on existing fancy GNN models whether it's worth having more complex and more computationally expensive models to achieve similar or even inferior performance. Overall, this paper is well-written and the contribution is clear. I would like to recommend a weak accept for this paper. If the suggestions below can be addressed in author response, I would be willing to increase the score.
+
+
+Suggestions for improvement:
+
+1) Considering the experimental results in this paper, it is possible that the existing graph classification tasks are not that difficult so that the simplified GNN variant can also achieve comparable or even better performance (easier to learn). This can be conjectured from the consistently better training performance but comparable testing performance of original GNN. Another possibility is that even the original GNN has larger model capacity, it is not able to capture more useful information from the graph structure, even on tasks that are more challenging than graph classification. However, this paper lacks such in-depth discussions;
+
+2) Besides the graph classification task, it would be better to explore the performance of the simplified GNN on other graph learning tasks, such as node classification, and various downstream tasks using graph neural networks. This can help demystify the question raised in the previous point; 3) The matrix \tilde{A} in Equation 5 is not well explained (described as ""similar to that in Kipf and Welling (2016)""). It would be more clear to directly point out that it is the adjacency matrix, as described later in the paper.",6,,ICLR2020
+rygqdW892X,2,S1lg0jAcYm,S1lg0jAcYm,"REVISED: ARM algorithm is an interesting approach to a limited domain of interest in ML. While limited, it may spark new research into augmentation of random variables for variance reduction","Overview.
+The authors present an algorithm for lowering the variance of the score-function gradient estimator in the special case of stochastic binary networks. The algorithm, called Augment-REINFORCE-merge proceeds by augmenting binary random variables. ARM combines Rao-Blackwellization and common random numbers (equivalent to antithetic sampling in this case, due to symmetry) to produce what the authors claim to be a lower variance gradient estimator. The approach is somewhat novel. I have not seen other authors attempt to apply REINFORCE in an augmented space and with antithetic samples / common random numbers, and Rao-Blackwellization. This combination of techniques may be a good idea in the case of Bernoulli random variables. However, due to a number of issues discussed below, this claim is not possible to evaluate from the paper.
+
+Issues/Concerns
+- I assess the paper in its current form as too far below the acceptable standard in writing and in clarity of presentation, setting aside other conceptual issues which I discuss below. The paper contains many typos and a few run-on sentences that span 5-7 lines. This hinders understanding substantially. A number key terms are not explained, irregularly. Although the paper assumes that readers do not know the mean and a variance of a Bernoulli random variable, or theof  definition of an indicator function, it does not explain what random variable augmentation means. The one sentence that comes close to explaining it seems to have a typo: ""From (5) it becomes clear that the Bernoulli random variable z ∼ Bernoulli(σ(φ)) can be reparameterized by racing two augmented exponential random variables ..."". It is not clear what is meant by ""racing,"" here, and I do not find it clear from equation (5) what is going on. Unfortunately, in the abstract, the paper claims that variance reduction is achieved by ""data augmentation,"" which has a very specific meaning in machine learning unrelated to augmented random variables, further obfuscating meaning. Similarly, the term ""merge"" is not explained, despite the subheading 2.3.
+- Computational issues are not addressed in the paper. Whether or not this method is useful in practice depends on computational complexity
+- No effort is made to diagnose the source of the variance reduction, other than in the special case of analytically comparing with the Augment-REINFORCE estimator, which does not appear in any of the experiments. 
+- No effort is made to empirically characterize the variance of the gradient estimator, unlike Tucker et al (2017) and Grathwohl et al. (2018).
+- The algorithm presented in the appendix appears to only address single-layer stochastic binary networks, which are uninteresting in practice.
+- Figure 2 (d), (e), and (f) all show that ARM was stopped early. Given that RELAX and REBAR overfit, this is a little troubling. Overal, these results are not very convincing that ARM is better, particularly in the absence of variance analysis (empirically, or other than w.r.t. the same algorithm without the merge step). All algorithms should be run for the same number of steps, particularly in cases where they may be prone to overfitting.
+- Figure 1 I believe contains an error for the REINFORCE figure. In my own research I have run these experiments myself, with a value of p close to the one used by the authors. REBAR and RELAX both reduce to a REINFORCE gradient estimator with a control variate that is differentiably reparametrizable, and so the erratic behaviour of the REINFORCE estimator in this case is likely wrong.
+- There is a mysterious sentence on page 6 that refers to ARM adjusting the ""frequencies, amplitudes, and signs of its gradient estimates with larger and more frequent spikes for larger true gradients""
+-The value to the community of another gradient estimator for binary random variables is low, given the plethora of other methods available. Given the questions remaining about this methodology and its experiments, I recommend against publication on this basis also.
+- Table 2 compares results that mix widely different architectures against each other, some taken directly from papers, others possibly retrained. This is not a valid comparison to make when evaluating a new gradient estimator, where the model must be fixed. 
+
+
+* EDIT: I have re-evaluated the careful and comprehensive response to my concerns by the authors. I thank them for their effort in this. As many of the concerns were related to communication and have been addressed in the most recent draft, I think it is appropriate to move my review upwards. The revisions make this paper quite different from the original, and I am happy to re-evaluate on that basis--this is a peculiarity of the ICLR open review procedure, but I consider it a strength. 
+
+I note that ""data augmentation"" in machine learning appears to have collided with a term in the Bayesian statistics literature, and the authors have provide a number of citations to support this. I strongly recommend ""variable augmentation"" going forward, as that is an accurate description (you are augmenting a random variable, rather than the input data domain). This appears to be one of the growing pains of the field of ML which has distinct and often orthogonal concerns to classical statistics around density approximation and computational issues.*
+
+",6,4.0,ICLR2019
+6klxucqLfY0,3,BM---bH_RSh,BM---bH_RSh,An interesting study on compression of recommendation models,"This paper studies the compression of recommendation models (RMs). That is new and relatively less studied in the model compression field, but of great practical value. The main unique challenge of RM compression lies in the entanglement of compressing both the network parameters and the feature embedding inputs, and the latter often accounts for more of the computational bottleneck. 
+
+To this end, the authors proposed the UMEC framework, by integrating these two sub-tasks into one unified constrained optimization problem, solved by ADMM. Specifically, they develop a resource-constrained optimization that directly sets the target resource consumption and eases the practical usage. The authors conduct extensive experiments and demonstrate the effectiveness of UMEC by observing its superior performance than other state-of-the-art baseline methods. The paper is very well written, and the notations and technical details are clearly presented. 
+
+Question 1: My major concern is that, although the authors reported many baseline comparisons and ablation studies, all experiments are on only one dataset (i.e., Criteo AI Labs Ad Kaggle), and one task (CTR prediction). It is unclear whether the proposed method can be generally useful or can be scaled up to industry-level large systems. 
+
+Question 2: in Section 4.4 the authors compared with Ginart et al. (2019) for input dimension reduction, while they mentioned another prior work Joglekar et al. (2020) using AutoML to search for feature dimensions. Is it possible to also compare with the later one?
+
+Question 3: in Eqn (1), why only enforcing structured sparsity for the input layer? Wouldn’t it be more natural if also extended to the remaining layers?",7,5.0,ICLR2021
+u4KTaCw8Da,3,VcB4QkSfyO,VcB4QkSfyO,"A stepping stone towards understanding the generalization and robustness properties of Deep Equilibrium models. Some comparison to relevant work is missing, which is a major issue.","**after rebuttal**: The authors have addressed some of my major concerns in an updated version. for this reason I raise my score to a point where I can recommend acceptance. I now add after-rebuttal comments at the end of each item of my original review.
+
+## Summary
+This work obtains upper bounds on the Lipschitz constant of a Monotone Deep Equilibrium Model (monotone DEQs) depending only on their strong monotonicity parameter $m$. This is in contrast to the naive bound for deep neural networks, which degrade with the depth. This also implies that controlling the smoothness of a monotone DEQ is possible just via the single parameter $m$, rather than controlling operator norms of each layer of a DNN. Additionaly, the authors derive generalization bounds  for such type of models, based on the deterministic PAC-bayes approach. This generalization bound reveal a dependency on the strong monotonicity parameter $m$, as well as the size of the hidden layer and appears to be the first result of such kind.
+
+## Pros:
+**1. Clarity: The main claims of the paper and their exposition are clearly stated**: The main contributions i.e., the upper bound on Lipschitz constant (Thm1) Upper bound on change after perturbation of the weights (Thm2) and Generalization bound (Thm3) draw from well developed techniques from monotone operator theory and statistical learning theory, using well known concepts which are accessible to researchers with basic knowledge in such areas.
+
+**2. Significance: It appears that DEQs have many advantages like reduce memory usage as well as good performance, hence it is of major importance to understand their smoothness properties and generalization guarantees, which this work contributes to**. Controlling the smoothness of traditional neural networks seems to suffer from a computation-quality tradeoff where simple bounds on the Lipschitz constant are easy to enforce, but are of dubious quality, while tight estimates are computationally inefficient. This work provides evidence that exerting such control on monotone DEQs is conceptually easier.
+
+**3. Originality: although the results mostly come from standard techniques, they are useful and novel, to my knowledge**. Previous work has mostly focused on computational aspects and variations, illustrating the feasibility of the approach.
+
+## Cons:
+**1. Originality/Significance: there is a major relevant work that is not cited nor compared to.** There is some work studying bounds on the Lipschitz constant  of implicit models, although more of the flavor of Neural ODEs https://arxiv.org/pdf/2004.13135.pdf 
+this work also appears to focus on the Lipschitz constant w.r.t. the parameters of the networks. I have to accept that the results there are somewhat convolved but I think this work should be cited and the differences between this work should be clarified
+
+There is also the following work https://arxiv.org/abs/1908.06315v4 (note that the v4 is recent so it might not classify as prior work given that the deadline was beginning of october, but there is an initial version v1 dating from 2019). THis work also studies many aspects of implicit models. In particular it looks like section 4 in v4 deals with Lipschitz constants with respect to the L-infinity norm, which is related to the current work but it is not cited. Again I think this work should be cited and the differences with the submitted work should be clarified. **after rebuttal**: The authors have included such references and a discussion of the main differences (end of section 2), making clear how their approach differs.
+
+**2. Clarity: Some claims are misleading, regarding the Lipschitz constant**: It is claimed that theorem 1 shows that the Lipschitz constant does not depend on the matrix W. However the distinction should be made that the value obtained is an upper bound, so in fact it is possible that the minimal Lipschitz constant depends on W, but this particular upper bound obtained does not. This should make the claim more clear. **after rebuttal**: authors acknowledge this issue, for some reason I don't see that this is fixed in the new version, but could be fixed with minor rewriting in the final version. They should only write the conclusions for their **derived upper bound on the Lipschitz constant** rather than **THE Lipschitz constant** which is traditionally understood as being an infimum over the set of possible constants with the Lipschitz property.
+
+**3. Clarity: It looks like the proposition 2 is not used anywhere and does not correspond to any substantial claim and thus should be removed**. Unless I am missing something, I don't understand why is preposition 2 relevant or how it is used to support the important theorems. After checking, it is not used in the proofs of the theorems. Am I missing something? **after rebuttal** the authors do not seem to address this. I have now realized that Proposition 2 is used in the proof of theorem 2 but because such proof is found in the appendix, it seems that including an intermediate result in the main text is a poor stylistic choice. However this is not a major issue.
+
+**4. Significance: I think that the pros of the generalization bound (no explicit dependence of depth) is great, but its weaknessess are downplayed. In particular the generalization bound depends linearly in the width $h$**. Weaknessess should also be acknowledged. In contrast, as far as I know the DNN gen bounds like that of Bartlett et al. or Neyshabur et al. depend only logarithmically on the width.
+
+## Other comments:
+1. THeorem 3: typo, change m to M(set of size M)
+2. In the experiments in 5 it is weird to use the ""lower bound"" from Combettes et al. as it is not really a lower bound. I don't remember what is the motivation to use this but seems really confusing to use something which is not a lower bound and call it lower bound. A better lower bound could be sampling points and taking the maximum norm of the gradients as is done in other papers. Could this be changed/added easily? **after-rebuttal**: the authors have changed the lower bound to a true lower bound.
+
+All in all, I think the missing references/discussion are a major point that has to be addressed. Hopefully this can be done in moderate space/time during rebuttal. In that case I would be willing to increase my score because my overall impression is positive.",7,4.0,ICLR2021
+HyldHxls2X,2,S1xBioR5KX,S1xBioR5KX,"In its present form, this paper seems more like engineered modifications of existing pipelines with insufficient validation, rather than a mature research contribution.","This paper presents a method for training neural networks where an efficient sparse/compressed representation is enforced throughout the training process, as opposed to starting with are large model and pruning down to a smaller size.  For this purpose a dynamic sparse reparameterization heuristic is proposed and validated using data from MNIST, CIFAR-10, and ImageNet.
+
+My concerns with this work in its present form are two-fold.  First, from a novelty standpoint, the proposed pipeline can largely be viewed as introducing a couple heuristic modifications to the SET procedure from reference (Mocanu, et al., 2018), e.g., substituting an approximate threshold instead of sorting for removing weights, changing how new weights are redistributed, etc.  The considerable similarity was pointed out by anonymous commenters and I believe somewhat understated by the submission.  Regardless, even if practically effective, these changes seem more like reasonable engineering decisions to improve the speed/performance rather than research contributions that provide any real insights.  Moreover, there is no attendant analysis regarding convergence and/or stability of what is otherwise a sequence of iterates untethered to a specific energy function being minimized.
+
+Of course all of this could potentially be overcome with a compelling series of experiments demonstrating the unequivocal utility of the proposed modifications.  But it is here that unfortunately the paper falls well short.  Despite its close kinship with SET, there are surprisingly no comparisons presented whatsoever.  Likewise only a single footnote mentions comparative results with DeepR (Bellec et al., 2017), which represents another related dynamic reparameterization method.  In a follow up response to anonymous public comments, some new tests using CIFAR-10 data are presented, but to me, proper evaluation requires full experimental details/settings and another round of review.
+
+Moreover, the improvement over SET in these new results, e.g., from a 93.42 to 93.68 accuracy rate at 0.9 sparsity level, seems quite modest.  Note that the proposed pipeline has a wide range of tuning hyperparameters (occupying a nearly page-sized Table 3 in the Appendix), and depending on these settings relative to SET, one could easily envision this sort of minor difference evaporating completely.  But again, this is why I strongly believe that another round of review with detailed comparisons to SET and DeepR is needed.
+
+Beyond this, the paper repeatedly mentions significant improvement over ""start-of-the-art sparse compression methods."" But this claim is completely unsupported, because all the tables and figures only report results from a single existing compression baseline, namely, the pruning method from (Zhu and Gupta, 2017) which is ultimately based on (Han et al., 2015).  But just in the last year alone there have been countless compression papers published in the top ML and CV conferences, and it is by no means established that the pruning heuristic from (Zhu and Gupta, 2017) is state-of-the-art.
+
+Note also that reported results can be quite deceiving on the surface, because unless the network structure, data augmentation, and other experimental design details are exactly the same, specific numbers cannot be directly transferred across papers.  Additionally, numerous published results involve pruning at the activation level rather than specific weights.  This definitively sacrifices the overall compression rate/model size to achieve structured pruning that is more naturally advantageous to implementation in practical hardware (e.g., reducing FLOPs, run-time memory, etc.).  One quick example is Luo et al., ""ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression,"" ICCV 2017, but there are many many others.
+
+And as a final critique of the empirical section, why not report the full computational cost of training the proposed model relative to others?  For an engineered algorithmic proposal emphasizing training efficiency, this seems like an essential component.
+
+
+In aggregate then, my feeling is that while the proposed pipeline may eventually prove to be practically useful, presently this paper does not contain a sufficient aggregation of novel research contribution and empirical validation.
+
+Other comments:
+
+- In Table 2, what is the baseline accuracy with no pruning?
+
+- Can this method be easily extended to prune entire filters/activations?",4,4.0,ICLR2019
+Sy3YDsf4e,2,r1VdcHcxx,r1VdcHcxx,Review,"This paper extends batch normalization successfully to RNNs where batch normalization has previously failed or done poorly. The experiments and datasets tackled show definitively the improvement that batch norm LSTMs provide over standard LSTMs. They also cover a variety of examples, including character level (PTB and Text8), word level (CNN question-answering task), and pixel level (MNIST and pMNIST). The supplied training curves also quite clearly show the potential improvements in training time which is an important metric for consideration.
+
+The experiment on pMNIST also solidly shows the advantage of batch norm in the recurrent setting for establishing long term dependencies. I additionally also appreciated the gradient flow insight, specifically the impact of unit variance on tanh derivatives. Showing it not just for batch normalization but additionally the ""toy task"" (Figure 1b) was hugely useful.
+
+Overall I find this paper a useful additional contribution to the usage of batch normalization and would be necessary information for successfully employing it in a recurrent setting.",8,4.0,ICLR2017
+BytyNwclz,3,HJGv1Z-AW,HJGv1Z-AW,"Explores interesting issues,  but needs more quantitative analysis and has limited novelty","This paper presents an analysis of the communication systems that arose when neural network based agents played simple referential games. The set up is that a speaker and a listener engage in a game where both can see a set of possible referents (either represented symbolically in terms of features, or represented as simple images) and the speaker produces a message consisting of a sequence of numbers while the listener has to make the choice of which referent the speaker intends. This is a set up that has been used in a large amount of previous work, and the authors summarize some of this work. The main novelty in this paper is the choice of models to be used by speaker and listener, which are based on LSTMs and convolutional neural networks. The results show that the agents generate effective communication systems, and some analysis is given of the extent to which these communications systems develop compositional properties – a question that is currently being explored in the literature on language creation.
+
+This is an interesting question, and it is nice to see worker playing modern neural network models to his question and exploring the properties of the solutions of the phone. However, there are also a number of issues with the work.
+
+1. One of the key question is the extent to which the constructed communication systems demonstrate compositionality. The authors note that there is not a good quantitative measure of this. However, this is been the topic of much research of the literature and language evolution. This work has resulted in some measures that could be applied here, see for example Carr et al. (2016): http://www.research.ed.ac.uk/portal/files/25091325/Carr_et_al_2016_Cognitive_Science.pdf
+
+2. In general the results occurred be more quantitative. In section 3.3.2 it would be nice to see statistical tests used to evaluate the claims. Minimally I think it is necessary to calculate a null distribution for the statistics that are reported.
+
+3. As noted above the main novelty of this work is the use of contemporary network models. One of the advantages of this is that it makes it possible to work with more complex data stimuli, such as images. However, unfortunately the image example that is used is still very artificial being based on a small set of synthetically generated images.
+
+Overall, I see this as an interesting piece of work that may be of interest to researchers exploring questions around language creation and language evolution, but I think the results require more careful analysis and the novelty is relatively limited, at least in the way that the results are presented here.",5,4.0,ICLR2018
+qevyo7b-1p,1,Bpw_O132lWT,Bpw_O132lWT,"Novel perspective, but left with much ambiguous analysis and insufficient empirical justifications","This paper proposes to use power-law dynamics to approximate the state-dependent gradient noise in SGD, and analyses its escaping efficiency compared with previous dynamics. 
+
+Strength:
+1.	To the best of my knowledge, it is novel to use power-law dynamics to analyze the state-dependent noise in SGD. 
+2.	Still with strong assumptions on covariance structure, the analytical results based on power-dynamics are interesting. For example, it indicates that so-called kappa distribution highly depends on the fluctuations to the curvature over the training data. This is consistent with following work. So I suggest authors provide some discussion with the following work.
+Wu et.al 2018. How sgd selects the global minima in over-parameterized learning: A dynamical stability perspective. In Advances in Neural Information Processing Systems (pp. 8279-8288).
+
+Weakness & Issues 
+1.	The analytical results seem that they strongly depend on the covariance structure assumption, i.e. C(w) is diagonally dominant according to empirical observation. Does it have any theoretical justifications, or even in simplified cases? 
+2.	The delivered PAC generalization bound and the followed analysis are a little ambiguous.  Firstly, in current deep learning theory community, the relationship between flatness (even how to define a proper flatness) and generalization is still mysterious and controversial, which depends many factors. This work uses one type of flatness measure, the determinant of H, and shows that flatter minima generalize better by only considering the KL term. However, the first term also includes the Hessian and might also affect generalization bound. Thus, the conclusion appears a little problematic. 
+The authors said that generalization error will decrease w.r.t. kappa’s increase and infinite kappa results in Langevin dynamics. Then the question is what are the difference between the power-law dynamics and Langevin dynamics in term of generalization? 
+My view on the ambiguous analysis is that the authors attempt to answer extremely challenging questions but left with many questionable concerns. 
+3.	The experiments might not be sufficient. 
+I don’t think fitting the parameter distribution according to limited empirical observations is an appropriate way to make justifications. At least, from visual observation, there are many other alternatives besides power-law distribution to fit, as Fig 3 shows. 
+About comparing the escaping efficiency, the result only shows the success rate, and the evidence about the polynomial and exponential difference should be provided. Also, practical networks and datasets should also be considered to provide more strong evidence.
+
+If the authors can resolve these issues carefully, I would raise the score. 
+
+Typos
+“Eq. 4” should be “Eq.3” below equation 3
+",5,5.0,ICLR2021
+rJxxa4IY3m,3,ryxHii09KQ,ryxHii09KQ,a good start,"In my opinion this paper is generally of good quality and clarity, modest originality and significance.
+
+Strengths:
+- The experiments are very thorough. Hyperparameters were honestly optimized. The method does show some modest improvements in the experiments provided by the authors.
+- The analysis of the results is quite insightful.
+
+Weaknesses:
+- The experiments are done on CIFAR-10, CIFAR-100 and subsets of CIFAR-100. These were good data sets a few years ago and still are good data sets to test the code and sanity of the idea, but concluding anything strong based on the results obtained with them is not a good idea.
+- The authors claim the formalization of the problem to be one of their contributions. It is difficult for me to accept it. The formalization that the authors proposed is basically the definition of curriculum learning. There is no novelty about this.
+- The proposed method introduces a lot of complexity for very small gains. While these results are scientifically interesting, I don't expect it to be of practical use.
+- The results in Figure 3 are very far from the state of the art. I realize that they were obtained with a simple network, however, showing improvements in this regime is not that convincing.  Even the results with the VGG network are very far from the best available models.
+- I suggest checking the papers citing Bengio et al. (2009) to find lots of closely related papers. 
+
+In summary, it is not a bad paper, but the experimental results are not sufficient to conclude that much. Experiments with ImageNet or some other large data set would be advisable to increase significance of this work. ",5,4.0,ICLR2019
+Ske--iHV3m,1,H1f7S3C9YQ,H1f7S3C9YQ,This paper presents a neural network model that detect synonymous entities based on contextual information without supervision.,"Strengths:
+- clear explanation of the problem
+- clear explanation of the model and its application (pseudocode)
+- clear explanation of training and resulting hyperparameters
+
+Weaknesses:
+- weak experimental settings: 
+-- (a) comparison against 'easy to beat' baselines. The comparison should also include as baselines the very relevant methods listed in the last paragraph of the related work section (Snow et a.l 2005, Sun and Grishman 2010, Liao et al. 2017, Cambria et al. 2018). 
+-- (b) unclear dataset selection: it is not clear which datasets are collected by the authors and which are pre-existing datasets that have been used in other work too. It is not clear if the datasets that are indeed collected by the authors are publicly available. Furthermore, no justification is given as to why well-known publicly available datasets for this task are not used (such as CoNLL-YAGO (Hoffart et al. 2011), ACE 2004 (NIST, 2004; Ratinov et al. 2011), ACE 2005 (NIST, 2005; Bentivogli et al. 2010), and Wikipedia (Ratinov et al. 2011)).
+- the coverage of prior work ignores the relevant work of Gupta et al. 2017 EMNLP. This should also be included as a baseline.
+- Section 2 criticises Mikolov et al.'s skip-gram model on the grounds that it introduces noisy entities because it ignores context structure. Yet, the skip-gram model is used in the preprocessing step (Section 3.1). This is contradictory and should be discussed.
+- the definition of synonyms as entities that are interchangeable under certain contexts is well known and well understood and does not require a reference. If a reference is given, it should not be a generic Wikipedia URL.
+- the first and second bulletpoint of contributions should be merged into one. They refer to the same thing. 
+- the paper is full of English mistakes. A proficient English speaker should correct them.
+",4,4.0,ICLR2019
+B1uXHzDeM,1,ry4SNTe0-,ry4SNTe0-,"Claims to address the instability of the ""Improved GANs"" but does not provide any convincing evidence","* Summary *
+The paper addresses the instability of GAN training. More precisely, the authors aim at improving the stability of the semi-supervised version of GANs presented in [1] (IGAN for short) . The paper presents a novel architecture for training adversarial networks in a semi-supervised settings (Algorithm 1). It further presents two theoretical results --- one (Theorem 2.1) showing that the generator's gradient vanish for IGAN, and the second (Theorem 3.1) showing that the proposed algorithm does not suffer this behaviour. Finally, experiments are provided (for MNIST and CIFAR10), which are meant to support empirically the claimed improved stability of the proposed method compared to the previous GAN implementations (including IGAN).
+
+I need to say the paper is poorly written and not properly polished. Among many other things:
+
+(1) It refers to non-existent results in other papers. Eq 2 is said to follow [1], meanwhile the objectives are totally different: the current paper seems to use the l2 loss, while Salimans et al. use the cross-entropy;
+
+(2) Does not introduce notations in statements of theorems ($J_\theta$ in Theorem 2.1?) and provides unreadable proofs in appendix (proof of Theorem 2.1 is a sequence of inequalities involving the undefined notations with no explanations). In short, it is very hard to asses whether the proposed theoretical results are valid;
+
+(3) Does not motivate, discuss, or comment the architecture of the proposed method at all (see Section 3).
+
+Finally, in the experimental section it is unclear how exactly the authors measure the stability of training. The authors write ""unexpectedly high error rates and poor generate image quality"" (page 5), however, these things sounds very subjective and the authors never introduce a concrete metric. The authors only report ""0 fails"", ""one or two out of 10 runs fail"" etc. Moreover, for CIFAR10 it seems the authors make conclusions based only on 3 independent runs (page 6).
+
+[1] Salimans et al, Improved Techniques for Training GANs, 2016",3,4.0,ICLR2018
+H1xUCxgEpm,3,ByePUo05K7,ByePUo05K7,This paper addresses an important issue but fails to propose a solution,"Humans leverage shape information to recognize objects. Shape prior information helps human object recognition ability to generalize well to different scenarios. This paper aims to highlight the fact that CNNs will not necessarily learn to recognize objects based on their shape. Authors modified training images by changing a value of a pixel where its location is correlated with object category or by adding noise-like (additive or Salt-and-pepper) masks to training images. Parameters of such noise-like masks are correlated with object category. In other words if one learns noise parameters or location of altered pixel for each object category, they can categorize all images in the training set. This paper shows that CNNs will overfeat to these noise based features and fail to correctly classify images at test time when these noise based features are changed or not added to the test images.  
+
+Dataset bias is a very important factor in designing a dataset (Torralba et al,. 2011). Consider the case where we have a dataset of birds and cats. The task is image classification. All birds' images have the same background which is different than cats' background. As a result the network that is trained on these images will learn to categorize training images based on their background. Because extracting object based features such as shape of a bird and bird's texture is more difficult than extracting background features which is the same for all training images. 
+
+Authors have carefully designed a set of experiments which shows CNNs will overfeat to non-shape features that they added to training images. However, this outcome is not surprising. Similar to dataset design example, if you add a noise pattern correlated with object categories to training images, you are adding a significant bias to your dataset. As a result networks that are trained on this dataset will overfeat to these noise patterns. Because it is easier to extract these noise parameters  than to extract object based features which are different for each image due to different viewpoints or illumination and so on. 
+
+This paper would have been a stronger paper if authors had suggested mechanisms or solutions which could have reduced dataset bias or geared CNNs towards extracting shape like features.  
+
+Antonio Torralba and Alexei A. Efros. Unbiased look at dataset bias. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR '11).",4,4.0,ICLR2019
+BkePIn1g5S,3,BkxA5lBFvH,BkxA5lBFvH,Official Blind Review #1,"
+This paper proposes to adapt RL agents from some set of training environments (which, in the current instantiation, vary in some simple respect) to a new domain. They build on a framework for model-based RL called PETS. 
+
+The approach goes as follows: 
+
+2-step process
+ * train probabilistic model-based RL agents in a “population of source domains”
+ * dropped into new environment use “pessimistic exploration policy”
+
+Then at test time, in order to compute estimates for the rewards for each action the authors use a “particle propagation” technique for unrolling through their dynamics model .
+
+The action is chosen by looking at the sum of the 0 through kth percentile rewards. 
+This is a weird choice. Why are they looking at a sum over quantiles vs a quantile itself?
+
+The claim is that the models from the first stage capture the epistemic uncertainty due to not knowing z.
+However, the authors give a too scant a treatment of what these uncertainty estimates really mean.
+For example, they appear to only be valid with respect to an assumed distribution over z.
+The paper’s experiments however focus in large part on what happens when the model is evaluated 
+on values of z that were outside the support of the distribution over training domains. 
+In this case, any benefit appears to be ill explained by the underlying motivation.
+
+
+The next step here is to finetune the model as data is collected on the new domain.
+
+Authors propose heuristics for this finetuning that include
+1. Drawing experiences from the past experiences (under different domains) and 
+2. “keeping the model close to the original model”, via some sort of regularization presumably.
+
+>>> 	why isn’t the exact nature of how they “keep the model near the original model explained in the text?
+	perhaps the authors mean that 1. and 2. are one and the same (1 as  means to achieve 2)
+	if this is the case, then the exposition should be improved to make this more clear.
+
+
+Some important details appear to be missing. For example, how many distinct source domains are seen during pretraining? Do they set z different z for every single episode of pretraining? Some language here is unclear, for example what precisely does an “iteration” mean in the context of the experiments? 
+
+The choice to report “average maximum reward” seems strange if what the authors care about is avoiding risk. Can they explain/justify this choice or if not, present a much more comprehensive set of experimental results?
+
+The figures tracking catastrophic failures vs performance resembles those in 
+“Combating Reinforcement Learning's Sisyphean Curse with Intrinsic Fear”  https://arxiv.org/abs/1611.01211
+This raises some question about why they don’t if concerned with “catastrophic events” model them more explicitly. 
+Else, if the return accurately captures all desiderata, why to we need to count the failures?
+
+In short this is a simpzle empirical paper that makes use of heuristic uncertainty estimates, 
+including in settings when the estimates have no validity. The writing is reasonably clear
+and the ideas are straightforward (which is perfectly fine!). A few of the decisions are unnatural,
+a few are ad hoc, and a few details are missing. Overall my sense is that this paper 
+has some good qualitities, including the clarity of much of the exposition, 
+but it’s still below the mark to be an impactful ICLR paper. 
+
+==========UPDATE=================
+I read the rebuttal and am glad that the authors took time to read my review and engage with the criticism as well as try to make some small improvements to the paper, especially exploring the impact on the number of training environments on the results (in the original paper the number of environments available at train time was unlimited). The answers to some of the other questions were less convincing. E.g. the seemingly incoherent objective of summing over the quantiles falls flat. Why should we care more about being a ""strict generalization"" of some previous algorithm built upon than of having a coherent objective? Overall, I don't think the paper makes it over the bar to accept but I hope the authors continue to improve upon the work and get it into shape where it could be accepted at another strong conference.",3,,ICLR2020
+SyglfqEchm,2,rkl4M3R5K7,rkl4M3R5K7,Well-motivated approach to an interesting problem,"This paper is concerned with the problem of finding adversarial examples for an ensemble of classifiers. This is formulated as the task of finding noise vectors that can be added to a set of examples in such a way that, for each example, the best ensemble element performs as badly as possible (i.e. it’s a maximin problem).
+
+This is formulated as a two-player game (Equation 1), in which the above description has been relaxed slightly: Equation 1 seeks a *distribution* over noise vectors, instead of only one. This linearizes the game, so that we can seek a mixed Nash equilibrium. Given access to a best response oracle, Algorithm 1 results in such a mixed Nash equilibrium. This is pretty standard stuff (see e.g. “Robust Optimization for Non-Convex Objectives” in NIPS’17, or “A Reductions Approach to Fair Classification” in ICML’18), but the application of this approach to this problem is novel and interesting.
+
+In Section 2.1, the authors seek to show that they can get provable guarantees for *linear* classifiers, provided that there exists a “pure strategy Nash equilibrium”, which is a set of noise vectors for which *every* classifier misclassifies *every* example. These conditions seem to me to be so strong that I’m not sure that this section is really pulling its weight.
+
+On the subject of Section 2.1, the authors might consider whether an analysis based on “Two-Player Games for Efficient Non-Convex Constrained Optimization” (on arXiv) could be used here: convert Equation 1 into a constrained optimization problem by adding a slack variable, then reformulate it as a non-zero-sum game, in which one player uses the zero-one loss, and the other uses e.g. the hinge loss.
+
+While Algorithm 1 makes an unrealistic oracle assumptions, and I didn’t find Section 2.1 fully satisfying, I think that overall the theoretical portion of the paper is sufficiently convincing that one should be surprised if their experiments don’t show good performance (which they do--extremely good performance, in fact). Overall, this is an interesting and important problem, and a well-motivated approach that seems to work well in practice. I think Section 2.1 is a bit weak, but this is a relatively minor issue.",6,4.0,ICLR2019
+BJ9J8G_ez,2,r1kjEuHpZ,r1kjEuHpZ,Interesting idea but insufficient explanations and experimental results,"The paper studies a regularization method to promote sparsity and reduce the overlap among the supports of the weight vectors in the learned representations. The motivation of using this regularization is to enhance the interpretability of the learned representation and avoid overfitting of complex models. 
+
+To reduce the overlap among the supports of the weight vectors, an existing method (Xie et al, 2017b) encouraging orthogonality is adopted to let the Gram matrix of the weight vectors to be close to the identity matrix (so that each weight vector is with unit norm and any pair of vectors are approximately orthogonal).
+
+Neural network and sparse coding are considered as two case studies. The alternating algorithm for solving the regularized sparse coding formulation is common and less attracted. I think the major point is to see how much benefit that the regularization can afford for learning deep neural networks. To avoid overfitting, some off-the-shelf methods, e.g., dropout which can be viewed as a kind of regularization, are commonly used for deep neural networks. Are there any connections between the adopted regularization terms and the existing methods? Will these less overlapped parameters control the activation of different neurons? I think these are some straightforward questions while there are not much explanations on those aspects.
+
+For training neural networks, a simple sub-gradient method is used because of the non-smoothness of the regularization terms. When training with large neural networks, will the sub-gradient method affect the efficiency a lot compared without using the regularizer? For example, in the image classification problem with ResNet.
+
+It is better not to use dropout in the experiments (language modeling and image classification), because one of the motivation of using the proposed regularizer is to avoid overfitting while dropout does the same work and may affect the evaluation of effectiveness of the regularization.
+",4,4.0,ICLR2018
+SJl2KhcNtB,1,BJxSI1SKDH,BJxSI1SKDH,Official Blind Review #2,"This paper addresses the problem of translating into morphologically-rich languages, which suffers from the problem of sparse vocabularies and high numbers of rare and unseen words. In particular, current approaches such as subwords lack explicit notions of morphology and are obtained independently of the translation objective, while operating at the character-level renders the learning of long-distance dependencies more difficult.
+
+More concretely, this paper models the generation of target words in a stochastic, hierarchical process, where the morphological features are modelled as latent variables. At each target word, the model samples a vector representing its lemma, followed by a k-dimensional latent inflectional features (e.g. nominative or accusative). To induce sparsity in the inflectional features, the paper uses a ""stretch-and-rectify"" distribution (Louizos et al., 2018) using the Kumaraswamy distribution. The paper further applies a sparsity-inducing regulariser to encourage the inflectional features to take discrete values of ""0"" or ""1"". Parameter estimation is done by optimising a lower-bound on the marginal log-likelihood. Experiments on translations into three morphologically-rich languages, English-{Arabic, Czech, Turkish}, indicate fairly small but consistent and statistically significant gains in BLEU. Further analysis indicates improved perplexity upper bound on rare words compared to subword-based baselines, along with somewhat interpretable latent inflectional features.
+
+Overall, this paper proposes an elegant solution to an important problem, and yields statistically significant BLEU improvements over the baselines. I have listed some pros and cons below, with a recommendation of ""Weak Accept"". I would be willing to raise my scores assuming my concerns are sufficiently addressed. 
+
+Pros:
+1. The proposed approach is easy-to-follow and explained clearly, and the paper is overall well-written.
+
+2. Experiments are done on translation into three morphologically-rich languages with both concatenative and non-concatenative morphology, and results in small but mostly consistent BLEU improvements over the best subword, hierarchical, and character baselines for each language.
+
+3. The proposed approach is general and can potentially be applied on top of other model architectures (e.g. Transformers) and other language generation problems, such as language modelling, summarisation, etc, although the experiments in the paper focus on NMT with a GRU-based architecture.
+
+Cons:
+1. It is still unclear how much computational overhead is introduced by the approach (in terms of both training and prediction times), and how scalable the approach is when applied to language pairs that have much bigger training sets. In Tables 4 and 5, it seems that the largest dataset is English-Turkish multi-domain (434K sentence pairs), which is fairly small compared to other datasets.
+
+2. Regarding the feature variations result in section 4.4.4, it is hard to draw any conclusions about the latent inflectional features just based on varying the inflectional features at one position. I would suggest running the same experiment for multiple words at different positions, and see whether the same set of inflection features always results in similar inflections. This would better convince the reader that the latent inflectional features really are capturing useful morphological information, and that the target word generation process is appropriately controlled by the latent variable.
+
+3. Since some of the experiments did not use the same standard splits (e.g. the multi-domain Turkish experiments), it would be nice to report how well external models (e.g. OpenNMT) would do in the authors' dataset split, to make sure that the reported numbers here are at least comparable to that. This would help ensure the credibility of the findings.
+
+4. The effect of data size experiments (Section 4.4.2) for the character model seems somewhat counter-intuitive. Why would making the training set bigger (by incorporating data from a more diverse domain) make the character model worse? The explanation that ""[the character model]'s capacity cannot cope with the increased amount of sparsity"" does not seem satisfactory.  Why does adding more data result in increased sparsity? Furthermore, one can simply use more hidden units or deeper layers to mitigate this problem in the character-based model. Assuming the authors are controlling for model capacity, then the character model's hidden state can be made bigger (since the vocabulary size is lower for character models, thus the embedding layer and the softmax layer are by definition smaller than the subword-based model).
+
+Minor suggestions:
+1. Section 3.5 is not very clear; adding some figures or more explanation there would help understand how prediction is done.
+
+2. It would be interesting to explore other potential metric for measuring morphological generalisation. For instance, at evaluation time, does the model predict words that it has never seen before (e.g. predicting ""plays"", even though the model has only seen ""play"" at training time) at a higher frequency than character/subword/hierarchical baselines? If yes, this would provide more evidence that the latent inflectional features are properly capturing morphological information.",6,,ICLR2020
+Y_9vXodaaPo,2,dN_iVr6iNuU,dN_iVr6iNuU,Review,"This paper proposes methods to induce diversity in the networks of ensemble-based Q-Learning methods. This is achieved my maximizing a variety of measures of inequality based on the L2 parameter norms of individual networks in an ensemble. This is motivated by the benefit of having diversity in the learned features, which itself is motivated by observations on the CKA of some ensembleDQN networks.
+
+Strengths:
+- The high-level motivation of this work is sound. Diversity within an ensemble is undeniably desirable
+- The proposed methods do improve performance on interesting benchmarks
+
+Possible improvements/Weaknesses:
+- There are missing steps in the chain from motivation to method, but these steps can easily be done and verified by measuring the appropriate quantities systematically.
+- It's not clear exactly what the method is doing, it would be good to actually measure correlations between feature similarities and performance (which the authors claim exists without measuring)
+- It's not clear if the method is truly benefiting from feature diversity, or if the regularization induces some totally different effect, simple baselines and using a variety of feature similarity metrics could address this
+
+I gave this paper a score of 5; while the method is somewhat interesting, the authors need to **quantitatively** show that the regularizations they propose have the effects they claim they have. Otherwise, this is just a ""my number is bigger than your number""-paper, which isn't very valuable for our scientific understanding of deep RL.
+
+
+Detailed comments:
+- ""The first deep RL algorithm, DQN"", that's not very historically accurate. Martin Riedmiller published Neural Fitted Q Iteration in 2005, Gerald Tesauro published TD-Gammon in 1995, and I don't even know if they're really the ""first"". Even if we're stricter about the notion of ""deep"", in 2010 Lange & Riedmiller successfully trained deep autoencoders to pretrain features used for NFQ, in 2012 Hess & al used RBMs and MLPs to model policies. Perhaps ""landmark"" or ""remarkable"" would be more accurate.
+- The authors emit the hypothesis that ""representation similarity between neural networks in an ensemble-based Q-learning technique correlates to poor performance. To empirically verify our hypothesis..."", but that hypothesis is never verified, showing qualitative plots of two networks with high CKA is very different than making a clear measurement of correlation. For example, a figure with CKA on the x axis and performance on the y axis for repeated runs and a clear correlation line and r^2 value would be a quantitative statement testing the hypothesis. 
+- ""that random initialization of neural networks enforces diversity is a misconception"". It's known that different similarity metrics like CCA or maximum matching tend to find that two networks trained with different initializations _do_ converge to different features, while CKA finds they do _not_. It might be valuable to test these metrics in the RL setting as well, I suspect the conclusions/motivation of this paper might differ somewhat.
+- Figure 3 is quite misleading, in the bottom row the individual networks are trained to have different norms, thus it is extremely likely that the activations of the last layer are also going to have different norms. This is the first thing the t-SNE is going to pick on, and in that sense Figure 3 most likely isn't showing that different features have been learned, and it's also likely that if the networks learned similar features with different scales t-SNE wouldn't be able to show it.
+- The argument the authors make about feature diversity is based on CKA, but the methods are based on L2 norms of parameter vectors.
+  1. The link between the two is not made (and not obvious to me)
+  2. Deep ReLU networks are known to be invariant under several reparameterizations, including some rescalings (see Dinh et al, https://arxiv.org/abs/1703.04933). It's not clear to me that two networks that have different L2 norms have significantly different features. It's in fact quite easy to take a trained network and then finetune it only with L2 or L1 without significant loss in performance (presumably the same features are thus kept, just scaled; this is what pruning, quantization, and DNN compression are based on). I thus don't see how diversity in features follows from diversity in L2 norms.
+  3. A crucial missing baseline seems to be to train an ensemble of networks with a varying L2 regularization loss on each network of the ensemble. Another interesting baseline would be to vary the learning rate of each network, since having smaller or larger parameters may in the end only affect the effective learning rate of a particular network.
+- The results of 5.1 and Fig 2 are interesting, but
+  1. it would be interesting to see the average performance of diversity-regularized methods, not just the best ones
+  2. it seems like a downside that no single diversity method is systematically the best, but perhaps it reflects something about the environments used, and could be interesting to investigate.
+- The conclusion claims that ""high representation similarity between the Q-functions in ensemble[s] [..] leads to a decline in learning performance"". The paper only seems to have 4 data points to that effect, Figure 1. This is not a very strong claim.
+- The conclusion claims that the proposed method can ""maximize the diversity in the representation space"", but this is not measured, only hypothesized. The claims that this paper really makes are:
+  - CKA can measure diversity, and ensembles with high CKA perform poorly (this is suggested by Figure 1 and 8, but not measured precisely), thus low diversity is bad
+  - Having variety in L2 norms in an EnsembleDQN setup is good
+ One big missing step is to verify that diversity in L2 norms induces diversity as measured by CKA. Another missing step as I mention earlier is that to show that high diversity (as measured by CKA) induces good performance.
+- Appendix F should be much larger, and contain a plots of all 5 diversity measures, as well as, ideally, individual or group statistics of the L2 norms these regularizations are acting upon.",5,4.0,ICLR2021
+HyemRBD8cB,2,Byx55pVKDB,Byx55pVKDB,Official Blind Review #5,"Summary
+
+This paper showed that out-of-distribution and adversarial samples can be detected effectively if we utilize logits (without softmax activations). Based on this observation, the authors proposed 2-logit based detectors and showed that they outperform the detectors utilizing softmax activations using MNIST and CIFAR-10 datasets.
+
+I’d like to recommend ""reject"" due to the following
+
+The main observation (removing softmax activation can be useful for detecting abnormal samples) is a bit interesting (but not surprising) but there is no theoretical analysis for this. It would be better if the authors can provide the reason why softmax activation hinders the novelty detection.
+
+The logit-based detectors proposed in the paper are simple variants of existing methods. Because of that, it is hard to say that technical contributions are very significant.
+
+Questions
+
+For evaluation, could the authors compare the performance with feature-based methods like Mahalanobis [1] and LID [2]?
+
+I would be appreciated if the author can evaluate their hypothesis using various datasets like CIFAR-100, SVHN, and ImageNet.
+
+[1] Lee, K., Lee, K., Lee, H. and Shin, J., 2018. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In Advances in Neural Information Processing Systems (pp. 7167-7177).
+
+[2] Ma, X., Li, B., Wang, Y., Erfani, S.M., Wijewickrema, S., Schoenebeck, G., Song, D., Houle, M.E. and Bailey, J., 2018. Characterizing adversarial subspaces using local intrinsic dimensionality. arXiv preprint arXiv:1801.02613.",1,,ICLR2020
+S1x9j79pKS,1,B1xGxgSYvH,B1xGxgSYvH,Official Blind Review #2,"This paper introduces the compression risk in domain-invariant representations. Learning domain-invariant representations leads to larger compression risks and potentially worse adaptability. To this end, the authors presents gamma(H) to measure the compression risk. Learning weighted representations to control source error, domain discrepancy, and compression simultaneously leads to a better tradeoff between invariance and compression, which is verified by experimental results.
+
+The paper presents an in-depth analysis of compression and invariance, which provides some insight. However, I have several concerns:
+* In Section 4, the authors propose a regularization to ensure h belongs to H_0. How is the regularization chosen? How does it perform on other datasets? Experimental results only on digit datasets are not convincing.
+* In Section 5, the authors introduce weighted representations to alleviate the curse of invariance. However, they do not provide experiments to validate their improvement. 
+* The organization of this manuscript is poor and difficult to follow. Starting from Section 3, the authors use several definitions to introduce their main theorem. However, these definitions are somewhat misleading. I cannot get the point until the end of Section 3. Besides, the notations are confusing, so I have to go back to the previous sections in case of misunderstanding.
+
+
+
+",3,,ICLR2020
+AsQVaEgEL_,2,fylclEqgvgd,fylclEqgvgd,Review for TRANSFORMER PROTEIN LANGUAGE MODELS ARE UNSUPERVISED STRUCTURE LEARNERS,"In this paper, the authors show that transformer protein language models can learn protein contacts from the unsupervised language modelling objectives. They also show that the residue-residue contacts can be extracted by sparse logistic regression to learn coefficients on the attention heads. One of the advantages of using transformers models is that they do not require an alignment step nor the use of specialized bioinformatics tools (which are computationally expensive). When compared to a method based on multiple sequence alignment, the transformers models can obtain a similar or higher precision.
+
+Contributions of this paper are:
+- showing that the attention maps built in Transformer-based protein languages learn protein contacts, and when extracted, they perform competitively for protein contact prediction;
+- a method for extracting attention maps from Transformer models;
+- a comparison between a recent protein transformer protein language model (which does dot require sequence alignment), and a pseudo-likelihood-based optimization method that uses multiple sequence alignment;
+- an analysis of how much the supervised learning (logistic regression) contributes to the results.
+
+The paper covers a relevant topic and it is easy to read. 
+
+However, I have a number of concerns. The main contribution of the paper is that attention maps built in Transformer-based protein languages learn protein contacts and can be used for protein contact prediction. However, this was reported before in Rives et al.(2019) (doi: 10.1101/622803). Also, several methods have been developed for this problem, but are not included in the comparisons. Finally, the provided implementation details are not sufficient to reproduce the results of the paper. 
+I detail some of these concerns below, together with questions/suggestions for improvements:
+
+1) I would recommend comparing transformers to other methods besides Gremlin, or justify why other methods were not included. This review can be helpful:
+
+(Adhikari B, Cheng J., 2016.. doi: 10.1007/978-1-4939-3572-7_24)
+
+Also, more recent methods that were published after the review are:
+
+(Badri Adhikari, 2020. https://doi.org/10.1093/bioinformatics/btz593)
+
+(Luttrell  et al., 2019. https://doi.org/10.1186/s12859-019-2627-6)
+
+(Gao et al.,2019. https://doi.org/10.1038/s41598-019-40314-1)
+
+(Ji S et al., 2019. https://doi.org/10.1371/journal.pone.0205214)
+
+2) On page 7, the authors state that ""We find that the logistic regression probabilities are reasonably well calibrated estimators of true contact probability and can be used directly as a measure of the model's confidence (Figure 10a)"". However, from the plot in Figure 10a, it is not totally clear that the probabilities are well calibrated. Could the authors add more justifications of why they consider it well calibrated? Could they also show a comparison of the calibration of the other transformer models, perhaps using MSE as a calibration metric?
+
+3) To understand the occurence of false positives, the authors analyze the Manhattan distance between the predicted contact and the true contact, which is between 1 and 4 for most false positives. They also show an example of a homodimer, for which predictions were far from the true contacts, and explain that the model is picking up inter-chain interactions. Could the authors report how many predictions have a Manhattan distance larger than 4? Is this one example representative of the group of false positives far from the true contact? Maybe the authors could analyse whether this happens in most of the cases.
+
+4) While ESM-1 is open-source and publicly available, this is not the case for ESM-1b. In section A.5, the authors provide implementation details as differences between ESM-1 and ESM-1b, stating “Compared to ESM-1, the main changes in ESM-1b are: higher learning rate; dropout after word embedding; learned positional embeddings; final layer norm before the output; and tied input/output word embeddings. The weights of all ESM models throughout the training process were provided to us by the authors.”. In my opinion, this is not enough to reproduce the results in this paper. To make it reproducible, the authors need to provide a detailed enough description of the differences to make the reader able to implement ESM-1b, or provide the weights and hyperparameters required to reproduce their results.
+",7,5.0,ICLR2021
+VhGLunpaCVS,2,Tt1s9Oi1kCS,Tt1s9Oi1kCS,Recommendation to accept ,"This paper covers an interesting topic of continual learning of the stream of data. One limitation of the existing classification algorithms is their close-set assumption. In close-set methods, a predefined set of classes are considered and a model is trained on the available data from these classes, based on the assumption that test data will be driven from a similar distribution as the training data. However, most of the real-world problems are open-set problems. Open-set models should be able to learn continuously in an online manner with minimum or zero supervision. In other words, they should be able to learn new classes or update the existing classes based on the received new data on-the-fly, without forgetting the previously learned knowledge.  
+Pros: 
+In this paper, the authors provide an incremental learning approach that prevents catastrophic forgetting. 
+Their approach can work on both balanced and unbalanced data.  
+cons: 
+The authors need to improve the presentation of the manuscript by providing more explanation. It could be confusing for readers who are not familiar with the topic. 
+It would be very helpful to publically share the code. 
+I highly recommend adding the confusion matrix or F1 score in addition to the accuracies. 
+",7,5.0,ICLR2021
+BkgV-wG5FH,1,H1eWGREFvB,H1eWGREFvB,Official Blind Review #1,"The papers described how to use the repulsive term used with standard SVGD within MCMC/SGLD. Briefly, the paper proposes to use a (damped version of) the SVG repulsive term between the current position of a SGLD trajectory and the empirical distribution defined by the trajectory.
+
+The approach is interesting and natural. Unfortunately, I do not think that the experiments are convincing. 
+
+(1) Since the authors are advertising the Bayesian framework, the choices of metrics such as test RMSE or test LL are not adapted
+(2) in the UCI dataset examples, it is indeed extremely difficult to explain the enhanced performances? Is it because of multimodality? Better exploration of a mode? Comparison to a single NN? Comparison with an ensemble of NN? 
+(3) I would have been much more convinced by a set of well-chosen and controlled experiments. The 2-dimensional examples are far too low-dimensional to be convincing. Higher-dimensional Gaussian? Higher-dimensional mixtures? Non-linear tractable problems in higher dimensions? Influence of the tuning parameters (RBF parameter? alpha? step-size in SGLD)? Computational issues? Subsampling-effect? etc...
+
+The proposed method is interesting and has a lot of potential. I would like to suggest the authors to spend more times on careful and controlled numerical experiments (Bayesian NN are not very good for this purpose) -- with convincing numerics (which would give more reasons to delve into the proofs) the method can be very promising.
+",3,,ICLR2020
+X2-7JhX1Cqc,3,y4-e1K23GLC,y4-e1K23GLC,A good paper in good shape,"**Summary**: 
+In this article, the authors investigated the fundamental trade-off between the size of a neural network and its robustness (measured by its Lipschitz constant), in the setting of a single-hidden-layer network with $k$ neurons and (approximately) Gaussian data, by proposing two conjectures, Conjecture 1 and 2, on the (lower and upper bound of the) network Lipschitz constant in perfectly fitting a given data set of size $n$ and data dimension $d$. Some weaker versions of the two proposed conjectures were proven, in Section 4 and 3, respectively. Empirical evidence for the proposed conjectures was shown in Section 5.
+
+**Strong points**: This paper proposed a clear and promising mathematical conjecture to investigate the robustness in neural network models, and provided many examples and explanations on why such conjecture is reasonable. The paper is in general well written: the simple examples in Sec 3.1 and 3.2 make a clear sense of the proposed theory, and solid technical contribution is provided in e.g., the proof of Theorem 2.
+
+**Weak points**: This paper is already in pretty good shape.
+
+**Recommendation**: This is a good paper that made solid contributions to the theoretical understanding of the fundamental trade-off between model size and robustness. I recommend it for publication at ICLR.",7,3.0,ICLR2021
+SkGbiIKxz,1,Syx6bz-Ab,Syx6bz-Ab,This is a decent work but contains certain obvious drawbacks,"This paper presents a new approach to support the conversion from natural language to database queries. 
+
+One of the major contributions of the work is the introduction of a new real-world benchmark dataset based on questions over Wikipedia. The scale of the data set is significantly larger than any existing ones. However, from the technical perspective, the reviewer feels this work has limited novelty and does not advance the research frontier by much. The detailed comments are listed below.
+
+1) Limitation of the dataset: While the authors claim this is a general approach to support seq2sql, their dataset only covers simple queries in form of aggregate-where-select structure. Therefore, their proposed approach is actually an advanced version of template filling, which considers the expression/predicate for one of the three operators at a time, e.g., (Giordani and Moschitti, 2012).
+
+2) Limitation of generalization: Since the design of the algorithms is purely based on their own WikiSQL dataset, the reviewer doubts if their approach could be generalized to handle more complicated SQL queries, e.g., (Li and Jagadish, 2014). The high complexity of real-world SQL stems from the challenges on the appropriate connections between tables with primary/foreign keys and recursive/nested queries. 
+
+3) Comparisons to existing approaches: Since it is a template-based approach in nature, the author should shrink the problem scope in their abstract/introduction and compare against existing template approaches. While there are tons of semantic parsing works, which grow exponentially fast in last two years, these works are actually handling more general problems than this submission does. It thus makes sense when the performance of semantic parsing approaches on a constrained domain, such as WikiSQL, is not comparable to the proposal in this submission. However, that only proves their method is fully optimized for their own template.
+
+As a conclusion, the reviewer believes the problem scope they solve is much smaller than their claim, which makes the submission slightly below the bar of ICLR. The authors must carefully consider how their proposed approach could be generalized to handle wider workload beyond their own WikiSQL dataset. 
+
+PS, After reading the comments on OpenReview, the reviewer feels recent studies, e.g., (Guu et al., ACL 2017), (Mou et al, ICML 2017) and (Yin et al., IJCAI 2016), deserve more discussions in the submission because they are strongly relevant and published on peer-reviewed conferences.",5,5.0,ICLR2018
+HygXKK8RKB,3,HylxE1HKwS,HylxE1HKwS,Official Blind Review #1,"In this manuscript, authors propose an OFA NAS framework. They train a supernet first and then finetune the elastic version of the large network. After training, the sub-networks derived from the supernet can be applied for different scenarios directly without retraining. The motivation is clear and interesting. My concerns are as follows.
+1.	When sampling sub-networks, a prediction model is applied to predict the accuracy of networks. It is interesting to show the accuracy of the prediction model itself and how it will influence the final selection.
+2.	The results compared in Table 2 are outdated. Authors should at least add the result of MobileNetV3.",6,,ICLR2020
+B1W_nBOEl,3,HkNEuToge,HkNEuToge,review,"The paper introduces an efficient variant of sparse coding and uses it as a building block in CNNs for image classification. The coding method incorporates both the input signal reconstruction objective as well as top down information from a class label. The proposed block is evaluated against the recently proposed CReLU activation block.
+
+Positives:
+The proposed method seems technically sound, and it introduces a new way to efficiently train a CNN layer-wise by combining reconstruction and discriminative objectives.
+
+Negatives:
+The performance gain (in terms of classification accuracy) over the previous state-of-the-art is not clear. Using only one dataset (CIFAR-10), the proposed method performs slightly better than the CRelu baseline, but the improvement is quite small (0.5% in the test set). 
+
+The paper can be strengthened if the authors can demonstrate that the proposed method can be generally applicable to various CNN architectures and datasets with clear and consistent performance gains over strong CNN baselines. Without such results, the practical significance of this work seems unclear.
+
+",5,4.0,ICLR2017
+PaPsLydRIdG,1,ku4sJKvnbwV,ku4sJKvnbwV,Review of LatCo,"Summary: The paper studies the problem of planning in domains with sparse rewards where observations are in the form of images. It focuses on solving this problem using model-based RL with emphasis on better trajectory optimization. The proposed solution uses latent models to extract latent representations of the planning problem that is optimized using the Levenberg-Marquardt algorithm (over a horizon). The experimental results show improvements over a) zeroth-order CEM optimization, b)  PlaNet (Hafner et al., 2019) and c)  gradient-based method that optimizes the objective in Eq. 1.
+
+Strengths:
+
+i) The motivation, organization and the overall writing of the paper are clear.
+
+ii) The tested experimental domains are good representatives of the realistic planning setting identified in the paper.
+
+Weaknesses:
+
+i) Discussion of literature on planning in latent spaces [1,2,3,4,5] is left out and should be included. Namely, [1,2] performs (classical) planning from images, and [3,4,5] perform planning with learned neural models. Here, space can be saved by removing Figure 4 since all of its subfigures look identical given their (visual) quality.
+
+ii) Have you tried solving Eq. 2. directly similar to [4]? It seems more appropriate baseline compared to c) (i.e., as labeled above).
+
+iii) How do you reason about the length of the horizon T? For example [1,2] use heuristic search.
+
+iv) There does not seem to be any presentation of hyperparameter selection/optimization, runtime results or quality of solutions. Table 1 is too high-level to provide any meaningful insight into understanding how each method compares. Similarly, Figure 5 is very hard to read and not clear what each axis represents. Overall, I would say this is the weakest part of the paper.
+
+References:
+
+[1] Classical Planning in Deep Latent Space: Bridging the Subsymbolic-Symbolic Boundary, Asai and Fukunaga AAAI-18.
+
+[2] Learning Neural-Symbolic Descriptive Planning Models via Cube-Space Priors: The Voyage Home (to STRIPS), Asai and Muise IJCAI-20.
+
+[3] Nonlinear Hybrid Planning with Deep Net Learned Transition Models and Mixed-Integer Linear Programming, Say et al., IJCAI-17.
+
+[4] Scalable Planning with Deep Neural Network Learned Transition Models, Wu et al. JAIR.
+
+[5] Optimal Control Via Neural Networks: A Convex Approach, Chen et al., ICLR 2019.
+
+** Post Rebuttal **
+
+To best of my understanding, the authors have addressed all my questions and suggestions with the appropriate revision of their paper. Specifically, the necessary discussion of hyperparameter selection is added and presentation of the runtime&solution quality results (i.e., raised in point iv)) have been improved with the inclusion of important details, additional discussion of related work is added (i.e., raised in point i)) and questions are addressed (i.e., raised in point ii) and iii)). As such, I have updated my rating accordingly.",7,4.0,ICLR2021
+HJzVc2sxf,3,SyJS-OgR-,SyJS-OgR-,"Review of ""Multi-level Residual Networks from Dynamical Systems View""","
+
+This paper proposes a new method to train residual networks in which one starts by training shallow ResNets, doubling the depth and warm starting from the previous smaller model in a certain way, and iterating.  The authors relate this idea to a recent dynamical systems view of ResNets in which residual blocks are viewed as taking steps in an Euler discretization of a certain differential equation.  This interpretation plays a role in the proposed training method by informing how the “step sizes” in the Euler discretization should change when doubling the depth of the network.  The punchline of the paper is that the authors are able to achieve similar performance as “full ResNet training” but with significantly reduced training time.
+
+Overall, the proposed method is novel — even though this idea of going from shallow to deep is natural for residual networks, tying the idea to the dynamical systems perspective is elegant.  Moreover the paper is clearly written.  Experimental results are decent — there are clear speedups to be had based on the authors' experiments.  However it is unclear if these gains in training speed are significant enough for people to flock to using this (more complicated) method of training.
+
+I only have a few small questions/comments:
+* A more naive way to do multi-level training would be to again iteratively double the depth, but perhaps not halve the step size.  This might be a good baseline to compare against to demonstrate the value of the dynamical systems viewpoint.
+* One thing I’m unclear on is how convergence was assessed… my understanding is that the training proceeds for a fixed number of epochs (?) - but shouldn’t this also depend on the depth in some way? 
+* Would the speedups be more dramatic for a larger dataset like Imagenet?
+* Finally, not being very familiar with multigrid methods from the numerical methods literature — I would have liked to hear about whether there are deeper connections to these methods.
+
+
+",7,4.0,ICLR2018
+Gddl40eQTz,4,_XYzwxPIQu6,_XYzwxPIQu6,Excellent paper proposing a novel regularization technique to ReLU-based RNNs for dynamical system identification ,"The paper explores a very important question in dynamical system identification of how to make recurrent neural networks (RNNs) learn both long-term and short-term dependencies without the gradient vanishing or exploding limitation. They suggest using piece-wise linear RNNs (PLRNNs) with a novel regularization technique.
+
+The paper is well written and is very thorough with the necessary theoretical foundation, numerical experiments and analysis. 
+
+I think the theory and results of this paper are significant and will be relevant to further our understanding of RNNs and system identification.
+
+Major points:
+
+1) L2 weight regularization can be easily applied to any of the RNN models used in the experiments. While other weight initialization schemes were compared to the paper's proposed model (rPLRNN), none of the other RNN models had similar regularization. This will shed some light on whether it is indeed the proposed regularization that matters or the full proposed model with PLRNN and a mix of regularized and non-regularized units.
+
+2) It is not clear to me how one can choose the correct ratio of regularized vs unregularized units in the model. While the amount of regularization clearly helps in reducing training error as shown in Figure 3, increasing the ratio of regularized units in Figure S3C did not help the error past 0.1 and then larger values resulted in large increases such that the error at ratio 1 is equivalent to the error at ratio 0. Perhaps this observation is specific to the addition problem, but I feel that a discussion of the effect of this ratio on performance should be included for clarity. Additionally, the ratio of regularized units with best performance could potentially be different for different regularization amounts.
+
+Minor Point:
+g is not defined in equation 2",8,4.0,ICLR2021
+H1Tt4YLVg,3,r1VdcHcxx,r1VdcHcxx,Batch normalisation brought to LSTM,"The paper shows that BN, which does not work out of the box for RNNs, can be used with LSTM when the operator is applied to the hidden-to-hidden and the input-to-hidden contribution separately. Experiments are conducted to show that it leads to improved generalisation error and faster convergence.
+
+The paper is well written and the idea well presented. 
+
+i) The data sets and consequently the statistical assumptions used are limited (e.g. no continuous data, only autoregressive generative modelling).
+ii) The hyper parameters are nearly constant over the experiments. It is ruled out that they have not been picked in favor of one of the methods. E.g. just judging from the text, a different learning rate could have lead to equally fast convergence for vanilla LSTM. 
+
+Concluding, the experiments are flawed and do not sufficiently support the claim. An exhaustive search of the hyper parameter space could rule that out. ",7,4.0,ICLR2017
+HklxPZIMcS,1,HygcdeBFvr,HygcdeBFvr,Official Blind Review #1,"In this paper, authors explore the problem of generating singing voice, in the waveform domain. There exists commercial products which can generate high fidelity sounds when conditioned on a score and or lyrics. This paper proposes three different pipelines which can generate singing voices without necessitating to condition on lyrics or score. 
+
+Overall, I think that they do a good job in generating vocal like sounds, but to me it's not entirely clear whether the proposed way of generating melody waveforms is an overkill or not. There is a good amount of literature on generating MIDI representations. One can simply generate MIDI (conditioned or unconditioned), and then give the result to a vocaloid like software. I am voting for a weak rejection as there is no comparison with any baseline. If you can provide a comparison with a MIDI based generation baseline, I can reconsider my decision. Or, explain to me why training on raw waveforms like you do is more preferable. I think in the waveform domain may even be undesirable to work with, as you said you needed to do source separation, before you can even use the training data. This problem does not exist in MIDI for instance. 
+",3,,ICLR2020
+TvReA43uQrQ,1,3Wp8HM2CNdR,3Wp8HM2CNdR,"An interesting attempt to remove negative examples by whitening, however experiments are not too satisfying","The paper proposes to first do representation ""whitening"", so that the representations are scattered in the space and not collapsing to a single data point; then compute distance metric on top of that (e.g. Euclidean, cosine similarity). A nice thing about explicit scattering is that it does not require large numbers of negative examples to pull the features apart. Experiments are done on several toy datasets like CIFAR.
+
++ The paper is a nice, alternative attempt to remove negative examples in contrastive learning. Indeed, large number of negative examples is annoying and this direction is both exciting and significant.
++ The approach proposed in this paper intuitively makes sense. There are several works already explaining that contrastive learning is essentially doing some kind of scattering in the space. 
++ The proposed approach seems pretty simple. The whitening code is only a dozen lines in PyTorch. I haven't run the code to verify the results though.
+
+I think the experiments are not too satisfying though. 
+
+- The comparison is less fair between BYOL and multi-crop version of W-MSE. In SwAV, it shows that with multiple crops, the performance can be boosted quite a bit. I haven't tried on BYOL but I believe it could also be helping there. So the most apple-to-apple comparison between BYOL and W-MSE is the 2-crop version (d=2). In this case, BYOL is outperforming in most entries. Though it can be viewed as concurrent work (I think W-MSE is actually even earlier than BYOL), but the experiment session in the paper is not clear about this.
+- Overall running experiments on these toy datasets are less satisfying, not only because it lacks comparison to other major approaches (like MoCo on ImageNet), but also because the signal we get from smaller datasets may not transfer well to more real-world images. 
+- I would like to see a comparison in terms of timing -- maybe BYOL (because of its momentum encoder and it needs 4 forward pass of the network to compute a single pair of losses) is running much smaller in training. W-MSE can run much faster because it only needs 4 (or even 1?) forward pass. This is a potential advantage that W-MSE has, but it is not clear from the paper.
+
+Other than experiments, I am also not too satisfied with the writing. The paper mentions another paper when talking about the key method (how to do whitening, e.g. which is back propagated, which is not) when describing the central technique of the paper. I would like to see the paper more self-contained in the next version. ",7,4.0,ICLR2021
+HkgVsbHch7,2,ryza73R9tQ,ryza73R9tQ,"the claimed ""new direction"" has been explored before.","The major issue in this paper is that the ""new direction"" in this paper has been explored before [1]. Therefore the introduction needs to be rewritten with arguing the difference between existing methods. 
+
+The proposed method highly relies on the percentage of implicitly aligned data. I suggest the author do more experiments on different data set with a significant difference in this ""percentage"". Otherwise, we have no idea about the performance's sensitivity to the different datasets. 
+
+More detailed explanations are needed. For example, what do you mean by ""p(w)  as the estimated frequency""? Why do we need to remove the first principal components?
+
+Section 3.2 title is "" aligning topic distribution"" but actually it is doing word distribution alignment.
+
+Do you do normalization for P(w^Y;d_i^X,\theta) in eq.6 which is defined on the entire vocab's distribution?
+
+I think the measurement of the alignment accuracy and more experiments with different settings of \alpha and \beta are needed.
+
+Citation needed for ""Second, many previous works suggest  that the word distribution ...""
+
+[1] Munteanu et al, ""Improving Machine Translation Performance by Exploiting Non-Parallel Corpora"", 2006",5,5.0,ICLR2019
+ByxET2cDsr,3,HJlyLgrFvB,HJlyLgrFvB,Official Blind Review #4,"This paper introduces two networks that are trained to predict DDS. While one is trained with perfect information, the other one (ISSN) with imperfect information.
+The ISSN is then used to compute posterior probability distribution (based on a history leading to the current state). The ideas is that such posterior distribution should perform better compared to uniform distribution when used in determinization process.
+
+I like the idea / motivation of the paper, but the authors could do a better job of explaining the motivation to people less familiar with techniques based on the determinization framework.
+I also like the baselines that they chose to compare against - but the resulting comparison is far from perfect (see Issues section).
+
+Minor issues:
+ -  Please do a careful language check - the grammar is wrong in many places (most notably plural/singular nouns).
+While this does not hurt the semantics, it makes it sometimes cumbersome to read.
+
+  - Since this work is mostly about using non-uniform distribution during the determinization process, I think it's worthwhile to also mention [Whitehouse, Daniel. Monte carlo tree search for games with hidden information and uncertainty. Diss. University of York, 2014.] as reference point.
+
+Issues:
+ - My biggest issue is the experimental and evaluation section. The reported improvements seem small, but most importantly - it is impossible to asses the relevance of the results. 
+  There are no confidence intervals or variance reported. Given the seemingly small improvements, this could easily be noise?
+
+ - While I am not certain, I assume that your numbers in Table 2 come from the 'test' split of the data - one would guess you used that split to stop the training (select the best model)?
+  If that is the case, I don't think you can use the same split during the evaluation (even though you evaluate differently) - the reported numbers will be biased.
+
+Improvement Suggestions
+  - Please see my issues with the evaluation.
+  - You say you will release the data and code - that is great, do it!
+  - Your figures are way too large for what they do. I think you should make them much more compact and use the resulting space to improve and expand the experimental section. Please add lot more details about the evaluation.
+
+Summary:
+I think the paper is looking into an interesting problem and is going in the right direction, but the experimental section is at this point no good enough to suggest an acceptance.",1,,ICLR2020
+SJgN1K_0FS,2,H1lTUCVYvH,H1lTUCVYvH,Official Blind Review #2,"---- Paper summary ----
+This paper proposes a curriculum learning approach for classification. The proposed curriculum consists of two phases:
+(1) a “label introduction phase”, in which the model is able to see and learn to classify only subset of labels (the model still trains on the samples belonging to the “unseen” classes, but their label is now set to a default class). The subset of seen labels is expanded incrementally, until the entire label set is observed. 
+(2) an “adaptive compensation phase”, where the model trains on all labels, but the targets for each class are replaced from 1-hot vectors to smooth version. This only applies to the classes on which the model has made mistakes in a previous training round.
+This method is tested on 3 image classification datasets, and a single neural network architecture is tested per dataset (either ResNet18 or ResNet34).
+
+---- Overall opinion ----
+While the ideas introduced in this paper may have merit, I believe the experimental evidence is quite limited. Based on the results shown in the paper, I am not convinced that this approach is better than the baselines it compares to. Moreover, since the claim is that this approach is a general curriculum learning method, I find the setting it was tested on very limited (3 datasets, 1 model per dataset), especially since there are no theoretical results to complement the empirical evidence. Finally, the method introduces several parameters (b, m, E, T, eta) that are treated is a somewhat hand-wavy manner, without a proper analysis on the effect of such parameters and how one should set them. Details on this, and other major issues, can be found below. For these reasons, I believe the paper in its current form is not yet ready for publication.
+
+---- Major issues ----
+1. The paper simply mentions that the unseen labels are set to a default label, which Figure 1 implies (and is not otherwise clarified) is one of the labels in the dataset. I am not sure intuitively why it makes sense to force the model in the beginning to map a sample’s inputs to another label from the dataset, which in the end it has to learn is wrong. Doesn’t it make more sense to map the unseen labels to a new, fictional label? If not, then how do you decide which of the M labels to choose as the fake label?
+
+2. The method introduces several hyperparameters, such as b (number of visible labels in the beginning), m (number of labels to reveal at each step), E (number of epochs in each incremental phase), T, epsilon. The specific numbers used in the experiments are reported, but it is not clear how these were chosen, and how one would choose them for a new dataset or model.
+
+3. In Table 1, the LILAC results are bolded with a caption saying that bold they means “the best case scenario”, and the main text also claims that “, LILAC is the only one to consistently increase classification accuracy and decrease the standard deviation across all datasets compared to batch learning”. However, from Table 1 it seems LILAC has neither the highest accuracy (label smoothing has overall higher accuracy), neither the lowest std. It may seem that the authors arbitrarily decided what is the best accuracy/std tradeoff which makes their method seem the best. Please define clearly the criteria for establishing the best method, and explain in what setting this criteria is a valid choice.
+
+4. Aside from the issue mentioned above, the differences in accuracy or std in Table 1 seem minor (e.g., within 0.10% on Cifar10, and within 1% in the others). Please provide more evidence that these differences are significant.
+
+5. How was the consistency in table 1 decided? Please provide the specific metric.
+
+6. Regarding the choice of models and datasets, the method was only tested on image datasets. This is could be enough as a contribution, but in this case the introduction and abstract should not claim a generality that has not been tested. Similarly, the paper only considers 1 model per dataset. Does this work for other models too? 
+
+7. From Table 3, it looks like the LILAC w/o AC is actually worse than the baseline. In this case, what is the benefit of having the label introduction phase? Why not just have the AC component alone? If there is a reason, please include the results for AC alone in Table 3.
+
+8. I find the comparison in Figure 2 potentially misleading, because I believe 1 epoch of batch training is not equal in terms of amount of training as 1 full span of the incremental phase. I believe these embeddings should be shown at convergence time.
+
+9. In Table 4 and accompanying text, the authors conclude that “LILAC is relatively invariant to the order of the label introduction”. However, to me both random and ascending order seem actually random with respect to how this order is used. I advise the authors to try other orderings too, such as sorting them by the error of an initial training of a classifier, or other more difficulty-based orders.
+
+---- Minor issues ----
+1. In Algorithm 1, there is a missing equation (“Eqn. ??”), also I’m not sure why the first line says “Write here the result”.
+
+2. “small batch sizes help achieve more generalizable solutions, but do not scale as well to vast computational resources as large mini-batches” → this is a bit confusing. How do large mini-batches scale better, and what is the difference between “small batch sizes” and “large mini-batches”? 
+
+---- Questions ----
+Please see the major issues questions above.",1,,ICLR2020
+HZ5ppa4jg50,4,#NAME?,#NAME?,Official Blind Review #4,"In this work, the authors propose to use orthogonal multi-path(OMP) block to improve the adversarial robustness of deep neural networks. They introduce three types of OMP, OMP-a/b/c, based on where the OMP block is located in the neural network. Experimental results demonstrate the effectiveness of their OMP approach. The idea is interesting and the paper is easy to follow. 
+
+However, I have some concerns below:
+
+1.	I feel the contribution of this work is not sufficient as the orthogonal feature learning has been explored in natural training.
+2.	In addition, the baselines are also not enough for the convincing. From the perspective of network structure, this work needs to compare with related work of network structure in adversarial robustness studies, such as Feature Denoising [1]. From the perspective of model ensembling or diverse feature learning (since OMP is an ensemble strategy), this work needs to compare with works on model ensembling or diverse feature learning [2]
+3.	The comparison is rather limited. The experiments are only conducted on one dataset, CIFAR-10, and the attacks for evaluation are limited.
+
+[1] Feature Denoising for Improving Adversarial Robustness. CVPR 2019.
+[2] Improving Adversarial Robustness via Promoting Ensemble Diversity. ICML 2019.
+",4,5.0,ICLR2021
+r1e0SGTY2Q,1,SJlt6oA9Fm,SJlt6oA9Fm,promising idea but over-complicated method. ,"The main contribution of the paper are a set of new layers for improving the 1x1 convolutions used in the bottleneck of a ResNet block. The main idea is to remove channels of low “importance” and replace them by other ones which are in a similar fashion found to be important.  To this end the authors propose the so-called expected channel damage score (ECDS) which is used for channel selection. The authors have shown in their paper that the new layers improve performance mainly on CIFAR, while there’s also an experiment on ImageNet
+It looks to me that the proposed method is overly complicated. It is also described in a complicated manner.  I don't see clear motivation for re-using the same features. Also I did not understand the usefulness of applying the spatial shifting of the so-called Channel Distributor. It is also not clear whether the proposed technique is applicable to only bottleneck layers.
+The results show some improvement but not great and over results that as far as I know are not state-of-the-art (to my knowledge the presented results on CIFAR are not state-of-the-art). The results on ImageNet also show decent but not great improvement. Moreover, the gain in reducing the model parameters is not that great as the R parameters are only a small fraction of the total model parameters. Overall, the paper presents some interesting ideas but the proposed approach seems over-complicated",5,3.0,ICLR2019
+BkgA0-5RqH,2,BklBp6EYvB,BklBp6EYvB,Official Blind Review #1,"This paper introduces a new light-weight framework for multi-task learning. In this method, the combination of extracted image features and task are fed into a top-down network which is responsible for generating a task-specific weight matrix. The weights are next convolved with the input image as an input to the task-agnostic bottom-up network that generates the labels.
+
+The idea of the paper interesting. The main shortcoming of the paper in my point of view is that all the numbers are reported as a single number, so they are prone to be changed by using different initial networks or optimizers. Here are some more comments:
+
+1) One limitation of the result section is that all the numbers are reported as static numbers. I am interested to see the training curves, either in using the wall-clock time or iteration in the x-axis and testing accuracy in y.
+
+2) Sections 3.1 and  3.2 as the main parts are not well-written. The shapes of the tensors are vague. What is the y,x in the parentheses? What does ch stand for? (defined?) I think that this part of the paper requires significant improvement.
+
+3) One valid question is how the proposed method is scalable. For example, can a model trained for 3 tasks used for 4 tasks? How hard is adding a new task? Also, worths comparing it with the learning from scratch. 
+
+4) In Section 3, the discussion about the loss function is missing. I believe that the explanation of how to choose a loss function as well as auxiliary losses should be move there. Also, I didn't find the current explanation of BU1 and TD auxiliary loss for Multi-MNIST very clear.
+
+5) Why the results of your method is better than the single model? This behavior should be justified. My impression is that each task trained independently should outperform any multi-training method. Your results seem counter-intuitive in this respect.
+
+6) I am not able to make any strong conclusions from Section 4.3.2. It is really hard to tell which connection is better based on a single number. I would suggest providing confidence intervals for making such kind of arguments. For example, you may train from 10 different network initializations and use them to construct more reliable estimates. I also believe that more reliable estimations are required for Table 3. 
+
+
+Minor:
+* In paragraph 2 of pages 2, you mention ""as illustrated in Figure 2a"". I do not see the attention to a part of the image. Am I missing something? A similar issue exists in the next sentence: I don't see any content-related modulization in Figures 2b and 2c. Please clarify.
+* use comma after equations if the equations are not ending the sentences. For example, add a comma after eq (1), (2) and (3). Also on page 4, ""Where $W$"" -> ""where $W$"".
+* Page 4, ""Our method address"" -> ""Our model addresses""
+* Where the third column of Table 1 is defined? On page 8. Move it to earlier sections.
+* In Table 2b, you have used +x, but the notation for gated modulation is something else in the text.
+* Are LL and RU used in Table 1 defined in the text?
+* The bold numbers in Figure 2b seem wrong. If you are bolding the large accuracies, be consistent in all tables.
+* Font of table 4 can be larger",3,,ICLR2020
+HkqIGZGlG,1,B1suU-bAW,B1suU-bAW,"Good idea, not well evaluated against other methods","This paper produces word embedding tensors where the third order gives covariate information, via venue or author. The model is simple: tensor factorization, where the covariate can be viewed as warping the cosine distance to favor that covariate's more commonly cooccuring vocabulary (e.g. trump on hillary and crooked)
+
+
+There is a nice variety of authors and words, though I question if even with all those books, the corpus is big enough to produce meaningful vectors. From my own experience, even if I spend several hours copy-pasting from project gutenberg, it is not enough for even good matrix factorization embeddings, much less tensor embeddings. It is hard to believe that meaningful results are achieved using such a small dataset with random initialization. 
+
+I think table 5 is also a bit strange. If the rank is > 1000 I wonder how meaningful it actually is. For the usual analogies task, you can usually find what you are looking for in the top 5 or less. 
+
+It seems that table 1 is the only evaluation of the proposed method against any other type of method (glove, which is not a tensor-based method). I think this is not sufficient. 
+
+Overall, I believe the idea is nice, and the initial analysis is good, but I think the evaluation, especially against other methods, needs to be stronger. Methods like neelakantan et al's multisense embedding, for example, which the work cites, can be used in some of these evaluations, specifically on those where covariate information clearly contributes (like contextual tasks). The addition of one or two tables with either a standard task against reported results or created tasks against downloadable contextual / tensor embeddings would be enough for me to change my vote. ",5,3.0,ICLR2018
+Hklv2HpdhX,2,rJxF73R9tX,rJxF73R9tX,Re: Abstention classifiers,"This manuscript introduces deep abstaining classifiers (DAC) which modifies the multiclass cross-entropy loss with an abstention loss, which is then applied to perturbed image classification tasks.  The authors report improved classification performance at a number of tasks.
+
+Quality
++ The formulation, while simple, appears justified, and the authors provide guidance on setting/auto-tuning the hyperparameter.
++ Several different settings were used to demonstrate their modification.
+- There are no comparisons against other rejection/abstention classifiers or approaches.  Post-learning calibration and abstaining on scores that represent uncertainty are mentioned and it would strengthen the argument of the paper since this is probably the most straightforward altnerative approach, i.e., learn a NN, calibrate predictions, have it abstain where uncertain.
+- The comparison against the baseline NN should also include the performance of the baseline NN on the samples where DAC chose not to abstain, so that accuracies between NN and DAC are comparable. E.g. in Table 1, (74.81, coverage 1.000) and (80.09, coverage 0.895) have accuracies based on different test sets (partially overlapping).
+- The last set of experiments adds smudging to the out-of-set (open set) classification tasks.  It is somewhat unclear why smudging needs to be combined with this task.
+
+Clarity
+- The paper could be better organized with additional signposting to guide the reader. 
+
+Originality
++ Material is original to my knowledge.
+
+Significance
++ The method does appear to work reasonably and the authors provide detail in several use cases.
+- However, there are no direct comparison against other abstainers and the perturbations are somewhat artificial.",5,3.0,ICLR2019
+rJl3L37nYB,2,rJx8I1rFwr,rJx8I1rFwr,Official Blind Review #3,"This paper proposes a general meta-learning with hallucination framework called PECAN. It is model-agnostic and can be combined with any meta-learning models to consistent boost their few-shot learning performance. 
+
+There are two key points for the proposed model. On the one hand, the authors introduce a novel precision-inducing loss which encourages the hallucinator to generate examples so that a classifier trained on them makes predictions similar to the one trained on a large amount of real examples. On the other hand, the authors introduce a collaborative objective for the hallucinator as early supervision, which directly facilitates the generation process and improves the cooperation between the hallucinator and the learner.
+
+On the whole, the paper is well-written, and the proposed idea is novel and interesting.
+
+I have some following major concerns about the paper:
+(1) In Figure 2, the authors first sample the training set S^*_{train}, which contains n^* examples for each of the m classes, and then they randomly sample n examples per class, and obtain a subset S_{train}. Why not generate the S_{train} directly and then measure your precision-inducing loss over the real set S_{train} and S^G_{train}? I hope the authors explain it in their paper.
+(2) For Function 2 in the paper, why compute the cosine distance on the probability vectors that are obtained by removing the logit for ground-truth label in original probability distributions? Could we compute the distance on the probability vectors that contains the logit for ground-truth label? I hope the authors explain it in detail.
+(3) As far as I know, there are some latest work on few-shot learning in 2019, especially the work “Few-shot Learning via Saliency-guided Hallucination of Samples” and “Edge-Labeling Graph Neural Network for Few-shot Learning”. I hope the authors can compare with these two methods to further demonstrate the effectiveness of the proposed model.",6,,ICLR2020
+SylKDMA2FB,1,SkgWeJrYwr,SkgWeJrYwr,Official Blind Review #1,"In this paper, the authors present an iterative approach for feature selection which selects features based both on the relevance and redundancy of each feature. The relevance of each feature is determined using a mild variant of the Feature Quality Index; essentially, the relevance is computed as the loss in model performance when setting each feature value to the mean and measuring the change in performance. Similarly, the redundancy of each feature is determined by comparing the reconstruction loss of an autoencoder when setting the feature value to its mean for all training samples. These two values are combined to give a single score for each feature at each iteration. The feature with the worst value is removed. A limited set of experiments suggests the proposed approach mildly outperforms other efficient feature selection methods.
+
+Major comments
+
+The paper does not include relevant, recent work on using autoencoders for feature selection, such as [Han et al., ICASSP 2018; Balın et al., ICML 2019], among others. Thus, it is difficult to discern how this paper either theoretically or empirically advances the state of the art.
+
+I found the proposed approach to efficient feature selection reasonable. However, there is no theoretical justification for the approach. Thus, I would expect a thorough empirical analysis. Only a few limited experiments on toy datasets (and one slightly more challenging one) are given.
+
+The paper is not well-written. For example, it seems as though the proposed approach is not applicable to datasets with categorical features. It is not obvious (and, presumably, would need to be shown empirically) if the mode could be used to replace categorical values analogously to how the mean is used for real-valued features. Alternatively, one could imagine one-hot encoding the categorical variables and grouping them in some manner similar to that used for the RadioML pairs (since the one-hot values are obviously highly correlated). However, the authors do not address these issues.
+
+Similarly, the entire discussion in Section 3 seems to assume the ranker model will be some sort of neural network. However, as far as I can tell, the ranker model is treated as a black box, so it could easily be some random forest model, etc. If there are some implicit assumptions that the ranker model is a neural network, this should be made explicit; if not, the discussion should be revised (and, of course, non-neural models should be used in the experiments).
+
+The approach seems to heavily depend on the ability of the autoencoder to reconstruct the input; however, it is unclear how the structure/capacity of the autoencoder affects the performance of the algorithm. For example, the authors propose a relatively simple structure, presumably to maintain computational efficiency. It would be interesting to explore more deeply how autoencoders with more capacity impact the results.
+
+It is unclear why the autoencoder is retrained at each step compared to just setting the removed feature values to the respective means, as is done with the ranker model.
+
+Clearly, the relevance and redundancy scores could be weighted unequally when selecting the feature to remove. It would be interesting to explore how different combinations affect the results.
+
+It seems that the experiments only consider backward feature selection approaches. Including forward feature selection approaches would add useful context for how the proposed approach compares to other strategies.
+
+Minor comments
+
+The cross-validation scheme used is not clear. While the authors mention that three runs are used to estimate performance variance, they do not describe if this is 3-fold cross validation, some Monte Carlo cross validation, or if the same splits are used all three times and just the random seeds are different.
+
+While methods like RFE have significantly higher computational cost than the methods considered here, it would be helpful to include it for at least one of the datasets to provide context on how much the less costly methods “lose”.
+
+What is the overlap in the selected features? both among the different methods and among the different folds for the same method.
+
+How were the hyperparameters for the various models chosen?
+
+Typos, etc.
+
+The references are not consistently formatted.
+
+The Section 2 headers all have an unnecessary “0” in them (e.g., “2.0.1”).
+
+Table 1 should include the standard deviations.
+",1,,ICLR2020
+BylleIdbaX,3,rJlg1n05YX,rJlg1n05YX,Interesting topic but the paper is not well explained,"This paper addressed an interesting problem of reducing the kernel to achieve CNN models, which is important and attracts lots of research work. However, the methods don't have very good justifications. 
+For example, in Section 3.1, the authors mentioned that ""Specifically, in normal CNNs it is quite common to have multiple stages/blocks which contain repeated patterns such as layers or structures."" It is still unclear why it is better to replace these so-called repeated patterns. 
+The defined ""information field"" is not clearly explained and the benefits are also not demonstrated.",5,3.0,ICLR2019
+HJxKcbi-5B,3,HJg_ECEKDr,HJg_ECEKDr,Official Blind Review #2,"This paper proposes a meta-learning algorithm Generative Teaching Networks (GTN) to generate fake training data for models to learn more accurate models. In the inner loop, a generator produces training data and the learner takes gradient steps on this data. In the outer loop, the parameters of the generator are updated by evaluating the learner on real data and differentiating through the gradient steps of the inner loop. The main claim is this method is shown to give improvements in performance on supervised learning for MNIST and CIFAR10. They also suggest weight normalization patches up instability issues with meta-learning and evaluate this in the supervised learning setting, and curriculum learning for GTNs.
+
+To me, the main claim is very surprising and counter-intuitive - it is not clear where the extra juice is coming from, as the algorithm does not assume any extra information. The actual results I believe do not bear out this claim because the actual results on MNIST and CIFAR10 are significantly below state of the art. On MNIST, GTN achieves about 98% accuracy and the baseline “Real Data” achieves  <97% accuracy, while the state of the art is about 99.7% and well-tuned convnets without any pre-processing or fancy extras achieve about 99% according to Yann LeCunn’s website. The disparity on CIFAR seems to be less egregious but the state of the art stands at 99% while the best GTN model (without cutout) achieves about 96.2% which matches good convnets and is slightly worse than neural architecture search according to https://paperswithcode.com/sota/image-classification-on-cifar-10.
+
+This does not negate the potential of GTNs which I feel are an interesting approach, but I believe the paper should be more straightforward with the presentation of these results. The current results basically  show that GTNs improve the performance of learners with bad hyper-parameters. On problems that are not as well-studied as MNIST or CIFAR10 this could still be very valuable (as we do not know what performance is good or bad in advance). Based on the results, GTN does seem to be a significant step forward in synthetic data generation for learning compared to prior work (Zhang 2018, Luo 2018).
+
+The paper proposes two other contributions: using weight normalization for meta-learning and curriculum learning for GTNs. Weight normalization is shown to stabilize GTNs on MNIST. I think the paper oversteps in the relevant method section, hypothesizing it may stabilize meta-learning more broadly. The paper should present a wider set of experiments to make this claim convincing. But the point for GTNs on MNIST nevertheless stands. For curriculum learning: the description of the method is done across section 2 and section 3.2 and does not really describe it completely. How exactly are the samples chosen in GTN - All Shuffled? How does GTN - Full Curriculum and Shuffled Batch parametrize the order of the samples so that it can be learned? I suggest that this information is all included as a subsection in the method (section 2). The results seem to show the learned curriculum is superior to no curriculum.
+
+At a high level it would be very surprising to me if the way forward for better discriminative models was to learn good generative models and use them again for training discriminative models, simply because discriminative models have proved thus far significantly easier to train. If this work does eventually show this result, it would be a very interesting result. At the moment, I believe it does not, but I would be happy to change my mind if the authors provide convincing evidence. Alternatively, I feel that the paper could be a valuable contribution to the community if the writing is toned down to focus on the contributions, presents the results comparing to well-tuned hyperparameters and not over-claim.
+
+More comments:
+
+What is the outer loop loss function? Is it assumed to be the same as the inner one (but using real data instead of training data)? I think this should be made explicit in the method section.
+
+There are some additional experiments in other settings such as RL and unsupervised learning. Both seem like quite interesting directions but seem like preliminary experiments that don’t work convincingly yet. The RL experiment shows that using GTN does not change performance much. There is a claim about optimizing randomly initialized networks at each step, but the baseline which uses randomly initialized networks at each step with A2C is missing. The GAN experiments shows the GAN loss makes GTN realistic (as expected) but there are no quantitative results on mode collapse. (Another interesting experiment would be to show how adding a GAN loss for generating data affects the test performance of the method.) Perhaps it would benefit the paper to narrow in on supervised learning? Given that these final experiments are not polished, the claim in the abstract that the method is “a general approach that is applicable to supervised, unsupervised, and reinforcement learning” seems to be over-claiming. I understand it can be applicable but the paper has not really done the work to show this outside the supervised learning setting.
+
+Minor comments:
+Pg. 4: comperable -> comparable",3,,ICLR2020
+Hyl6l37i3Q,3,r1lgm3C5t7,r1lgm3C5t7,Review TLDR: Ok good fit for ICLR maybe even better for QIP ,"Authors give a method to perform a full quantum problem of classifying unknown mixed quantum states. This is an important topic but the paper is ok and I think the test case is a bit lacking.
+
+The theory  is sound and the math is good. The only question I have is how does this hold on a real quantum computer such as IBMQ/rigetti quantum computing etc.. or even under a noisy simulator 
+
+Although the paper is sounds and it is a good idea, the presentation is a bit lacking. There are several typos and formatting problems, such as excess spaces and some sort of hex code (9b8d) in the abstract which I am guessing is left over from the NIPS template.
+Two other things is that usually in double blind review one should not leave the emails with affiliation and one should anonymize the Acknowledgements as well.",5,3.0,ICLR2019
+qB6WaDHpv6V,1,sAX7Z7uIJ_Y,sAX7Z7uIJ_Y,"Interesting idea, but paper and experiments need revision ","** Summary: 
+This work addresses the context of semantic segmentation where a single input image could be associated with multiple valid labels, as a result of natural ambiguities. Starting from a pretrained deterministic segmentation network F, this work proposes to use an additional conditional generative model G, named as *refinement network*, to generate multiple segmentation predictions; the model G is conditioned on the segmentation probabilistic output of F and the input image. G is trained with adversarial loss and the proposed *calibration loss*, essentially the KL-divergence between the probabilistic output of F and the sample average of G. At runtime, the unified pipeline of F and G can produce multiple segmentation predictions. On one toy example and two real benchmarks, the proposed method show improvements over addressed baselines, in terms of generalized energy distance (GED) and Hungarian-matched IoU (HM-IoU).
+
+** Strengths:
+- The idea of using conditional GANs to produce multiple predictions is interesting.
+- The proposed framework and learning scheme are simple. I think it's easy to reimplement and reproduce results.
+
+** Weakness:
+- Going through the paper I had trouble understanding how the refinement network G can guarantee to produce calibrated probabilities of segmentation modes. The calibration network F, to my understanding, is a deterministic segmentation model trained in a conventional fashion using only the cross-entropy loss. I believe numerous works proved that a model trained that way will end up with over-confident predictions, which are uncalibrated (actually shown in Figure 5). It actually seems misleading to name F with calibration.
+
+- Outputs of pretrained F is then used to regularize the training of the cGAN G via the KL-divergence ""calibration loss"" (which is more like a reconstruction loss to me). Can the authors explain how the refinement network, trained to match sample average with uncalibrated probabilistic targets, can successfully produce calibrated probabilistic outcomes? Also, I would love to see results with calibration metrics like NLL and ECE.
+
+- On the Cityscapes experiments, the segmentation network B is finetuned with or without class-flipping labels? It's quite confusing when sometimes F is a full segmentation network as in Sec 3, Sec 4 and Sec 5.2.1, sometimes F is an ad-hoc network like in 5.2.2. Also F's architecture is detailed in the beginning of 5.2 as SegNet, but only used in 5.2.1. 
+
+- Can the authors please provide experimental evidence of how the cross-entropy loss and adversarial loss are not well aligned in the presence of noisy data?
+
+- Minor typos: 
+	+ In Table 2, shouldn't the GED of Kohl et al be 0.206?
+	+ It may look obvious but should the notations like H,W,C,K be introduced? I thought C is the number of classes at first.
+
+** Preliminary evaluation: this work targets an interesting task of stochastic semantic segmentation. The architecture design and learning scheme seems reasonable to me. The major problem is the lack of evidence to support the claim on output calibration. In terms of writing, I find the paper hard to follow with lots of confusions. Due to those limitations, I give an initial rating of 5.
+
+-- Post-rebuttal -----------------------------------------------------------------------------------------------
+
+Given the improvement of the last revision, I increase my rating to 6. The revised version has been very much improved, especially in the abstract and introduction Sections. Still I think it's important to additionally have one or two sentences to make very clear on the meaning of calibration, as to not confuse readers.",6,3.0,ICLR2021
+pi5i_d-iEKD,3,O1pkU_4yWEt,O1pkU_4yWEt,Distantly supervised end-to-end medical entity extraction review,"
+This paper presents a model that performs Medical Entity Extraction in a way that 1) allows to generalise better certain types of entities, resulting in a better performance and 2) can achieve human-like quality despite relying on a distantly supervised training.
+
+PRO's:
+- State of the art is compelling
+- The authors are clear in stating the contributions of their work, and how the model has been implemented. 
+- The topic of labelling medical documents is highly relevant in these covid times: being able to automatically process medical documents is a challenge that could help doctors and health experts to deal with the huge amount of information that is being publish daily. 
+- Include experiments on Russian, good for a domain where most of the research is focusing on English. 
+
+However, imho the contributions stated are not significant enough for the paper to be accepted for publication. In particular:
+1) The model architecture does not bring any novelty. (It is basically BERT+Linear layer). The pretrained model helps improving the labelling task, but this has been already observed in many other NLP domains and papers. 
+2) Distant supervision is attractive, as we don't rely on expensive labelled data, but the performance analysis is limited to a few (most-frequent) classes. In a domain like health, the long tail of entities can be crucial so this techniques might not be suitable for some use cases. 
+3) It is too risky to claim human-like performance, when only one expert annotator has been involved.
+
+In addition to this, it would be nice if the authors can address the following comments:
+- In related work you claim that ""Instead of using a contexualized representation of each token to classify it as an entity we use the representation of the whole text to extract all entities present in the text at once"". Can you clarify how this is achieved? 
+- In addition to the number of classes in train versus test datasets, could you elaborate a bit more on other differences, e.g. average length of the documents, presence of some structured data within the documents (e.g. template)...
+- Is the fact that you used Russian datasets making your challenge harder? E.g would you be able to better leverage UMLS if you run similar experiments in medical records in English?
+- ""we decided to keep only terms that appear at least 10 times in the test dataset reducing the total number of entities to 4434"" -> Ideally, you should not rely on any clues or signals from the test dataset, so you don't bias your model toward specific categories present in the test data. I would suggest to repeat experiment without this step and report what happens with this less frequent classes too.
+- Section 4 is to scarce, more details about the model and the training should be included, for example: how did you pretrained RuBERT on electronic health records? E.g. how much data was available for this domain, for how many epochs you trained... Also, How do you think this is affecting the end2end performance of your model?
+- As stated earlier, one annotator is not enough for a proper human evauation. I suggest to be more exhaustive in this study and bring in some more experts to annotate the text, so we can calculate agreement score.
+- Please include more details when you refer to model's mistakes. e.g. ""Large number of model errors in case of ’Coughing’ are due to a single term synonym that was almost completely absent from train dataset and present in test dataset."" -> Indicate the term
+- Table 4: How can you quantify the improvement achieved by these entity extraction generalizations? The distant supervision is not able to capture those advanced use cases when labelling entites so, do you have ideas of how many of those you might be ignoring during your evaluation?
+- 150.000 training examples to achieve human-quality: Are you sure this is good enough? What happens when the number of available classes increase?",4,4.0,ICLR2021
+zjW8gy6u_Rj,2,sCZbhBvqQaU,sCZbhBvqQaU,A well written and motivated paper that training agents and adversary alternatively,"The paper is very well written and the considered problem of training an adversary along with the agent is very interesting. Within the proposed concept, the parameterized adversary can be trained by viewing the agent as a part of the environment, so it avoids to access the parameters of the agent policy. From the perspective of the agent, with an unknown adversary, the MDP becomes a POMDP with uncertainty hidden in the adversary, and hence the fact of using LSTM policy is much better for the agent is reasonable. The entire problem is wisely formulated.  The experimental settings are well designed and results support the positiveness of the proposed framework.
+
+I only have one comment that the SA-MDP can also be understood from another perspective. That is, SA-MDP is actually an asymmetric competitive multi-agent problem, and the alternative training of agent and adversary can be viewed as an instance of self-play. Also, the optimality of SA-MDP for either the agent or the adversary can be explained through multi-agent RL or game theory. It would be interesting if the authors could take a look into such a direction.",7,5.0,ICLR2021
+H1x72FypKB,1,H1efEp4Yvr,H1efEp4Yvr,Official Blind Review #1,"This paper considers reinforcement learning for discrete choice models with unobserved heterogeneity, which is useful for analyzing dynamic Economic behavior.  Random choice-specific shocks in reward is accommodated, which are only observed by the agent but not recorded in the data. Existing optimization approaches rely on finding a functional fixed point, which is computationally expensive.  The main contribution of the paper lies in formulating discrete choice models into an MDP, and showing that the value function is concave with respect to the policy (represented by conditional choice probability).  So policy gradient algorithm can provably converge to the global optimal.  Conditions on the parameters for global concavity are identified and rates of convergences are established.  Finally, significant advantages in computation were demonstrated on the data from Rust (1987), compared with “nested fixed point” algorithms that is commonly used in Econometrics.
+
+This paper is well written.  The most important and novel result is the concavity of the value function with respect to the policy.  My major concerns are:
+
+1. How restrictive the assumptions are in Definition 1.4?  In particular, R_min is defined from Assumption 2.2 as “the immediate reward … is bounded between [R_min, R_max]”.  So if we set R_min to negative infinity, the right-hand side of Eq 6 will be infinity, and so the condition is always met.  Is this really true?  At least, does the experiment satisfy Definition 1.4?
+
+2. The experiment is on a relatively small problem.  Solving value/policy iteration with 2571 states and 2 actions is really not so hard, and many efficient algorithms exist other than value/policy iteration.  For example, a variant of policy iteration where policy evaluation is not solved exactly, but instead approximated by applying a small number of Bellman iterations. Or directly optimize the Bellman residual by, e.g., LBFGS, which also guarantees global optimality and is often very fast.  See http://www.leemon.com/papers/1995b.pdf .  An empirical comparison is necessary.",3,,ICLR2020
+rkq9KFDlz,3,BkPrDFgR-,BkPrDFgR-,no original contribution,"The paper compares some recently proposed method for validation of properties
+of piece-wise linear neural networks and claims to propose a novel method for
+the same. Unfortunately, the proposed ""branch and bound method"" does not explain
+how to implement the ""bound"" part (""compute lower bound"") -- and has been used 
+several times in the same application, incl.:
+
+Ruediger Ehlers. Planet. https://github.com/progirep/planet,
+Chih-Hong Cheng, Georg Nuhrenberg, and Harald Ruess.  Maximum resilience of artificial neural networks. Automated Technology for Verification and Analysis
+Alessio Lomuscio and Lalit Maganti.  An approach to reachability analysis for feed-forward relu neural networks. arXiv:1706.07351
+
+Specifically, the authors say: ""In our experiments, we use the result of 
+minimising the variable corresponding to the output of the network, subject 
+to the constraints of the linear approximation introduced by Ehlers (2017a)""
+which sounds a bit like using linear programming relaxations, which is what
+the approaches using branch and bound cited above use. If that is the case,
+the paper does not have any original contribution. If that is not the case,
+the authors may have some contribution to make, but have not made it in this
+paper, as it does not explain the lower bound computation other than the one
+based on LPs.
+
+Generally, I find a jarring mis-fit between the motivation (deep learning
+for driving, presumably involving millions or billions of parameters) and
+the actual reach of the methods proposed (hundreds of parameters).
+This reach is NOT inherent in integer programming, per se. Modern solvers
+routinely solve instances with tens of millions of non-zeros in the constraint
+matrix, but require a strong relaxation. The authors may hence consider
+improving the LP relaxation, noting that the big-M constraint are notorious
+for producing weak relaxations.",3,5.0,ICLR2018
+HkgePooptS,1,HkxjqxBYDB,HkxjqxBYDB,Official Blind Review #3,"The paper proposes to combine DQN with a nonparametric estimate of the optimal Q function based on the graph of all observed transitions in the buffer. Specifically, they use the nonparametric estimate as a regularizer in the DQN loss. They show that this regularizer facilitates learning, and compare to other nonparametric approaches. I found the paper easy to read. The ideas are intuitive and seem to work.
+
+It would be great to have more experiments providing insight into when the associative memory estimate works and when it doesn't. Since at the end of the day both DQN and the non-parametric estimate use the same data, there's no fundamental reason why the later should contain more information. Is it possible that more aggressive training of DQN would eliminate the need for the nonparametric estimate? Why would I expect the nonparametric estimate based on random projections to generalize better to new states than DQN? What would be the performance of DQN with only the random projections as inputs? I believe including experiments probing in this direction would make the paper better.
+
+-------------------------------------------------------------------------------------------------------------------
+Thanks for your response and the additional experiments. I still find the paper interesting and hence keeping my score as is.",6,,ICLR2020
+0a51bbW8kl-,1,2d34y5bRWxB,2d34y5bRWxB,No contribution?,"The paper presents a study on regularization methods for the feedforward fully connected neural networks.
+The study is formulated as hyper-parameter optimization task, heavily using Auto-Pytorch library. The paper claims as contribution (sorry for copy-pasting):
+
+1. We demonstrate the empirical accuracy gains of regularization cocktails in a systematic
+manner via a large-scale experimental study;
+2. We challenge the status-quo practices of designing universal dataset-agnostic regularizers,
+by showing that an optimal regularization cocktail is highly dataset-dependent;
+3. We demonstrate that regularization cocktails achieve a higher gain on smaller datasets;
+4. As an overarching contribution, this paper provides previously-lacking in-depth empiri-
+cal evidence to better understand the importance of combining different mechanisms for
+regularization, one of the most fundamental concepts in machine learning.
+
+***
+
+I am highly sceptical on the paper usefulness for the community. 
+In general terms, the benchmark/empirical study type of paper typically can have one (or more) of the following contributions:
+
+a) New knowledge, which was obtained as a result of a study. E.g. surprising results, practical recommendations, so on. For example, A Metric Learning Reality Check by Musgrave et. al (ECCV 2020) revealed surprising knowledge about metric learning methods. 
+
+b) The methodology of such study, which was not used before. E.g. Visual Object Tracking Challenge, which become the benchmark for the tracking methods since 2013. 
+c) The software and/or dataset, which were developed for the study. E.g. OpenAI Gym.
+
+
+(a) is not the case IMO, because all the recommendations are known to the practicioners, e.g. check the any Kaggle winning solution
+https://www.kaggle.com/sudalairajkumar/winning-solutions-of-kaggle-competitions
+
+(b) I see no novelty in using hyper-parameter optimization for the study. The paper agrees with me on this  (see Related Work, ""Positioning in the realm of AutoML""
+
+(c) Neither software, nor dataset is proposed -- the paper uses existing ones. 
+
+****
+
+Now I will go over contributions.
+
+
+1. It is well known that the regularization/augmentation/... need to be tuned to archieve the best results. One can publish a CVPR paper about such good combination, e.g. He et.al. (CVPR2019), 
+Bag of Tricks for Image Classification with Convolutional Neural Networks
+
+2. I don't see the support for that claim in the paper. Yes, the specific combination of regularization techniques, which performs the best on the given dataset is, perhaps, unique. But the techniques are applicable broadly, which is supported by the paper (Fig. 1), e.g. DropOut, MixUp, BatchNorm, are pretty universal.
+
+3. It is also obvious, that the less data you have, more regularization and design bias are needed for better results, see OpenAI Image GPT, or more ViT paper vs. ConvNets. 
+
+4. See (1)
+
+
+Overall, if the paper spend some space on particually interesting regularization combinations/interplay of components/analisys, it might be quite useful for researchers. For now it seems as a lot of experiments were done, but no analisys is really performed.
+
+E.g. abstract says: ""there is no systematic study as to how different regularizers should be combined into the best ""cocktail"""". 
+
+But I don't see the answer of how should they be combined.
+
+********
+
+Small things, not contributing to the score:
+
+- AutoPyTorch is cited twice, as arXiv and as CoRR
+
+- While skip-connections and BN can be seen as ""regularization"", I would rather call them ""architecture"". Anyway, that is just matter of naming. 
+
+
+### After rebuttal update.
+
+The paper has been significantly refocused and now sells itself as a way of deep neural networks being competitive versus gradient boosting methods, which are dominating the tabular heterogeneous tasks. 
+
+I was also convinced by authors responce on paper novelty, technical contribution and (after the re-focusing) potential usefulness to the community.
+Thus, I am raising my rating to weak accept.
+
+### Comments on authors response (as after Nov 24 I cannot post messages, visible to authors)
+
+> 1. However, we would be thankful if you might share any prior work (paper or published practice) where the authors automatically searched for the optimal combination of regularizers for deep learning models among a large set of regularizers, as presented in this study.
+
+I am surprised with the results of the googling, but have to admit that authors are technically right and I was wrong. While it seems obvious to me, that regularization (specifically, dropout and L2 weight decay) are the hyperparameters of the deep network training, somehow papers and guides online mostly consider mostly architectural things + learning rate + (sometimes) dropout rate as a hyperparameters to optimize. 
+
+https://arxiv.org/pdf/2006.12703.pdf
+https://nanonets.com/blog/hyperparameter-optimization/
+
+Anyway, I lift my objections on novelty.
+
+> 3. ""Neither software, nor dataset is proposed, the paper uses existing ones."" : We engineered a source code that selects the application of 13 regularizers to a neural network, which required extensive programming efforts and several additions to the AutoPytorch library, as mentioned in Section 4.1.
+
+OK, I agree.
+
+
+> 4. ""It is well known that the regularization/augmentation/ methods need to be tuned ... He et.al. (CVPR2019)"" : The suggested reference is a collection of refinements (many of which are actually not regularization techniques), which have been suggested by the deep learning community for maximizing the generalization on Imagenet. That paper only summarizes a collection of some practices, however, it does not present a method that searches for the best combination of a large set of practices
+
+No, we are discussing different things. I gave the He at.al as an example, that community is well aware about the fact, the regularization and augmentation (and other things) have to be highly tuned. I agree that He at.al and the current paper use completely different methods for solving the problem (manual tinkering vs. auto-search). What I disagree is that the community was not aware about importance of regularization tuning before this paper.
+
+>5.  E.g. DropOut was present in only 35% of the dataset cocktails, hence was not selected in the cocktails of the 65% of the datasets.
+
+And experiments were using fixed-size network. Of course, is the network is not wide enough for the task, the dropout might not be needed. It is also quite strong statement that ""there is no universal regularization"", given that L2 weight decay and dropout are widely used in a such different domains as image, text, speech processings, RL and so on. 
+****
+
+I would like to point out the TabNet paper https://arxiv.org/pdf/1908.07442.pdf, which claimed ""beating GB methods for the tabular data"". I appreciate the fact, that unlike the TabNet, RegCocktails were using a standard deep MLP and not the attention model, yet one needs to add that reference.
+
+
+
+
+ ",6,4.0,ICLR2021
+S1xnL1sXnQ,1,HJepJh0qKX,HJepJh0qKX,An experimental paper with many problems in the setting and analysis,"This paper proposes a specific measure of difficulty for training examples called “easiness”. Easiness is based on training the model N times and counting the number of times an example is classified correctly. Based on this measure, they introduce “matching rate” as a measure of similarity of two architectures. Two architectures are suggested to be similar if the set of easy and hard examples is similar. The rest of the paper presents comparisons of architectures. Considering the problems below, I don’t see any reliable contribution in this paper.
+
+- Why this specific definition of easiness? Can you compare to simply using “loss” as a measure for the difficulty of an example?
+- e_t seems to be measuring the variance of training on a single example. If there is only one example that is always classified correctly, the denominator can be simplified to K. It doesn’t tell us how many training iterations it takes to fit that example.
+- Why this specific formulation for “matching rate”? Why not a more common measure of similarity between sets such as intersection over union (IoU)? Can you suggest any references using a similar similarity score?
+- Numbers in Table 1 do not seem particularly big to support the claim in section 4 that “...CNNs start learning from the *same* examples even if CNN architectures are different”. 0.20 is definitely bigger than random 0.1 for the matching rate but it still means only a 20% match.
+- Random 0.1 is redundant in table 1.
+- In section 4, define “contradicted patterns”.
+- Are all images in Figure 1 for one model? How does it compare to visualizing examples according to their loss?
+- The conclusion in section 5 says “... different CNNs start learning from similar patterns”. As mentioned above, “easiness” and consequently “matching rate” do not provide information about the progress of training and only final trained models. Regardless, this conclusion does not seem particularly unexpected or informative.
+- Section 6 proposes to test a model on data with a different structure from data provided in training. This is a distribution mismatch and the model is not trained to handle.",3,4.0,ICLR2019
+DhNzkTAPqhx,4,6M4c3WegNtX,6M4c3WegNtX,Simple and interesting method but the important experiment is missing,"The paper suggests a new approach to the construction of ensembles of deep neural networks (DNN). Unlike previous methods which usually deal with multiple DNNs of same structure authors propose to form an ensemble of networks with different architecture. The main claim is that using diverse architectures increases diversity and hence the quality of predictions. To find the best architectures they use methodology inspired by neural architecture search (NAS) in particular random search and regularized evolution. The method for neural ensemble search (NES) is algorithmically simple although computationally hard. On several experiments the authors show NES outperforms standard deep ensembles formed from networks with same (even optimal) structure both in terms of test NLL and in terms of uncertainty estimation under domain shift.
+
+Pros.
+Nice idea
+Simple algorithm
+
+
+Cons.
+My main point for the criticism is the lack of experiment which I find to be crucially important namely the comparison aganist deep ensemble of DNNs with same architecture to which ForwardSelect procedure has been applied. Train P DNNs with same architecture then perform ForwardSelect routine to take the best K of them and compare your method with such deep ensemble. Currently the authors only compare their method with deep ensembles to which no special selection procedure was applied. This causes bias and it is not clear whether the improvement in NES is due to the usage of different architectures or due to the selection procedure which encourages diversity in resulting ensemble.
+
+P.S. Please correct me if I misunderstood the last point. I have read the corresponding part twice and found no evidence that you're using ForwardSelection when analysing the performance of ensembles of DNNs with same architecture.  
+
+====UPDATE===
+My concerns were partly addressed in author's response so I have raised my score to 5.",5,4.0,ICLR2021
+4V0sQCTAPXT,3,QfEssgaXpm,QfEssgaXpm,Finite sample stability criterion for control non-linear systems,"
+### Summary
+
+The paper describes a finite sample approximation for estimating stability of
+control policy. They propose to use stability in mean squared setting with
+finite sample errors on estimating Lyapunov function. The experimental results
+are thin and based only on balancing a cart-pole. 
+
+### Pros
+
+1. Take aways from the paper are intuitively reasonable. Any sampling based
+   method would suffer from low samples M or high variance due to lower T in
+   sequence based estimates.
+
+2. Authors in related works highlight the importance of missing research in
+   stability analysis in recent advances in RL. THe points covered in related
+   work are valid and relevant
+
+### Cons/Questions
+
+1. I disagree that Lyapunov analysis is always impossible. Finding a Lyapunov
+   candidate function is very challenging task, but it doesn't not mean there
+   isn't a function. Impossible --> implies there is no such function.
+2. One typically addresses the infinite horizon requirements with discounting.
+   One can obtain an error bound for horizon truncation see [Section III Ref 1.] 
+3. From the cart pole results which a fairly linear system around the vertical
+   stabilising point, the numbers for M and T look very large. Authors have not
+   mentioned sampling rate for the cart-pole system, to judge what exactly the
+   number T implies. 
+
+### Language
+
+1. The paper in general would benefit from a language review. This does not
+   factor in the review decision, however, authors may prefer to ease readers
+   burden by getting the work reviewed for language alone.
+
+
+### Ref
+
+1. D. Ernst, M. Glavic, F. Capitanescu and L. Wehenkel, ""Reinforcement Learning
+   Versus Model Predictive Control: A Comparison on a Power System Problem,"" in
+   IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics),
+   vol. 39, no. 2, pp. 517-529, April 2009, doi: 10.1109/TSMCB.2008.2007630.
+",5,3.0,ICLR2021
+B1loxGgTYB,2,rkly70EKDH,rkly70EKDH,Official Blind Review #1,"
+This paper proves that one can design a (shallow)neural network that with a mild amount of overparametrization (e.g. the number of datapoints n is roughly less than d^2 in Theorem 1), a second-order method can reach a global minimum. In general, I think this is an interesting direction. However, I have some doubts regarding the comparison to prior work (especially regarding the results on the two-layer network), as well as some technical details that need some clarification. Regarding the quality of the writing, it’s in general ok but there are lots of grammatical mistakes, the authors should pay more attention to this. I’m not giving a high score for now but I will reconsider my review once I hear back from the authors.
+
+Comparison to Oymak & Soltanolkotabi 2019: my understanding is that this paper prove convergence to a global minimum for a neural network where the number of parameters is only twice the number of datapoints so aren’t your results “worse” in that sense? The text in the related work seem to say the opposite, so I’m rather confused by your statement, please clarify.
+
+Landscape: I think this discussion is largely missing in the paper but another way to prove the same result would be to focus on showing that the loss surface is “well-behaved”. In fact the paper by Soltanolkotabi et al (see reference below) proves this for a similar network with a quadratic activation function and arbitrary data. Essentially, they show that such network obeys the following two properties:
+There are no spurious local minima, i.e. all local minima are global.
+All saddle points have a direction of strictly negative curvature
+My understanding is that they prove this for a parametrization regime where n=d^2, isn’t this the same regime as proved in your results?
+Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks.
+
+Three-layer neural net
+1) This network uses activation function of the form x^p. You wrote “for some constant p”, is there any specific lower bound on p? I’m also surprised that one would allow large values of p as such functions have a saddle at x=0 with a large region with low gradient magnitudes around it.
+2) Limitation quadratic activation function. You say “To address this problem, the three-layer neural net in this section uses the first-layer as a random mapping of the input“. How is this helping with changing the activation function?
+3) You have to fix part of the weights of the network, this seems to be a limitation of the analysis that should be more clearly highlighted and better contrasted to what has been done in prior work.
+4) I think it would be interesting for the reader to focus more on the three layer network (instead of the two-layer one whose analysis is rather simple) and provide a more detailed proof sketch.
+
+Paper organization
+The most interesting result of the paper is the one about the three-layer network but the entire analysis is relegated to the appendix. I feel it would be worth trying to provide a rough proof sketch in the main paper to highlight the difficulty of the analysis.
+
+Proof Lemma 2
+Alternatively to the current proof, couldn’t you differentiate the second term (2/n \sum_j (\sum …)^2 ) w.r.t. z and show it is zero at the argmax_z? Isn’t this what your result says?
+
+Experiments
+Consider repeating the experiments a few times and showing the average.
+
+Minor: Low loss vs perfect fitting: I couldn’t find any discussion about this but the authors seem to assume that a zero loss directly implies that the training data is fit perfectly. In the case n=d, there exists a function that yields a perfect fit of the data and therefore a larger network should be able to represent this function. Perhaps it would be worth writing this down.
+
+More minor comments
+- Figure experiments: please use a log scale
+- Formula top of page 16: should be z_k on the left and right of the term inside the bracket
+- missing citation top of page 24: “e.g. in”
+",3,,ICLR2020
+B1g0bJk5h7,2,Byx93sC9tm,Byx93sC9tm,Paper contains only little novelty and the experiments are not sufficiently thorough,"The paper shows that Bayesian neural networks, trained with Dropout MC (Gal et al.) struggle to fully capture the posterior distribution of the weights.
+This leads to over-confident predictions which is problematic particularly in an active learning scenario.
+To prevent this behavior, the paper proposes to combine multiple Bayesian neural networks, independently trained with Dropout MC, to an ensemble.
+The proposed method achieves better uncertainty estimates than a single Bayesian neural networks model and improves upon the baseline in an active learning setting for image classification.
+
+
+The paper addresses active deep learning which is certainly an interesting research direction since in practice, labeled data is notoriously scarce. 
+
+However, the paper contains only little novelty and does not provide sufficiently new scientific insights.
+It is well known from the literature that combining multiply neural networks to an ensemble leads to better performance and uncertainty estimates.
+For instance, Lakshminarayanan et al.[1] showed that Dropout MC can produce overconfident wrong prediction and, by simply averaging prediction over multiple models, one achieves better performance and confidence scores. Also, Huand et al. [2] showed that by taking different snapshots of the same network at different timesteps performance improves.
+It would also be great if the paper could related to other existing work that uses Bayesian neural networks in an active learning setting such as Bayesian optimization [3, 4] or Bandits[5].
+
+
+Another weakness of the paper is that the empirical evaluation is not sufficiently rigorous: 
+
+1) Besides an comparison to the work by Lakshminarayanan et. al, I would also like to have seen a comparison to other existing Bayesian neural network approaches such as stochastic gradient Markov-Chain Monte-Carlo methods.
+
+ 2) To provide a better understanding of the paper, it would also be interesting to see how sensitive it is with respect to the ensemble size M. 
+ 
+ 3) Furthermore, for the experiments only one neural network architecture was considered and it remains an open question, how the presented results translate to other architectures. The same holds for the type of data, since the paper only shows results for image classification benchmarks.
+ 
+ 4) Figure 3: Are the results averaged over multiple independent runs? If so, how many runs did you perform and could you also report confidence intervals? Since all methods are close to each other, it is hard to estimate how significant the difference is.
+ 
+
+
+
+[1] Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles
+Balaji Lakshminarayanan, Alexander Pritzel, Charles Blundel
+NIPS 2017
+
+[2] Gao Huang and Yixuan Li and Geoff Pleiss and Zhuang Liu and John E. Hopcroft and Kilian Q. Weinberger
+    Snapshot Ensembles: Train 1, get {M} for free}
+    ICLR 2017
+
+[3] Bayesian Optimization with Robust Bayesian Neural Networks
+    J. Springenberg and A. Klein and S.Falkner and F. Hutter
+    NIPS 2016
+ 
+[4] J. Snoek and O. Rippel and K. Swersky and R. Kiros and N. Satish and N. Sundaram and M. Patwary and Prabhat and R. Adams
+    Scalable Bayesian Optimization Using Deep Neural Networks
+    ICML 2015
+
+[5] Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling
+    Carlos Riquelme, George Tucker, Jasper Snoek
+    ICLR 2018",4,4.0,ICLR2019
+sRe4AiG3c6l,4,muppfCkU9H1,muppfCkU9H1,Important problem but the proposed solution may be not powerful than existing models ,"Summary:
+
+     Conventional Graph Neural Networks (GNNs) learn node representations that encode information from multiple hops away by iteratively aggregating information through their immediate neighbors. Self-Attention modules have been adopted to GNNs to selectively aggregate information coming through the immediate neighbors at different propagation stages. However, current self-attention mechanisms are limited to only attend over the nodes' immediate neighbors and not directly over their neighbors that are multiple hops away. Here in this work, the authors intend to address this issue and propose a means to obtain attention scores over indirectly connected neighbors. 
+
+    The message passing paradigm is commonly adopted in GNNs because directly computing the higher powers of an adjacency matrix is not scalable. The same scalability concern is present for this work, which tries to obtain attention scores for indirect neighbors directly. Thus in order to solve this issue, the authors propose to diffuse the learned attention scores from their 1-hop neighbors to neighbors that are multiple hops away, thereby providing a means to directly obtain attention scores over indirect neighbors that are reachable from the nodes. 
+
+——
+Pros:
+
+	The paper is well written.
+	The paper provides experimental results for both homogenous and multi-relational graphs.
+
+——
+Concerns:
+	(i) Proposed methodology being more powerful than GAT is arguable:
+
+	When the attention scores for indirectly connected neighbors are still computed based on the immediate neighbors' attention scores, it is not convincing enough to be argued as more powerful than GAT, which learns attention scores over contextualized immediate neighbors.  Also, the approximate realization of the model described in Eqn: 5 follows a message-passing style to propagate attention scores. Suppose it is to be argued that standard message-passing-based diffusion is not powerful enough to get a good immediate neighbor representation that encodes neighbors' information from far away. In that case, it is not immediately clear how a similar diffusion, when used for propagating attention scores from immediate neighbors to neighbors multiple hops away, will be more powerful. 
+
+(ii) Experimental results are not conclusive: 
+
+		(a) Effect of Layer-Norm and FeedForward 
+			One of the important ablation model that is missing is MAGNA without feed-forward and layer-Norm components. Currently, it is not clear how much of an improvement is achieved because of these standard two components.  
+		(b) Disentangling the effect of page-rank from the attention diffusion
+			Since Diffusion-GCN is also based on Page-Rank based propagation, it would be helpful to compare with the Diffusion-GCN model with these two components appended to them, along with residual connection if already not present. This would help us clarify how much of the gain in performance depends on the page-rank-based propagation compared to the attention propagation. The teleport probability of Diffusion-GCN should also be similarly experimented with and the analysis should be compared with the plots in Figure 3. 
+
+ 		(c) Comparable or not significant gains achieved in Node classification tasks. 
+			It is amendable that the authors have reported results for both single-relational and multi-relational graphs. However, the node classification results are not significantly better than GAT or Diffusion GCN on the reported smaller datasets (Table 1) with a single train/val/test split. And on OGB Arxiv dataset, GAT and Diffusion GCN numbers are not reported. Hence, it would be helpful to analyze additional datasets.
+  					Ignoring the benefits of LayerNorm that the MAGNA can leverage, comparing its No-Feed-Fwd version with Diffusion-GCN, which is also based on a page-rank formulation, MAGNA gain ~1% improvement on Cora and Pubmed dataset whereas it falls behind by ~1% in Citeseer. 
+
+		(d) Disentangling the effect of Multi-scale diffusion from attention diffusion 
+			Since MAGNA uses a multi-scale diffusion at each layer, a comparison with a similar non-attentive multi-scale diffusion model like Lanczosnet that is also referred in the paper would be helpful to disentangle	and understand the importance of the attention mechanism. 
+
+		(e) KG Completion: Missing baselines and model variations 
+ 
+ 		- Missing comparison with Self-attention (GAT) based knowledge graph embedding model, KBGAT. 
+		  Nathani et al., Learning Attention-based Embeddings for Relation Prediction in Knowledge Graphs, ACL 2019
+		- Additionally, it would be helpful to have similar model ablation studies of MAGNA model as in Table 1 for KG Completion. 
+	
+		(f) Depth Analysis:
+			  Diffusion-GCN comparison missing. The performance stabilization over GAT might be arising because of the restart probability. MAGNA only has weights associated with 3 layers, unlike with GAT, which has weights for every propagation step. 
+			
+
+——
+Questions during rebuttal:
+
+	- Kindly clarify concern (i)
+	- Check experimental concerns above for additional ablation and baseline variants that is required to disentangle and appreciate the usefulness of the primary contribution, the attention diffusion component. 
+	- Comparison with the KB-GAT model that is based on GAT for KG completion task, will strengthen the results on KG completion task.
+	- Comparison with GAT and Diffusion-GCN with LayerNorm and FeedFwd components on multiple train/test/val splits for smaller datasets or for other datasets from OGB will strengthen the results for node classification. 
+
+--- Post-rebuttal
+I thank the authors for responding to all the questions and getting back with additional experiment results.
+
+Major concern: While I understand the motivation and how having attention scores over nodes multiple hops away can be powerful, I'm still not convinced with the approximate realization. It is not clear how diffusing attention defined over 1-hop neighbors is powerful over attention methods defined over immediate neighbors that contain k-hop information aggregated from diffusion.
+
+Also, the performance drop and overfitting issue with GAT or diffusion-GCN can be combated similarly by sharing weights across GNN layers and also using a higher-order diffusion matrix at each GNN layer.
+",5,5.0,ICLR2021
+SJlbynqTYS,2,HklUCCVKDB,HklUCCVKDB,Official Blind Review #3,"** post rebuttal start **
+
+After reading reviews and authors' response, I decided not to change my score.
+I am happy with the author's response addressing my concerns (mainly about the fairness on the size of the model), so I recommend its acceptance. I believe it is a good addition to the community of continual learning.
+
+** post rebuttal end **
+
+
+- Summary:
+This paper proposes to use a way to improve continual learning performance by taking ""Bayes-by-backprop"" method. They claim that the uncertainty can naturally be measured by estimating (log of) the standard deviation, and it is indeed useful to judge the importance of each learnable parameter. Experimental results on several benchmarks show that their method outperforms few state-of-the-art methods.
+
+
+- Decision and supporting arguments:
+Weak accept.
+
+1. The proposed method is simple but effective. However, It is still questionable whether \sigma is the best measure of the weight importance. An ablation study with different choices of the importance measure (maybe \mu can also be incorporated as well as \sigma?) would be good to see.
+
+2. Survey and comparison with memory-based methods are limited. Though memory-based methods require some memory to keep the experience, the proposed method also requires additional memory for \sigma; it essentially doubles the model capacity, assuming that \sigma is solely for measuring the weight importance. In particular, when it comes to large-scale models, memory for storing some important experiences would be small compared to the memory to store the model.
+Here are some papers about recently proposed memory-based methods, which are not cited:
+
+Castro et al. End-to-End Incremental Learning. In ECCV, 2018.
+Wu et al. Large Scale Incremental Learning. In CVPR, 2019.
+Lee et al. Overcoming Catastrophic Forgetting with Unlabeled Data in the Wild. In ICCV, 2019.
+
+3. Comparison should include the model capacity as in Table 1(b). Again, compared to the conventional non-Bayesian model, half of the model capacity is used for computing \sigma (uncertainty), I wonder it causes a performance drop when the model capacity is the same over all compared methods. If they used the same model architecture and just doubled the number of learnable parameters for \sigma, then it is obviously unfair.
+
+
+- Comments:
+1. Pruning is not beneficial in terms of the performance. I hope to see some quantitative benefits obtained by introducing pruning. In Table 1(b), why doesn't pruning reduce the number of parameters?
+",6,,ICLR2020
+r1eDVHYJ9B,3,H1lfwAVFwr,H1lfwAVFwr,Official Blind Review #3,"The authors propose capacity-limited reinforcement learning and apply an actor-critic method (CLAC) in some continuous control domains. The authors claim that CLAC gives improvements in generalization from training to modified test environments, and that it shows high sample efficiency and requires minimal hyper-parameter tuning.
+
+The introduction started off making me think about this area in a new way, but as the paper continued I started to find some issues. To begin with, I think the motivation in the introduction could be improved. Why would I choose to limit capacity? This is not sufficiently motivated. I suspect that the author(s) want to argue that it *should* give better generalization, but this argument is not made very clearly in the introduction. Perhaps this is because it would be difficult to make this argument formally, and so it is merely suggested at?
+
+Are there connections between this and things like variational intrinsic control (VIC, Gregor et al. 2016) and diversity is all you need (DIAYN, Eysenbach et al., 2019)? These works aim to maximize the mutual information between latent variable policies and states/trajectories, whereas this work is really doing the opposite. I would be interested in understanding the author’s take on how the two are related conceptually.
+
+Moving to the connections with past work, this paper seriously abuses notation in a way that actually hinders comprehension. Some of the parts that really bothered me, and should be fixed to be correct:
+
+Mutual information is a function of two random variables, whereas it is repeatedly expressed as a function of the policy. Being explicit about the random variables / distribution here is pretty important.
+
+In Equation 2 (and subsequent paragraph) the marginal distributions p_a(a) and p_s(s) are not well defined, marginalizing over what, what are these distributions? I might guess that p_s(s) is the steady state distribution under a policy pi, and that p_a(a) is marginalizing over the same distribution, essentially capturing the prior probability of each action under the policy. But these sort of things need to be said explicitly.
+
+In KL-RL section there is a sentence with “This allows us to define KL-RL to be the case where p_0(a, s) = \pi_0(a_t | s_t).” What does this actually mean? One of these is a joint probability for state and action, and one is an action probability conditional on a state. 
+
+What does \pi_\mu(a_t) \sim \mathcal{D} mean? 
+
+In the block just before Algorithm 1, many of these symbols are never defined. This needs a significant amount of care (by the authors) and right now relies on the reader to simply make a best guess at what the authors probably intend.
+
+Overall in the first three sections the message I would like the authors to understand is that, in striving for a concise explanation they have significantly overshot. These sections require some significant work to be considered publishable.
+
+The experiment in section 4.1 is intended to give a clean intuitive understanding of the method, but falls a bit short here. It is clean, but I needed more explanation to really drive the intuition home. I see that CLAC finds a solution more sensitive to the beta distribution, but help me understand why this is the right solution in this particular case.
+
+I really disagree with the conclusions around the experiments in section 4.2. I do not think these results show that for the CLAC model increasing the mutual information coefficient increases performance on the perturbed environments. First, the obvious, how many seeds and where are the standard deviations? Second, the trend is extremely small and the gap between CLAC and SAC is just as minor. Finally, CLAC has better performance on the training distribution which means that it actually lost *more* performance than SAC when transferring to the testing and extreme testing distributions.
+
+The results for section 4.3 are just not significant enough to draw any real conclusions. The massive temporal variability makes me very suspicious of those super tight error bands, but even without that question, the gap is just not very large.
+
+Finally, in section 4.4 we see the first somewhat convincing experimental results. These look reasonable, but even here I have a fairly pointed question: compared with the results in Packer et al (2018) the amount of regression from training to testing is extremely large (whereas they found vanilla algorithms transfer surprisingly well). Can you explain why there is such a big discrepancy between those results and these? But again, this section’s results are in my opinion the most convincing that something interesting is happening here.
+
+Lastly, in section 8.1 the range of hyper-parameters for the mutual information coefficient is very broad, which really makes it hard to buy the claim of requiring minimal hyper-parameter tuning.
+
+All in all there is something truly interesting in this work, but in the present state I am unable to recommend acceptance, and the amount of work required along with questions raised lead me to be fairly confident in this assessment.
+",1,,ICLR2020
+Hyx7wwpchX,2,BJeY6sR9KX,BJeY6sR9KX,Interesting attempt to bridge the gap between ML and neuroscience,"In this interesting study, the authors propose a score (BrainScore) to (1) compare neural representations of an ANN trained on imagenet with primate neural activity in V4 and IT, and (2) test whether ANN and primate make the same mistakes on image classification.  They also create a shallow recurrent neural network (Cornet) that performs well according to their score and also reasonably well on imagenet classification task given its shallow architecture.
+
+The analyses are rigorous and the idea of such a score as a tool for guiding neuroscientists building models of the visual system is novel and interesting.
+
+Major drawbacks:
+
+1. Uncertain contribution to ML: it remains unclear whether architectures guided by the brain score will indeed generalize better to other tasks, as the authors suggest.
+
+2. Uncertain contribution to neuroscience: it remains unclear whether finding the ANN resembling the real visual system most among a collection of models will inform us about the inner working of the brain.
+
+
+The article would also benefit from the following clarifications:
+
+3. Are the recurrent connections helping performance of Cornet on imagenet and/or on BrainScore?
+
+4. Did you find a correlation between the neural predictivity score and behavioral predictivity score across networks tested? If yes, it would be interesting to mention.
+
+5. When comparing neural predictivity score across models, is a model with more neurons artificially advantaged by the simple fact that there is more likely a linear combination of neurons that map to primate neural activations? Is cross-validation enough to control for this potential bias?
+
+6. Fig1: what are the gray dots?
+
+7. “but it also does not make any assumptions about significant differences in the scores, which would be present in ranking. “
+What does this mean?
+
+8. How does Cornet compare to this other recent work: https://arxiv.org/abs/1807.00053 (June 20 2018) ?
+
+Conclusion:
+This study presents an interesting attempt at bridging the gap between machine learning and neuroscience. Although the impact that this score will have in both ML and Neuroscience fields remains uncertain, the work is sufficiently novel and interesting to be published at ICLR. I am fairly confident in my evaluation as I work at the intersection of deep learning and neuroscience.
+",7,4.0,ICLR2019
+S1CreA-Ne,2,BJO-BuT1g,BJO-BuT1g,Good paper with important contribution,"The paper introduces an elegant method to train a single feed-forward style transfer network with a large number of styles. This is achieved by a global, style-dependent scale and shift parameter for each feature in the network. Thus image style is encoded in a very condensed subset of the network parameters, with only two parameters per feature map. 
+
+This enables to easily incorporate new styles into an existing network by fine-tuning. At the same time, the quality of the generated stylisations is comparable to existing feed-forward single-style transfer networks. While this also means that the stylisation results in the paper are limited by the quality of current feed-forward methods, the proposed method seems general enough to be combined with future improvements in feed-forward style transfer.  
+
+Finally, the paper shows that having multiple styles encoded in one feature space allows to gradually interpolate between different styles to generate new mixtures of styles. This is comparable to interpolating between the Gram Matrices of different style images in the iterative style transfer algorithm by Gatys et al. and comes with similar limitations: Right now the parameters of the style feature space are hard to interpret and therefore there is little control over the stylisation outcome when moving in that feature space.
+Here I see the most potential for improvement of the paper: The parameterisation of style in terms of scale and shift parameters of individual features seems like a promising basis to achieve interpretable style features. It would be a great addition to explore to what extend statements such as “The parameters of neuron N in layer L encodes e.g. the colour or brush-strokes of the styles” can be made. I agree that this is a potentially laborious endeavour, but even just qualitative statements of this kind that are demonstrated with the respective manipulations in the stylisation would be very interesting.
+
+In conclusion, this is a good paper presenting an elegant and valuable contribution that will have considerable impact on the design of feed-forward stylisation networks.
+",8,5.0,ICLR2017
+BklV_RvRtB,2,S1eQuCVFvB,S1eQuCVFvB,Official Blind Review #3,"Inspired by work in ensembling human decisions, the authors propose an ensembling technique called ""Machine Truth Serum"" (based off ""Bayesian Truth Serum""). Instead of using majority vote to ensemble the decisions of several classifiers, this paper follows the ""surprisingly popular"" algorithm; the ensembled decision is the decision whose posterior probability (based on several classifiers) most exceeds a prior probability (given by classifier(s) trained to predict the posterior predictions). It's quite a nice idea to bring this finding from human decision-making to machine learning. If it worked in machine learning, it would be quite surprising, as the surprisingly popular algorithm risks that the ensemble makes a decision against the majority vote, which is usually consider the safe/default option for ensembling.
+
+Overall, I did not find the experiments (in the current state) to provide compelling enough support for the claim that MTS is a useful approach to ensembling in machine learning. 
+* Unless I am mistaken, the authors use a more powerful model (an MLP) as the regressor compared to some of the models they ensemble over. In practice, people ensemble the most powerful models they have available, so it's unclear if using a regressor with the same capacity as the ensembled classifiers will provide any additional benefit. On a related note, it would be nice to know what is the classification performance of each individual classifier? As well as how often the regressor correctly choose to go with several weaker models rather than the strongest model. In particular, I am concerned that the performance of the ensemble might be less than or equal to the performance of a single MLP classifier (or whatever other model does best).
+* ""In this paper, each of the datasets we used has a small size - we chose to focus on the small data regime where the classifiers are likely to make mistakes."" Why not try large data tasks that are challenging for state-of-the-art models? The paper makes a general claim that MTS is a good way to aggregate predictions, so only evaluating on small datasets seems to be a limitation
+* As I understand (correct me if I am wrong), the reported results are only on examples with ""high disagreement"" between classifiers. However, for practical use cases, it is useful to know how the overall accuracy compares. One major risk of using the ""surprisingly popular"" algorithm is that the algorithm may cause the ensemble to make many incorrect predictions when the majority is right (but the minority prediction is selected). If you have those numbers, I would be interested to see them added to the paper.
+
+I am also unsure about if applying the ""surprisingly popular"" algorithm in machine learning makes sense. The algorithm is motivated by the fact that difference agents have different information. However, in the ML setting, various classifiers usually have the same information. It's possible to restrict the information given to each classifier, but that would limit the performance of each individual classifier (and hurt the ensemble). I would be curious if the authors have any thoughts on this point.
+
+I also have a few questions/concerns about how the approach is implemented:
+* Why not train a single model to predict the average prediction of all models and use that model's prediction as the prior? This approach seems simpler but equivalent to the approach currently taken.
+* Why not use model distillation (predicting all output logits/probabilities, or an average thereof) rather than just predicting an average of 0/1 predictions?
+* For HMTS, why do the regressors for each of L labels need to be separate? It seems more efficient to use a multi-class model (as many model distillation approaches do)
+* If DMTS can learn to predict when most classifiers are wrong, why wouldn't the original classifiers themselves learn to predict the answer correctly? It seems to me that the reason the experiments show that DMTS/HMTS work is that some/many of the underlying classifiers are weaker than the model that is used to ensemble the predictions (an assumption that doesn't hold in practice).
+
+
+Overall, I really like the high-level idea, and a better ensembling approach promises to bring empirical gains across many ML tasks. However, I have several concerns about the experiments, motivation, and algorithmic decisions which make me hesitant to recommend the paper for acceptance.",3,,ICLR2020
+B1xkJ3EhtB,2,SygEukHYvB,SygEukHYvB,Official Blind Review #3,"The paper modifies existing classifier architectures and training objective, in order to minimize ""conditional entropy bottleneck"" (CEB) objective, in attempts to force the representation to maximize the information bottleneck objective. Consequently, the paper claims that this CEB model improves general test accuracy and robustness against adversarial attacks and common corruptions, compared to the softmax + cross entropy counterpart. This claim is supported by experimental results on CIFAR-10 and ImageNet-C datasets.
+
+In overall, the manuscript is easy-to-follow with a clear motivation. I found the experimental results are also promising, at least for the improved test accuracy and corruption robustness. Regarding the results about adversarial robustness, however, it was really confusing for me to understand and validate the reported values. I would like to increase my score if the following questions could be addressed:
+
+- It is not clear whether adversarial training is used or not in CEB models for the adversarial robustness results. If the results were achieved ""without"" adversarial training, these would be somewhat surprising for me. At the same time, however, I would want to see more thorough evaluation than the current, e.g. PGD with more n, random restart, gradient-free attacks, or black-box attacks.
+- I wonder if the paper could provide a motivation on why the AutoAug policy is adopted when training robust model. Personally, this is one of the reasons that makes me hard to understand the values presented in the paper.
+- Figure 3, right: Does this plot indicates that ""28x10 Det"" is much more robust than ""28x10 Madry""? If so, it feels so awkward for me, and I hope this results could be further justified in advance.
+- Figure 3: It is extremely difficult to understand the plots as the whole lines are interleaved on a single grid. I suggests to split the plots based on the main claims the paper want to demonstrate.
+- How was the issue of potential over-fitting on rho handled, e.g. using a validation set?
+- In general, I slightly feel a lack of justification on why the CEB model improves robustness. In Page 5: ""... Every small model we have trained with Batch Normalization enabled has had substantially worse robustness, ..."" - I think this line could be a critical point, and need further investigations in a manner of justifying the overall claim.",3,,ICLR2020
+hF6gXS_asmk,1,86PW5gch8VZ,86PW5gch8VZ,The paper proposes Dynamic Quantized SGD (DQSGD) algorithm for distributed learning. It shows some convergence properties of this algorithm and provides some experiments to assess its efficiency.,"-It is known that Assumption 3 equation (10) (bounded variance) with Assumption 2 (strong convexity) leads to a contradiction. Thus having these two assumptions together is strong.
+There are some recent works that overcome Assumption 3 by a different assumption known as the expected smoothness like in the following work: Gower,  R.  M.,  Richtarik,  P.,  and  Bach,  F.    Stochastic quasi-gradient methods: Variance reduction via Jacobian sketching.arxiv:1805.02632, 2018.
+The authors may revise their strongly convex results part using this kind of assumption.
+
+-In proposition 1 equation (13), it is not clear what the authors mean by the probability of a random variable. I checked the proof but I did not understand it either. 
+
+-Some reported results are already known in the literature and the paper gives the impression that these results are new, especially that the authors give the proofs.
+Examples of such results: the unbiasedness of the quantization, its bounded variance, and the first part of theorem 1...
+
+-Page 5, concerning the quadratic example: this is a trivial case and the only case where one can hope the lower bound to match the upper bound. In fact, alpha = beta iff L=mu and from Assumptions 1 & 2 we get that F is quadratic with mu=L which implies H = mu I.
+
+- Equation (20): for me this one of the main results of the paper. But I did not see its proof anywhere?",2,4.0,ICLR2021
+zvfRbxLUcGq,4,Wj4ODo0uyCF,Wj4ODo0uyCF,Nice work showing how to add language-specific modeling capacity to large multilingual NMT models in a principled manner,"In this work, the authors present a conditional language-specific routing (CLSR) scheme for transformer-based multilingual NMT systems. They introduce a CLSR layer after every transformer encoder and decoder layer; each such layer is made up of hard gating functions conditioned on token representations that will either select a language-specific projection layer or a shared projection layer. Further, a budget is imposed on the language-specific capacity measured by aggregating the number of gates that allow for language-specific computations; this budget constraint forces the network to identify the sub-layers that will benefit most from being language-specific.
+
+This is nice work. The proposed technique has been described clearly, the idea is intuitive and the experiments are pretty compelling. I have a couple of minor comments/suggestions for the authors.
+
+* The authors show heat-maps of LSScore distribution in Figure 6 (Appendix B) which suggest that the LS capacity schedule might have little to do with linguistic characteristics. However, this might have to do with the multilingual model being trained on as many as 94 different languages. It seems plausible that linguistic similarities might govern LS capacity scheduling when there are fewer training languages to learn from. To check for this, it might be interesting to redo this experiment with the medium resource and low resource buckets containing 26-28 languages each.
+
+* There are two (among many other) interesting things that stand out from the results in Tables 1 and 2. (1) From Table 1, the only setting where CLSR* (as well as ""Top-Bottom"" and ""Dedicated"") underperforms compared to the baseline is M2O for low-resource languages. It seems like the use of language-specific layers here has a strong adverse effect on performance (-4.56 with CLSR-L) which is largely offset by CLSR*. Some more insights based on the individual BLEU scores for each test language in the ""Low"" bin and whether there were certain languages that were largely responsible for the drop in performance would be interesting to the reader. (2) From M2O in Table 2, the win ratios of Top-Bottom are much lower when compared with Dedicated and CLSR* (61.54 vs. 84.62 vs. 84.62; 30.77 vs. 84.62 vs. 100).  Could the authors share their thoughts on why this drop might be appearing?",7,4.0,ICLR2021
+BJl7SNponX,2,B1ethsR9Ym,B1ethsR9Ym,Modify social attributes on face images results here in low-quality images,"The paper is about changing the attributes of a face image to let it look more aggressive, trustworthy etc. by means of a standalone autoencoder (named ModifAE). The approach is weak starting from the construction of the training set. Since continue social attributes on face images does not exist yet, CelebA dataset is judged by Song et al. (2017) with continuous face ratings and use the predicted ratings to train ModifAE. This obviously introduces a bias driven by the source regression model. The hourglass model is clearly explained. The experiments are not communicating: the to qualitative examples are not the best showcase for the attributes into play (attractive, emotional), and requires to severely magnify the pdf to spot something. This obviously show the Achille’s heel of these works, i.e., working with miniature images. Figure 5, personally, is about who among modifAE and stargan does less bad, since the resulting images are of low quality (the last row speaks loud about that)
+Quantitative results are really speaking user tests, so I will cal it as they are, user tests. They work only on two attributes, and show a reasonable advantage over stargan only for one attribute. 
+",4,4.0,ICLR2019
+xXs_eQpCwND,3,#NAME?,#NAME?,A different adversarial training approach,"This paper proposes to learn multiple near-orthogonal paths (OMP) in the CNN which could provide better adversarial training performance by using one random path selected from the OMP block, improving the diversity of the adversarial training examples generated. Results show some improvements over regular adversarial training. Interestingly, the improvements are very significant on the VGG networks, while not quite significant for the ResNet variants tested.
+
+This paper claimed that it creates orthogonal paths, but it's realistically near-orthogonal since they only added a soft constraint on the OMP regularization term, similar algorithms have been proposed in the past:
+
+[Bansal et al. 2018] Can we gain more from orthogonality regularizations in training deep cnns?
+
+There have also been quite a few work on learning real orthogonal paths based on Riemannian manifold optimization. Some of these are of similar speed as conventional SGD and Adam. A review paper can be found at:
+
+[Huang et al. 2020] Normalization Techniques in Training DNNs: Methodology, Analysis and Application.
+
+Some of those papers should be cited.
+
+In terms of performance, I feel this work should be compared against other regularization-based adversarial defense methods. A couple examples of that are:
+
+Qin et al. Adversarial Robustness through Local Linearization. NeuRIPS 2019
+Mao et al. Metric Learning for Adversarial Robustness. NeuRIPS 2019.
+
+Comparisons against those algorithms would further verify the performance of the proposed approach.
+
+Besides, there should be some discussions on potentially why the improvements on VGG networks are very significant and not so much on ResNet.
+
+There is also some recent evidence on the effect of early stopping on adversarial defenses (e.g. Rice et al. Overfitting in adversarially robust deep learning. ICML 2020). It would be nice if the authors could state when did they stop the training of the respective models.
+
+In terms of ablation, it would be nice to see different inference schemes. e.g. whether using a subset of the paths in the OMP block would be beneficial against adversarial examples or not.
+
+I look forward to seeing the authors rebuttal and comments from other reviewers.",5,3.0,ICLR2021
+rJxHfKlVqB,3,SJlbyCNtPr,SJlbyCNtPr,Official Blind Review #1,"This paper proposes a Locally Linear Q-Learning (LLQL) method for continuous action control. It uses a short-term prediction model and a long-term prediction model to generate actions that achieve short-term and long-term goals simultaneously. The problem the paper seeks to solve is important. However, this paper has several issues.
+
+First, there seems to be an over-claim of the contribution. The proposed method is more like hybrid of model-based and model-free RL method. Specifically, the short-term prediction model is in fact the linearized dynamic system with system parameters modeled by deep neural networks, while the long-term prediction model is in fact different (state- and action-) value functions. For this reason, it is probably unnecessary to name them as “short-term network” or “long-term network”, since they are simply the system model (or the model-based part) and the value functions (the model-free part).
+
+Second, the proposed method is not sufficiently evaluated. It is only evaluated on the toy Mountain Car task (and the Crane system in supplementary material). In order to justify the performance of an RL algorithm for continuous action space, it should be at least evaluated on the set of MuJoCo tasks. 
+
+Third, the proposed method is not sufficiently compared with different baselines. In Figures 3-5, the proposed LLQL algorithm is never compared to any baseline method, leaving it open whether it is actually better than earlier methods like DDPG. In Table 1, LLQL is compared to DDPG (a model-free method), and is shown to achieve better performance. However, this seems to be unfair because the proposed method is in fact a model-based RL algorithm. Therefore, it should at least compare to other model-based algorithms (and also other riche set of safe-exploration RL methods).
+
+Other comments:
+•	In eqn. (6), \gamma should be \gamma^{i-k}?
+•	In the paragraph after (8), “Q-learning algorithms (16)…” is referring to a wrong equation (16) for Q-learning. Or probably the authors are not using the correct format to cite the reference. (This seems to happen repeatedly in later part of the paper such as as in the paragraph between (14) and (15).) It confuses the equation number and the reference number.
+•	More explanation should be given about d(x_k|\theta^d) and h(x_k|\theta^h) after (15). The meaning of them has never to defined before.
+",3,,ICLR2020
+L1zo_hA9zcx,3,Y5TgO3J_Glc,Y5TgO3J_Glc,"General method with interesting ideas, but weak baselines and unconvincing evaluation method","### Summary
+
+This paper proposes an approach for generating sequences that possess high-level structure, and in particular structure that can be expressed using hand-crafted symbolic relations. Given a domain and a set of possible relations to consider, this approach first extracts a sequence representation of the relational constraints for each example, then trains a two-stage generative model that first generates relational constraints and then generates the final output conditioned on the constraints. The authors apply their approach to the generation of music and poetry, and show that, compared to simple baselines, the generated outputs are more consistent with human-generated examples (according to other learned models of high- and low-level structure).
+
+The approach is quite general, and seems like an interesting way of imposing structure on sequences. However, I'm not convinced that their approach ""significantly improves over state-of-the-art approaches"" as the authors claim. The baselines are somewhat limited, and the evaluation metrics that the authors use are difficult to interpret. The authors also do not compare against existing approaches for constrained music and poetry generation, instead comparing only to neural sequence models. Additionally, the generated results still seem to lack global coherence despite having global relational structure.
+
+### Detailed comments
+
+The authors present their approach as an extension of ""neurosymbolic generative modeling"", which uses program synthesis to extract structure and then trains one model to generate programs and another model to use the output of that program. In this work, the authors consider a different representation of structure, which associates each element of a sequence with a prototype and a set of relations that determine how the element relates to the prototype. The prototypes in this approach are subcomponents of the full output (for music, they are measures, and for poetry, they are lines), and the relations are hand-engineered (rhyme scheme and meter for poetry, and various music theory concepts for music). This relational graph is still referred to as a ""program"", and the extraction of the graph as ""program synthesis"", but I'm not sure that's the appropriate terminology; the graph isn't executable, it is just a symbolic object.
+
+The generative process that the authors propose is to first generate constraints (expressed as a sequence), then generate prototypes, and finally generate output conditioned on the constraints and prototypes. Each of these components is trained separately. Given a dataset, a maximal set of constraints is extracted from each example by using an SMT solver. Afterward, a VAE is trained to produce constraints, and a pretrained model is used to generate prototypes. Finally, examples are sampled conditioned on these constraints, either by doing rejection sampling on a pretrained model or by training a new generative model to generate outputs conditioned on constraints (or, in some cases, by doing combinatorial search with heuristics).
+
+The authors cite a large number of works in the realm of neurosymbolic machine learning, but are not as thorough regarding music and poetry generation, both of which are subfields of their own right. For music generation, there are existing approaches for trying to learn long-term structure, such as Music Transformer [1], hierarchical VAEs [2], and StructureNet [3]. For poetry, structure can be learned using methods like weighted transducers [4], and there are also previous works on using explicit constraints to generate poems [5]. The authors do not compare with any of these methods, and the ones they do compare against aren't necessarily representative of ""state of the art"".
+
+I'm also not convinced that the metrics the authors use are adequate for judging performance. For ""low level"" structure, the authors report the (negative) log-likelihood scores given by a pretrained sequence model, with the assumption that higher log-likelihood means better low-level structure. But it has been shown that log-likelihood of models does NOT directly correspond to what humans find ""good"" (see for instance [6]), and models will in some cases assign higher log-likelihood to data far outside of their training set (e.g. [7]), which means that higher log-likelihood isn't necessarily a good proxy for human-like structure. It would be more standard to report log-likelihood of the human test data under the proposed model, instead of log-likelihood of the proposed model's outputs under some pre-trained model.
+
+For high-level structure, the authors extract relations using their method, then train a separate classifier to discriminate between human and generated examples based on those extracted relations as features. But this seems to be slightly unfair, since their model is trained directly to maximize this objective, whereas other models do not use this at all. The presentation of these results is also quite confusing (see below).
+
+In a more qualitative sense, while it is true that the generated poems do have reasonable rhyme and meter, the semantic content of the poems is disjointed and does not have a sense of global coherence. I wonder if the proposed approach is satisfying the hand-engineered relations at the expense of losing the semantic structure that is harder to measure. 
+
+### Questions and suggestions
+
+Abstract: Should ""generate"" be ""generation""? Also, the sentence ""To train model (i), ... resulting relational constraints"" is a bit unclear. Is that whole sentence about model (i) or is the last part referring to model (ii)?
+
+In figure 1, what do the colors of the lines represent? It seems that there are both green and purple relational constraints, but these colors aren't explained.
+
+The notation in sections 2 and 3 is a bit unclear. The set $\mathcal{C}$ is used before being defined, and it isn't clear what the difference is between $c$, $f$, and $\Phi$; are those all just representations of the same thing? The sum in the equation at the bottom of page 3 is over $r \in \mathcal{R}$ but $r$ doesn't appear in the equation, only $c$ does.
+
+It seems somewhat restrictive to assume that each line/measure has only one prototype, since it's possible for line C to share rhyme with line A but share meter with line B. Have you considered a more general formulation of constraints?
+
+I was surprised that the prototype subcomponents are represented as part of the constraints but are generated independently from the set of relations. It seems like those prototype subcomponents could just as easily be left out of the constraint sets and instead just be generated as part of the second step. Is there a reason that they need to be part of the constraints?
+
+The optimization problem in 3.2 seems underdetermined. In particular, if X and Y have some relation between them, it doesn't seem like there's anything to determine which of the two should become the prototype. Is one of those simply chosen arbitrarily?
+
+In section 4.1, $p_\phi(c|z)$ is referred to as ""a VAE"", which seems slightly inaccurate; technically the VAE includes the prior, encoder, and decoder, not just the decoder.
+
+In section 5, the description of how A3 is applied to the music domain seems to be missing from the first paragraph.
+
+The results in table 1 are very difficult to understand, especially for the high-level classification task. Based on the appendix, it seems that ""RF Disc."" is measuring classification accuracy, for which lower values are better (because more humanlike outputs are harder to classify). But ""CGN Disc."" is measuring cross-entropy loss, for which higher values are better. This was very difficult to follow. I think it would be clearer to just report accuracy for both, or, at the very least, describe what each of the columns are measuring. I also don't understand the Human row of this table. Is this comparing one set of human examples to a different set of human examples? For RF, 50% seems like the expected chance accuracy, but distinguishing humans from humans is apparently possible 51% of the time? For GCN, the value 0.69 makes me think this must be measured with a natural logarithm, is that correct?
+
+The statement ""our approach outperforms BERT as a language model in terms of its own score"" seems unfair due to the reasons described in the previous section. We would not expect a language model to assign the highest likelihood to all of its own predictions, since it is supposed to describe a full distribution of outputs. The likelihood assigned by a language model isn't a measure of goodness, just a measure of how predictable that sequence is by the model. At most, perhaps this indicates that the proposed approach is more predictable than real-world natural language, so BERT is able to predict it better. (Also, is the BERT used to score the models the same as the BERT used to generate? Or is one of them fine-tuned and the other left as-is?)
+
+### References
+
+[1] Huang, Cheng-Zhi Anna, et al. ""Music transformer: Generating music with long-term structure."" International Conference on Learning Representations. 2018.
+
+[2] Roberts, Adam, Jesse Engel, and Douglas Eck. ""Hierarchical variational autoencoders for music."" NIPS Workshop on Machine Learning for Creativity and Design. Vol. 3. 2017.
+
+[3] Medeot, Gabriele, et al. ""StructureNet: Inducing Structure in Generated Melodies."" ISMIR. 2018.
+
+[4] Hopkins, Jack, and Douwe Kiela. ""Automatically generating rhythmic verse with neural networks."" Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017.
+
+[5] Toivanen, Jukka, Matti Järvisalo, and Hannu Toivonen. ""Harnessing constraint programming for poetry composition."" The Fourth International Conference on Computational Creativity. The University of Sydney, 2013.
+
+[6] Meister, Clara, Tim Vieira, and Ryan Cotterell. ""If beam search is the answer, what was the question?."" arXiv preprint arXiv:2010.02650 (2020).
+
+[7] Nalisnick, Eric, et al. ""Do deep generative models know what they don't know?."" arXiv preprint arXiv:1810.09136 (2018).
+
+
+---
+
+## Post-revision update
+
+I cannot seem to add a new comment at this time, so I am editing this review instead. The updated paper seems to be an improvement, although some of my original concerns remain.
+
+Improvements:
+
+- Thank you for adding the additional baselines, which do give more context as to how this approach relates to prior work.
+- The section describing $c$, $f$, and $\Phi$ is clearer in the revision, and is much easier to follow.
+- I also appreciate the clarification about nondeterministic choices in Z3.
+- The overly-strong claims about outperforming state-of-the-art approaches have been qualified appropriately.
+- Some of the confusing details regarding the evaluation method have been moved from the appendix to the main text, which makes them much easier to find.
+
+Remaining high-level concern:
+
+- I'm still not convinced that the results of this evaluation method are that meaningful.
+  + NLL under a pretrained model doesn't necessarily imply better low-level structure, for reasons stated in my initial review.
+  + Ease-of-discrimination of your extracted constraints doesn't necessarily imply better high-level structure, especially since the proposed model is trained on those constraints directly.
+
+Other remaining issues in the revision:
+
+- A few comments from my initial review have not been addressed:
+  + ""To train model (i), ... resulting relational constraints"" remains unclear
+  + In section 4.1, $p_\phi(c|z)$ is referred to as ""a VAE"" but is just a small part of a VAE
+  + In section 5, the description of how A3 is applied to the music domain is missing
+  + Table 1 is still difficult to interpret, in particular regarding the higher-is-better vs lower-is-better columns, and the (new in this revision) bolding of the human data, which I don't quite follow.
+- Figure 1's caption now states that rhyme and meter constraints are the reason for the green and purple edges, but this is confusing because there are purple edges between poem lines that don't have meter constraints.
+- Regarding semantic content, the authors state in their response that ""The semantic content of the poems (or lack thereof) reflects defects in the underlying deep generative models"". But Section 4.2 approach 1 seems to imply that each of the lines are sampled independently of one another, which seems like a strong limitation that the underlying models do not have. Perhaps the notation is unclear, and the model does get the full input context; if this is the case, I would suggest revising the notation.
+
+I have raised my score from 5 to 6 in light of the improvements, and with the understanding that my concerns in ""Other remaining issues in the revision"" could be fixed in a final version of the paper. I'm still borderline on accepting the paper, however, due to my concern (shared with Reviewer 3) about how meaningful the evaluation results are and whether they match what humans mean by high-level and low-level structure.",6,4.0,ICLR2021
+SyehoaKh3m,2,S1gARiAcFm,S1gARiAcFm,Novelty and performance evaluation are unclear,"This paper describes a NN method called DyMoN for predicting transition vectors between states with a Markov process, and learning dynamics of stochastic systems.
+
+Three biological case studies are presented, it is unclear if any new biology was learned from these cases that we could not have learned using other methods, and how accurate they are. The empirical validations are all on nonbiological data and disconnected from the first part of the paper, making the main application/advantage of this method confusing.
+
+I agree with the computational advantages mentioned in the paper, however, interpretation of the representational aspect is challenging especially in the context of biological systems. Regarding denoising, what are the guarantees that this approach does now remove real biological heterogeneity? Also, a denoising method (MAGIC) was still used to preprocess the data prior to DyMon, there is no discussion about any contradictory assumptions.
+
+Overall, the main shortcoming of the paper is lack of performance evaluation, comparison to other methods and clarifying advantages or novel results over other methods. The description of the method could also be improved and clarified with presenting an algorithm.",4,5.0,ICLR2019
+hx8YlI3mIxu,3,MbM_gvIB3Y4,MbM_gvIB3Y4,Official Blind Review #2,"Summary of the work: This work studies which mutual-information representation learning objectives (1. forward information, 2. state-only transition information, 3. inverse information) are sufficient for control in terms of representing the optimal policy, in the context of reinforcement learning (RL). As a result, they find a representation that maximizes 1 is sufficient for optimal control under any reward function, but 2 and 3 fails to provide that guarantee in some MDP cases. They provide both proof and interesting counter examples to justify the findings. Besides, they conduct some empirical studies on a video game (i.e. Catcher) and show that the sufficiency of a representation can have a substantial impact on the performance of an RL agent that uses that representation. 
+
+l like the idea of trying to understand the recent popular mutual information objectives in RL. To the best of my knowledge, Q^*-sufficiency analysis for mutual information objectives is novel. The counterexamples in sufficiency analysis are interesting. The paper is well written.
+
+However, l still have the following concerns: 
+
+Empirically mutual information objectives often play the role of auxiliary losses to improve the sample efficiency of RL. Thus, it is very useful for us to theoretically understand which objective is better in terms of improving sample efficiency. The property of sufficiency (i.e. the ability to represent the optimal policy) is important. However, only this property may not strong enough, because there may exist a trade-off between minimizing information loss and maximizing state space reduction (Li et al. 2006). Can we add one more perspective, such as whose representation is finer (like Definition 2 in Li et al. 2006) to better understand these MI objectives, if possible? Intuitively, a coarser representation results in a larger reduction in the state space, which in turn translates into the efficiency of solving the problem.     
+
+Regarding the experiment results, the authors give some intuitive descriptions to show that state-only transition objective and inverse objective may be insufficient, but forward objective works in the catcher game. To enhance its solidness, l strongly suggest that we may conduct an experiment on the predictability of optimal Q-function by the representations trained by various MI objectives. For example, we can spend a long time to train a good enough policy and treat it as an optimal policy.  
+
+
+Regarding the modeling of representation, the representation is modeled as a random variable in this paper for analysis. However, in practice, the representation is always built upon neural networks and thus deterministic. Therefore, l am a little curious if the derived conclusion will still hold for deterministic cases.  
+
+Regarding the proof of Proposition 4, l am sorry that l do not fully understand the derivation from Q(s,a) to Q^* (s,a). Can the authors provide more details on that? 
+
+From the proof of proposition 4 in appendix,  I(Z_t,A_t;Z_(t+k)) seems to be maximized for ∀k>0,t>0. l suggest we make that clearer in the definition of those MI objectives (e.g. Equation 2), since some practical algorithms are based on the fixed k, not all k.  
+",5,3.0,ICLR2021
+u4viPbfkYTC,4,XG1Drw7VbLJ,XG1Drw7VbLJ,A benchmak for continual few-shot learning ,"The work proposes Continual Few-Shot learning -- a setting to study tasks (1) with a small labeled dataset, and (2) retain knowledge acquired on a sequence of instances.  Additionally, the authors build a compact variant of ImageNet which retains all original 1000 classes but only contains 200 instances of each one (a total of 200K data-points) downscaled to 64 × 64 pixels. 
+
+By evaluating baselines on the proposed benchmark, the authors observe that embedding-based models tend to perform better when ""incoming tasks contain different classes from one another"" and gradient-based methods tend to perform better when the task classes ""form super-classes of randomly combined categories."" 
+
+The overall idea is interesting (few-shot + sequential observations); however, it's not clear should one take home after reading this draft. The conclusions seem intuitive and reasonable, but leave the reader with questions about the main findings of the work. 
+
+I would take issue with the way the word ""continual learning"" is used. In practice, the author(s) use existing datasets where the instances arrive in a sequential manner (streaming observations). However, this is not quite what they motivate: ""Consider a user in a fast-changing environment who must learn from the many scenarios that are encountered."" since, in the described sequential setting all the instances belong to the same underlying distribution (the original dataset), even though they're observed sequentially.  
+
+Note that a realistic temporal observations are much more challenging (e.g., the language of search queries over time because of the change in the functionality of search engines, or the changing distribution of images over time because of various social changes.)
+
+The overall direction is promising: clearly, we need to move towards more data-efficiency and build better frameworks for measuring the generalization of our models. However, I am not convinced if the presented work is a significant step toward that goal (or, at least, I don't see it). Happy to change my mind, if I am missing anything. 
+",4,3.0,ICLR2021
+OK53E1yTD0H,2,XJk19XzGq2J,XJk19XzGq2J,Interesting work,"## Review
+
+### Summary
+
+The paper proposes an empirical analysis of the dimension of natural images of multiple datasets.
+The contributions are:
+1. A validation for nat. im. of previously proposed dimension estimation methods (using GAN to control the intrinsic dimension of the generated im.)
+2. A confirmation that intrinsic dimension of nat. im. is lower than the dimension of their pixel space
+3. That the lower the intrinsic dimension the easier the learning (for neural net)
+
+### Strengths
+
+* The paper is well-written and easy to follow. 
+* The data analysis pipeline is convincing.
+* Up to my knowledge there is no such a clear statement about the dimension of natural image 
+
+### Weaknesses
+
+* The results are not sufficiently discussed. I think the idea that nat. im. are low-d is more controversial than it is presented. It is sometime proposed that image patches (which are more likely to be textures) are low-d (eg Brendel & Bethge ICLR 2019). In relation, to neural net learning, these are known to be biased toward textures (Geirhos et. al. ICLR 2019). So among your contribution 3/ can be true while 2/ is not (but then what would explain your finding ?). I mean that in fact 3/ is more due two the low-d of textures than the low-d of nat. im. (which would still be too high).Complementary to this, it is suggested that natural images can be viewed as mixture of textures which belong to different low-d manifold (Vacher & Coen-Cagli, arXiv 1905.10629; Vacher et. al., NeurIPS 2020). 
+
+### Minor comments
+
+* None
+
+
+",8,4.0,ICLR2021
+Qb1Vl4OB3MZ,2,SXoheAR0Gz,SXoheAR0Gz,An algorithm for quickly approximating partial Fourier transforms,"The paper suggests a method for quickly computing a ""partial Fourier transform"", which basically means that we want only a small range of output frequencies.  The main technique is an approximation of so called ""twiddle functions"" (which are basically trigonometric functions, or, exponents of complex units if viewed in the complex plane) using polynomials.   The resulting algorithms run in time O(N + M log M) where M is the size of the required frequency range, and N is the input.  This should be compared with Cooley and Tukey's FFT, which is O(N log N).  In fact, the main idea in the paper uses Cooley and Tukey's decomposition of the expression for Fourier transform.
+
+The first thing I must note is, that the authors seem not to be aware of existing literature on partial Fourier transforms.  For example, Ailon and Liberty show in their paper [1] that whenever you are only interested in M Fourier frequencies then you can do exact FFT in time O(N log M).  This is done by pruning Cooley Tukey's FFT.  It is not as good as O(N + M log M), but it is EXACT, and it doesn't require the M frequencies to be in a single consecutive range.  (In fact, [1] does this for the Hadamard transform, but I am quite certain that it works for the DFT as well because the argument depends on the computational graph only, which is identical for both cases).  
+
+Additionally, Indyk et al have written many papers on fast computation of Fourier when you know a priori that only M frequencies are nonzeros (and the rest are 0).  This setting is different from the one in this paper, but it is definitely relevant.
+
+Finally, I am inclined to say that the subject of this paper is only loosely related to the scope if ICLR.  I mean, computing Fourier coefficients is definitely a way to get a useful representation of the data, but I don't know if ICLR is the correct venue for this kind of paper.
+
+Otherwise, the paper is well written, and I assume it is correct.
+
+[1] Ailon, Liberty, ""Fast Dimension Reduction Using Rademacher Series on Dual BCH Codes. "" (Discrete and Computational Geometry, 2009)
+
+",5,3.0,ICLR2021
+HkxJSbehYS,1,rklVOnNtwH,rklVOnNtwH,Official Blind Review #1,"This paper tackles out-of-distribution samples detection via training VAE-like networks. The key idea is to inject learnable Gaussian noise to each layer across the network in the hope that the variance of the noise correlates well with the uncertainty of the input features. The network is trained to minimize the empirical loss subject to noise perturbation. The paper is well written, and the background is introduced clearly.
+ 
+As I understand it, the goal of *out-of-distribution sample detection* is to train a deep network that simultaneously generalizes well and also be discriminative to outliers. However, it’s not clear to me why the proposed method server this purpose; empirical results are not convincing either. My major concerns are as follows:
+ 
+First of all, from my intuition, it would be much easier to train deterministic networks than their counterparts with randomness. Empirically, researchers also often observe near-zero training loss for large deterministic networks such as Dense-BC trained on simple CIFAR/SVHN datasets. Especially, in this case, the training goal is simply to map higher-dimensional inputs to lower-dimensional classification categories. That being said, one would expect the variances go to zero at convergence to achieve lower empirical loss in the case of no additional diversity (or uncertainty) promotion terms. 
+ 
+It is not clear to me how to avoid degenerate solutions at convergence 
+while maintaining good testing performance with the proposed training strategy. 
+From the empirical results, it also appears that all models reported might not be fully optimized? 
+The baseline results are significantly worse than those reported in previous work.  
+Specifically, 
+in table 1, the testing accuracy of Dense-BC trained on CIFAR-100 is only 71.6.
+In table 2, the reported testing accuracy on CIFAR-10 using Dense-BC is 92.4.
+ 
+However, the results of DenseNet-BC (k=12, L=100, table 2) reported in the original paper are:
+CIFAR10  94.0  (also leave 5K examples as validation set)
+CIFAR100 75.9
+  
+Meanwhile, the reported accuracy of WRN-40-4 trained on CIFAR-10 and CIFAR-100 are 89.6 and 66.0, respectively. However, the corresponding baseline numbers in the original WRN paper are much higher,  
+CIFAR-10  95.03
+CIFAR-100 77.11 
+
+Could the authors comment on that?
+
+References:
+Gao Huang, Zhuang Liu, Laurens van der Maaten, Kilian Q. Weinberger.
+Densely Connected Convolutional Networks
+https://arxiv.org/abs/1608.06993
+
+Sergey Zagoruyko, Nikos Komodakis. 
+Wide Residual Networks.  
+https://arxiv.org/pdf/1605.07146.pdf
+",1,,ICLR2020
+bpi8cR-_Buh,2,04LZCAxMSco,04LZCAxMSco,Very High Quality of Argumentation,"This is a paper in theoretical computer science which considers what 
+many would consider a very hard problem -- learning a latent simplex lurking 'underneath' a cloud of data.
+It is extremely well organized and tightly put together. 
+
+It patiently shows that this problem is lurking behind many different 
+high-dimensional data analysis challenges, and then carefully lays the groundwork
+for discussing a proposed algorithm and  its performance characteristics.
+
+What really surprised me about the paper was that the assumptions
+going to make up the 'math result' seemed to be quite stylized to me,
+in such a way that many datasets could never really satisfying them
+(although perhaps some could). Nevertheless, the authors present results on
+real data which seem to suggest that the algorithm really can be used
+in empirical representation learning. This really surprises me,
+while putting forward positive results based on the authors' own
+invented assumptions impresses me not so much.  So it would be interesting
+to hear from the authors, for example in a talk or in an eventual journal paper,
+to what extent the data actually obey the stated assumptions. If they
+don'y obey the stated assumptions of this paper, then some less restrictive
+assumptions would be the 'sharp conditions'. In that case, what are the true 
+conditions needed for the algorithm to work? 
+
+Obviously the significance this work will have in the eyes of potential users 
+depends a lot on the just-discussed  Q and A.
+
+",9,4.0,ICLR2021
+9fXahovi5GL,2,1GTma8HwlYp,1GTma8HwlYp,A simple yet insightful idea is implemented while the experiments might not demonstrate the algorithm's full potential.,"The work studies the auxiliary task selection in deep learning to resolve the burden of selecting relevant tasks for pre-training or the multitask learning. By decomposing the auxiliary updates, one can reweight separately the beneficial and harmful directions so that the net contribution to the update of the primary task is always positive. The efficient implementation is experimented in text classification, image classification, and medical imaging transfer tasks.
+
+The first contribution is the decomposition algorithm and reweighting of the auxiliary updates. It is a simple idea with a nice insight of treating the primary task and the auxiliary tasks in different manners. The decomposition allows a reweighting on the updates to optimize the primary task as much as possible while keeping the auxiliary tasks providing improvable directions. The second contribution is an efficient mechanism to approximate and calculate the SVD of the Jacobian of the primary task. The mechanism is implemented from an existing randomized approximation method. The third contribution is a set of experiments verifying the proposed method. The experiments include text classification, image classification, and medical imaging transfer tasks. The most salient result is the 99% data efficiency to achieve improvable performance in the medical imaging transfer task.
+
+Concerns
+
+Besides the above positive contributions, following are some concerns from the observations:
+
+1. The relative improvements comparing to the baselines in Table 1 and Table 2 do not seem as much as that in (Gururangan et al. 2020) and (Yu et al., 2020), respectively. 
+
+2. The weights reported in the experiments are 1 or -1 in the experiments. For instance, \eta_aux = (1, 1, -1) is reported in the image classification task.
+
+The reader would expect much better improvements when given the freedom to reassign the weights on the decomposed directions, especially when the harmful part has a negated weight. Moreover, why are the values chosen in \eta 1 or -1? Would there be a nicer balance between, say, the beneficial and the harmful parts? For instance, would \eta = (1, 0.8, -0.9) be a better choice? It would be crucial that the authors can explain furthermore or support further experiments to confirm whether the potential of this decomposition algorithm is fully demonstrated or not.
+
+
+=====================
+
+Post Rebuttal
+
+I have read the authors' response. All my concerns are addressed properly. However, I still doubt that even the corner cases of \eta have a better performance, would there be a systematic way to find the optimal parameters reflecting the true potential of this method. Thus, I will keep my score unchanged.
+",6,3.0,ICLR2021
+X5fspbD-lNp,3,04cII6MumYV,04cII6MumYV,This paper proposes a transformer based exploitation of multiple domain-specific backbones to achieve better performance across all the domains at hand.,"The review is brief because of time pressure. However, I have gone through the paper carefully.
+Motivation
+The paper is well motivated. It is keenly aware of previous work in the field and establishes its advancement of the state of the art clearly. It reviews past work in meta-learning as well as universal representations and transformers. I do have a suggestion for improvement, which is to consider the lifelong learning literature where reinforcement learning based methods have been developed for learning tasks over a lifetime. While reinforcement learning is a qualitatively different approach, lifelong learning requires the kind of adaptation to changes in tasks that the authors are addressing in their paper. It might behoove them to look at that literature and make a critical assessment with respect to their work. I don't see this as a weakness of the paper at all.
+
+Approach
+The approach is clearly described and is technically sound. It essentially sets up an optimization across multiple domain specific backbones to solve the multi-task problem. Such an approach has the advantage of modular design although I am curious to know if the authors have any opinions on how to introduce a new backbone into their system without having to retrain the entire system end to end. Or just in general how they would introduce a new domain specific backbone.
+The optimization is clearly described and convincing. 
+Results
+The results are convincing. They are at par or better than the state of the art. They are carried out on datasets well accepted by the community.
+Quality, Clarity, Originality and Significance
+Clarity - The paper is extremely well written. There are typos for example Representation is misspelled (misspelt). Those can be easily removed with a single editing pass. The paper motivates its approach well and describes the approach systematically. The results are presented convincingly and clearly. I would say the clarity of the paper is high.
+Quality, Originality and Significance - The idea presented here is certainly novel in its details. The overall idea of using multiple domains to compensate for data-scarcity in certain domains is not new, but realizing that in a mostly better than the state of the art manner is a challenge that the authors address successfully. The overall proposal is a small but good idea that leads to good results. I would therefore say that the paper has good quality, significance and originality.",7,5.0,ICLR2021
+GaUkYV8YaL,3,Z_TwEk_sP34,Z_TwEk_sP34,Some concerns need to be solved,"Summary and contributions
+This paper studies the relationship between adversarial transferability and knowledge transferability. By defining two quantities to measure the adversarial transferability, it shows that adversarial transferability measured in this way indicates knowledge transferability both theoretically and empirically.
+
+Strengths
+This paper is the first work to theoretically focus on the correlation of two prevalent phenomena of DNN--adversarial transferability and knowledge transferability. Its backgrounds and theoretical results and proofs are clearly presented. Moreover, the experiments are complete: they include experiments for three types of knowledge transferability whose datasets, transfering methods and results are very clear. 
+
+Weaknesses
+First, this paper does not provide the intuition to use squared cosine value rather than the cosine value in the defination of $\tau_1$(although Proposition 2 theoretically shows the relationship between $\tau_1$ and Cross Adversarial Loss). Fig.1 in the paper also regards $\tau_1$ as the cosine value. For an example of the ill-defined parts of $\tau_1$, given $\bm \delta_{f_1}=\bm \delta_{f_2}=-\bm \delta_{f_3}$, one would claim that the adversarial transferabilities of $f_1$ to $f_2$ and $f_1$ to $f_3$ are same considering their same squared cosine values, which is counterintuitive to make $\tau_1$ an appropriate quantity to measure the similarity between two attacks. 
+
+Second, in the definition of $\tau_2$, the authors provide the linear map $A$ and $\tau_2$ without literally stating their physical meanings. Besides, the demonstration of $\tau_2$ in Fig.1 is misleading since it seems that $\tau_2$ is a vector in Fig.1 which is not true. The authors also do not clarify the reasonability of comparing vectors of different dimensions which is necessary for most settings of this paper.
+
+Third, this paper does not provide an indicator with both $\tau_1$ and $\tau_2$ for high adversarial transferability. Futhermore, it lacks the theoretical relation between $\tau_2$ and Cross Adversarial Loss to demonstrate its reasonability. 
+
+If these concerns could be solved appropriately, I would consider to raise the score. 
+
+Correctness
+I have carefully checked the proofs of all theorems in this paper, and ensure that they are correct.
+
+Reproducibility
+Yes. The setting of their experiments is clear and complete. 
+
+Clarity
+The idea, structure and expression of this paper are easy to understand and follow. However, the explanation of Definition 3 is hard to understand. 
+
+Relation to the prior work
+Yes. Authors show they have a good understanding of prior work's contributions, especially the three types of knowledge transferability. 
+
+Addition
+Pro. Authors use PGD-attack adversarial transferring in Section 5.1 but virtual adversarial transferring in Section 5.2. Is there any difference between them, and if so, what's the difference?  
+Pro. Authors claim that $g$ is a trainable function in Section 3 while use a specified $g$ instead in Theorem 3. Does a better $g$, such as $g = \arg \min \left\|g \circ f_{S}-y\right\|_{\mathcal{D}}^{2}$, make the bound in Theorem 3 tighter? ",5,4.0,ICLR2021
+OD5vdPJauN_,4,fycxGdpCCmW,fycxGdpCCmW,a good submission empirically improving on top of prior hybrid EBM works,"Summary
+- Paper proposes Hybrid Discriminative Generative training of Energy based models (HDGE) which combines supervised and generative modeling by using a contrastive approximation of the energy based loss
+- Approach shows this is better than baselines on various tasks like confidence calibration, OOD detection, robustness and classification accuracy
+
+
+Clarity
+- Overall well written paper. Figures and tables are informative and supplement the flow.
+- Formatting error in figure 4 in appendix
+
+
+Novelty
+- Paper proposes a simple but unified view of contrastive training, generative and discriminative modeling - a nice, novel contribution with empirically strong results
+- Gets rid of computationally expensive SGLD by using contrastive approximation which was a key limitation of prior energy based modeling work like JEM
+
+
+Significance
+- Results are compelling across a wide range of tasks over existing (EBM) baselines including calibration, robustness, OOD detection, generative modeling and classification accuracy
+
+
+Questions/clarifications/comments
+- “, Grathwohl et al. (2019) show the alternative class-conditional EBM p(y|x) leads to significant improvement in generative modeling while retain compelling classification accuracy” -> Not sure the JEM model is class conditional
+
+- How alpha = 0.5 (the weighting chosen)? The details are not presented.
+
+- Error bars are missing for the classification accuracy experiments in Table 1 which makes it hard to verify improvements especially wrt supervised contrastive loss method
+
+- Detail on how classification accuracy is computed when using generative term in HDGE is missing? Is it a linear classifier on top of learned representations?
+
+- “Prior work show that fitting a density model on the data and consider examples with low likelihood to be OOD is effective” -> Not completely true see https://arxiv.org/abs/1810.09136
+
+- Please share exact details on how p(x) for OOD score is calculated
+- Error bars again missing in Table 2
+
+- “We find HDGE performs beyond the performance of a strong baseline classifier” - this is a strong statement as only for CIFAR10/Celeb A the gains of HDGE are clear
+
+- Why was the Winkens et al, 2020 contrastive baseline not used here to compare in Table 2 - https://arxiv.org/abs/2007.05566?
+- “HDGE is conceptual simple to implement, scalable, and powerful.” -> conceptually. Also scalability is a somewhat strong claim as the main datasets used here are CIFAR variants.
+
+- Was HDGE + JEM experiments also performed for OOD detection?
+
+- Legend in figure 4 should be “HDGE” not “HDSE”?
+
+Overall good effort with seemingly good improvements over prior efforts on hybrid EBMs over a number of tasks. Main concern is lack of error bars which makes it hard to validate claims in certain cases.
+",6,4.0,ICLR2021
+SylwnL-I6m,2,HJMCdsC5tX,HJMCdsC5tX,This paper would not be best fit to ICLR.,"This paper focuses on the extraction of the (multi) periodicities from a signal. The paper describes the conventional method based on the Fourier transformation and/or autocorrelation methods, and proposed method, which first detects a distribution of spectral leakages, and prune the periodicity hints by using a clustering algorithm. The proposed method is also extended to deal with multi-periodicities. The effectiveness of the proposed method is shown with the controlled simulation data and several real data. This paper is well written (note it is over 8 pages though), but it is not learning-based approach, and would not best fit to major ICLR interests.
+
+Comments:
+- The abstract needs to be more self-consistent without referring the citation for a brief explanation. Also it should have more detailed experimental discussions.
+- Algorithm 1 needs some refinement (too code-like, although it is understandable). For example, several methods (nextBinValue and append) would be better to be replaced with other (human readable) expressions.",5,2.0,ICLR2019
+820tCAKilvN,3,Cb54AMqHQFP,Cb54AMqHQFP,Re-training matters as much as sparsification,"The authors conducted a comprehensive set of experiments on choices of learning rate schedules for re-training/fine-tuning during iterative or after 1-shot pruning of deep convnets.  Empirically, they reported that high learning rate (LR) is particularly helpful in recovering generalization performance of the resultant sparse model.  The results are purely empirical, well-documented observations from well-designed experiments, which is of practical value in practice of network compression, and the consistent, somewhat surprising observation raises interesting questions.  
+
+Notably, this work has brought to attention an important but often overlooked aspect of network pruning: there exist complex interactions between the dynamics of optimization and sparsification, and as a consequence, it is only fair to compare two sparsification techniques when each of them are put in the _best_ optimization setup, respectively.  
+
+I have a few comments that I wish the authors would address here, discuss in revision or note for future work:
+
+(1) Why is large LR helpful in recovering the accuracy of sparse nets?  There is little information provided in these experimental results to shed light on this question.  There has been loss landscape studies of sparse nets during training (such as arxiv:1906.10732, arxiv:1912.05671)--perhaps these could be applied to study the problem.  If the high LR's role were to knock the solution out of bad local minima, then does adding noise to gradients or smaller batch size achieve similar effect at the initial phase of re-training?  
+
+(2) Given a fixed re-training flop budget, after a pruning operation on the network, both (a) weight value rewinding (as in the Lottery Ticket Hypothesis training), (b) re-training LR schedule (as in this work) might be potentially helpful.  How does weight value rewinding interact with LR?  
+
+(3) For the random pruning results in Sec. 4, do fine-grain unstructured pruning methods present the same results?  
+
+(4) Does the result generalize to transformer models?  What about optimizers?  Does Adam present a same story as SGDM? 
+
+Page 5, line1 of the 3rd paragraph of Sec. 3.2: typo ""reachs""",8,5.0,ICLR2021
+mw6U9lMb6m5,3,GCXq4UHH7h4,GCXq4UHH7h4,Lack of novelty hinders the acceptance,"This paper proposes to jointly optimize non-uniform subsampling pattern as well as the reconstruction network to perform image compressed sensing. It shows using learned selective sensing can significantly improve the reconstruction accuracy over the (random Gaussian) compressed sensing and uniform subsampling counterparts. 
+
+However, the novelty of the proposed method is very limited. The idea of using continuous interpolation on the discrete indices/grid, i.e. the main technical contribution in this paper, has already been proposed for CS-MRI sensing pattern design [1]. Although it is conducted on the k-space, applying this technique on image pixel space is straightforward and trivial. Besides, the comparison to a closely related work [2] is missing -- the probabilistic subsampling could also be used for the problem this paper tackles. 
+Finally, since the selective sensing pattern is learned from data, it's more fair to compare it with the learned compressed sensing counterpart, as exploited in the journal version of the ReconNet [3] used in this paper. 
+
+[1] PILOT: Physics-Informed Learned Optimized Trajectories for Accelerated MRI  
+[2] Deep probabilistic subsampling for task-adaptive compressed sensing  
+[3] Convolutional Neural Networks for Noniterative Reconstruction of Compressively Sensed Images  ",4,5.0,ICLR2021
+H1ezHHmAtB,2,r1lZ7AEKvB,r1lZ7AEKvB,Official Blind Review #2,"The paper elaborates on the expressivity of graph neural networks (GNNs). More precisely, the authors show that expressivity of AC-GNNs (aggregate and combine) can only express logical classifiers that can be expressed in graded modal logic. By adding readouts, ACR-GNNs (aggregate, combine and readout) can capture FOC2 which is logical classifiers expressed with 2 variables and counting quantifiers. The second theorem leaves open the question of whether ACR-GNNs can capture logical classifiers beyond FOC2.
+
+The paper is written nicely, its easy on the eyes, and delegates the proofs to the appendix. I was a bit surprised by the lack of a discussion connecting the choice of the aggregate and combine operations to the representation power of GNNs. One has to delve deep into the proofs to find out if the choice of these operations affects expressivity.",8,,ICLR2020
+rkoNCSV4e,3,SygGlIBcel,SygGlIBcel,Review,"This paper proposes an extension of neural network language (NLM) models to better handle large vocabularies. The main idea is to obtain word embeddings by combining character-level embeddings with a convolutional network.
+
+The authors compare word embeddings (WE),character embeddings (CE) as well a combined character and word embeddings (CWE). It's quite obvious how CE or CWE embeddings can be used at the input of an NLM, but this is more tricky at the output layer. The authors propose to use NCE to handle this problem.  NCE allows to speed-up training, but has no impact on inference during testing: the full softmax output layer must be calculated and normalized (which can be very costly).
+
+It was not clear to me how the network is used during TESTING with an open-vocabulary. Since the NLM is only used during reranking, the unnormalized probability of the requested word could be obtained at the output. However, when reranking n-best lists with the NLM feature, different sentences are compared and I wonder whether this does work well without proper normalization.
+
+In addition, the authors provide perplexities in Table 2 and Figures 2 and 3.  This needs normalization, but it is not clear to me how this was performed.  The authors mention a 250k output vocabulary. I doubt that the softmax was calculated over 250k values. Please explain.
+
+The model is evaluated by reranking n-best lists of an SMT systems for the IWSLT 2016 EN/CZ task.  In the abstract, the authors mention a gain of 0.7 BLEU. I do not agree with this claim. A vanilla word-based NLM, i.e. a well-known model, achieves already a gain of 0.6 BLEU. Therefore, the new model proposed in this paper brings only an additional improvement of 0.1 BLEU. This is not statistically significant. I conjecture that a similar variation could be obtained by just training several models with different initializations, etc.
+
+Unfortunately, the NLM models which use a character representation at the output do not work well. There are already several works which use some form of character-level representations at the input.
+
+Could you please discuss the computational complexity during training and inference.
+
+Minor comments
+ - Figure 2 and 3 have the caption ""Figure 4"". This is misleading.
+ - the format of the citations is unusual, eg.
+   ""While the use of subword units Botha & Blunsom (2014)""
+   -> ""While the use of subword units (Botha & Blunsom, 2014)""",3,4.0,ICLR2017
+HJl5Z4Kf67,3,BkxgbhCqtQ,BkxgbhCqtQ,Interesting topic but contributions are not well-motivated,"The authors propose “Stochastic Quantized Activation Distributions” (SQUAD). It quantizes the continuous values of a network activation under a finite number of discrete (non-ordinal) values, and is distributed according to a Gumbel-Softmax distribution. While the topic is interesting, the work could improve by making more precise the benefit of (relaxed) discrete random variables. This will also allow the authors to more precisely display in the experiments why this particular approach is more natural than other baselines (e.g., if multimodality is the issue, compare to a mixture model; if correlation is a difficulty, compare to any structured distribution such as a flow).
+
+Derivation-wise, the method ends up resembling Gumbel-Softmax VAEs but under an information bottleneck (discriminative model) setup rather than under a generative model. Unfortunately, that in and of itself is not original. 
+
+The idea of quantizing a continuous distribution over activations using a multinomial is interesting. However, by ultimately adding Gumbel noise (and requiring a binning procedure), the resulting network ends up looking a lot like continuous values but now constrained under a simplex rather than the real line. Given either the model bias against a true Categorical latent variable, or continuous simplex-valued codes, it seems more natural as a baseline to compare against a mixture of Gaussians. They have a number of hyperparameters that make it difficult to compare without a more rigorous sensitivity analysis (e.g., bin size).
+
+Given that the number of bins they use is only 11, I’m also unclear on what the matrix factorization approach benefits from. Is this experimented with and without?",5,3.0,ICLR2019
+pn5QGsQNRn_,1,igkmo23BgzB,igkmo23BgzB,Official Blind Review #4,"The authors introduce a log-barrier extension loss term enforcing soft constraints on the range of values to enable fully end-to-end quantization-aware training. 
+
+Strengths of the paper: 
+
+- The paper addresses an important topic, because there are increasing concerns in performing fully end-to-end low precision training to deploy on low-precision hardware. 
+- The method has a practical goal and could be interesting for practitioners 
+
+Weaknesses of the paper: 
+
+- Lack of positioning with respect to the SOTA quantization-aware training(QAT) and post-training quantization(PTQ) schemes, there are plenty of missing related literatures on both quantization schemes. Some statements in the background and related work could be wrong. For example, QAT also focuses on efficient inference as well as PTQ. The levels of practical applicability of a variety of quantization solutions have been introduced in DFQ(Nagel etal., 2019). Survey on the related work is not sufficient. 
+- Having a benchmark would be interesting if it will include some SOTA methods and evaluates with them. The comparison targets are mostly out-of-date. It is lack of convincing evaluation results to support the proposed scheme.
+- Organizing the whole contents is ok but not good enough for the readers to easily follow and understand. 
+
+Detailed comments: 
+
+(1) The terminology on swamping might not be familiar with the ML community. Explaining the criticality of swamping problem is not good enough in the intro. You should provide how critical the problem is on the low-precision hardware with the other SOTA quantization schemes. For example, the probability of occuring swamping without applying the proposed scheme, etc.
+
+(2) Evaluation results provided in the paper are just for comparing accuracy. Accuracy loss is intrinsic in fully end-to-end low-precision training. The benefits of employing the proposed scheme would be beyond accuracy, say memory or energy-saving constraints for on-device training. Experimental evaluations to support the necessity and merits of the proposed scheme should be provided. 
+
+(3) The quantization range is fixed in the proposed scheme. Is it a merit that the proposed scheme does not need to adjust the range and precision either per-layer or per-channel during training as in other SOTA methods?
+
+(4) Writing on the constrained optimization formulation is a bit verbose and not properly formulated. 
+
+(5) Inducing the tail bound of distribution to demonstrate that the probability of swamping can be controlled, several assumptions and approximations have been applied for the worst-case upper bound. Are the assumptions reasonable to work in practice? For example, assuming that the weight distribution is Gaussian is too strong to be practical. 
+
+(6) The paper has a conceptual overlap with other quantization approaches and some of the proposed scheme is not entirely novel resulting in a weak contribution. 
+
+(7) In Table 2, the MobilNet has more severe degradation of accuracy than the ResNet on the low-precision(8-bit) setting. Could you explain why this happens?
+
+Minors: 
+- Several typos: There was been in p.2, to soft threshold the range of in p.3, theta-i in eq.(3), some more in p.7",3,5.0,ICLR2021
+QzpGk_O_TP4,2,nCY83KxoehA,nCY83KxoehA,"Some nice results, but I'm not fully convinced by the technique","This paper explores a way of learning how to automatically construct a concatenated set of embeddings for structured prediction tasks in NLP. The paper's model takes up to L embeddings concatenated together and feeds them into standard models (BiLSTM-CRFs or the BiLSTM-Biaffine technique of Dozat and Manning) to tackle problems like POS tagging, NER, dependency parsing, and more.  Search over embedding concaneations is expressed as a search over binary masks of length L.  The controller for this search is parameterized by an independent Bernoulli for each mask position.  The paper's approach learns the controller parameters with policy gradient, where the reward function is (a modified version of) the accuracy on the development set for the given task. This modified reward uses all samples throughout training to effectively get a more fine-grained baseline for the current timestep based on prior samples.  Notably, the paper uses embeddings that are already fine-tuned for each task, as fine-tuning the concatenated embeddings is hard due to divergent step sizes and steep computational requirements.
+
+Results show gains over randomly searching the space of binary masks. The overall model outperforms XLM-R in a range of multilingual settings.
+
+This paper has some nice empirical results and the simplicity of its approach is attractive. But there are two shortcomings of the paper I will discuss.
+
+MOTIVATION/COMPARISONS
+
+The authors motivate their technique by drawing parallels to neural architecture search. But I actually think what the authors are doing more closely resembles ensembling, system combination, or model stacking, e.g.:
+https://www.aclweb.org/anthology/N09-2064.pdf
+https://www.aclweb.org/anthology/N18-1201.pdf
+
+When you take large Transformer models (I have to imagine that the Transformers are contributing more to the performance than GloVe and other static word embeddings -- and Table 11 supports this somewhat) and staple a BiLSTM-CRF on top of them, most of the computation is happening in the (fixed) large Transformer. Most NAS methods I'm familiar with re-learn fundamental aspects of the architecture (e.g., the Evolved Transformer), while fixing most of the architecture and re-learning a last layer or two is more suggestive of system combination or model stacking.
+
+My main question is: did the authors try comparing to an ensemble or post-hoc combination of the predictions according to different models?  Computationally this would be cheaper than what the authors did. It's also much faster to search over 2^L-1 possibilities when checking each possibility just requires decoding the dev set rather than training the BiLSTM-CRF -- actually, this can be done very efficiently if each model's logits are cached.
+
+There are more sophisticated variants of this like in the papers I linked above where each model has its own weights or additional inputs are used. Intellectually, I think these approaches are related, and they should be discussed and compared to.
+
+RESULTS
+
+As for the results, Table 1's gains are small -- they are consistent over random search, but I don't find them all that convincing.  There are too many embeddings here for ALL to work well -- my guess would be that a smaller set would yield better performance.
+
+Tables 2-4 show improvements over existing baselines, XLM-R, and XLNet. This performance is commendable. However, again, I don't know how this compares to ensembling across a few leading approaches (like mBERT and XLM-R for the cross-lingual tasks).
+
+CONCLUSION
+
+In the end, I'm not sure how readily this approach will be picked up by others. Because the embeddings aren't themselves fine-tuned as part of the ensemble, it really feels more like a fine-tuned ensemble of existing models rather than true NAS. And the overhead of this approach is significant: it requires running many training runs over large collections of existing pre-trained models to get a small improvement over the current state-of-the-art. This is a possibly useful datapoint to have in the literature, but it feels like the technique isn't quite right to lead to more work in this area.
+
+MINOR:
+
+""use BiLSTM-Biaffine model (Dozat & Manning, 2017) for graph-structured outputs""
+
+This is a very particular structure, namely a directed minimum spanning tree (MST), though projective trees are also possible using the Eisner algorithm. The paper should specify that it's these, and not arbitrary graphs that are being produced here.
+
+-------------------
+
+UPDATE AFTER RESPONSE
+
+Thanks for the response and the additional experiments. The comparison between ACE and these other techniques is nice to see, although I'll note that both SWAF and voting shouldn't make totally independent predictions in tasks like NER, but should at least respect constraints in the label space (not sure if there were applied or not).
+
+In the end, my opinion of this paper largely comes down to the practicality of this technique and its likelihood to be adopted more generally. This results in a large, complex model, and while I am now convinced that the authors have a better ensembling/combination technique than some others, I think it still falls short of a real ""neural architecture search"" contribution or a really exciting result.",4,4.0,ICLR2021
+RcK-eyPSgMu,2,zg4GtrVQAKo,zg4GtrVQAKo,feasible and sound approach for private information retreival,"Summary:
+The paper aims at taking a new approach towards the problem of private information retrieval. The proposed method relies on the interplay of the three parameters: distortion (utility), leakage (privacy) and the download rate/cost. They try to decrease the download cost, by sacrificing some utility (lossy compression through GANs), which is an interesting and seemingly novel take on the problem. Their method needs to solve two optimization problems: the first one assumes a fixed utility and privacy, and minimizes the download rate, the second one is a minimax loss that assumes a fixed rate and trades off privacy for utility. They then propose three practical methods based on this, the first two are not applicable on all cases, the last one, however, is data driven and can be applied to a wider range of problems. 
+
+pros:
++The method offers a nice trade-off between the three dimensions of cost, utility and privacy. 
++The proposed method is evaluated both theoretically and experimentally which helps verify its veracity. 
+
+cons:
+- This is more of a question: why are there only two datasets, the synthesized one and MNIST? why not bigger datasets? would the proposed methods also work on larger images? I assume one reason would be because information theoretic bounds are much harder to enforce in high dimension cases. Is that the case? If so, how can it be addressed? 
+
+- Also, in Section 5, paragraph third paragraph, there was a sentence which is a bit ambiguous to me: ""However, due to a very small size of images, the overhead (e.g., for storing the Huffman codebook) turned out to be unacceptably high.""I did not completely understand what the problem is. Is it the case that because the images are small, the codebook becomes really large? Would this get worse with larger images? 
+
+
+[I am not at all familiar with information retrieval and the work surrounding it, so I am not entirely confident in my review and I might update it based on the review of expert reviewers later on].",6,3.0,ICLR2021
+Bs7lO3v9Fon,3,e-ZdxsIwweR,e-ZdxsIwweR,"An important topic, but the results and the presentation are substandard","The paper suggests two approaches to combine the concepts of robust Markov decision processes (MDPs) with that of constrained MDPs. In the first approach, called R3C, a worst-case setting is used for both the expected total discounted rewards criterion and the constraints on the state-action pairs. The robustness is defined with respect to all possible choices (from an uncertainty set) of transition-probability functions. In the second approach, called RC, only the constraints should be robust against all possible transition probabilities. The paper studies the value functions and the corresponding Bellman operators of these problems and argues that, in both cases, these operators are contractions in the supremum norm. Finally, numerical experiments are presented on RWRL problems, such as the cart-pole and the walker, showing the effect of using the redefined operators.
+
+The general problem that the paper studies (namely, robust constrained MDPs) is nice and worth investigation, but the offered combination is straightforward, the theoretical results are weak, and the paper is poorly written (see below). Therefore, the current form of the paper is substandard and needs major improvements. 
+
+Several notations and concepts are not specified, starting from the state and action spaces of the MDP. For example, it is not clear whether these spaces are finite, or otherwise, what structure is assumed about them (the minimal assumption that one needs is that they are measurable spaces). The uncertainty set itself is not defined, just the notation is used. The  constraint function $C$ is defined as $S \times A \to \mathbb{R}^K$, but then few lines later in the definition of $J_C^{\pi}$ we simply have $\sum_{t=0}^{\infty} \gamma^t c_t \leq \beta$, without actually stating what $c_t$ is. If $c_t$ is defined as $c_t := C(s_t, a_t)$, then $\beta$ should be a vector to make the above inequality meaningful, with the notation that $\leq$ means coordinate-wise less than or equal. However, things like these should be guessed by the reader as the paper lacks proper definitions.
+
+In Section 2.3.1, about the R3C part, in the second equation after (1), the obtained results are dubious, as ${\bf V}(s')$ should not depend on $s'$, as $s'$ is just a random variable with respect to an expectation is taken (see the definition of the classical Bellman operator). Also function ${\bf V}$ is not defined in the paper, the reader should guess its meaning from the appendix. In section 2.3.1, in equation (2), it is not clear with respect to what probability distribution the expectation is taken in $V(s)$. Is there a special element of the uncertainty set, a ""nominal"" model?
+
+The structure of the paper is also a bit chaotic. For example, there is an ""Experiments"" and also an ""Experimental Results"" part, both containing results of various experiments. Moreover, Sections 6 and 7 should be subsections of Section 5, etc.",4,4.0,ICLR2021
+MEcjIJKSjwF,2,oSrM_jG_Ng,oSrM_jG_Ng,review 351,"This paper presents a new method for the camera based physiological measurement task. The key idea of this work is to use attention mechanism to learn discriminative features from regions of interest, and use the inverse attention mask to select contextual information to learn noise representation for refinement. Experiments on three datasets show state-of-the-art performance.
+
+I have several main concerns.
+
+1. The main idea of using the reverse attention mask to learn noise information is not significant. In fact, using reverse attention to focus on other regions has been studied in other areas (e.g., for saliency detection). The paper applies the idea to a specific task, i.e., the camera based physiological measurement task.
+
+2. The motivation is not well justified. It is not explained the advantages of using reverse attention mask to learn noise representation. Taking Figure 1 for example, it seems that the attention mask focuses on some key facial regions. While these regions of interest also contain noisy informaiton (e.g., motion and lighting change), the reverse attended regions (the textured background) may not necessarily contain. Why not directly learn an attention map for noise estimatation, or why not directly use the original attention map to learn noise representation?
+
+3. The discussion of how existing methods deal with large motion and noise is not given in the related work section. It is then hard to evaluate the significance of proposed method.
+
+4. The computation of noise (Eq.1) is a multiplication of the reverse attention mask and the original input. However, I do not understand how this operation can be interpreted as noise estimation, as it seems to be a region selection process. Given that, the denoising model seems to be a refinemet model that considers reverse attended regions and the output of the first model. It is hard for me to understand the second model as a denoising model. 
+
+Minor issues.
+
+1. The description of Fig.1 can be improved with more details.
+
+2. It is better to provide more details of how errors are corrected in the MMSE-HR dataset (Last sentence of the first paragraph in page 6).
+
+3. Section 5 talks about the training details. It would be better to directly use ""Training Details"" instead of ""Experiments"".
+
+4. In the first paragraph of section 6, four variants are given. However, it is not easy to understand the last two, i.e., how the noise subtraction is performed in frequency or time domain. 
+
+5. Tables 3 -> Table 3 (page 8).
+
+6. The abstract mentions two datasets but there are actually three datasets in the experiments. 
+",5,4.0,ICLR2021
+T6gJRTrzKGc,1,GY6-6sTvGaf,GY6-6sTvGaf,The paper proposes a new approach in RL by showing effectiveness of image augmentation in DQN. They compares against benchmark available in DeepMind control suite on several environments. ,"PROS
+-------------------------------------------
+- Finding that image augmentation helps in learning a good policy is indeed a substantial contributions. 
+- Other two aspects to regularize Q-values are also helping to learn a better policy. 
+
+CONS
+--------------------------------------------
+The paper presents many experiments but there are a few crucial ones which are missing. 
+For example, 
+- what is the impact on training time of DrQ as compared conventional DQN? 
+- How much augmentation is good? etc.
+
+Question that needs justification:
+---------------------------------------------
+- Overall, the paper is well written and explain various things but the algo-1 needs explanation of all parameters. For example, \math{D} (reader has to think that it's a replay buffer), \math{u(T)} is not at all clear. Many notation are not explained even in text and needs a clear explanation for reader from RL and non-RL domain. 
+
+- The primary claim of the paper is that image augmentation improves the performance. Sec5.1 shows significant improvement when image augmentation is used with different methods but it is very strange to see the improvement is just by adding 4 pixels on image boundary. Is there implication when we do more augmentation by increasing the size of random shift? How does the augmented image compare visually to the original image? The figure-6 in appendix shows all the results but which one is the random shift? 
+
+- In figure4, the SAC state is significantly better than the DrQ and no explanation is provided. Why one should not used SAC compared to DrQ?
+
+- Paper claims that the proposed method can work with any model-free RL algorithm. Any justification or experiments to support the claim? If not, the contribution needs a re-writing. 
+",7,4.0,ICLR2021
+P6tt2yGZ1jM,4,j0p8ASp9Br,j0p8ASp9Br,"An interesting paper about uncertainty quantification in dynamics learning for control. The proposed method EpiOut is sample-free at the inference time and it outperms some baselines. But there are no theoretical justification or insights, and the binary epistemic output is too simplified.","**Pros and the Key Idea**
+This paper studies uncertainty quantification (UQ) in model-based learning for control, which is a timely and important research direction. The proposed method (EpiOpt) trains a neural network to predict the epistemic uncertainty directly. The training data for epistemic uncertainty prediction is artificially generated based on a simple nearest neighbor principle. The key ideas are: given the labeled training dataset $X_{tr},Y_{tr}$ (the data size $|X_{tr}|=N_{tr}$), this paper first randomly samples $X_{epi}$ around $X_{tr}$, where $|X_{epi}|=N_{epi}=k\times N_{tr}$. Then this paper labels $x\in X_{epi}$ by $1$ if the minimum distance from $x$ to $X_{tr}$ is far, and by $0$ if the distance is short. Finally, a neural network is trained for this binary classification task. 
+
+Finally, this paper uses this idea in online control: the learned epistemic uncertainty is for adaptive data collection, and the aleatoric uncertainty is for control gain tunning. The advantage of this framework is that it is very simple, and doesn't need sampling or test-time dropout at the inference time.
+
+**Cons and Suggestions**
+(1) Many related work is missing especially for domain shift and adaptive control. As mentioned in this paper, the epistemic uncertainty is mainly from the data distribution shift, but there is no discussion about domain shift in this paper. The main idea of domain shift in ML is to *quantify the ""distance'' between the source and target domains*, which is similar to the epistemic uncertainty prediction $\eta(\cdot)$ in this paper. People also considers domain shift in control and learning (http://proceedings.mlr.press/v120/liu20a.html, https://arxiv.org/abs/2006.13916 and many others). 
+Also, adaptive control can handles epistemic uncertainty in an online manner as well. It would be great to discuss the difference between adaptive control. 
+The ensemble method (e.g., https://proceedings.neurips.cc/paper/2017/file/9ef2ed4b7fd2c810847ffa5fa85bce38-Paper.pdf) is also an active method for UQ.
+
+(2) The key method, *EpiOpt*, is a bit too simple and there are no theoretical justifications or insights, such that it is hard to convince me about the generalirity and robustness of this method. As I mentioned above, the key idea of *EpiOpt* is to train a ""distance function"" $\eta(\cdot)$ to quantify the distance between the source and target data. This idea has been both theoretically and empiracally studied in domain shift and transfer learning. A few questions pop up from this paper: 
+* Why do you need to sample $X_{epi}$ around $X_{tr}$? Since the goal is to get $\eta(\cdot)$, why don't you consider some analytical solutions such as KDE (kernel density estimation) to derive $\eta(\cdot)$? In other words, you could just use $X_{tr}$ to estimate a density function for the source data, and then evaluating this density function to get $\eta(\cdot)$. I didn't see a clear reason to *train* a neural network to estimation this distance.  
+* In equation (5), this paper labels $X_{epi}$ either $1$ or $0$. Why isn't it a continuous value from $0$ to $1$? For example, you can rank $d_j$ to get this continuous value.
+
+(3) The title and abstract of this paper emphasize a lot on *epistemic and aleatoric uncertainty decomposition*. However, the key method *EpiOpt* is only about the epistemic uncertainty, and how to deal with the aleatoric uncertainty only appears in the experimental section in equation (9). I highly recommend the authors discuss about these two types in the method section (Section 3) and give a general framework. The current aleatoric uncertainty is more like a gain tunning method, which is not related to *uncertainty decomposition* .
+
+(4) The experimental section is a bit vague. I highly recoomend the authors present the concrete learning and control problem first, e.g., Which part in the dynamics is learned? How to collect data? How to decompose the epistemic and aleatoric parts? A good example is https://ieeexplore.ieee.org/abstract/document/8794351, where the task is very similar. 
+
+**Code-of-Ethics**
+I see no ethic issues in this paper.",5,3.0,ICLR2021
+SkMJBHOez,1,rkONG0xAW,rkONG0xAW,This work suggest how to train a NN in incremental way so for the same performance less memory is needed or for the same memory higher performance can be achieved. ,"The idea of this work is fairly simple. Two main problems exist in end devices for deep learning: power and memory. There have been a series of works showing how to discretisize neural networks. This work, discretisize a NN incrementally. It does so in the following way: First, we train the network with the memory we have. Once we train and achieve a network with best performance under this constraint, we take the sign of each weight (and leave them intact), and use the remaining n-1 bits of each weight in order to add some new connections to the network. Now, we do not change the sign weights, only the new n-1 bits. We continue with this process (recursively) until we don't get any improvement in performance. 
+
+Based on experiments done by the authors, on MNIST, having this procedure gives the same performance with 3-4 times less memory or increase in performance of 1% for the same memory as regular network. 
+
+I like the idea, and I think it is indeed a good idea for IoT and end devices. The main problem with this method that there is undiscussed payment with current hardware architectures. I think there is a problem with optimizing the memory after each stage was trained. Also, current architectures do not support a single bit manipulations, but is much more efficient on large bits registers. So, in theory this might be a good idea, but I think this idea is not out-of-the-box method for implementation.
+
+Also, as the authors say, more experiments are needed in order to understand the regime in which this method is efficient. To summarize, I like this idea, but more experiments are needed in order to understand this method merits. ",7,4.0,ICLR2018
+gvWCbHwIDt,4,tilovEHA3YS,tilovEHA3YS,Interesting learning approach for a classical statistical problem,"The paper considers the problem of estimating the support of a discrete distribution, when provided access to samples and an oracle that approximately predicts the probability of the observed sample.
+
+They propose an algorithm based on Chebyshev polynomials  and also show that the proposed algorithm is optimal. They evaluate the algorithm on two public datasets and a synthetic dataset and show that the algorithm performs reasonably well. The results are interesting and I recommend acceptance.
+
+The main technical contribution is to use the approximate probability of the sample to divide the interval [0, log n /n] into exponential bins and use the best Chebyshev approximation within each interval.
+I strongly encourage authors to add technical comparisons between their work and that of Canonne and Rubinfeld 2014 and other relevant papers e.g.,
+1. Optimal Bounds for Estimating Entropy with PMF Queries
+2. Probability–Revealing Samples
+
+I am also curious to know if similar results hold for unseen species estimation (e.g., https://www.pnas.org/content/113/47/13283.short)?
+",7,4.0,ICLR2021
+H1xECZ5_qS,3,SkeAaJrKDS,SkeAaJrKDS,Official Blind Review #4,"This paper proposes an approach, named SAVE, which combines model-free RL (e.g. Q-learning) with model-based search (e.g. MCTS). SAVE includes the value estimates obtained for all actions available in the root node in MCTS in the loss function that is used to train a value function. This is in contrast to closely-related approaches like Expert Iteration (as in AlphaZero etc.), which use the visit counts at the root node as a training signal, but discard the value estimates resulting from the search. 
+
+The paper provides intuitive explanations for two situations in which training signals based on visit counts, and discarding value estimates from search, may be expected to perform poorly in comparison to the new SAVE approach:
+1) If a trained Q-function incorrectly recommends an action ""A"", but a search process subsequently corrects for this and deviates from ""A"", no experience for ""A"" will be generated, and the incorrect trained estimates of this action ""A"" will not be corrected.
+2) In scenarios with extremely low search budgets and extremely high numbers of poor actions, a search algorithm may be unable to assign any of the visit count budget to high-quality actions, and then only continue recommending the poor actions that (by chance) happened to get visits assigned to them. 
+
+The paper empirically compares the performance of SAVE to that of Q-Learning, UCT, and PUCT (the approach used by AlphaZero), on a variety of environments. This includes some environments specifically constructed to test for the situations described above (with high numbers of poor actions and low search budgets), as well as standard environments (like some Atari games). These experiments demonstrate superior performance for SAVE, in particular in the case of extremely low search budgets.
+
+I would qualify SAVE as a relatively simple (which is good), incremental but convincing improvement over the state of the art -- at least in the case of situations with extremely low search budgets. I am not sure what to expect of its performance, relative to PUCT-like approaches, when the search budget is increased. For me, an important contribution of the paper is that it explicitly exposes the two situations, or ""failure modes"", of visit-count-based methods, and SAVE provides improved performance in those situations. Even if SAVE doesn't outperform PUCT with higher search budgets (I don't know if it would?), it could still provide useful intuition for future research that might lead to better performance more generally across wider ranges of search budgets.
+
+
+Primary comments / questions:
+
+1) Some parts of the paper need more precise language. The text above Eq. 5 discusses the loss in Eq. 5, but does not explicitly reference the equation. The equation just suddenly appears there in between two blocks of text, without any explicit mention of what it contains. After Eq. 6, the paper states that ""L_Q may be any variant of Q-learning, such as TD(0) or TD(lambda)"". L_Q is a loss function though, whereas Q-learning, TD(0) and TD(lambda) are algorithms, they're not loss functions. I also don't think it's correct to refer to TD(0) and TD(lambda) as ""variants of Q-learning"". Q-learning is one specific instead of an off-policy temporal difference learning algorithm, TD(lambda) is a family of on-policy temporal difference learning algorithms, and TD(0) is a specific instead of the TD(lambda) family.
+
+2) Why don't the experiments in Figures 2(a-c) include a tabular Q-learner? Since SAVE is, informally, a mix of MCTS and Q-learning, it would be nice to not only compare to MCTS and another MCTS+learning combo, but also standalone Q-learning.
+
+3) The discussion of Tabular Results in 4.1 mentions that the state-value function in PUCT was learned from Monte-Carlo returns. But I think the value function of SAVE was trained using a mix of the standard Q-learning loss and the new amortization loss proposed in the paper. Wouldn't it be more natural to then train PUCT's value function using Q-learning, rather than Monte-Carlo returns?
+
+4) Appendix B.2 mentions that UCT was not required to visit all actions before descending down the tree. I take it this means it's allowed to assign a second visit to a child of the root node, even if some other child does not yet have any visits? What Q-value estimate is used by nodes that have 0 visits? Some of the different schemes I'm aware of would involve setting them to 0, setting them optimistically, setting them pessimistically, or setting them to the average value of the parent. All of these result in different behaviours, and these differences can be especially important in the high-branching-factor / low-search-budget situations considered in this paper.
+
+5) Closely related to the previous point; how does UCT select the action it takes in the ""real"" environment after completing its search? The standard approach would be to maximise the visit count, but when the search budget is low (perhaps even lower than the branching factor), this can perform very poorly. For example, if every single visit in the search budget led to a poor outcome, it might be preferable to select an unvisited action with an optimistically-initialised Q-value.
+
+6) In 4.2, in the discussion of the Results of Figure 3 (a-c), it is implied that the blue lines depict performance for something that performs search on top of Q-learning? But in the figure it is solely labelled as ""Q-learning""? So is it actually something else, or is the discussion text confusing?
+
+7) The discussion of Results in 4.3 mentions that, due to using search, SAVE effectively sees 10 times as many transitions as model-free approaches, and that experiments were conducted on this rather complex Marble Run domain where the model-free approaches were given 10 times as many training steps to correct for this difference. Were experiments in the simpler domains also re-run with such a correction? Would SAVE still outperform model-free approaches in the more simple domains if we corrected for the differences in experience that it gets to see?
+
+
+Minor Comments (did not impact my score):
+- Second paragraph of Introduction discusses ""100s or 1000s of model evaluations per action during training, and even upwards of a million simulations per action at test time"". Writing ""per action"" could potentially be misunderstood by readers to refer to the number of legal actions in the root state. Maybe something like ""per time step"" would have less potential for confusion?
+- When I started reading the paper, I was kind of expecting it was going to involve multi-player (adversarial) domains. I think this was because some of the paper's primary motivations involve perceived shortcomings in the Expert Iteration approaches as described by Anthony et al. (2017) and Silver et al. (2018), which were all evaluated in adversarial two-player games. Maybe it would be good to signal at an early point in the paper to the reader that this paper is going to be evaluated on single-agent domains. 
+- Figure 2 uses red and green, which is a difficult combination of colours for people with one of the most common variants of colour-blindness. It might be useful to use different colours (see https://usabilla.com/blog/how-to-design-for-color-blindness/ for guidelines, or use the ""colorblind"" palette in seaborn if you use seaborn for plots).
+- The error bars in Figure 3 are completely opaque, and overlap a lot. Using transparant, shaded regions could be more easily readable.
+- ""... model-free approaches because is a combinatorial ..."" in 4.2 does not read well.
+- Appendix A.3 states that actions were sampled from pi = N / sum N in PUCT. It would be good to clarify whether this was only done when training, or also when evaluating.",6,,ICLR2020
+SklebdhWoB,3,B1xSperKvH,B1xSperKvH,Official Blind Review #4,"This paper examines combining two approaches of obtaining a trained spikingneural network (SNN). The first approach of previous work is converting the weights of a trained artificial neural network (ANN) with a given architecture, to the weights and thresholds of a SNN, and the second approach uses a surrogate gradient to train an SNN with backpropagation. The true novelty of the paper seems to be in showing that combining the two approaches sequentially, trains a SNN that requiresfewer timesteps to determine an output which achieves state of the art performance. This is summarized by Table 1. However, it does not mention how many epochs it takes to train an SNN from scratch, nor compare this to the total training time (ANN training + SNN fine-tuning) of their approach. They also claim a novel spike-time based surrogate gradient function (eq. 11), but it is very practicallysimilar to the ones explored in the referenced Wu. et al 2018 (eq. 27 for instance), and these should be properly contrasted showing that this novel surrogate function is actually helpful (the performance/energy efficiency might only come from the hybrid approach). The authors argue for SOTA performance in Table 2, but the comparison to other work doesn’t clearly separate their performance from the otherlisted works; For example the accuracy gain against Lee et al.,2019 only comes from the architecture being VGG16 as opposed to VGG9, as can be seen from comparing with the VGG9 architecture from Table 1, furthermore they take the sameamount of timesteps, which is supposed to be the principle gain of this work.
+
+Some small suggestions that are independent from the above:
+
+1.The most similar or relevant version of equation (2) in previous work could be referenced nearby for context.
+
+2.The last sentence of the first paragraph on p.4 “the outputs from each copy...” is confusing. Are you just meaning to describe BPTT? 
+
+3.Typos: sec7 4th line “neruons”, sec 2.2 “both the credit” (remove “the”)
+
+---------------
+Following the author response I have upgraded my rating.",6,,ICLR2020
+QeHKiPg6_mD,1,_MxHo0GHsH6,_MxHo0GHsH6,"The paper shows promising results, but needs further polishing","Summary:
+	This paper performs a joint optimisation for DNN models, making the NAS scheme is aware of both the quantisation and architectural search spaces. The paper presented a large range of comparisons to different quantisation strategies and ran a lot of experiments to support their claims. However, the writing quality of this paper is worrying. Also, I am a little worried about the novelty of this paper.
+
+Strength:
+1. There are a lot of experiments with the proposed method, showing a great empirical value for researchers in this field. I consider results shown in Figure 2 and Table 2 very supportive evidence of the effectiveness of the proposed method.
+2. It is nice to see a large scale study (1.5K architectures) on some common properties of network architectures and their interactions with quantisation.
+3. To my knowledge, this paper does present a state-of-the-art number for low-precision ImageNet classification.
+
+Weakness:
+1. The writing quality of this paper is worrying. This is not simply to do with the use of language, but also on the clarity of some matters. I strongly recommend the authors to have a serious polish of their paper, since they do present valuable results and STOA numbers.
+2. To me, the novelty of this paper is limited, it seems like an extension to Once-for-all, and the authors also cited this work. The teacher-student technique is also a published idea. The authors claim this is the first piece of work of NAS without re-training. However, they are iteratively reducing the bit-width K, which implies a large training cost and is somehow equivalent to re-training. The method in the paper looks like a combination of a number of well-known techniques, which might limit the novelty claim in this paper. However, I have to say I am not very troubled with combining a bunch of existing techniques if it show new STOA that is outperforming by a significant margin. This weakness is only minor to me.
+
+My suggestions & confusions:
+1. It seems like you can boost the performance of quantised networks from a) jointly search for architectures and quantisation and b) teacher-student alike quantisation training with inherited weights. Could you test these two parts in isolation and quantify the contributions of each technique?
+2. Why you quantise activations to unsigned numbers (Page 4)? Don’t you consider activations like leakyrelu in your activation search space? Or you do not search activations at all?
+‘... NAS methods suffer from more unreliable order preserving’, Who are you comparing to in this case? Is it more unreliable compared to RL based NAS?
+3. What is your Flops reported in Table 2? Flops means floating point operations, do you mean bitops? or you somehow scaled flops with respect to bitwidths?
+4. ‘we focus on the efficient models under one fixed low bit-width quantization strategy’ Do you mean the network is uni-precision? So no layer-wise mixed-precision is allowed?
+5. I spotted a number of misused languages, and will strongly recommend authors to check mistakes like:
+a) Ambiguity: 
+ i) ‘with high floating-point performance’: do you mean floating-point models? Or you mean customised floating-point models? Describe floating-point as high is very misleading.
+ii) ‘quantize the network with retraining’: I guess I understand what you mean, but you might say “retrain the quantised models” to be less ambiguous. 
+
+b) Grammar:
+i). ‘different bit-width’ -> ‘different bit-widths’
+ii) ���quantization supernet’ -> ‘quantized supernet’ and so on.
+
+c) Do not assume readers have prior knowledge:
+i) ‘we use sandwich rules’ -> ‘we use the sandwich rule’ and maybe you should consider explain what it is.
+
+I cannot present all the mistakes here, these are just examples, I would iterate again that I would strongly recommend you to polish the paper since I do like the results you are presenting and think if the code is open-sourced, they will benefit the community.
+
+
+
+
+
+",4,4.0,ICLR2021
+FZ70y38XOH_,4,TYXs_y84xRj,TYXs_y84xRj,Intuitive design with reasonable performance gain,"# Summary
+In this paper, the authors propose a simple but effective keypoint-based anchor-free object detection system. The main idea is to replace the Cartesian coordinate with the polar coordinate, compared to the closest related work, FCOS. According to the extensive empirical results, the proposed system achieves a better trade-off between speed and accuracy. 
+
+The overall writing looks good to me. The storyline is consistent and well-motivated. The authors provide enough detail to shed light on the design choices for the state-of-the-art anchor-free detector. The figures and tables are also quite informative. For example, I particularly love the figure 1 because it helps the readers catch up with the most recent progress on keypoint-based object detection frontier. It could be better if the authors could describe more details in the caption. By the way, the black color of rho and theta should be changed into a lighter one.
+
+# Questions
+
+1. scale-sensitive vs. scale-invariant
+In this paper, the author mentioned the scale-related terminology many times. I wonder if the authors could explain more detail about why PolarNet is scale-invariant? From my perspective, the offset regression is scale-sensitive because the target numbers are strongly related to the actual object size. Even though the proposed method utilizes the corner points to localize objects, the bounding box offset part is still scale-sensitive, right?
+
+2. center-based method + polar coordinate
+I wonder if the authors could try to put the polar coordinate offset regression into a center-based anchor-free detector. For example,  CenterNet[1] regresses the bounding box offset in the Cartesian coordinate system. What if we change the distance encoding to polar coordinate? In section 4.4, the authors claim, ""Speciﬁcally, compared with other center-based methods such as CenterNet or FSFA, our method not only extracts features from the central region, but also encodes features from the whole bounding boxes."" I wonder how much performance gain could be seen if we only change the coordinate system. It could be a helpful ablation study to support this claim.
+
+# Reference
+[1] Zhou, Xingyi, Dequan Wang, and Philipp Krähenbühl. ""Objects as points."" arXiv preprint arXiv:1904.07850 (2019).
+
+---- Post-rebuttal comments----
+
+The rebuttal and the paper revision address my concerns. I keep my original rating.",6,4.0,ICLR2021
+rJo_WJGNg,2,r1xUYDYgg,r1xUYDYgg,"Javascript wrapper for DNN code allows training within a web browser, even using GPU. ","While it is interesting that this can be done, and it will be useful for some, it does seem like the audience is not really the mainstream ICLR audience, who will not be afraid to use a conventional ML toolkit. 
+There is no new algorithm here, nor is there any UI/meta-design improvement to make it easier for non-experts to design and train neural network systems. 
+
+I think there will be relatively little interest at ICLR in such a paper that doesn't really advance the state of the art. 
+I have no significant objection to the presentation or methodology of the paper. 
+",4,2.0,ICLR2017
+C8JLEBwGU8-,2,w_haMPbUgWb,w_haMPbUgWb,interesting idea but many questions not addressed,"This paper proposes a rewriter-evaluator framework for multi-pass decoding, where translation hypothesis at last iteration is refined by the rewriter and further evaluated by the evaluator to decide whether to terminate the iterative decoding process. Compared to previous studies, the evaluator offers this framework the capability of flexibly controlling the termination. The authors also propose a prioritized gradient decent algorithm that biases the training process to those low-quality translation samples. Experiments on NIST Zh-En and WMT15 En-De demonstrate that the proposed model significantly outperforms several strong baselines.
+
+Pros:
+
+= The idea behind the rewriter-evaluator framework is easy to follow.
+
+= The proposed method achieves significant performance improvement against several multi-pass baselines on both Zh-En and En-De translation tasks.
+
+= The authors demonstrate that the proposed training algorithm has similar training time to the vanilla baseline, i.e. no training time loss (Table 3).
+
+Cons:
+
+= Some model details are missing, and the NIST Zh-En training data is not publicly available so it’s hard to exactly replicate the experiments.
+
+= Although the framework enables flexible termination, the evaluator requires a threshold that has a large impact on translation quality and must be carefully tuned (Table 2).
+
+= Baselines and model optimization should be further improved to fully convincing readers.
+
+My detailed comments are as follows:
+
+1.	The authors claim that their model can better handle the termination. One important experiment is to testify how many iterations the model uses for translating one sentence, and what factors could affect the iteration number, such as sentence length (would long inputs require more passes?). Particularly, it would be great to have an experiment to compare the termination difference between the proposed model and the Adaptive Multi-pass Decoder (Geng et al., 2018), and show evidence how the proposed one outperforms Geng et al., 2018 (i.e. RL-based model).
+
+2.	The authors didn’t provide full model details about how to combine source encoder outputs (x) and target encoder outputs (z) in the decoder part, i.e. more details about Ea. (4) are required. 
+
+3.	Compared to other baselines, the authors adopt the copy mechanism. Could you please provide an ablation study to justify its impact, such as retraining a model using the rewriter-evaluator framework without copying?
+
+4.	For Transformer, the authors concatenate the source input and target translation but disables cross-attention over them. Could you please give some explanation behind this practice? What if we allow the cross-attention here?
+
+5.	Using non-public dataset, like NIST Zh-En, is not suggested in my opinion. Other researchers might not be able to replicate this experiment at all. Running experiments with (C)WMT Zh-En would be a better alternative. Besides, using tokenized BLEU with ‘multi-bleu.perl’ is also not suggested nowadays. Use sacrebleu and report the signature, instead.
+
+6.	The results, in Table 1, for WMT15 En-De on newstest2015 are not convincing. Based on my own experience, a standard Transformer-base model can already achieve a tokenized BLEU score of ~29. I think this weak baseline comes from the fact that the authors train Transformer with batch size of 80 and optimize model parameters with RMSProp, as shown in experimental settings, Section 5.1. The authors should update their experimental training.
+
+7.	If I understand correctly, Algorithm 1 requires online decoding process, that is, performing greedy or beam search decoding during training to get real-time estimation for q and r. In my experience, the decoding is very time-consuming. However, results in Table 3 show that there is almost no training time difference! Could you please show some theoretical explanation about this? How did you perform decoding, or get z^k from z^k-1 in practice?
+
+If the authors can address all my concerns, I’d like to update my scores. 
+",4,4.0,ICLR2021
+iCmjTiYUTzu,3,GtiDFD1pxpz,GtiDFD1pxpz,A nice work with theoretical novelty although experimental study can be further expanded,"This work proposes a novel machine learning architecture (or M-layer) that is in the form matrix exponential. This architecture can effectively model the interaction of feature components and learn multivariate polynomial functions and periodic functions. The architecture of the M-layer is well described. Its universal approximation capability is explained and proved in the Appendix. Other properties related to periodicity, connection to Lie groups, the interpretation from the perspective of dynamical systems, and the robustness bounds are discussed. Experimental study is conducted on toy datasets and three image classification datasets to demonstrate the properties of the proposed M-layer. 
+
+Overall, this is a nice piece of work. The proposed M-layer is novel, and the properties shown by this kind of architecture are interesting, especially the capability in modelling periodic functions and extrapolating data. Theoretical discussion is generally clear although some places are a bit hard to follow. Experimental study supports the claims. 
+
+Meanwhile, this work could be further improved at the following points:
+
+1. The experimental study does not combine the M-layer with the convolution layers, and it only tests the M-layer on relatively simple image recognition datasets. It will be interesting to know whether the M-layer can be combined with convolutional layers and trained in an end-to-end manner. This is important to check the potential of the M-layer for the applications of visual, audio, and text data analysis, for which end-to-end training has proven to be more effective. This paper could test this setting on some larger-scale benchmark such as ImageNet if possible. 
+
+2. In the experimental study, it will be interesting to see the comparison between the M-layer and Gaussian RBF kernel based SVM (RBF is mentioned in Related Work) on these toy datasets and image benchmarks. RBF-SVM can be regarded as a neural network with one hidden layer. 
+
+3. This work indicates that the M-layer can better consider the cross-terms of feature components. This could be related to the recent work on second (or higher)-order visual representation, in which the correlations of the channels in a convolutional feature map are extracted and used as feature representation for the subsequent fully connected and softmax layers. This paper can discuss the potential connections with this line of research ([R1]-[R4])
+
+[R1] Tsung-Yu Lin; Aruni RoyChowdhury; Subhransu Maji, Bilinear CNN Models for Fine-Grained Visual Recognition, 2015 IEEE International Conference on Computer Vision (ICCV).
+[R2] Peihua Li; Jiangtao Xie; Qilong Wang; Wangmeng Zuo, Is Second-Order Information Helpful for Large-Scale Visual Recognition? 2017 IEEE International Conference on Computer Vision (ICCV)
+[R3] Melih Engin, Lei Wang, Luping Zhou, and Xinwang Liu, DeepKSPD: Learning Kernel-Matrix-Based SPD Representation For Fine-Grained Image Recognition, European Conference on Computer Vision ECCV 2018
+[R4] Piotr Koniusz; Hongguang Zhang; Fatih Porikli, A Deeper Look at Power Normalizations, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
+
+4. The capability of ""extrapolate learning"" of this M-layer is very interesting. This work uses the results in Figures 2 and 3 to demonstrate it. In addition to the double spiral data and the periodic data, is this ""extrapolate learning"" capability applicable to other more general data or patterns? Please comment.  
+
+--- Thank the authors for the detailed response. After reading the response and the comments of peer reviewers, the rating is altered as follows. ",5,3.0,ICLR2021
+w1QOCSeoQ1t,1,cu7IUiOhujH,cu7IUiOhujH,"Clear goal, clear presentation, somewhat limited novelty / impact","The paper proposes a new training objective for fine-tuning pre-trained models: a weighted sum of the classical cross-entropy (CE) and a new supervised contrastive learning term (SCP). The latter uses the (negated) softmax over the embedding distances (i.e. dot products) between a training instance and all other instances in the batch with the same label. In contrast to the more traditional self-supervised contrastive learning (where positive pairs are obtained by applying transformations to the original data instance), there is no data augmentation; two examples with the same label constitute a positive pair.
+
+Experiments on the GLUE benchmark compare the baseline (RoBERTa-Large with CE loss) against the proposed objective (RoBERTa-Large with CE+SCP loss). There are 4 sets of experiments:
+1) When training on the full datasets, results are quite modest (+0.4 increase in accuracy on average over 6 GLUE tasks).
+2) In the few-shot setting, CE+SCP does meaningfully better than the baseline (for instance, when fine-tuning on only 20 data points, CE+SCP improves accuracy by more than 10%); these gains decrease as the dataset size increases.
+3) When the datasets are noisy (effect obtained via back-translation), CE+SCP shines again (for instance, when the degree of corruption is very high, MNLI accuracy goes from ~47% up to ~53%).
+4) Finally, the authors look at domain shift; they fine-tune a model on SST-2, then apply few-shot learning on other sentiment classification datasets. This set of experiments has quite high error margins, so I didn't find it as convincing as 2) and 3).
+
+Here are some questions/suggestions for the authors regarding their experiments:
+
+a) ""In all our experiments, [...] we sample half of the original validation set of GLUE benchmark and use it as our test set, and sample ~500 examples for our validation set from the original validation set [...]"" -- Evaluating the models on a *subset* of the validation set makes it harder to compare it against other papers that fine-tune RoBERTa-Large. I think that, at least for Table 2, it would be useful for posterity if you could either i) get the true test scores from the GLUE server, or ii) use part of the training set for validation, and then test on the full dev set, which is more standard practice.
+
+b) ""We run each experiment with 10 different seeds, and pick the top model out of 10 seeds based on
+validation accuracy and report its corresponding test accuracy"" -- I am assuming this statement describes how evaluation numbers are reported for a fixed set of hyperparameters. Why do you choose to pick the *top* model as opposed to reporting the *average* accuracy across the 10 runs?
+
+c) ""we observe that our proposed method does not lead to improvement on MNLI [...]. We believe
+this is due to the fact that number of positive example pairs are quite sparse [...] with batch size 16 [...]. We show evidence for this hypothesis in our ablation studies that we show in Table 3"" -- Then why doesn't Table 3 include MNLI? Am I missing something?
+
+d) This method excels in the few-shot setting, at least compared to the CE baseline. So I think it would be a lot more impactful to focus on this particular use case and convince the reader that CE+SCP is better than some other standard few-shot learning baselines (e.g. meta-learning objectives). I do appreciate that the current message of the paper is crystal-clear (adding a SCP term to the loss leads to better fine-tuning), but I also think that the results in Table 2 are too weak for this somewhat general statement. There is quite a bit of real-estate in the paper that could be re-allocated to something more substantive (e.g. Table 1).
+
+Strengths:
+- The presentation of the paper is extremely clean, and the goal is clear.
+- In the few-shot learning scenario, CE+SCP performs meaningfully better than the CE baseline.
+
+Weaknesses:
+- The main weakness is related to my suggestion d) above. I believe marketing CE+SCP as a general fine-tuning solution with somewhat underwhelming results in Table 2 is a missed opportunity to lead with potentially strong results on few-shot learning. I'm calling the results ""underwhelming"" because there is evidence that a thorough hyperparameter sweep can boost fine-tuning accuracy on GLUE by quite a bit. For instance, Dodge et al. [1] show that fine-tuning BERT carefully can increase SST-2 accuracy by ~2% without any changes in the pre-trained model or fine-tuning objective.
+
+[1] Dodge et al., Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping",6,4.0,ICLR2021
+Skx4CYgj3X,2,S1lvm305YQ,S1lvm305YQ,"Timbre can be tranferred pretty well using a constant-Q transform for features, followed by a CycleGAN to do the transfer, followed by a Wavenet to resynthesize it to audio.","Main Idea: The authors use multiple techniques/tools to enable neural timbre transfer (converting music from one instrument to another, ex: violin to flute) without paired training examples. The authors are inspired by the success of CycleGANs for image style transfer, and by the success of Wavenet for generating realistic audio waveforms. Even without the CycleGAN, the use of CQT->WaveNet for time stretching and pitch shifting of a single piece is an interesting and valuable contribution.
+
+Methodology: Figure 1 captures the overall timbre-conversion methodology concisely. In general the details of the methodology look sound. The lengthy appendices offer additional implementation details, but without access to a source code repository, it is hard to say if the results are perfectly reproducible.
+
+Experiment and Results: Measuring the quality of generated audio is challenging.  To do so, subjective listening tests are conducted on Amazon mechanical turk, but without a comparison to a baseline system except for another performance of the target piece. Note that there are few published timbre-transfer methods (see Similar Work).
+
+One issue with the AMT survey is that the total number of workers is not reported, and as such the significance of the results can be questioned.
+
+Significance: In my mind, the paper offers validation of the three techniques used. CycleGANs, originally designed for images,  are shown to work for style transfers on audio spectrograms. Wavenet's claim to be a generic technique for audio generation is tested and validated for this domain (CQT spectrogram to audio). That CQT outperforms STFT on musical data seems to be a well established result already, but this offers further validation.
+
+This paper also offers practical advice for adapting the techniques/tools (Wavenet, CycleGAN, CQT) to the timbre-transfer task.
+
+
+Similar Work:
+
+I have only found 2 papers dedicated to timbre transfer in the field of Learning Representations.
+
+Bitton, Adrien, Philippe Esling, and Axel Chemla-Romeu-Santos. ""Modulated Variational auto-Encoders for many-to-many musical timbre transfer."" arXiv preprint arXiv:1810.00222 (2018).
+
+which was published on sept 29th 2018, so less than 30 days ago, which is fine according to the reviewer guidelines.
+
+
+Verma, Prateek, and Julius O. Smith. ""Neural style transfer for audio spectograms."" arXiv preprint arXiv:1801.01589 (2018).
+
+which is a short 2 page exploratory paper.
+
+ 
+It could be useful to cite:
+
+Shuqi Dai, Zheng Zhang, Gus G. Xia.  ""Music Style Transfer: A Position Paper."" arXiv preprint arXiv:1803.06841 (2018)
+
+
+Writing Quality
+
+Overall the paper is written well with clear sentences.
+
+Certain key information would be useful to move from the appendices to the main body of the paper.  This includes the number of AMT workers, the size of the CQT frame/hop over which they are summarized, and the set of instruments that are being used in the experiments.
+
+
+Some minor nitpicks: 
+
+section 6.3, sentence 2 needs to be reworked. ('After moving on to real world data, we noticed that real world data is harder to learn because compared to MIDI data it’s more irregular and more noisy, thus makes it a more challenging task.') 
+
+section 3.2 sub-section 'Reverse Generation', sentence 1 uses the word 'attacks' for the first time. Please explain this for those not familiar.
+
+section 3.1, sentence 3 has a typo, 'Thanks' is wrongly capitalized.
+
+table 1 (and other tables in appendix), 'Percentage' (top left) does not add anything to the table.
+
+",7,4.0,ICLR2019
+ByxDwhP0n7,2,r1glehC5tQ,r1glehC5tQ,Not enough depth,"Defensive Distinction (DD) is an interesting model for detecting adversarial examples. However, it leaves some key aspects of defense and distinction out. Firstly, one can argue that if you know the adversaries of your model you can simply regularize the model for them. Even if regularization doesn't work fully, the DD model still suffers since it can have its own adversarial examples. From distinction perspective, it would be hard to believe that every single adversarial example will be detected, at least not without some solid theoretical background. It seems that  and natural examples are being thrown at the DD model without an elegant approach. 
+
+I have the following concerns about the visualization and understanding of what DD does, which I believe should have been the focus of this paper. It was not immediately clear, what the message of the paper is or the claimed message was too weak: detecting adversarial examples using a classifier. It was not immediately clear why this is a good idea (since an adversarial example can be an adversary of both original network and DD) or what the DD learns.
+
+Furthermore, from experimental perspective, it is not sufficient to just perform experiments on one dataset, specially if the claim is big. You should consider running your model on multiple datasets and reporting what each DD learned. Furthermore, you should establish better comparison and back your claims with proper references. Some claims were too strong to believe without reference. 
+
+I do look forward to seeing more about the visualization and intriguing properties which may arise from continuation of your studies. In the current state, I vote to reject until a more clear demonstration of your work comes out. 
+",4,5.0,ICLR2019
+ZERu1NP4Ti6,2,EZ8aZaCt9k,EZ8aZaCt9k,Review,"Summary
+
+This paper studies the optimization landscape of the training loss of deep neural networks. For  a general setup, the paper shows that if the network width is greater than $2m(n+1)^l$, then any parameter value has a path to a global minimum on which the training loss does not increase. Here, $m$ is the output dimension, $n$ is the number of training examples, and $l$ is the number of hidden layers.
+
+Strength
+
+The result presented in the paper holds for a surprisingly wide range of setups. The theorem holds for any convex loss function, any arbitrary activation, arbitrary output dimension and depth. The theorem holds for unconstrained optimization setup, and also some specific type of constrained setup (Eq (3)).
+
+The paper is well-written and it delivers its key messages fairly well. The main text provides a good proof sketch that is easy to follow. After a quick perusal of the proof I am convinced that the proof is correct. Overall, I enjoyed reading this paper.
+
+Weakness
+
+There are two main weaknesses of this paper: one is the huge width requirement that exponentially grows with depth, the other is too many missing (directly relevant) citations and the unconventional usage of the term “spurious local minima”.
+
+Let me start with the second one. There is a large body of literature that refers to non-optimal local minima as spurious/bad local minima, and investigates existence/nonexistence of such local minima in the context of neural networks [1, 2, 3, 4, and many more]. In fact, for deep networks with piecewise linear activations, no matter how wide the network is, it is known that there are non-optimal local minima in the training loss in the unconstrained case [3, 4]. The paper misses the entire body of literature and defines spurious local minima in a different way, which can confuse the readers. The title already sounds like a contradiction to [3, 4]!
+
+More importantly, there are also other missing citations that are even more directly relevant, namely, the ones studying nonincreasing paths to global minima [5, 6, 7]. This set of results typically require that one of the hidden layers is wider than the number of data points $n$ and that the subsequent layers after the wide hidden layer only get narrower. Although these results hold for more specific setups than this paper, the difference in the width requirement is huge: $n$ vs $O(n^l)$.
+
+In light of these existing results, I fear that the width requirement in this paper is too big, and it grows exponentially with the depth $l$. Although I liked the construction illustrated in this paper, this significant weakness of the main result makes me hesitate to recommend acceptance.
+
+One minor weakness I also wanted to point out was that, “nonincreasing path to global minimum” alone is not enough, in the sense that this property does not rule out the existence of a local minimum in a locally constant region of the loss landscape (think of the case where all the ReLUs are turned off in a ReLU network). Although sufficient width can ensure the existence of a nonincreasing path to a global minimum from this locally constant region, there is no way that a gradient-based local search algorithm can escape this region.
+
+Overall Assessment
+
+Although I liked this paper as I read it, I believe the weaknesses of the main result are too significant. I also think that the paper could use a rewriting to contextualize the results relative to the missing citations. I lean slightly towards rejection at this time.
+
+
+[1] Kenji Kawaguchi. Deep Learning without Poor Local Minima, 2016
+
+[2] Itay Safran, Ohad Shamir. Spurious Local Minima are Common in Two-Layer ReLU Neural Networks, 2018
+
+[3] Chulhee Yun, Suvrit Sra, Ali Jadbabaie. Small nonlinearities in activation functions create bad local minima in neural networks, 2019
+
+[4] Fengxiang He, Bohan Wang, Dacheng Tao. Piecewise linear activations substantially shape the loss surfaces of neural networks, 2020
+
+[5] Benjamin D. Haeffele, Rene Vidal. Global optimality in neural network training, 2017
+
+[6] Quynh Nguyen. On Connected Sublevel Sets in Deep Learning, 2019
+
+[7] Henning Petzka, Cristian Sminchisescu. Non-attracting Regions of Local Minima in Deep and Wide Neural Networks, 2018
+",5,4.0,ICLR2021
+sCemgA99YN1,2,ct8_a9h1M,ct8_a9h1M,Small but consistent improvement in experimental results.,"Summary: This paper proposes a way of estimating data-dependent dropout probabilities. This is done in each layer using a small auxiliary neural network which takes the data (on which dropout is going to be applied) as input and outputs dropout probabilities, which are sampled and multiplied into the data.
+
+Pros:
+- The proposed model consistently leads to improvements in accuracy and uncertainty estimation over standard dropout for multiple network types and tasks (including CNNs and Transformers).
+- The experimental results are thorough and include relevant p-values and error bars.
+- The method is sound and well explained.
+
+Cons:
+- The gains in accuracy and uncertainty estimation for most tasks are small.
+- A concern with learning the dropout probabilities on the training set is that the optimal value of dropout is zero, since that would remove regularization, thereby minimizing the training loss. It is not clear how this is avoided in the proposed approach. One thing that could prevent the dropout rates from going to zero is the prior ((\ \eta \\). However, this is also learned. So it would be good to explain what prevents the dropout rates from becoming zero.
+- Conflating of 2 different effects : There are at least 2 aspects of the proposed model that could be beneficial : (1) Increase in model capacity from having a multiplicative gating interaction in the network (in expectation, the states are gated by \\(\sigma(\alpha^l)\\)), and (2) a decrease in model capacity (regularization) due to dropout noise.   An ablation analysis can help tease apart the benefits coming from these two sources.  This could be done, for example, by removing stochastic sampling (so that only the gating aspect remains) and optionally adding a regular dropout layer on the gated data. The increase in model capacity due to gating (aspect (1)) could partly explain why dropout rates do not become zero.
+
+Overall, the model leads to consistent gains in performance with relatively extra low computational cost which makes this a good contribution. However, the significance is limited because the results are only moderately better.
+
+Post rebuttal
+The authors addressed the concerns around having a deterministic gating only baseline. I will increase my score to 7",7,4.0,ICLR2021
+HkevBVyAh7,3,HJei-2RcK7,HJei-2RcK7,"This paper proposes an intereting method for graph dataset. However,  some points need to be verified.","This paper proposes a graph transformer method to learn features from the data with a graph structure. Actually it is the extension of Transformer network to the graph data. Although it is not very novel, yet it is interesting.  The experimental result has confirmed the author's claim.
+
+I have some concerns as follows:
+1. For the sequence input, this paper proposes to use the positional encoding as the standard Transformer network. However, for graphs, edges have encoded the relative position information. Is it necessary to incorporate this positional encoding? It's encouraged to conduct some experiments to verify it.
+
+2. It is well known that graph neural networks usually have large memory overhead. How about this model? I found that the dataset used in this paper is not large. Can you conduct some experiments on large-scale datasets and show the memory overhead?
+
+",6,5.0,ICLR2019
+eBW7rUs-I-,1,b-7nwWHFtw,b-7nwWHFtw,Review of Privacy-preserving Learning via Deep Net Pruning,"Overview: This paper aims to establish a theoretical connection between differential privacy and magnitude based pruning. The authors show theoretically that outputs of pruned single layer neural networks have some similarity to outputs of the same network with differential privacy added. The paper then empirically demonstrates that model inversion attacks are harder against magnitude pruned neural networks.
+
+Reason for score: 
+I tend to vote reject. Overall, I think this paper contains several correct and interesting pieces that are put together the wrong way. The empirical observation that pruned networks are harder to invert is very supported by experiments in this paper; the main theorem 4.2 is (I assume) correctly derived with some new theoretical tools that may be of independent interest. Yet, I am unconvinced about the relation between pruning and differential privacy claimed by this paper due to the concerns that listed below. 
+
+Pros:
+Paper is well written and mostly concise in presentation. The experiments are convincing in supporting the claim that pruned networks are harder to invert. I have glanced through the supplementary material and the derivations seem correct.
+
+Cons:
+1. Statement about the definition of differential privacy is inconsistent: In definition 3.1, the authors define differential privacy with respect to pairs of real vectors x and y with l1 norm less than 1. This definition is incorrect as differential privacy is defined on pairs of databases (set/multiset of vectors or strings), and the neighbouring relationship between pairs of databases should be those which differ by at most one entry. In contrast, the authors appear to use a different definition in appendix D, where x and y now differ by “one entry”, which I presume to mean in one dimension of the vector. This would be analogous to protecting the value of one pixel in a database of a single image, which is a rather useless notion of privacy.
+
+2. Assumption on weights: the main contribution of this paper, which is theorem 4.2, assumes that the network weights are iid drawn from a normal distribution. This assumption seems far fetched since trained neural network weights tend to be strongly correlated rather than independent. I am also not aware of any empirical work who states that magnitudes of trained neural network weights are well approximated by a gaussian. Hence, I am not convinced that the result drawn in theorem 4.2 has real applicability to understanding the relation between DP and pruning (if any such relation exists at all).
+
+3. Conclusion drawn from theorem 4.2: while h(x) satisfies authors’ definition of DP (I discussed by why this definition is problematic in comment 1) and is close to g(x), this should be interpreted as neither adding noise or applying pruning significantly deviates from f(x) in probability. The authors instead use 4.2 to draw similarities between adding noise and applying pruning, which I think is not well supported by their findings.
+
+4. Relationship between Pruning and DP: In DP, the mechanism is required to be stochastic in order to maintain privacy given repeated usage. In contrast, unless the original network is re-trained for each input during inference, the output is deterministic. The idea that we can somehow achieve DP (with respect to inputs during inference) by passing the input through a pruned network is thus unrealistic. 
+
+Questions:
+If we replaced \bar{A}x with another matrix B that has upper and lower bounded spectral norm, wouldn’t it be possible to achieve the similar results to claim 4.3 and 4.4? Namely, let e=Lap^m * Bx, then f(x)+e is “private” and ||e - Bx|| is epsilon small?
+
+[Post rebuttal] The rebuttal has not address any of my main concerns, so my rating stays.",4,4.0,ICLR2021
+BJxNftyy6X,3,BJgYl205tQ,BJgYl205tQ,Concerns about clarity and scalability of the metric,"The paper proposes a new metric to evaluate GANs. The metric, Cross Local Intrinsic Dimensionality (CrossLID) is estimated by comparing distributions of nearest neighbor distances between samples from the real data distribution and the generator. Concretely, it proposes using the inverse of the average of the negative log of the ratios of the distances of the K nearest neighbors to the maximum distance within the neighborhood. 
+
+The paper introduces LID as the metric to be used within the introduction, but for readers unfamiliar with it, the series of snippets “model of distance distributions” and “assesses the number of latent variables” and “discriminability of a distance measure in the vicinity of x”  are abstract and lack concrete connections/motivations for the problem (sample based comparison of two high-dimensional data distributions) the paper is addressing.
+
+After an effective overview of relevant literature on GAN metrics, LID is briefly described and motivated in various ways. This is primarily a discussion of various high-level properties of the metric which for readers unfamiliar with the metric is difficult to concretely tie into the problem at hand. After this, the actual estimator of LID used from the literature (Amsaleg 2018) is introduced. Given that this estimator is the core of the paper, it seems a bit terse that the reader is left with primarily references to back up the use of this estimator and connect it to the abstract discussion of LID thus far.
+
+Figure 1 is a good quick overview of some of the behaviors of the metric but it is not clear why the MLE estimator of LID should be preferred (or perform any differently) on this toy example from a simple average of 1-NN distances. The same is also appears to be true for the motivating example in Figure 8 as well.
+
+To summarize a bit, I found that the paper did not do the best job motivating and connecting the proposed metric to the task at hand and describing in an accessible fashion its potentially desirable properties.
+
+The experimental section performs a variety of comparisons between CrossLID, Inception Score and FID. The general finding of the broader literature that Inception Score has some undesirable properties is confirmed here as well. A potentially strong result showing where CrossLID performs well at inter-class mode dropping, Figure 4, is unfortunately confounded with sample size as it tests FID in a setting using 100x lower than the recommended amount of samples. 
+
+The analysis in this section is primarily in the form of interpretation of visual graphs of the behavior of the metrics as a quantity is changed over different datasets. I have some concerns that design decisions around these graphs (normalizing scales, subtracting baseline values) could substantially change conclusions. 
+
+An oversampling algorithm based on CrossLID is also introduced which results in small improvements over a baseline DCGAN/WGAN and improves stability of a DCGAN when normalization is removed. A very similar oversampling approach could be tried with FID but is not - potentially leaving out a result demonstrating the effectiveness of CrossLID.
+
+The paper also proposes computing CrossLID in the feature space of a discriminator to make the metric less reliant on an external model. While this is an interesting thing to showcase - FID can also be computed in an arbitrary feature space and the authors do not clarify or investigate whether FID performs similarly.
+
+These two extensions, addressing mode collapse via oversampling and using the feature space of a discriminator are interesting proposals in the paper, but the authors do not do a thorough investigation of how CrossLID performs to FID here.
+
+Several experiments get into some unclear value judgements over what the behavior of an ideal metric should be. The authors of FID argue the opposite position of this paper that the metric should be sensitive to low-level changes in addition to high-level semantic content. It is unclear to me as the reader which side to take in this debate. 
+
+I have some final concerns over the fact that the metric is not tested on open problems that GANs still struggle with. Current SOTA GANs can already generate convincing high-fidelity samples on MNIST, SVHN, and CIFAR10. Exclusively testing a new metric for the future of GAN evaluation on the problems of the past does not sit well with me. 
+
+Some questions:
+* Could the authors comment on run time comparisons of the metric with FID/IS?
+* How much benefit is there from something like CrossLID compared to the simplest case of distance to 1-NN in feature space? More generally an analysis of how the benefits of CrossLID as you increase neighborhood size would help illuminate the behavior of the metric.
+* For Table 2, what are the FID scores and how do they correlate with CrossLID and Inception Score?
+
+Pros: 
++ Code is available!
++ The metric appears to be more robust than FID in small sample size settings.
++ A variety of comparisons are made to several other metrics on three canonical datasets.
++ The paper has two additional contributions in addition to the metric. Addressing mode collapse via adaptive oversampling and utilizing the features of the discriminator to compute the metric in.
+Cons:
+- No error bars / confidence intervals are provided to show how sensitive the various metrics tested are to sample noise. 
+- Authors test FID outside of recommended situations (very low #of samples (500) in Figure 4) without noting this is the case. The stated purpose of Figure 4 is to evaluate inter-class mode dropping yet this result is confounded by the extremely low N (100x lower than the recommended N for FID).
+- It is unclear whether metric continues to be reliable for more complex/varied image distributions such as Imagenet (see main text for more discussion)
+- Many of the proposed benefits of the model (mode specific dropping and not requiring an external model) can also be performed for FID but the paper does not note this or provide comparisons.
+",4,3.0,ICLR2019
+B1VlMxjlG,3,ByqFhGZCW,ByqFhGZCW,"Well-written, but experiments could be more thorough. ","The authors describe a mechanism for defending against adversarial learning attacks on classifiers. They first consider the dynamics generated by the following procedure. They begin by training a classifier, generating attack samples using FGSM, then hardening the classifier by retraining with adversarial samples, generating new attack samples for the retrained classifier, and repeating.  
+
+They next observe that since FGSM is given by a simple perturbation of the sample point by the gradient of the loss, that the fixed point of the above dynamics can be optimized for directly using gradient descent. They call this approach Sens FGSM, and evaluate it empirically against the various iterates of the above approach. 
+
+They then generalize this approach to an arbitrary attacker strategy given by some parameter vector (e.g. a neural net for generating adversarial samples). In this case, the attacker and defender are playing a minimax game, and the authors propose finding the minimax (or maximin) parameters using an algorithm which alternates between maximization and minimization gradient steps. They conclude with empirical observations about the performance of this algorithm.
+
+The paper is well-written and easy to follow. However, I found the empirical results to be a little underwhelming. Sens-FGSM outperforms the adversarial training defenses tuned for the “wrong” iteration, but it does not appear to perform particularly well with error rates well above 20%. How does it stack up against other defense approaches (e.g. https://arxiv.org/pdf/1705.09064.pdf)? Furthermore, what is the significance of FGSM-curr (FGSM-81) for Sens-FGSM? It is my understanding that Sens-FGSM is not trained to a particular iteration of the “cat-and-mouse” game. Why, then, does Sens-FGSM provide a consistently better defense against FGSM-81? With regards to the second part of the paper, using gradient methods to solve a minimax problem is not especially novel (i.e. Goodfellow et al.), thus I would liked to see more thorough experiments here as well. For example, it’s unlikely that the defender would ever know the attack network utilized by an attacker. How robust is the defense against samples generated by a different attack network? The authors seem to address this in section 5 by stating that the minimax solution is not meaningful for other network classes. However, this is a bit unsatisfying. Any defense can be *evaluated* against samples generated by any attacker strategy. Is it the case that the defenses fall flat against samples generated by different architectures? 
+
+
+Minor Comments:
+Section 3.1, First Line. ”f(ul(g(x),y))” appears to be a mistake.",5,4.0,ICLR2018
+Hkl3RCOAnQ,3,HkgSk2A9Y7,HkgSk2A9Y7,"Good balance of theory and practice, lackluster experiment results","Authors propose using gossip algorithms as a general method of computing approximate average over a set of workers approximately. Gossip algorithm approach is to perform linear iterations to compute consensus, they adapt this to practical setting by sending only to 1 or 2 neighbors at a time, and rotating the neighbors.
+
+Experiments are reasonably comprehensive -- they compare against AllReduce on ImageNet which is a well-tuned implementation, and D-PSGD.
+
+Their algorithm seems to trade-off latency for accuracy -- for large number of nodes, AllReduce requires large number of sequential communication steps, whereas their algorithm requires a single communication step regardless of number of nodes. Their ""time per iteration"" result support this, at 32 nodes they require less time per iteration than all-reduce. However, I don't understand why time per iteration grows with number of nodes, I expect it to be constant for their algorithm.
+
+The improvements seem to be quite modest which may have to do with AllReduce being very well optimized. In fact, their experimental results speak against using their algorithm in practice -- the relevant Figure is 2a and their algorithm seems to be worse than AllReduce. 
+
+Suggestions:
+- I didn't see motivation for particular choice of mixing matrix they used -- directed exponential graph. This seems to be more complicated than using fully-connected graph, why is it better?
+- From experiment section, it seems that switching to this algorithm is a net loss. Can you provide some analysis when this algorithm is preferrable
+- Time per iteration increases with number of nodes? Why? Appendix A.3 suggests that only a 2-nodes are receiving at any step regardless of world-size",6,3.0,ICLR2019
+gEeeZ3w8lbm,1,xYGNO86OWDH,xYGNO86OWDH,An exploratory analysis of contextualized embedding geometry,"This paper analyzes the geometry of several contextualized embeddings. The authors show that global anisotropy is caused by strong clustering of word vectors, and that vectors of different word types are isotropically distributed within the cluster.
+
+**Strengths**
+- This work is a nice-to-have extension of [Ethayarajh (2019)](https://www.aclweb.org/anthology/D19-1006.pdf) that dives deeper into the geometric properties of contextualized vectors.
+- The research question is clearly stated (Why doesn't anisotropy hurt performance?) and clearly answered (There's no anisotropy locally).
+- The 3D visualizations provide a better geometric intuition than the flat visualizations that are common in this kind of papers.
+
+**Issues**
+- I don't think that good performance _contradicts_ anisotropy. For example, we already know that the classical static embeddings are also anisotropic [(Mimno and Thompson, 2017)](https://www.aclweb.org/anthology/D17-1308.pdf), and this means that good performance (as measured by downstream tasks) may co-exist with anisotropy. So please consider rewording the beginning of Section 1.2. For example, instead of ""There is an apparent contradiction."" consider ""It is not clear why ...""
+- How _representative_ is one random sample from $\Phi(t_i)$ for measuring $S_\text{inter}$ in formula (1). You gave an example in the Introduction when the same word type (bank) can have totally different meanings depending on context, and thus (I believe) the corresponding $\phi_1(\text{bank})$ and $\phi_2(\text{bank})$ may be totally different. Why not taking more samples for polysemous words?
+- Why do you use different distance metrics (Euclidean vs cosine) for estimating LIDs of contextualized vs static embeddings (Table 3)? 
+- ""For GPT2, we had hoped to find that some types are associated with one cluster and other types are associated with the other cluster, but that is not verified in our experiments"" -- I think you should look at contexts rather than types (since you're dealing with the contextualized embeddings). It would be interesting to see whether you have the same type in both clusters, and then to look at its contexts. I bet that the contexts will differ.
+
+**Minor issues**
+- ""We find a low-dimensional manifold in GPT/GPT2 embeddings, but not in BERT/DistilBERT embeddings."" -- but your LIDs are low for BERT/D-BERT layers as well! Why can't you claim the low-dimensionality for BERT/D-BERT embeddigs?
+- I doubt that PTB with 10K vocabulary size gives ""good"" coverage in 2020. You may simply state that this a widely-used dataset.
+- Wiki2 (Merity et al., 2016) is usually referred to as _WikiText-2_.
+- Please consider rephrasing ""experiments"" -> ""analysis"", as you are not conducting _controlled experiments_, but rather performing exploratory analysis of the embeddings.",7,4.0,ICLR2021
+BygM1K4c37,1,BkgzniCqY7,BkgzniCqY7,"the paper has good technical qualities, but motivation for the research is not explained","The paper proposes a method to find adversarial examples in which the changes are localized to small regions of the image. A group-sparsity objective is introduced for this purpose and it is combined with an l_p objective that was used in prior work to define proximity to the original example. ADMM is applied to maximize the defined objective. It is shown that adversarial examples in which all changes are concentrated in just few regions can be found with the proposed method.
+
+The paper is clearly written and results are convincing. But what I am not sure I understand is what is the purpose of this research. Among the 4 contributions listed in the end of the intro only the last one, Interpretability, seems to have a potential in terms on the impact. Yet am not quite sure how “obtained group-sparse adversarial patterns better shed light on the mechanisms of adversarial perturbations”. I think the mechanisms of adversarial perturbations remain as unclear as they were before this paper.
+
+I am not ready to recommend acceptance of this paper, because I think the due effort to explain the motivation for research and its potential impacts has not been done in this case. 
+
+UPD: the discussion and the edits with the authors convinced me that I may have been a bit too strict. I have changed my score from 5 to 6.
+",6,2.0,ICLR2019
+15cjBR4BLHY,4,#NAME?,#NAME?,A solid contribution,"The paper presents some novel contributions regarding recurrent neural networks. 
+Building on the work of Chang et al. (2019),  the authors provide a global convergence result for the hidden representation of a family of recurrent neural networks using standard techniques from the Lyapunov analysis of dynamical systems.
+The requirements of the theorem are met  (within the limits of discretization) by their proposed algorithmic scheme. 
+Numerical evaluation on a variety of benchmarks shows that the proposed algorithm yields systematic improvement over other RNN approaches. 
+For all of the above reasons, I recommend the acceptance of the paper. 
+
+Some concerns to be addressed:
+- The connection between stability and trainability or refer to Chang et al. (2019) if their analysis applies here. 
+- specify the functions \sigma_min and \sigma_max used in Theorem 1
+- specify the meaning of the one-arg function f(h^*) as opposed to the 2-arg f(h,t) appearing in Definition 1. 
+",8,3.0,ICLR2021
+iZiAM4GPYof,4,KJNcAkY8tY4,KJNcAkY8tY4,This paper studies the effects of width and depth on neural network representation.,"In this paper，the author studies the effects of width and depth on neural network representation.
+
+
+In this paper，the author studies the effects of width and depth on neural network representation. This paper conducts lots of experiments on CIFAR-10, CIFAR-100 and ImageNet with different network architectures and apply the CKA to measure the similarity between representation of each layer. As a result, they find a characteristic block structure in the hidden representations of larger capacity models which is also dependent to the size of dataset. This work has the following advantages:
+1、	Well-arranged and detailed experiments which strongly support the final conclusion.
+2、	Exquisite figures that well displays the experiment results.
+3、	The concept of “block structure” in ResNet is novel. And all of the experiments and analysis illustrate there really exists blocks with similar representation in overoptimization models. And this phenomenon can guide researchers to design networks well.  
+However, there are some disadvantages or doubts in my opinion:
+1、	This paper lacks of further explanations about the CKA or HSIC tools. I can’t fully understand how the similarities between representations of each layer are measured.
+2、I wonder if the block structure arises dependent to the residual blocks. I want to see more experiments with other network architectures.
+3、I think the analysis of effects of width and depth on neural network representation can well guide researchers to design networks. So，I expect to see an modified network architecture or a method to balance the network size and accuracy . However, this paper is just about theoretical analysis based on experiment phenomenon. Thus, I think this paper is lack of some innovation.
+
+4. Some previous related works is better to be appreciated: 
+
+C. L. Philip Chen et al Broad Learning System: An Effective and Efficient Incremental Learning System Without the Need for Deep Architecture，IEEE Transactions on Neural Networks and Learning Systems，2017 
+C.L.P. Chen, Z. Liu, and S. Feng, Universal approximation capability of broad learning system and its structural variations， IEEE transactions on neural networks and learning systems 30 (4), pp. 1191-1204.
+",6,3.0,ICLR2021
+g4Jir8CZZ6a,2,6zaTwpNSsQ2,6zaTwpNSsQ2,"Simple extension of block floating point, interesting contribution on the hardware aspects","The authors proposes block-minifloat (BM), a floating-point format for DNN training. BM is a fairly simple extension to block floating-point (BFP), which was proposed in (Drumond 2018 and Yang 2019). In BFP, a block of integer mantissas share a single exponent. In BM, a block of narrow floats share a single exponent bias. The shared exponent bias helps to shorten the exponent field on each individual float element. This is a good contribution, though a bit trivial.
+
+Where the paper make a strong contribution is in the hardware implementation of BM, something which neither Drumond or Yang really got into. The authors propose to use a Kulisch accumulator for minifloat dot products, which basically works by converting the floats to integers with a shift, and accumulating with a very wide register. Kulisch accumulators are normally far too wide to be practical (see Eq. 4), but they've been proposed for posit computation (https://engineering.fb.com/2018/11/08/ai-research/floating-point-math/). This seem like a great idea here since BM can reduce the exponent length to only 2 or 3 bits.
+
+The authors also did a good job evaluating the area and power of the BM hardware circuit. They built a 4x4 systolic array multiplier using BM units in RTL, and synthesized to place and route. The results show that BM6 can be 4x smaller and use 2.3x less power than FP8 while achieving to comparable accuracy. This is a pretty impressive result, and the hardware evaluation methodology is more stringent than most quantization papers at NeurIPS/ICML/ICLR. The only **minor** issue I have here is the area/power numbers are reported for BM8/BM6, but the exact config is not specified. E.g. is BM8 referring to (2e, 5m)?
+
+The accuracy comparison is pretty standard, with CIFAR and ImageNet results using mostly ResNet-18. The authors' simulation framework slows training by 5x, so this is as much as I would expect. One **major** issue is that Tables 1 and 3 shows that for training to succeed, the forward and backwards BM formats must be different. Table 3 has three separate BM formats for each row. Implementing them all in hardware could incur significant overhead, which the paper doesn't discuss. The authors mention that the HFP8 paper does the same - but that paper defends this practice by showing that their two formats (which only differ by 1 e/m bit) can be supported by a single FP unit with minimal overhead. This paper uses (2e,5m), (4e,3m) and (6e,9m) in the same experiment labeled ""BM8"", which seems both misleading and unjustified. Note that SWALP and S2FP8 (and bfloat16/float16 papers) would use the same format in forwards and backwards pass and avoid this overhead.
+
+A few other insights: (1) subnormal floats are important and can't just be flushed to zero; (2) a square BM block size of 48x48 seems to work fine.
+
+Minor issues:
+ - The methodology for hardware area seem solid (Appendix 4), but there isn't much detail on power. Was power obtained through modeling or using an RTL simulator? What kind of test vectors were used?
+ - The area/power numbers are given for ""BM8"", but what's the precise format? I assumed it was (2, 5).
+ - The introduction of log-BM seems very sudden, and they're only used for VGG-16? Did regular BM5 not work? I'm not sure what to take away from the comparison in Table 2.
+ - Equation 6 was a bit confusing for me. It would be helpful to explain briefly how each term was derived.
+ - Training in BM requires you to update the exponent biases in each step (?), which requires computing the dynamic range of each $N \times N$ block. I believe this is probably negligible, but it should be discussed as an additional overhead.
+
+EDIT: the authors have clarified that the hardware area results take into account the need to support multiple formats, which addressed my biggest issue with the paper. I have raised my score to a 7 (accept).",7,5.0,ICLR2021
+-6rkWfXRaw5,1,9D_Ovq4Mgho,9D_Ovq4Mgho,The proposed idea is not novel,"The paper describes a  knowledge transfer technique based on  training a student network using annotation creating by  a teacher network. This is actually not a summary of the method but  the method itself. Most the the rest of the paper is devote do describe experiment details. 
+
+The idea is  well known in machine learning community see e.g. Distilling the Knowledge in a Neural Network by Hinton. where is used to transfer knowledge from a huge network to a small network. Hence, there is not much novelty in the paper.
+
+Although the method is very simple it is difficult the follow the experimental results. It is written in a very unclear way.
+Do you use step 3 in the experiments? 
+
+what is your conclusion regarding parameter fine tuning vs. your approach?
+
+Over all the paper is more suitable for a medical imaging conference than  fro a general deep learning conference. ",3,4.0,ICLR2021
+BJxDhRyCtS,2,rJeU_1SFvr,rJeU_1SFvr,Official Blind Review #3,"Summary:
+
+LOGAN optimizes the sampled latent generative vector z in conjunction with the generator and discriminator. By exploiting second order updates, z is optimized to allow for better training and performance of the generator and discriminator.
+
+Pros:
++ A relatively efficient way of exploiting second order dynamics in GAN training via latent space optimization.
++ A good set of experiments demonstrating the superior performance of the proposed method on both large and small scale models and datasets.
+
+Cons:
+- Lack of code
+
+Comments:
+All in all, this appears to be a solid contribution to the GAN literature, addressing some of the limitations of CS-GAN [1]. The lack of open source code accompanying the paper (in this day and age) does it a serious disservice. I have already tried and failed to replicate the cifar10 results. There appears to be some nuance in implementation that would probably clear up if the authors release their code along with the paper.
+
+[1] - http://proceedings.mlr.press/v97/wu19d.html",6,,ICLR2020
+N4ZRLWLW-Ld,1,Y9McSeEaqUh,Y9McSeEaqUh,A promising novel solution,"The authors discuss how a classifier’s performance over the initial class sample can be used to extrapolate its expected accuracy on a larger, unobserved set of classes by mean of the dual of the ROC function, swapping the roles of classes and samples. Grounded on such function, the authors develop a novel ANN approach learning to estimate the accuracy of classifiers on arbitrarily large sets of classes. Effectiveness of the approach is demonstrated on a suite of benchmark datasets, both synthetic and real-world.
+
+The manuscript is well written and understandable also by a non-specialist audience; the reference list is up-to-date and the introduction properly details the motivations for tackling the problem. The underlying math is sound, and the proposed solution is smart, but the experimental section is not convincing, and hardly supporting the authors’ claims. Overall, I would vote for a weak accept.
+
+Pros:
+- The proposed solution is grounded and interesting, and the results shown are encouraging;
+- When optimized/improved, CleaneX may have a relevant impact on the multi class classification theory,
+
+Cons:
+- Classifiers are compared on the basis of RMSE, although the original problem is multi class classification; I would strongly suggest a more classification-oriented measure such as multiclass MCC.
+- CleaneX is compared only to regression and KDE - what about adding also very widespread algorithms such as RandomForest?
+- Performance gain w.r.t. regression is quite limited, especially on real-world datasets: would CleaneX benefit from adding dropout layers or more refined activation functions?",6,4.0,ICLR2021
+HkxP0bceM,3,HJsjkMb0Z,HJsjkMb0Z,Official review,"
+
+The paper is well written and easy to follow. The main contribution is to propose a variant of the RevNet architecture that has a built in pseudo-inverse, allowing for easy inversion. The results are very surprising in my view: the proposed architecture is nearly invertible and is able to achieve similar performance as highly competitive variants: ResNets and RevNets.
+
+The main contribution is to use linear and invertible operators (pixel shuffle) for performing downsampling, instead of non-invertible variants like spatial pooling. While the change is small, conceptually is very important.
+
+Could you please comment on the training time? Although this is not the point of the paper, it would be very informative to include learning curves. Maybe discarding information is not essential for learning (which is surprising), but the cost of not doing so is payed in learning time. Stating this trade-off would be informative. If I understand correctly, the training runs for about 150 epochs, which is maybe double of what the baseline ResNet would require?
+
+The authors evaluate in Section 4.2 the show samples obtained by the pseudo inverse and study the properties of the representations learned by the model. I find this section really interesting. Further analysis will make the paper stronger.
+
+Are the images used for the interpolation train or test images?
+
+I assume that the network evaluated with the Basel Faces dataset, is the same one trained on Imagenet, is that the case?
+
+In particular, it would be interesting (not required) to evaluate if the learned representation is able to linearize a variety of geometric image transformations in a controlled setting as done in:
+
+Hénaff, O,, and Simoncelli, E. ""Geodesics of learned representations."" arXiv preprint arXiv:1511.06394 (2015).
+
+Could you please clarify, what do you mean with fine tuning the last layer with dropout?
+
+The authors should cite the work on learning invertible functions with tractable Jacobian determinant (and exact and tractable log-likelihood evaluation) for generative modeling. Clearly the goals are different, but nevertheless very related. Specifically,
+
+Dinh, L. et al  ""NICE: Non-linear independent components estimation."" arXiv preprint arXiv:1410.8516 (2014).
+
+
+Dinh, L. et al ""Density estimation using Real NVP."" arXiv preprint arXiv:1605.08803 (2016).
+
+The authors mention that the forward pass of the network does not seem to suffer from significant instabilities. It would be very good to empirically evaluate this claim.
+",8,4.0,ICLR2018
+H1gaST1RYB,3,BygdyxHFDS,BygdyxHFDS,Official Blind Review #2,"This paper proposes to meta-learn a curiosity module via neural architecture search. The curiosity module, which outputs a meta-reward derived from the agent’s history of transitions, is optimized via black box search in order to optimize the agent’s lifetime reward over a (very) long horizon. The agent in contrast is trained to maximize the episodic meta-reward and acts greedily wrt. this intrinsic reward function. Optimization of the curiosity module takes the form of an epsilon-greedy search, guided by a nearest-neighbor regressor which learns to predict the performance of a given curiosity program based on hand-crafted program features. The program space itself composes standard building blocks such as neural networks, non-differentiable memory modules, nearest neighbor regresses, losses, etc. The method is evaluated by learning a curiosity module on the MiniGrid environment (with the true reward being linked to discovering new states in the environment) and evaluating it on Lunar Lander and Acrobot. A reward combination module (which combines intrinsic and extrinsic rewards) is further evaluated on continuous-control tasks (Ant, Hopper) after having been meta-trained on Lunar Lander. The resulting agents are shown to match the performance of some recent published work based on curiosity and outperforms simple baselines.
+
+This is an interesting, clear and well written paper which covers an important area of research, namely how to find tractable solutions to the exploration-exploitation trade-off. In particular, I appreciated that the method was clearly positioned with respect to recent work on neural architecture search, meta-learning approaches to curiosity as well as forthcoming about the method’s limitations (outlining many hand-designed curiosity objectives which fall outside of their search space). There are also some interesting results in the appendix which show the efficacy of their predictive approach to program performance.
+
+My main reservation is with respect to the empirical validation. Very few existing approaches to meta-learning curiosity scale to long temporal horizons and “extreme” transfer (where meta-training and validation environments are completely different). As such, there is very little in the way of baselines. The paper would greatly benefit from scaled down experiments, which would allow them to compare their architecture search approach to recent approaches [R1, R2], black-box optimization methods in the family of evolution strategies (ES, NES, CMA-ES), Thompson Sampling [R3] or even bandits tasks for which Bayes-optimal policies are tractable (Gittins indices). These may very well represent optimistic baselines but would help better interpret the pros and cons of using neural architecture search for meta-learning reward functions versus other existing methods. Conversely, the paper claims to “search over algorithms which [...] generalize more broadly and to consider the effect of exploration on up to 10^5, 10^6 timesteps” but at the same time does not attempt to show this was required in achieving the reported result. Pushing e.g. RL2 or Learning to RL baselines to their limits would help make this claim.
+
+Along the same line, it is regrettable that the authors chose not to employ or adapt an off-the-shelf architecture search algorithm such as NAS [R4] or DARTS [R5]. I believe the main point of the paper is to validate the use of program search for meta-learning curiosity, and not the details of the proposed search procedure (which shares many components with recent architecture search / black-box optimization algorithms). Using a state-of-the-art architecture search algorithm would have made this point more readily.
+
+Another important point I would like to see discussed in the rebuttal, is the potential for cherry-picking result. How were the “lunar lander” and “acrobot” environments (same question for “ant” and “hopper”) selected? From my understanding, it is cheap to evaluate learnt curiosity programs on downstream / validation tasks. A more comprehensive evaluation across environments from the OpenAI gym would help dispel this doubt. Another important note: top-16 results reported in Figure 4 and Table 1 are biased estimates of generalization performance (as they serve to pick the optimal pre-trained curiosity program). Could the authors provide some estimate of test performance, by e.g. evaluating the performance of the top-1 program (on say lunar lander) on a held-out test environment? Alternatively, could you comment on the degree of overlap between the top 16 programs for acrobot vs lunar lander? Thanks in advance.
+
+[R1] Learning to reinforcement learn. JX Wang et al..
+[R2] RL2: Fast Reinforcement Learning via Slow Reinforcement Learning. Yan Duan et al.
+[R3] Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search. Guez et al.
+[R4] Neural Architecture Search with Reinforcement Learning. Barrett et al.
+[R5] DARTS: Differentiable Architecture Search. Liu et al.
+",6,,ICLR2020
+SyeNxz5NaX,3,SJfZKiC5FX,SJfZKiC5FX,"An improvement to the SoA of the domain, more explanations and analysis welcomed","The paper proposes a restoration method based on deep reinforcement learning. It is the idea of trainable unfolding that motivates the use of Reinforcement learning, the restoration unit is a SoA U-Net. 
+
+Remarks
+
+* The author seems to make strong assumptions on the nature of the noise and made no attempt to understand the nature of the learning beyond a limited set of qualitative example and PSNR. 
+
+* Even if the experimental protocol has been taken from prior work, it would have been appreciated to make it explicit in the paper, especially as ICLR is not a conference of image processing. Indeed, It would have made the paper more self-sufficient. 
+
+* Second 2 describing the method is particularly hard to understand and would require more details. 
+
+* In the experimental section, the authors claim that ""These results indicate that the restoration unit has the potential to generalize on unseen degradation levels when trained with good policies"". It would have been important to mention that such generalization capability seems to occur for the given noise type used in the experiments. I didn't see any explicit attempt to variate the shape of the noise to evaluate the generalization capability of the model.
+
+In conclusion, the paper proposes an interesting method of image denoising through state of the art deep learning model and reinforcement learning algorithm. The main difference with the SoA on the domain is the use of a diffusion dynamics. IMHO, the paper would need more analysis and details on the mentioned section.
+",6,3.0,ICLR2019
+b26Ui1Epo6S,4,ioXEbG_Sf-a,ioXEbG_Sf-a,Good paper with a few points not clearly described,"This paper is on an experience replay approach, as applied to deep RL methods, that uses a density ratio between on-policy and off-policy experiences as the prioritization weights. The objective is to find appropriate bias-variance trade-offs for importance sampling from the replay buffer. In particular, there's the bias issue from replay experiences of other policies, and the variance issue from the recent on-policy experiences. 
+
+It's not entirely clear to the reviewer about the necessity of maintaining two replay buffers (slow and fast). Instead of maintaining two replay buffers, it seems one can simply retain the standard single-buffer strategy, and evaluate how likely the experience is with respect to the current policy. Then the likelihood can be used as the weight for prioritized experience replay. This simple strategy also takes the bias-variance trade-offs coming from on-policy and off-policy experiences. The reviewer is curious about the advantage of the developed approach that uses two buffers over what's described above. 
+
+In the paper, the slow buffer is considered for maintaining off-policy experience, and the fast buffer is for on-policy experience. Accordingly, the sizes of those two buffers are supposed to be very important parameters. For instance, if the two buffers have similar sizes, then the developed approach are expected to function like standard deep RL (on policy or off policy depending on the buffer size). However, Figure 2 (b) shows that the performance is not sensitive to the size of the fast (on-policy) buffer. The result is counter-intuitive, though it's clear that the results are supposed to show the insensitivity to such parameters. 
+
+Experiments were conducted by combining the experience replay approach (called LFIW, likelihood-free importance weighting) with three existing deep actor-critic methods, and then comparing the combinations with their originals. The results look good, and demonstrate the effectiveness of the developed approach. It's unclear why the results were presented in tables instead of curves (minor point), which can be potentially more better for readability. ",6,4.0,ICLR2021
+Hyly32Fqhm,3,S1fcnoR9K7,S1fcnoR9K7,"Interesting idea, does not seem to work consistently, limited theoretical explanation","
+This work proposes an optimization method called All Learning Rate
+At Once (Alrao) for hyper-parameter tuning in neural networks.
+Instead of using a fixed learning rate, Alrao assigns the learning
+rate for each neuron by randomly sampling from a log-uniform
+distribution while training neural networks. The neurons with
+proper learning rate will be well trained, which makes the whole
+network eventually converge. The proposed method achieves
+performance close to the SGD with well-tuned learning rate on the
+experiments of image classification and text prediction.
+
+
+#Pros:
+
+-- The use of randomly sampled learning rate for deep learning
+models is novel and easy to implement. It can become a good
+approximation of using SGD with the optimal learning rate.
+
+-- The paper is well-written and easy to follow. The proposed
+method is illustrated in a clear way.
+
+-- The experiments are solid, and the performance on three
+different architectures are shown for comparison. According to the
+experiments, the proposed method is not sensitive to the
+hyper-parameter \eta_{min} and \eta_{max}.
+
+#Cons:
+
+-- The authors have not given any theoretical convergence analysis
+on the proposed method.
+
+-- Out of all four experiments, the proposed method only
+outperforms Adam once, which does not look like strong support.
+
+-- Alrao achieves good performance with SGD, but not with Adam.
+Also, there are no experimental results on Alrao with other
+optimization methods.
+
+#Detailed comments:
+
+(1) I understand that Alrao will be more efficient compared to
+applying SGD with different learning rate, but will it be more
+efficient compared to Adam? No clear clarification or experimental
+results have been shown in the paper.
+
+(2) The units with proper learning rate could learn well and
+construct good subnetworks. I am wondering if the units with ""bad""
+(too small or too large) learning rate might give a bad influence
+on the convergence or performance of the whole network.
+
+(3) The experimental setting is not clear, such as, how the input
+normalized, how data augmentation is used in the training phase,
+and what are the depth, width and other settings for all three
+architectures.
+
+(4) The explanation on the influence of using random learning rate
+in the final layer is not clear to me.
+
+(5) Several small comments regarding writing:
+    (a) Is the final classifier layer denoted as $C_{\theta^c}$ or  $C_{\theta^{cl}}$ in the third paragraph of ""Definitions and notations""?
+    (b) In algorithm 1, what is the stop criteria for the do while? The ""Convergence ?"" in the while condition is confusing.
+    (c) Is the learning curve in Figure 2 from one run or is it the average of all runs? Are the results consistent for each run? How about the learning curves for VGG19 and LSTM, do they have similar learning curves with the two architectures in Figure 2?
+    (d) For Figure 3, it will be easier to compare the performance on the training and test set, if the color bars for the two figures share the same range.
+",6,4.0,ICLR2019
+B99a7Ug_XLw,1,2AL06y9cDE-,2AL06y9cDE-,"This paper argues to proposed a closed-loop control in the robustness training to achieve good results on different types of attack. To me, this statement is ambitious to be called closed-loop control, but the overall structure is meaningful and interesting.","Strength:
+1. This paper first introduced layer-wised projection from the poisoned data to the clean data.
+2. The results show improvement of the robustness over the baseline on different types of attacks.
+
+Weakness:
+1. The statement of the closed-loop control is a little bit ambitious. The overall methods are the layer-wise projection from the poisoned data to the clean data manifold. Normally, in the closed-loop, we will use the control signal $u$ to control the original data instead of the next layer data. For example, the final balance stage should be $u=g(x+f(x+u))=0$. So closed-loop should be at least multi-steps within one layer. For different layers, the closed-loop control will have different control signals, since the dimension/ distribution between layers is much different. So this method is only a one-step layer-wise projection. The $x$ between layers cannot be viewed as the same sample to be controlled. I would recommend the author to change the statement from closed-loop into the layer-wise projection for a better suit. Also, this method is still a feed-forward network, not a ""loop"" control. I do think this method is interested, just a little ambitious. This could be a useful extension of the resnet-based network, since the control $u$ can be viewed as a complicated version of residuals.
+2. The experiment is weak for only comparing with one baseline. Also, can the author provide which baseline model that the author is comparing with? I cannot find it in the text. I would appreciate it.
+3. It's unclear to me about the training of $\mathcal{E}(x)$, will this requires extra data to train? What's the running speed of this ""closed-loop"" method compared with others?
+
+Some tiny comments:
+1. A trivial comparison would be to train an autoencoder for each layer of $x$ and only use the decoded results to pass through the network. This in principle learns the data manifold and provides the projection to this manifold. 
+2. Table 2, the dataset name is not aligned in the center.
+3. Table 3 should be more self-explainable. It's a little confusing in the current form.
+4. Have the author tried multi-steps in a single layer or constraint $x_t$ to be in the same space? 
+
+
+
+----- post rebuttal -----
+
+The authors addressed most of my concerns and the revision is better than before. 
+I would like to increase my score and would recommend an acceptance.",7,4.0,ICLR2021
+SJlM3Xs49r,2,Hyx5qhEYvH,Hyx5qhEYvH,Official Blind Review #1,"####
+A. Summarize what the paper claims to do/contribute. Be positive and generous.
+####
+The paper translates the Leaky Integrate and Fire model of neural computation via spike trains into a discrete-time RNN core similar to LSTM. The architecture would be readily amenable to the modern deep learning toolkit if not for the non-differentiability of the hard decision to spike or not. The hard decision is made by thresholding. The paper adopts a simple approximation of backpropagating a ""gradient"" of 1.0 through the operation if the threshold is within a neighbourhood [thresh - a, thresh + a], and otherwise 0.0, so the system can be trained by backpropagation.
+
+The architecture is tested on a few ""neuromorphic"" video classification datasets including MNIST-DVS and CIFAR-DVS. Experiments are also run on a text summarization task.
+
+####
+B. Clearly state your decision (accept or reject) with one or two key reasons for this choice.
+####
+
+The reviewer thinks the paper should be rejected in its current state. 
+
+The proposed architecture is a straightforward change to a standard LSTM core. Thus it should be compared head-to-head to LSTM on standard datasets for these models (e.g. classic synthetic tasks, language modeling, speech recognition, machine translation, etc) with everything else held constant (hidden size, learning rate, sequence length, etc etc).
+
+It also doesn't really carry over any of the benefits of Spiking Neural Nets even though it is inspired by Leaky Integrate and Fire because it operates in discrete time like a normal RNN, just with an extra binary output produced by spiking. It's unclear that a spiking inductive bias is actually useful, even though event-driven computation could in theory allow much less computation, the proposed method does not have that property. 
+
+So the paper doesn't really provide evidence to back up their claim that the proposed model combines the complimentary advantages of Deep Learning and Spiking Neural Nets. 
+
+####
+C. Provide supporting arguments for the reasons for the decision.
+
+While the proposed method is in-spirit inspired by the leaky integrate and fire model, it is operated/trained in discrete time which does not allow it to achieve the benefits of continuous time integrate-and-fire models which allow for less computation and time-discretization-invariance. 
+
+The conversion of the spiking model to the deep learning framework is rather crude, as the differentiable approximation to the non-differentiable threshold operation is biased and not well-motivated either empirically, intuitively, or theoretically (i.e. there are no comparisons to alternative choices).
+
+There are new techniques for marrying continuous-time models and deep learning which seem more promising to investigate to this end (e.g. Neural ODE).
+
+So in summary, the method doesn't have the computational benefits of a biologically plausible spiking algorithms and is not well-tested against competing deep learning methods, making it hard to verify the motivation of pushing toward a performant yet biologically plausible algorithm.
+####
+
+####
+D. Provide additional feedback with the aim to improve the paper. Make it clear that these points are here to help, and not necessarily part of your decision assessment.
+####
+There are many grammatical and word-choice mistakes which make the paper hard to read.
+
+Mainly, from a practical perspective, the paper would be much-improved by showing what benefit the spiking inductive bias confers over a standard LSTM on standard tasks in the deep learning community.
+
+The method/landscape should be developed and studied in further detail until claims can be made about combining the strengths of spiking and deep-learning models.",1,,ICLR2020
+IEIp0CXjtV,4,i3Ui1Csrqpm,i3Ui1Csrqpm,A simple yet impactful optimization approach using a suite of pruning methods. ,"The main goal of this paper is to introduce a simple methodology for optimizing transformer based models for efficiency and effectiveness. 
+
+The paper introduces two main ideas:
+
+1)A top-down strategy for pruning components of a transformer model: Given a specific focus, say speed, the strategy is to consider pruning large coarse-grained components first followed by smaller finer-grained components. The pruning decision is made based on a “significance analysis” -- a component is considered significant for pruning if it from the model does not result in a substantial increase in the model’s loss (as decided by a pruning threshold).
+
+2) Pruning and approximating techniques for different components: For example feed-forward networks are pruned by removing weights in groups (determined via a hyperparameter). For approximating self-attention a sign-matching technique for deciding which top K keys to use for computing Query x Key dot products. 
+
+The main strengths of this work are as follows:
+
+1) The techniques do not require training networks from scratch and can be applied directly during fine-tuning. 
+
+2) The techniques are simple and should apply widely to most transformer-based models. 
+
+3) The empirical results support the claim that the technique can yield significant speed-up and memory-reductions while maintaining accuracy and even provide improvements in accuracy if that is the pruning goal. They show that technique is orthogonal to other models explicitly designed for speed and memory footprint (Q8BERT, DistillBERT) and can provide further improvements in both efficiency and effectiveness. 
+
+4) This is a practical and useful approach that should be widely applicable along with many useful insights about optimizing transformer-based systems. 
+
+I appreciate that the experimental results are reported with averages across multiple runs!
+
+I don’t see any major weaknesses in the paper. Here are some areas that can be improved:
+
+1) The description of the pruning strategies was hard to follow and needed to be tightened up. Possibly adding equations and some pseudo-code to the description should help.
+
+2) I am curious to know what components get pruned cross the different models that were optimized. I wonder if there are systematic differences between original and distilled models and between auto-regressive (GPT) and auto-encoding style models.
+
+3) Also some level of ablation analysis on the strategies used will be helpful. For example if the elements were not ordered based on the granularity would the results be any different? Since this is an iterative strategy the order should play an important role in selection and utility of the subsequent pruning steps. Same goes for the set of pruning strategies. A related question would be what gives the biggest gains. 
+
+4) What is the impact on the fine-tuning time? The baseline only requires one fine-tuning pass. Does this method require multiple fine-tuning passes? Or can the loss thresholds be computed on a smaller subset of the target data? This may be a good future work to look into for tasks where the training data is relatively large, where one cannot afford to exhaustively search through all the pruning strategies. 
+",6,4.0,ICLR2021
+tBjdNZSl8n,2,rvosiWfMoMR,rvosiWfMoMR,Promising directions but the study needs to be extended,"In the paper, the authors adapt CycleGAN, a well-known model for unpaired image-to-image translation, to automatic music arrangement by treating MFCCs extracted from audio recordings as images. Also, the authors propose a novel evaluation metric, which learns how to rate generated audio from the ratings of (some) music experts. The authors make use of two large-scale datasets to train and evaluate the model on two scenarios, namely 1) generating drum accompaniment a given bass line, 2) generating arrangement given a voice line. They report promising results on the first task; however, the model is not as successful on the second (more challenging) task.
+
+The problem is challenging, and meaningful solutions may bring innovative and creative solutions to music production. The literature is well-covered, with a few missing citations (see below). The approach is built upon existing work, and the experiments are conducted on two relevant, public datasets. On the other hand, the experimental code is not shared, and the dataset section lacks a few details to reproduce the findings easily.
+
+Below are the shortcomings of the paper:
+
+1. While adapting past music generation work for arrangement generation is not trivial, the authors could have still used variants of CycleGAN and other unpaired image-to-image translation models for comparison.
+2. The sources are primarily limited to bass, drums, and vocals. I do not think the narrow scope is an issue on a paper focusing on an unexplored subject. On the contrary, the experiments could have more variety, e.g. drums2bass, bass&vocals2drums, and other combinations, so that we could examine which settings bring interesting and/or challenging outcomes in arrangement generation.
+4. The evaluation and discussion could have more depth, e.g. inter-annotator agreement, the effect of source separation in the generated audio (separation errors, audible artifacts, ...)
+
+The paper is novel in its application and brings promising results. However, the authors should extend the experiments, compare relevant models against each other, and discuss the results more in detail. Therefore, I would strongly encourage the authors to build upon their existing work and re-submit the revised paper to ICLR or another conference such as ISMIR.
+
+Specific comments
+=================
+
+- As mentioned above, the authors should have added more ""experimental settings."" At least they should have included ""generation of a bass line given the drums"" (reverse of bass2drums) because 1) it would have allowed the readers to contrast the performance with bass2drums, 2) the task would be closer to the real-world use case (drums are typically the first to be recorded in a session followed by bass).
+
+- The method works on music strictly with drums, bass and vocals, which is not mentioned until Section 3.4. This limitation/condition should be specified clearly and earlier in the Introduction and/or in Section 3.1.
+
+- ""Nevertheless, only raw audio representation can produce, at least in the long run, appealing results in view of music production for artistic and commercial purpose.""
+
+  Even if we restrict ourselves to popular music, this argument is too ambitious if not misleading. Many artists (performers, composers, conductors, etc.) are not only well fledged but - by profession - required to appreciate music by reading sheet music. Countless programmable interfaces and software, which make use of symbolic/hybrid music representations but do not generate raw audio directly, have been used extensively as part of music performances and production in a both artistic and commercial setting. While audio - without any doubt - is the essence of music, we can never disregard other representations.
+
+- Citing the two papers below could improve the literature review:
+
+  >Hawthorne, Stasyuk, Roberts, Simon, Huang, Dieleman, Elsen, Engel and Eck, ""Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset"", International Conference on Learning Representations, 2019. => similar to the authors' design decision, this paper uses a cheaper intermediate representation (music scores) for efficiency
+
+  >Donahue et al. LakhNES: Improving multi-instrumental music generation with cross-domain pre-training => the paper involves mapping (""arranging"") the instrumentation in MIDI files to NES sound channels.
+
+- Please cite `FMA` and `MusDB18` datasets following the instructions in the respective online sources.
+
+- Section 3.1. ""While showing nice properties,""
+
+  The authors only mention that Demucs solve audio source separation (for the data the authors use) and the algorithm is time equivariant. However, the text reads like the authors would like to state other properties as well. If there are others, they should be stated explicitly.
+
+- Section 3.2.
+
+  The authors should mention and cite the library they have used to extract MFCCs.
+
+- Section 4.1 ""we chose to select only pop music and its sub-genres for a total of approximately 10,000 songs""
+
+  It would be beneficial to share IDs of the songs in the subset for reproducibility purposes. Also, the authors do not state whether they use the untrimmed or trimmed versions of the tracks in the FMA dataset, which is a crucial detail for model training as well as experimental reproducibility.
+
+- The authors should state:
+
+  1. number of songs used from the MusDB18 dataset (i.e. have they used both the train and test splits?)
+  2. Total duration and number of samples in training, test and fine-tuning
+
+- In the test set, instead, we chose only a few samples for each song due to the relative uniformity of its content: in other word, we expect our model to perform in similar ways on different parts of the same song.
+
+  I find this assumption a bit unrealistic. In what sense, is the content uniform across the song? Is it uniformity in mixing, structure, arrangement, melody, tempo, or rhythm? Even if the authors use trimmed audio excerpts for training/testing, these characteristics can vary substantially within seconds (even if they use trimmed tracks). 
+
+  The authors should clearly state how they define content uniformity, provide a more informed argument around this assumption and experimentally show that the assumption holds for the test set.
+
+- Section 4.2: ""the result is somehow subjective thus different people may end up giving different or biased ratings based on their personal taste""
+
+  The authors portrait subjectivity as unfavourable. However, - as a human construct - there are no objective, universal criteria for appreciating music. Likewise, the evaluation metric, which the authors are proposing, is based on the subjective responses from music experts. I think the justification needs rephrasing.
+
+- Section 4.3: In the paper, the authors do not state the cultural background or the genre(s) of the focus of the music experts. The inter-agreement between the experts are not presented either. Due to lack of information and the small number of subjects, it is difficult to assess whether the (trained) evaluation metric has positive/negative/desired biases based on the experience, knowledge, personal taste etc. of the experts. Therefore, the claim about the proposed ""metric correlating with human judgment"" is a bit weak.
+
+- What is the distribution of scores for bass and voice?
+
+- How much do the artifacts (due to imperfections in source separation) affect the judgements?
+
+Minor comments
+==============
+
+- Introduction, Paragraph 1: ""allow artists and producers to easily manipulate recordings and create high quality songs directly from home.""
+
+  The phrasing somewhat disregards the music studios.
+
+- Page 2, top row: ""given a musical sample encoded in a two-dimensional time-frequency representation (known as Mel-spectrogram)""
+
+  It reads like all two-dimensional time-frequency representations are called ""Mel-spectrogram""s, instead of the authors using Mel-spectrograms, which is one type of two-dimensional time-frequency representations. 
+
+- The text should explain the relevance of the selected experimental settings to the music production: e.g. drums and bass are usually the first ""sessions"" to be recorded; a demo typically consists of the melodic prototype/idea with minimal accompaniment, which is later arranged by many collaborators...
+
+- ""Figure 1 shows a Mel-spectrogram example, a visual representation of a spectrum, where the x axis represents time, the y axis represents the Mel bins of frequencies and the third gray tone axis represents the intensity of the sound measured in decibel (Briot et al., 2020).""
+
+  I do not understand what the authors mean by ""third gray tone axis."" Is it because the MFCCs are treated as a single channel image, hence ""gray""? If yes, it is better to state that the ""MFCCs are treated as a single channel image"" without resorting to image processing jargon.
+
+- ""Mel-frequency cepstral coefficients are the dominant features used in speech recognition, as well as in some music modeling tasks (Logan & Robinson, 2001)""
+
+  It may be better to introduce this sentence earlier in the paragraph.
+
+- Section 3.4: ""On the one hand, ... On the other hand""
+
+  It might be easier to read if the setting is enumerated for readability.
+
+- Section 4.1: ""To train and test our model We decide""
+
+  Lowercase ""We"" -> ""we""
+
+- MusDB18 URL is broken
+
+- Section 4.3: ""Time: a rating from 1 to 10 of whether the produced drums and arrangements are on time the the bass and voice lines""
+
+  Double ""the the"" -> ""with the""
+  ",4,3.0,ICLR2021
+B1gBElBTtr,2,BklmtJBKDB,BklmtJBKDB,Official Blind Review #1,"The work proposes a method to improve conditional VAE with a learnable prior distribution using normalizing flow. The authors also design two regularization methods for the CF-VAE to improve training stability and avoid posterior collapse. The paper is clearly motivated and easy to follow. Experiment results on MNIST, Stanford Drone and HighD datasets show the proposed that the model achieves better results than previous state-of-the-art models by significant margins.
+
+However, the reviewer has the following comments on improving the paper:
+
+The motivation of the conditional normalizing flow design could be made more clear. The posterior regularization originates from the problem that the log Jacobian term encourages contraction of the base distribution. The log Jacobian term would be zero and would not encourage the contraction of the base distribution if the normalizing flow was volume-preserving, like NICE (http://proceedings.mlr.press/v37/rezende15.pdf, https://arxiv.org/pdf/1410.8516.pdf), which could be to convert into a conditional normalizing flow. On the MNIST results, the CF-VAE model with the proposed conditional normalizing flow even has worse performance than the affine flow model without the regularization. Therefore, clarifying the motivation behind this design choice is important.
+
+The work claims the two regularization methods are used to avoid a low-entropy prior and posterior collapse. But the claims are not fully substantiated in the experimental results. It would be better if the paper explicitly compares the CF-VAE models with and without regularizations in terms of the entropy of prior distribution and KL divergence.",6,,ICLR2020
+rkl524DAKH,3,HyxTJxrtvr,HyxTJxrtvr,Official Blind Review #1,"This paper presents learning a spatio-temporal embedding for video instance segmentation. With spatio-temporal embedding loss, it is claimed to generate temporally consistent video instance segmentation. The authors show that the proposed method performs nicely on tracking and segmentation task, even when there are occlusions.
+
+Overall, this paper is well-written. Section 3 clearly explains the loss functions. The main idea is not very complex, but generally makes sense. The authors mention that scenes are assumed to be mostly rigid, and appearance change is mostly due to the camera motion. I would like to see more argument about this, as there are cases if this is obviously not true; for instance, human changes pose significantly. If we limit the range of discussion to some narrow domain, such as self-driving, this might be more valid, but we may want to see some discussion about validity of this assumption.
+
+Some modules are not full explained in detail. For example, what is the background mask network? Which model was used, and how was it trained?
+
+In experiment, the proposed method shows nice score on MOTSA and sMOTSA, but all other metrics, it is on the worse side. The authors are encouraged to discuss more about the metrics and experimental results with the other metrics as well. Other than these, the experiment was well-designed and conducted.",6,,ICLR2020
+Vyzy7fCTQe,4,Vfs_2RnOD0H,Vfs_2RnOD0H,Interesting work,"Summary: This paper proposed a simple yet effective greedy algorithm with a new heuristics on checkpointing deep learning models so that people could train large model with restricted GPU memory budgets. The proposed method operates in an online setting and do not need static analysis of computation graph, thus could be used for both static and dynamic models. In a restricted model setting of linear forward network and equal space and time cost for each node, the author proves the proposed method could reach the same bound on tensor operation and memory budget with previous static checkpointing methods. The author also establish a theorem on tensor operation numbers between the proposed dynamical method and an optimal static checkpointing algorithm. In experiment, the author compared the proposed method with static techniques including the optimal Checkmate tool of Jain et al. (2020), showing the proposed method gives competitive performance without static model analysis in prior. The author also compared the proposed heuristics with prior arts on several static and dynamic models. Finally, the author described a prototype of PyTorch implementation of the proposed method. 
+
+Pros: 
+1. While goes under a limited setting, the theoretic analysis on the tensor operation and memory budget bound of the proposed method, as well as on the relationship between the proposed method and optimal static analysis method is novel and interesting. The experiment also shows the competitiveness of the proposed method by comparing to static methods. 
+2. The author does a great job explaining the idea, concepts, procedures and experiments. 
+
+Cons: 
+1. The author provides the comparison between the proposed heuristics and others with the same greedy algorithm, but it seems not to have the full comparison to other dynamic checkpointing approach(e.g. Peng et al. (2020) ). Although experiments in the paper shows competitive results with static model, the main use cases of the proposed method might still be in dynamic models as in normal static model use cases, the time overhead of static analysis could be ignored compared to actual model training time.
+2. As the proposed heuristics bears some similarity with the one used in Peng et al. (2020), it would be more convincing to also have an ablation study of replacing the heuristics used in Peng et al. (2020) with the proposed one. 
+
+References:
+
+[1] Jain, Paras, et al. ""Checkmate: Breaking the memory wall with optimal tensor rematerialization."" Proceedings of Machine Learning and Systems 2 (2020): 497-511.
+
+[2] Peng, Xuan, et al. ""Capuchin: Tensor-based GPU Memory Management for Deep Learning."" Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 2020.
+",6,3.0,ICLR2021
+YzK1EGqm6t,1,dYeAHXnpWJ4,dYeAHXnpWJ4,Needs to be reworked to have a clear message and support evidence,"This paper investigates the utility of interpreting deep classification models using the gradients of the output logits w.r.t the inputs, a common practice that is also potentially misleading.
+
+The hypothesis stated for the paper is ""input-gradients are highly structured because this implicit density model is aligned with the ‘ground truth’ class-conditional data distribution?""
+
+For the observation in section 2 (also maybe number blocks like that similar to lemmas and hypotheses to make it easier to refer to them) if g is 0 then this is trivially true and not saying anything. I think I know what you want to say but this formalism is not adding any clarity. I think adding some constraints to g or simply calling it a variable which takes on specific values. It is not clear why it needs to be conditioned on x either.
+
+3.3 Stabilized score matching: It seems multiple published methods did not help you prevent collapse of the Hessian-trace, but your heuristic did. Is this a common trick or a novel contribution? It would be nice to indicate it one way or another, if it solves a real problem that previous score-matching approaches fail to solve. Furthermore, it would be important to know the sensitivity of your approach to the choice of this hyperparameter.
+
+The importance of section 4.1 is not clear to me… it appears that the authors believe activity maximization is a biased explainability measure and therefore should not be used if one accepts their framework. Their intention for this paragraph should be more explicit.
+
+Section 4.2 draws a tight parallel between the pixel perturbation test and their density ratio test, demonstrating that the pixel perturbation test captures the sensitivity of the implicit density model and not the discriminative model. They therefore suggest this test always be done while tracking changes in classification output, which is a nice takeaway.
+
+In section 4.3, the authors draw a parallel between their score-matching objective and adversarial training, although they state that “score matching imposes a stronger constraint”. I am not an expert on these topics, but I think these kinds of more speculative observations should be moved to the discussion in general.
+
+In section 5, the authors introduce their experimental setup. They used a baseline model ResNet-18 model with 78.01% accuracy, and compared it with their regularized model that only achieved 72.20% accuracy (5.8% drop). The authors weight the entire regularization term with a single lambda =1e-3. This raises a few important questions.
+First, the strength of the stability regularizer and the score matching terms should likely be decoupled to achieve maximum classification performance which is crucial in practice. The difficulty of tuning these hyperparamaters are also extremely important for us to understand the utility of the Author’s proposed approach. It would be good for us to see the results of a proper hyperparamater search over the weighting of the score-matching and stability regularization term independently. Ideally, the performance drop observed can be reduced or eliminated. 
+
+Second, presumably, if the score matching loss itself hurts classification performance, then intuitively the intuitions built early in the paper cannot be correct: if the aligned density functions p_theta(X) and p_data(X) do can not arise from logits that produce the optimal classifier, then the saliency maps produced by this method cannot be used to diagnose model correctness (as the practitioners who utilize saliency maps would hope). Since the score matching loss is only accurate up to some constant, perhaps this is the source of the issue, but we cannot conclude one way or the other from the data provided.
+
+Furthermore, few details are given for the training setup: how long was each model trained for, was early stopping employed, were multiple seeds evaluated, are we convinced that all models converged? These are important for doing the relevant comparisons, and many of the later results are hard to interpret in the absence of hyperparamater tuning or these experimental details. 
+
+Anti-score matching is a great baseline experiment, but the use of the threshold seems arbitrary. This is doublyThis doubly true because the lambda of the regularization is smaller (1e-4). Why this discrepancy? It makes it harder for us to compare the results. How does this model perform as that threshold is varied? If one does a hyperparamater search with a fixed threshold, one would find the best performing anti-score matched model, which would potentially be easier to interpret. Crucially, this approach seems to actually outperform the score-matching model and underperform the baseline, implying that either the lower lambda or the anti-score constraint improved the performance of the classifier, but we cannot know which.
+
+A similar comment can be made about the gradient-norm model RE: hyperparameter searches.
+
+The Density Ratio experiments are excellent, but the y axis is hard to read.
+
+For the Sample Quality experiments: why would the gradient norm model outperform score matching? I am not convinced by the speculation given in the paper. It might be the case that the score-matched models were not converged, highlighting the importance of improving or simply explaining the experimental setup as I mentioned earlier.
+
+Finally, the Gradient Visualization experiments are very unclear. Intuitively, it might be the case that a small portion of the image is enough to explain the class of the data, which is what saliency maps are typically used for. The gradient norm approach and your score matching approach appear to perform almost identically, and it isn’t clear how much better they perform in a practical sense. It would be nice to have a more convincing demonstration of examples where the model obviously classifieds an image using the appropriate information present in using your method but not the baseline. As it stands, your results appear to be largely due to the fact that the gradients are smoother when using score matching or gradient-norm regularization.
+
+In summary, I really like the approach and theory presented in the paper, and tackles an important issue with broad relevance to the field, but the experimental results as they stand are not sufficient to convince me that this approach works in practice.
+
+Typo Section 4: ""show how these can interpreted from""
+",5,4.0,ICLR2021
+h1NTfrC4XAB,5,TBIzh9b5eaz,TBIzh9b5eaz,Official Blind Review,"The submission investigates the risk averse objective in offline RL. Usually, the parametric uncertainty is the main source of worries in offline RL, and dealing with the stochastic uncertainty on top of it, in order to account for risk aversion, is very challenging. The authors expose clearly their method and algorithm, even though we sometimes would have liked a bit more argumentation on why this and why not that. Since no theoretical analysis is provided, the only validation is empirical. It is rather complete regarding both the settings and the domains. The results are quite impressive, in particular in the offline setting. For all these reasons, I recommend to accept the submission.
+
+Please answer/address these comments in the rebuttal/final version:
+* 1- It is unclear whether d^\beta is defined as the empirical distribution in the batch, or the true distribution. I believe that it would have been helpful in general to formalize more the distributions definitions.
+* 2- U(.) is not introduced.
+* 3- First sentence of 3.2: It is frequent in offline RL to use stochastic policies in order to leverage the risk taken in the face of parametric uncertainty. It is therefore rather odd to state here that only deterministic policies are considered. Even more when we notice in 3.3 that the actual policy is stochastic, since b is sampled from the stochastic behavioral policy.
+* 4- Please indicate the performance of the behavioural policy (average and cvar) in the experiments.",7,3.0,ICLR2021
+r1xtybOgir,3,rylmoxrFDH,rylmoxrFDH,Official Blind Review #5,"The paper provides an in-depth exploration of stochastic binary networks, continuous surrogates, and their training dynamics with some potentially actionable insights on how to initialize weights for best performance. This topic is relevant and the paper would have more impact if its structure, presentation  and formalism could be improved. Overall it lacks clarity in the presentation of the results, the assumptions made are not always clearly stated and the split between previous work and original derivations should be improved. 
+
+In particular in section 2.1, the author should state what exactly the mean-field approximation is and at which step it is required (e.g. independence is assumed to apply the CLT but this is not clearly stated). Section 3 should also clearly state the assumptions made. That section just follows the “background” section where different works treating different cases are mentioned and it is important to restate here which cases this paper specifically considers. Aside from making assumptions clearer, it would be helpful to highlight the specific contributions of the paper so we can easily see the distinctions between straightforward adaptations of previous work and new contributions.
+
+Specific questions:
+
+It might be worth double checking the equation between eq. (2) and eq. (3) , the boundary case (l=0) does not make sense to me, in particular what is S^0 ?.
+
+What does the term hat{p}(x^l) mean in the left hand side of eq.(3)? 
+
+In eq. (7) (8) why use the definition symbol := ?
+
+At the beginning of section 3.1, please indicate what “matcal(M)” precisely refers to. Using the term P(mathcal(M) = M_ij) does not make much sense if the intent is to use a continuous distribution for the means. 
+
+Just after eq. (9), please explain what Xi_{c*} means. 
+
+Small typo: Eq. (10) is introduced as “can be read from the vector equation 31”, what is eq. (31)?
+
+In section 5.2, why reducing the training set size to 25% of MNIST?
+",6,,ICLR2020
+H1kAEtYlz,1,rJ5C67-C-,rJ5C67-C-,Computing node embeddings and hypernode embeddings for hypergraphs,"The paper studies different methods for defining hypergraph embeddings, i.e. defining vectorial representations of the set of hyperedges of a given hypergraph. It should be noted that the framework does not allow to compute a vectorial representation of a set of nodes not already given as an hyperedge. A set of methods is presented : the first one is based on an auto-encoder technique ; the second one is based on tensor decomposition ; the third one derives from sentence embedding methods. The fourth one extends over node embedding techniques and the last one use spectral methods. The two first methods use plainly the set structure of hyperedges. Experimental results are provided on semi-supervised regression tasks. They show very similar performance for all methods and variants. Also run-times are compared and the results are expected. In conclusion, the paper gives an overview of methods for computing hypernode embeddings. This is interesting in its own. Nevertheless, as the target problem on hypergraphs is left unspecified, it is difficult to infer conclusions from the study. Therefore, I am not convinced that the paper should be published in ICLR'18.
+
+* typos
+* Recent surveys on graph embeddings have been published in 2017 and should be cited as ""A comprehensive survey of graph embedding ..."" by Cai et al
+* Preliminaries. The occurrence number R(g_i) are not modeled in the hypergraphs. A graph N_a is defined but not used in the paper.
+* Section 3.1. the procedure for sampling hyperedges in the lattice shoud be given. At least, you should explain how it is made efficient when the number of nodes is large.
+* Section 3.2. The method seems to be restricted to cases where the cardinality of hyperedges can take a small number of values. This is discussed in Section 3.6 but the discussion is not convincing enough.
+* Section 3.3 The term Sen2vec is not common knowledge
+* Section 3.3 The length of the sentences depends on the number of permutations of $k$ elements. How can you deal with large k ?
+* Section 3.4 and Section 3.5. The methods proposed in these two sections should be related with previous works on hypergraph kernels. I.e. there should be mentions on the clique expansion and star expansion of hypergraphs. This leads to the question why graph embeddings methods on these expansions have not be considered in the paper.
+* Section 4.1. Only hyperedeges of cardinality in [2,6] are considered. This seems a rather strong limitation and this hypothesis does not seem pertinent in many applications. 
+* Section 4. For online multi-player games, hypernode embeddings only allow to evaluate existing teams, i.e. already existing as hyperedges in the input hypergraph. One of the most important problem for multi-player games is team making where team evaluation should be made for all possible teams.
+* Section 5. Seems redundant with the Introduction.",5,3.0,ICLR2018
+9gqOLUXduKo,1,Lvb2BKqL49a,Lvb2BKqL49a,Novel mutual information lower bound to regularized the drifting problem,"### EDIT:
+
+ I thank the authors for their detailed response. I also appreciate the effort that's been put into refining the draft. Unfortunately, I'm still not very happy with the motivation of attacking the drifting phenomenon on MINE. The main reason for removing the drifting effect is for moving average of history outputs. However, there are various ways for tackling this ( as pointed out in my original reviews, like using a non-drifted mutual information estimator with moving average or plugin some robust density estimators). Yes, I agree with the author the drifting phenomenon is not the only problem that the proposed method solves. But actually, the stability of other MI estimators also allows them to avoids having exploding network outputs.
+
+ Also, in practice, people don't usually run moving average on MI for representation learning. I encourage the author to explore the importance of moving average of MI estimators further.
+
+R3 suggests the author take some non-parametric estimators as baselines. But I think it's fine to only compare to some parametric(variational) methods on high dimension setting, where most non-parametric estimators fail. Nonetheless, it's always good to have additional experiments compared to some non-parametric methods in low dimension settings.
+
+Overall, I lean toward rejection given current concerns.
+
+Summary
+
+The paper introduces a generalized version of the mutual information neural estimation (MINE), termed regularized MINE (ReMINE). Some interesting experimental results are firstly provided on a synthetic dataset: the constant term in the statistical network is drifting after MI estimate converges. Also, the optimization of MINE will result in the bimodal distribution of the outputs, in which the statistical network has very distinct values for joint and non-joint samples. The paper presents a theoretical explanation for the drifting phenomenon. In light of this, the author's approach is to add a regularized term to prevent the drifting phenomenon. They impose a $L_2$ regularization on the logsumexp term to enforce the network to find a single solution. Further, the authors make use of the historical estimation for better performance. Empirically, the proposed regularized term works well along with the original MINE estimator and ReMINE  has better performance in the continuous domain
+
+Contributions
+
+i) Proposal of a novel regularized MINE objective for solving the drifting phenomenon. The new objective successfully finds a single solution and exhibits a lower variance.
+
+ii) Provide interesting insights, such as drifting phenomenon and the instability due to small batch size, out of experiments. 
+
+iii) Experimental validation of the proposed method for solving the drifting problem. Achieve better performance for mutual information estimation in the continuous domain. 
+
+Issues:
+
+i) The drifting phenomenon of MINE is a feature but not a bug, since the drifting term has no effect on the final MINE estimated value. The motivation and benefits of solving drifting problems are unclear to me.
+
+ii) Is the drifting phenomenon of the statistical network ubiquitous among the density ratio estimators? For example, does it exist in the density ratio estimator in logistic regression or in JS dual lower bound? If not, we can directly plug these non-drifting density estimators in MINE, instead of regularized MINE. Another apparent remedy is to make the output of statistical network $T$ zero-meaned by subtracting the online sample mean of $T$.
+
+iii) The proposed ReMINE is motivated by the drifting phenomenon. But it can also alleviate the exploded outputs / bimodal distribution of the outputs since ReMINE explicitly imposes $L_2$ constraint on its output. The connection between the exploded outputs / bimodal distribution of the outputs and ReMINE is weak in the paper.
+
+iv) The paper states that MINE must have a batch size proportional to the exponential of true MI to control the variance. The statement is wrong. Yes, the variance of some mutual information estimator, like NWJ,  is proportional to the exponential of true MI, as proved in [1]. However, the variance of MINE is not proportional to the exponential of true MI in finite sample case (in asymptotic maybe), due to the log function. 
+
+Minors:
+a) I wonder whether the SMILE estimator (cited in the paper) implicitly solves the drifting problem. Since the optimal statistical network cannot drift freely in SMILE.
+
+[1]A Theory of Usable Information Under Computational Constraints, Xu et al, ICLR20. ",5,4.0,ICLR2021
+prTYPzai0EO,1,7pgFL2Dkyyy,7pgFL2Dkyyy,CLASS NORMALIZATION FOR ZERO-SHOT LEARNING,"Summary: This paper presents a theoretical justification for normalization in model training on how it affects model performance and training time. It proposes two normalization tricks: normalize + scale trick and attributes normalization trick and apply in the zero-shot image classification task. This paper also shows that two normalization tricks are not enough to variance control in a deep architecture. To address this problem, a new initialization scheme is introduced. Apart from theoretical analysis and a new initialization scheme for normalization, it extends the zero-short learning approach in a continual learning framework. This new framework is called continual zero-shot learning (CZSL) and provides corresponding evaluation metrics. The experiments for CZSL are performed in two datasets, CUB and SUN.  This paper experimentally shows the effectiveness of the initialization, normalization, and scaling trick.
+
+Strong Points:  1-  Paper is well organized and easy to follow.
+
+2-  This paper took an interesting problem and developed a compelling investigation for how normalization affects performance. The theoretical justification for the normalization tricks sounds interesting and makes sense. 
+
+3-  It introduced two new techniques for normalization and shows by the theoretical justification that only normalization techniques are not sufficient for proper model training. For good model training apart from normalization tricks, it introduced an initialization scheme.
+
+4- Using normalization tricks and a new initialization scheme reduces a significant model training time compared to previous approaches. It presents training speed results for several baseline approaches.
+ 
+5- Innovative attempts in introducing a new ZSL problem, and several evaluation metrics are proposed for continual ZSL.
+
+
+Weaknesses:  1- This paper presents a robust analysis of normalization, initialization, and scaling trick for ZSL. It also extends ZSL in continual learning.  I appreciate the author's effort for this solid analysis.  I expect a  new proposed model from authors to make the paper more strong. 
+2-  Missing comparison:  I recommend including paper [a] in the comparison table for CZSL.  Approach [a] is the first proposed baseline for continual zero-shot learning. Therefore it must be included in the comparison table. 
+3- Some recent state-of-the approaches are missing in the comparison table for ZSL. Please compare it with [b],[c] models.
+
+4- Why have aPY dataset is not included in the experiments? Does this model not perform well in aPY dataset?
+5- Is it possible for this normalization and scaling tricks for other applications such as object detection, action recognition, and image retrieval? 
+6-  I wonder by the training time you reported. I understand you have used quite a small neural network. Still, to have a clear view or fair comparison, you should have compared the timings with other initialization and default normalization and scaling tricks as well.
+
+[a]- Lifelong Zero-Shot Learning, by Kun Wei et al. IJCAI 2020.
+[b]- Episode-Based Prototype Generating Network for Zero-Shot Learning,  by  Yu et al. CVPR2020.
+[c]- Meta-Learning for Generalized Zero-Shot Learning, by Verma et al. AAAI 2020.
+
+Rating Reason: This paper has included a well-detailed analysis and mathematical formulations for normalization, initialization about model training. But it does not propose a novel model, which limits the novelty of the model.  CZSL formulation is also already explored in [a].",7,5.0,ICLR2021
+D_I0_i7WbcD,3,JyDnXkeJpjU,JyDnXkeJpjU,"Overall a sound algorithm, but framed/motivated in a strange way and hardly supported by relevant/realistic experiments","=== Summary ===
+
+The paper proposes a meta-learning method based on a notion of task similarity/dissimilarity. In particular, the paper motivates its proposed method TANML through a generalization of Meta-SGD wherein the learnable parameter-wise learning rate in the inner update of Meta-SGD is replaced by a quadratic pre-conditioner matrix.
+
+The proposed method TANML closely resembles gradient-based meta-learners in the outer update but replaces the inner update by the matrix-vector product of kernel regression coefficient matrix and task similarity vector based on a kernel function. In that, the kernel function effectively quantifies the similarity of the loss gradients of the different tasks, evaluated at a learnable parameter initialization. Overall, the coefficient matrix can be understood as a look-up matrix in which each row holds the learned parameter vector for one meta-training task, the final adapted parameters are a linear combination of these parameter vectors in, weighted by the kernel between current task and the meta-train tasks.
+
+In two simple simulated experiments, the paper demonstrates that TANML is able to outperform MAML and Meta-SGD when the meta-train tasks are set up in a pathological way (e.g. by combining two dissimilar clusters of tasks or by adding outlier tasks).
+
+=== Reviewer’s main argument ===
+
+Overall, the idea of incorporating a notion of task-similarity into the meta-learner and the particular proposal to use the kernel between the task loss gradients to quantify such similarity is sound and is a valuable contribution in itself. 
+
+Unfortunately, the relationship between Generalized Meta-SGD to TANML is unclear. Usually the connection between linear regression (c.f. Eq. 1 in the paper) and kernel regression (Eq. 2) is established through the particular form of the kernel regression coefficients. However, since the coefficient matrix is (meta-)learned in the paper, it is unclear how TANML relates to Meta-SGD. In fact, TANML seems more like a learned linear combination of task parameters which does not resemble much commonalities with MAML. Overall, the connection to MAML seems a bit set-up/artificial. Discussing the particular relationship between MAML/Meta-SGD and TANML would improve the storyline of the paper. For instance, if TANML is a generalization of MAML, it would be good to state with which particular choices of $\theta_0$ and $\Psi$, we can recover MAML.
+
+The related work section is quite minimalistic. For instance, discussing how TANML is different from e.g. multi-task nonparametric methods (e.g. [1-2]) that also use a kernel between tasks, would better clarify how TANML relates to previous work.
+
+The numerical experiments are very simple / limited and designed in a pathological way. Thus, it is not surprising that MAML/Meta-SGD perform worse than TANML. How applicable the experimental results are in more realistic meta-learning setups is unclear. Despite the simplicity of the experiments, there is not enough information to properly reproduce the experiment. For instance, how are A and $\omega$ in experiment 2 sampled, how are the x in experiment 1 sampled and how many data points per task are used in experiment 1? The following would strengthen the experiment section:
+- A real-world use case in which we expect to see a meta-training set with e.g. outliers similar to experiment 2
+- Experiments with real world meta-learning datasets. For real-world & small-scale meta-learning environments for regression, see e.g. [3].
+- An additional meta-learning setup without outliers / clusters of meta-learning tasks. This way one can assess how the proposed method compares to MAML/Meta-SGD in standard setting
+- Adding missing details, e.g. to the appendix, which are necessary for reproducing the experiment.
+
+=== Overall assessment ===
+
+I vote for rejecting the paper. In the current state, the storyline from MAML to TANML provides little value to me as a reader. The proposed algorithm resembles a classical kernel-weighted linear combination of parameters and the pathological toy experiments provide little value for assessing the actual usefulness of TANML in realistic meta-learning scenarios. However, using the kernel between the task loss gradients as a similarity metrics of task is a nice idea and is a valuable contribution. I highly encourage the authors to further improve the paper. Overall, TANML has scientific merit - when introduced with a convincing storyline and properly supported by realistic experiments and relevant baseline comparisons, this would be a clear accept.
+
+=== Minor remarks ===
+
+- Section 2: Eq. 1: move the comma. It should be $[\theta_0^\top, \nabla_{\theta_0} \mathcal{L}$ …
+- Section3: Either the $\Psi$ should be a $T \times D$ matrix, or there should be no transpose in Eq. 2
+- Section 3 Eq 2: The kernel in the sum should probably be between i and i’, not between i and i.
+- Section 4.1, 2nd paragraph: “... could be ascribed [to] its linear nature …”
+
+
+[1] Bonilla, Edwin V., Kian M. Chai, and Christopher Williams. ""Multi-task Gaussian process prediction."" Advances in neural information processing systems. 2008.
+
+[2] Micchelli, Charles A., and Massimiliano Pontil. ""Kernels for Multi--task Learning."" Advances in neural information processing systems. 2005.
+
+[3] Rothfuss, Jonas, Vincent Fortuin, and Andreas Krause. ""PACOH: Bayes-Optimal Meta-Learning with PAC-Guarantees."" arXiv preprint arXiv:2002.05551 (2020).
+",4,4.0,ICLR2021
+A7nTZc9o3jf,2,HO80-Z4l0M,HO80-Z4l0M,The paper presents a method for robustly handling long-tailed learning and demonstrated the impact on image classification. ,"Significance:
+This article is a useful contribution to transfer learning for tasks where there is not enough data available, showing a modest improvement over the other methods that employ transfer learning in the classifier space.
+
+Novelty:
+The main contribution of this paper is the improvement of weak classifiers when there is not enough data for a class by combining the weak classifiers with the most relevant strong classifiers. This method finds k closest strong classifiers to the weak classifier and then combines the weak classifier with existing classifiers without creating new classifiers or networks from scratch.
+
+Potential Impact:
+The approach presented in this paper is well-evaluated in computer vision, but potentially useful in many other settings.
+
+Technical Quality:
+The technical content of the paper appears to be correct.
+
+Presentation/Clarity :
+The paper is generally well-written and structured clearly. While this method is a clear winner on Few classes, it is not performing as well in Medium classes, as shown in Table 1. An explanation about this issue could strengthen the paper.
+
+Reproducibility:
+The paper describes all the algorithms in full detail and provides enough information for an expert reader to reproduce its results. I would suggest the authors release their code on GitHub or other sites to help other researchers reproduce their results.",8,4.0,ICLR2021
+SGqi9o6sY4P,3,MDsQkFP1Aw,MDsQkFP1Aw,Official Blind Review #1,"This paper proposed an unsupervised method for open-domain, audio-visual separation system. The proposed model was optimized using the newly suggested mixture invariant training (MixIT) together with a cross entropy loss function. The authors suggest to separately process the audio and video, and next align them with a spatiotemporal attention module.
+
+Unsupervised source separation, especially for open-domain is an interesting and important research direction. However, there are several concerns with this submission that need to be addressed first.
+
+My main concern is the contribution of this paper. The authors presented a fairly complicated system comprised of several modules. I would expect the authors to run an ablation study / analysis to better understand their contribution to the final model performance. For instance, why do we need attentional pooling? do we need it in both audio and video? When does the model fail? can we learn something from it? 
+
+Second, I know the authors said this is the first system to do so, however, I would still expect the authors to compare to some baseline. Maybe a fully supervised one? Otherwise it is hard to interpret the numbers presented in Table 1. It is hard to understand how good is this system and how much room for improvement do we have. 
+
+Regarding the samples, it is a bit hard to interpret this results. For every file there are 5 videos and 5 separated samples, some of them sound almost identical. Again, there is no baseline to compare against, so it is hard to understand how good the quality of the separations is. I suggest the authors to improve the samples page to better present emphasis their results. 
+
+A question for the authors, since you treated this task at an unsupervised task, did you try to run some subjective evaluations? maybe let users annotate the sound files and compare variance?",6,4.0,ICLR2021
+BJlqUqW3tH,1,Byg1v1HKDB,Byg1v1HKDB,Official Blind Review #3,"Summary: the paper purposes a dataset of abductive language inference and generation. The dataset is generated by human, while the testing set is adversarially selected using BERT. The paper experiments the popular deep learning models on the dataset and observe shortcoming of deep learning on this task.
+
+Comments: overall, the problem on abductive inference and abductive generation in language in very interesting and important. This dataset seems valuable. And the paper is simple and well-written.
+
+Concerns: I find the claim on deep networks kind of irresponsible. 
+1. The dataset is adversarially filtered using BERT and GPT, which gives deep learning model a huge disadvantage. After all, the paper says BERT scores 88% before the dataset is attacked. 
+2. The human score of 91.4% is based on majority vote, which should be compared with an ensemble of deep learning prediction. To compare the author should use the average score of human.
+3. The ground truth is selected by human.
+
+On a high level, the main difficulty of abduction is to search in the exponentially large space of hypothesis. Formulating the abduction task as a (binary) classification problem is less interesting. The generative task is a better option.
+
+Decision: despite the seeming unfair comparison, this task is novel. I vote for weak accept.",6,,ICLR2020
+u9PVPNUIH1u,4,r1j4zl5HsDj,r1j4zl5HsDj,Differentiable instance dependent learning,"The paper ""Learning to Actively Learn"" proposes a differentiable procedure to design algorithms for adaptive data collection tasks. The framework is based on the idea of making use of a measure of problem convexity (each problem is parametrized by a parameter theta) to solve solving a min-max objective over policies. The rationale behind this objective is that the resulting policy of solving this min-max objective should be robust to the problem complexity. The algorithm then proceeds to sample problem instances and making use of a differentiable objective to find a policy parametrized by a parameter psi, which could be parametrizing a neural network.  
+
+One potential drawback is that the authors assume the dynamics of the problem instance rewards are known by the learner (for example they are a gaussian), which is necessary for computing policy gradients through the policy parametrization. A second drawback lies in the problem tessellation over the theta space. As it is written the method does not seem to scale beyond very small dimensional problem instances, since otherwise the value N would have to be exponential in the dimension, and therefore intractably large. The paper falls within the differentiable ""meta-learning"" for bandits literature, and it does a good job of placing itself within that literature. It also has a convincing experimental section. Other works in the area have not tackled the problem that the authors set themselves to solve: designing algorithms that can adaptively perform well depending on the instance they are fed.
+
+I also find particularly interesting the use of the model complexity balls that can be defined using other existing results in the literature such as in the case of transductive linear bandits. I would suggest to add more explanation to what r is earlier in the paper as it is hard on the reader after a first read. Overall I think this is nice work. The paper itself is more applied than theoretical but I think it is appropriate for ICLR. ",7,4.0,ICLR2021
+H1epcjMThm,2,ByN7Yo05YX,ByN7Yo05YX,"limited experiments, doubts about the method","The paper is written rather well, however I find the experiments incomplete and have some reservations about the
+method. My main points of critique are:
+
+1.  Combining DT & NN 
+I have doubts that the way you combine DT &NN  you get the ""Best of both world"". In some ways your architecture also 
+shares disadvantages of both:
+
+1.1 Interpretability
+Because each node in the tree can a neural network (with arbritrary complexity), this approach looses one central advantage of DT, that is the interpretability of the result.    Each node in the tree can perform arbritrary complex (and hierarchical) 
+computations. The authors only show one particular example (Fig. 2a), where the model has learned is a reasonable 
+structure.
+
+1.2 Complexity:
+The whole architecture is much more complex than either a neural network or a decision tree. I expect that therefore training these is not easy, and expert knowledge in either DT  or NN may not be enough to use this model.
+
+
+2. Limited experiments
+
+2.1 The authors only consider 2 experiments from vision (MNIST & CIFAR 10) while proposing a universal method.  To show universality the authors should use data sets from different domains (e.g UCI data sets)
+
+2.2 The authors argue that a  strength of the method   is  that it uses a low number of parameters on average for a forward path (compared to the total parameter size).  I don't find this argument to be convincing. In the limit this would imply a high memorization of the  data.  Also, a similar case can be made for standard CNN, when a particular filter is mostly inactive for some data points.
+
+2.3 The interpretability of DT compared to NN I mentioned earlier.  To make the argument that their method learns the
+hierarchical structure of the data , the authors should have added experiments to support this, where  such a hierarchical structure is clearly present and can be evaluated empirically.
+
+
+--
+
+In light of the extended experiments w.r.t. to 2.1 I increased my score from 5 to 6.  Overall, I still have doubts about the interpretability and complexity of the proposed method.  
+
+Complexity:  ""but all the intuitions needed would come solely from training NN"".    I disagree with this response.   The architecture is a mix between a tree (hard, decision-tree like error surface,  non-local) and neural network (smooth, mostly convex error surface). This also implies that the training process and its behavior will possess patterns and challenges of both approaches. 
+
+Interpretability:  I think the method misses ""priors"" that enforce credit assignment.  Partitioning the problem in subp-roblems should be done via the tree components, whereas processing (such as image filtering) should be done in the network nodes. However,  the method does not enforce, or encourage this behavior, for instance
+via constraints:   also nodes can do partitioning (because neural networks can approximate decision trees)  and edges can do processing (e.g. decisions-trees can be used for mnist).
+
+So I still believe this to be a borderline paper, however, the experiments support a more general applicability.
+",6,4.0,ICLR2019
+wYQfuiwrdPP,3,BIwkgTsSp_8,BIwkgTsSp_8,Well written but lack of theoretical discussions and weak empirical studies. ,"In this paper, the authors present a generative-model-based Laplace mechanism. By training the VAE on some dataset, the trained encoder can be used to privatize raw data towards epsilon, delta-LDP. Though the method is novel, the privacy guarantee of the proposed method is not clearly stated and proved. Related experiments are not convincing, either.
+
+**Strength**
+The paper is well written with a clear motivation, explanation of methodology. To my knowledge, I believe the work is useful for the privacy research community. The proposed method is also novel.
+
+**Weakness**
+- The motivation to use the Laplace mechanism is not very clear. At the beginning of Sec. 2, the authors reason the usage by ""as it provides strong theoretical privacy guarantees"". This is not convincing for readers especially for those who are not familiar with LDP. Since the Laplace mechanism directly comes from the CDP, I would wonder how does the Gaussian mechanism works. How does the Laplace mechanism guarantee privacy better than the Gaussian mechanism? Reference or proof is essential here.
+
+- In page 3, the authors briefly mention that the local version of the Laplace mechanism can be epsilon-LDP if the sensitivity is accordingly defined. This really lacks rigorousness. In the following sections, the authors refer to (Dwork and Roth, 2014) for the post-processing theorem. Since the work (Dwork and Roth, 2014) is mainly about CDP, I am not sure how the post-processing theorem can be adopted for LDP. Either reference or clear proof is required.
+
+- Meanwhile, there lacks an end-to-end proof of the privacy guarantee of the VAE. I am not sure if the proposed VLM training guarantees privacy. Either, the privacy of encoding is not very clear. Especially, there involves a non-private training on stage 1.
+
+- The experiments are run with pretty week baselines. Through this paper, the authors actively use the same conclusion from CDP (Dwork and Roth, 2014). Thus, I suppose the state-of-the-art CDP algorithms should also be applicable to the experimented tasks, e.g, classification. For the specific task, how well is the proposed compared to the SOTA CDP private learning algorithms? For example, (Abadi, et al., 2016), or (Phan, et al. 2017). Especially, (Phan, et al. 2017) also proposed an adaptive Laplace mechanism without depending on pre-training of the mechanism.
+
+- In page 4, the DP-Adam mentioned in Stage 2 is not stated or proved in (Abadi et al., 2016). Only DP-SGD was discussed. A strict proof is required for the DP-Adam which intensively re-uses private results to help improve the gradients. Thus, the privacy guarantee is not straightforward.
+
+- Seems the VLM training is using a non-DP optimizer at stage 1. Then how the whole training could guarantee privacy on the VLM training set. In experiments, the VLM training set is directly extracted from the private dataset (MNIST). Even though the author experiments with diverse D_1 D_2 distribution for VLM train/test in Sec 4.2, the two datasets are still from the same dataset. In practice, when such a D_2 is private, it is hard to find a D_1 to be non-private. I am afraid this could cause serious privacy leakage. Therefore, I doubt if the experimental results are useful for proving the effectiveness of a private algorithm. More realistic dates should be used.
+
+- In Sec 4.1, the authors run the experiments in two steps. First, the VLM is trained with 'a DP encoder using D_1'. I am not clear how the DP encoder comes from. Does the VLM is also trained with DP? The setting has to be clarified.
+
+- The experiment comparison seems not fair for baselines. For VLM, there are two datasets for training VLM and encoding classification train data. However, the baseline only has classification training data. The VLM encoder has additional information about the data distribution or the noise (by back-propagation in VLM training). The unfairness in the information could be the core reason for the difference in performance. How does the baseline perform if it is pre-trained and tuned (hyper-parameters) on another dataset?
+
+(Phan, et al. 2017). Adaptive Laplace Mechanism: Differential Privacy Preservation in Deep Learning",3,4.0,ICLR2021
+By_sZWcgz,2,r1RF3ExCb,r1RF3ExCb,Incremental work with unclear contribution,"This paper offers an extension to density estimation networks that makes them better able to learn dependencies between covariates of a distribution.
+
+This work does not seem particularly original as applying transformations to input is done in most AR estimators.
+
+Unfortunately, it's not clear if the work is better than the state-of-the-art. Most results in the paper are comparisons of toy conditional models. The paper does not compare to work for example from Papamakarios et al. on the same datasets. The one Table that lists other work showed LAM and RAM to be comparable. Many of the experiments are on synthetic results, and the paper would have benefited from concentrating on more real-world datasets.",5,2.0,ICLR2018
+SyGyMzqez,2,rkQkBnJAb,rkQkBnJAb,Design choices,"The paper presents a variant of GANs in which the distance measure between the generator's distribution and data distribution is a combination of two recently proposed metrics. In particular, a regularized Sinkhorn loss over a mini-batch is combined with Cramer distance ""between"" mini-batches. The transport cost (used by the Sinkhorn) is learned in an adversarial fashion. Experimental results on CIFAR dataset supports the usefulness of the method.
+
+The paper is well-written and experimental results are supportive (state-of-the-art ?)
+
+A major practical concern with the proposed method is the size of mini-batch. In the experiment, the size is increased to 8000 instances for stable training. To what extent is this a problem with large models? The paper does not investigate the effect of small batch-size on the stability of the method. Could you please comment on this?
+
+Another issue is the adversarial training of the transport cost. Could you please explain why this design choice cannot lead instability? 
+",6,2.0,ICLR2018
+BJx5sWN5hX,1,HJgXsjA5tQ,HJgXsjA5tQ,"Interesting experimental results, but less significant theoretical contribution","This paper presents a class of neural networks that does not have bad local valleys. The “no bad local valleys” implies that for any point on the loss surface there exists a continuous path starting from it, on which the loss doesn’t increase and gets arbitrarily smaller and close to zero. The key idea is to add direct skip connections from hidden nodes (from any hidden layer) to the output.
+
+The good property of loss surface for networks with skip connections is impressive and the authors present interesting experimental results pointing out that
+* adding skip connections doesn’t harm the generalization.
+* adding skip connections sometimes enables training for networks with sigmoid activation functions, while the networks without skip connections fail to achieve reasonable performance.
+* comparison of the generalization performance for the random sampling algorithm vs SGD and its connection to implicit bias is interesting.
+
+However, from a theoretical point of view, I would say the contribution of this work doesn’t seem to be very significant, for the following reasons:
+* In the first place, figuring out “why existing models work” would be more meaningful than suggesting a new architecture which is on par with existing ones, unless one can show a significant performance improvement over the other ones.
+* The proof of the main theorem (Thm 3.3) is not very interesting, nor develops novel proof techniques. It heavily relies on Lemma 3.2, which I think is the main technical contribution of this paper. Apart from its technicality in the proof, the statement of Lemma 3.2 is just as expected and gives me little surprise, because having more than N hidden nodes connected directly to the output looks morally “equivalent” to having a layer as wide as N, and it is known that in such settings (e.g. Nguyen & Hein 17’) it is easy to attain global minima.
+* I also think that having more than N skip connections can be problematic if N is very large, for example N>10^6. Then the network requires at least 1M nodes to fall in this class of networks without bad local valleys. If it is possible to remove this N-hidden-node requirement, it will be much more impressive.
+
+Below, I’ll list specific comments/questions about the paper.
+* Assumption 3.1.2 doesn’t make sense. Assumption 3.1.2 says “there exists N neurons satisfying…” and then the first bullet point says “for all j = 1, …, M”. Also, the statement “one of the following conditions” is unclear. Does it mean that we must have either “N satisfying the first bullet” or “N satisfying the second bullet”, or does it mean we can have N/2 satisfying the first and N/2 satisfying the second?
+* The paper does not describe where the assumptions are used. They are never used in the proof of Theorem 3.3, are they? I believe that they are used in the proof of Lemma 3.2 in the appendix, but if you can sketch/mention how the assumptions come into play in the proofs, that will be more helpful in understanding the meaning of the assumptions.
+* Are there any specific reasons for considering cross-entropy loss only? Lemma 3.2 looks general, so this result seems to be applicable to other losses. I wonder if there is any difficulty with different losses.
+* Are hidden nodes with skip connections connected to ALL m output nodes or just some of the output nodes? I think it’s implicitly assumed in the proof that they are connected to all output nodes, but in this case Figure 2 is a bit misleading because there are hidden nodes with skip connections to only one of the output nodes.
+* For the experiments, how did you deal with pooling layers in the VGG and DenseNet architectures? Does max-pooling satisfy the assumptions? Or the experimental setting doesn’t necessarily satisfy the assumptions?
+* Can you show the “improvement” of loss surface by adding skip connections? Maybe coming up with a toy dataset and network WITH bad local valleys will be sufficient, because after adding N skip connections the network will be free of bad local valleys.
+
+Minor points
+* In the Assumption 3.1.3, the $N$ in $r \neq s \in N$ means $[N]$?
+* In the introduction, there is a sentence “potentially has many local minima, even for simple models like deep linear networks (Kawaguchi, 2016),” which is not true. Deep linear networks have only global minima and saddle points, even for general differentiable convex losses (Laurent & von Brecht 18’ and Yun et al. 18’).
+* Assumption 3.1.3 looked a bit confusing to me at first glance. You might want to add some clarification such as “for example, in the fully connected network case, this means that all data points are distinct.”",6,5.0,ICLR2019
+r1xpvEcpKH,2,BJxwPJHFwS,BJxwPJHFwS,Official Blind Review #3,"Summary:
+This paper builds upon the CROWN framework (Zhang et al 2018) to provide robustness verification for transformers. The  CROWN framework is based upon the idea of propagating linear bounds and has been applied to architectures like MLP, CNNs and RNNs. However, in Transformers, the presence of cross-nonlinearities and cross-position dependencies makes the backward propagation of bounds in CROWN computationally intensive. A major contribution of this paper is to use forward propagation of bounds in self attention layers along with the usual back-propagation of bounds in all other layers. The proposed method provides overall reduction in computational complexity by a factor of O(n). Although the fully forward propagation leads to loose bounds, the mixed approach (forward-backward) presented in this work provides bounds which are as tight as fully backward method. 
+
+
+Strength:
+Use of forward propagation to reduce computational complexity is non-trivial
+Strong results on two text classification datasets:
+Lower bounds obtained are significantly tighter than IBP
+The proposed method is an order of magnitude faster than fully backward propagation, while still maintaining the bounds tight.
+
+Weakness:
+Experiments only cover the task of text classification. Experiments on other tasks utilizing transformers would have made the results stronger.
+The paper makes the simplifying assumption that only a single position of an input sentence will be perturbed.  They claim that generalizing to multiple positions is easy in their setup but that is not supported.   The paper needs to declare this assumption early on in the paper (abstract and intro). As far as I could tell, even during the experiments they perturb a single word at a time.
+
+The paper is more technical than insightful.  I am not at all convinced from a practitioners viewpoint that such bounds are useful.  However, given the hotness of this topic, someone or the other got to work out the details.   If the math is correct, then this paper can be it.  
+
+The presentation requires improvement.  Some parts, example, the Discussion section cannot be understood.
+
+
+Questions:
+The set-up in the paper assumes only one position in the input sequence is perturbed for simplicity. Does the analysis remain the same when multiple positions are perturbed? 
+
+Suggestions:
+A diagram to describe the forward and backward process would significantly improve the understanding of the reader.
+
+In Table 3, I am surprised that none of the sentiment bearing words were selected as the top-word by any of the methods.  Among the ‘best’ words,  the words chosen by their method does not seem better than those selected by the grad method.
+Several typos in the paper: spelling mistakes in “obtain a safty guarantee”, poor sentence construction in “and independent on embedding distributions”, subject verb disagreement in “Upper bounds are discrete and rely on the distribution of words in the embedding space and thus it cannot well verify”.
+
+I have not verified the math to see if they indeed compute a lower bound.  
+",6,,ICLR2020
+rkKt2t2xz,3,ryvxcPeAb,ryvxcPeAb,Interesting study of the most intriguing but lesser studied aspect of adversarial examples.,"The problem of exploring the cross-model (and cross-dataset) generalization of adversarial examples is relatively neglected topic. However the paper's list of related work on that toopic is a bit lacking as in section 3.1 it omits referencing the ""Explaining and Harnessing..."" paper by Goodfellow et al., which presented the first convincing attempt at explaining cross-model generalization of the examples.
+
+However this paper seems to extend the explanation by a more principled study of the cross-model generalization. Again Section 3.1. presents a hypothesis on splitting the space of adversarial perturbations into two sub-manifolds. However this hypothesis seems as a tautology as the splitting is engineered in a way to formally describe the informal statement. Anyways, the paper introduces a useful terminology to aid analysis and engineer examples with improved generalization across models.
+
+In the same vain, Section 3.2 presents another hypothesis, but is claimed as fact. It claims that the model-dependent component of adversarial examples is dominated by images with high-frequency noise. This is a relatively unfounded statement, not backed up by any qualitative or quantitative evidence.
+
+Motivated by the observation that most newly generated adversarial examples are perturbations by a high frequency noise and that noise is often model-specific (which is not measured or studied sufficiently in the paper), the paper suggests adding a noise term to the FGS and IGSM methods and give extensive experimental evidence on a variety of models on ImageNet demonstrating that the transferability of the newly generated examples is improved.
+
+I am on the fence with this paper. It certainly studies an important,  somewhat neglected aspect of adversarial examples, but mostly speculatively and the experimental results study the resulting algorithm rather than trying trying the verify the hypotheses on which those algorithms are based upon.
+
+On the plus side the paper presents very strong practical evidence that the transferability of the examples can be enhanced by such a simple methodology significantly.
+
+I think the paper would be much more compelling (are should be accepted) if it contained a more disciplined study on the hypotheses on which the methodology is based upon.",5,4.0,ICLR2018
+B1x_qtQNl,2,r1fYuytex,r1fYuytex,,"The paper proposes a sparsely connected network and an efficient hardware architecture that can save up to 90% of memory compared to the conventional implementations of fully connected neural networks. 
+The paper removes some of the connections in the fully connected layers and shows performance and computational efficiency increase in networks on three different datasets. It is also a good addition that the authors combine their method with binary and ternary connect studies and show further improvements.
+The paper was hard for me to understand because of this misleading statement: In this paper, we propose sparsely-connected networks by reducing the number of connections of fully-connected networks using linear-feedback shift registers (LFSRs). It led me to think that LFSRs reduced the connections by keeping some of the information in the registers. However, LFSR is only used as a random binary generator. Any random generator could be used but LFSR is chosen for the convenience in VLSI implementation. 
+This explanation would be clearer to me: In this paper, we propose sparsely-connected networks by randomly removing some of the connections in fully-connected networks. Random connection masks are generated by LFSR, which is also used in the VLSI implementation to disable the connections.
+Algorithm 1 is basically training a network with back-propogation where each layer has a binary mask that disables some of the connections. This explanation can be added to the text.
+Using random connections is not a new idea in CNNs. It was used between CNN layers in a 1998 paper by Yann LeCun and others: http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf It was not used in fully connected layer before. The sparsity in fully connected layer decreases the computational burden but it is difficult to speed up. Also the author's VLSI implementation does not speed up the network inference.
+How are the results of this work compared to Network in Network (NiN)? https://arxiv.org/abs/1312.4400 In NiN, the authors removed the fully connected layers completely and used a cheap pooling operation and also got improved performance. Are the results presented here better? It would be more convincing to see this method tested on ImageNet, which actually uses a big fully connected layer. 
+
+Increased my rating from 5-6 after rebuttal.
+",6,4.0,ICLR2017
+HklnD5Ke9r,2,HJepXaVYDr,HJepXaVYDr,Official Blind Review #3,"Summary: 
+The authors propose stochastic algorithms for AUC maximization using a deep neural network. Under the assumption that the underlying function satisfies the PL condition, they prove convergence rates of the proposed algorithms. The key insight is to use the equivalence between AUC maximization and some min-max function. Experiments results show the proposed algorithms works better than some baselines. 
+
+Comments: 
+The technical contribution is to show stochastic optimization algorithms for some kind of min-max functions converge to the optimum under the PL condition. The proposed algorithms have better convergence rates than a naïve application of Rafique et al. The technical results rely on previous work on the PL condition and stochastic optimization of min-max functions. The techniques are not straightforward but not seem to be highly innovative, either. 
+
+As a summary, non-trivial algorithms for AUC maximization with neural networks are presented, which could be useful in practice.
+
+Minor Comments:
+
+-How the validation data for tuning parameter are chosen in the experiments? This is absent in the descriptions for experiments. 
+",6,,ICLR2020
+rJxg-r9pFB,1,SyeyF0VtDr,SyeyF0VtDr,Official Blind Review #1,"This paper properly applied several technique from RNN and graph neural networks to model dynamically-evolving, multi-relational graph data. There are two key component: a RNN to encode temporal information from the past event sequences, and a neighborhood aggregator collects the information from the neighbor nodes. The contribution on RNN part is design the loss and parameterizes the tuple of the graph. The contribution of the second part was adapting Multi-Relational Aggregator to this network. The paper is well-written. Although I'm familiar with the dataset, the analysis and comparison seems thorough. 
+
+I'm leaning to reject or give borderline for this paper because (1) This paper is more like an application paper. Although the two component is carefully designed, the are more like direct application. I'm not challenge this paper is not good for the target task. But from the point of view on Machine learning / deep learning, there is not much insight from it. The technical difficult was more from how to make existing technique to fit this new problem.  This ""new"" problem seems more fit to data mining conference. (2) The experiments give tons of number but it lack of detailed analysis, like specific win/loss case of this model. As a more application-side paper, these concrete example can help the reader understand why this design outperform others. For example, it can show what the attention weights look like, and compare to the proposed aggregator. 
+
+Some questions:
+[This question is directly related to my decision] Does this the first paper to apply autoregressive to knowledge graph? from related work, the answer is no. Can the author clarify more on this sentence? 
+
+""In contrast, our proposed method, RE-NET, augments a RNN with message passing procedure between entity neighborhood to encode temporal dependency between (concurrent) events (i.e., entity interactions), instead of using the RNN to memorize historical information about
+the node representations.""
+
+The paper give complexity of this algorithm but no comments about how it compare with other method and how practical it is.
+
+It lacks of some details for the model:
+(1) what is the RNN structure? 
+(2) For the aggregator, what is the detailed formulation of h_o^0? 
+ ",3,,ICLR2020
+S1e4FUxTFB,1,rJecbgHtDH,rJecbgHtDH,Official Blind Review #1,"This paper proposes a new framework for defining Boolean algebra over the space of tasks in goal conditioned reinforcement learning and thereby achieving composition of tasks, defined by boolean operators, in zero-shot. The paper proves that with some assumptions made about a family of MDP’s, one can build Boolean algebra over the optimal Q-functions of the individual MDP and these Q-functions are equipped with all the mathematical operations that come with the Boolean algebra (e.g negation, conjunction). The paper verify their theoretical results by experiments in both the 4-room domain with standard Q-learning and in a simple video game domain with high-dimensional observation space and DQN. The proofs of all the theoretical results seem sound and the experiments support the theory. I enjoyed reading this paper as the paper is generally well written and the idea is quite neat.
+
+That being said, I have a few concerns and questions about the paper that I would like the authors to respond to so I am leaning towards rejecting this paper at the moment. However, I will raise my score if the revision addresses my concerns or provide additional empirical evidence. My concerns are the following:
+
+    1. My biggest concern is whether boolean algebra is the right abstraction/primitive for task level composition. Thus far, the most important application of boolean algebra has been in designing logic circuits where the individual components are quite simple. In the proposed framework, it seems that all of base tasks are required to be a well defined task which are already quite complex, so the utilities of composing them seems limited. For example, in the video game domain the author proposed, a very reasonable base task would be “collect white objects” -- this task when composed with the task “collect blue objects” is meaningless. This seems to be true for a large number of the MDP’s in the super-exponential composition. Furthermore, [1] also considers task level composition with sparse reward but I think these compositions cannot be expressed by boolean algebra. One of the most important appeal of RL is its generality so It would be great if the author can discuss the limitations of the proposed framework and provide an complex/real-world scenarios where composing these already complex base tasks are useful. Just writing would suffice as I understand setting up new environments can be difficult in short notice (Of course, actual experiments would be even better).
+
+    2. Does the maze not change in the environment setup? (It would be nice if source code is provided) If that is the case I would like to see additional experiments on different mazes (i.e. different placement of walls and objects). In my opinion, if there is only a single maze, then the only thing that changes is the location of the agent which makes the task pretty easy and do not show the full benefit of function approximators. I think it’d strengthen the results if the framework generalizes to multiple and possibly unseen mazes.
+
+    3. In the current formulation, a policy is discouraged to visit goals that are not in its current goal sets (receives lowest reward). While this could be just a proof artifact, it can have some performance implications. For example, in the 4 room domain, if I place a goal in the left corridor, then the agent in the bottom left room will need to take a longer route to reach top left (bottom left -> bottom right -> top right -> top left) instead of the shorter route (bottom left -> top left). From this perspective, it seems some non-trivial efforts need to be put into designing these ""basis"" tasks. I am curious about the discussion on this as well.
+
+    4. Haarnoja et al. 2018 and other works on composing Q values can be applied to high-dimensional continuous control using actor-critic style algorithms and relies on the maximum entropy principle. Can the method proposed in this paper be used with actor-critic style? Is the max-entropy principle applicable here as well? Discussion would be great and experiments would be even better.
+
+Out of all my concerns, 1 matters the most and I am willing to raise my score to weakly accept if it’s properly addressed. If, in addition, the authors could adequately address 2-4 I will raise my score to accept.
+
+=======================================================================
+Minor comments that did not affect my decision:
+    - In definition 1, it would be nice to define r_min and r_max and g \ne s \in \mathcal{G} is also somewhat confusing.
+    - In definition 2, \pi_g is never defined
+
+Reference:
+[1] Language as an Abstraction for Hierarchical Deep Reinforcement Learning, Jiang et al. 2019
+",3,,ICLR2020
+SylCPVKnFH,1,HylKvyHYwS,HylKvyHYwS,Official Blind Review #2,"This paper wants to study the problem of “learning with rejection under adversarial attacks”. It first naively extends the learning with rejection framework for handling adversarial examples. It then considers the classical cost-sensitive learning by transfer the multi-class problem into binary classification problem through one-vs-all and using the technique they proposed to reject predictions on non-important labels, and name such technique as “learning with protection”. Finally, they do some experimental studies. 
+
+The paper does not show any connection between “learning with rejection” and “adversarial learning”. The method it proposes is also a naïve extension of existing methods. Both the problem setting and the technique does not have novelty. The paper fails to realize that the motivated application is actually called “cost-sensitive learning” and has been studied long time before. The paper also has problems in writing. Finally, there is no comparison with any baseline. Only empirical results of the proposed methods are shown. Due to all these reasons, there is still a long way to go before the paper can be published. I will rate it a clear rejection.
+
+More specially, 
+
+The definition of “suspicious example” in Sec.3.1 has no relationship with adversary examples. Does the paper focus on adversary examples? If the definition has no relationship, it is classical learning with rejection. 
+In the last equation of Page 3, there is no definition of \tilde L. Actually, according to Figure 1, x’s is more close to the decision boundary, it is an example more hard to classify, which could also be “suspicious”. 
+In the definition of “suspicious example” at the beginning of Sec.3.1, is both x and x’ defined as suspicious examples in this way?
+In the last equation of page 2, there is a rejection function, so minimizing this loss is a “separation-based approach”. However, at the end of Sec.2 the paper states they “follow a confidence-based approach”. Any comment on the inconsistency?
+
+The motivated problem is not new. It is called cost-sensitive learning in machine learning and can date back to 2001:
+Charles Elkan. The Foundations of Cost-Sensitive Learning. IJCAI 2001: 973-978.
+Where they study the same problem when misclassifying one class of data may cost a lot than misclassifying another class of data. The current paper has not discussed any related work of cost-sensitive learning although they want to study a problem in its field. 
+
+The paper should be also improved in writing in the following aspects. 
+There is a lot of inaccurate statements in the paper. For example,  “In Sections 3 and 4, we propose and describe our algorithm”, what is the difference between propose and describe? “an estimator \hat h might return result that differ greatly from h^* in a case with finite samples”. Actually there are rigorous theoretical results describing how the number of finite samples will impact the estimator \hat h on unseen data. For example, 
+Peter L. Bartlett, Shahar Mendelson. Rademacher and Gaussian Complexities: Risk Bounds and Structural Results. JMLR, 2002.
+So inaccurate/unclear statements that will mislead readers should be avoided. 
+
+In writing, the paper also lacks the necessary references in many places. For example, “Learning with rejection is a classification scenario where the learner is given the option to reject an instance instead of predicting its label.”, “…classifies adversary attacks to two types of attacks, white-box attack and black-box attack.”, “Methods for protecting against these adversarial examples are also being proposed.”. Necessary references are needed for these places.
+
+The organization is also problematic. For example, in the second half of Sec.2 introducing two kinds of learning with rejection models, it should be included in a “related work” part. 
+
+------------------------------------------
+Thank you for the rebuttal. I raised my score a little bit. But I still think this paper has not been ready to be published yet.
+",3,,ICLR2020
+rJetvQrJcB,3,SJxUjlBtwB,SJxUjlBtwB,Official Blind Review #1,"- The authors proposed a novel method for cryo-EM reconstruction that extends naturally to modeling continuous generative factors of structural heterogeneity. To address intrinsic protein structural heterogeneity, they explicitly model the imaging operation to disentangle the orientation of the molecule by formulating decoder as a function of Cartesian coordinates.
+
+- The problem and the approach are well motivated. 
+
+- This reviewer has the following comments:
+1) VAE is known to generate blurred images. Thus, based on this approach, the reconstruction image may not be optimal with respect to the resolution which might be critical for cryo-EM reconstruction. What's your opinion?
+2) What's the relationship between reconstructed performance, heterogeneity of the sample and dimensions of latent space?
+3) It would be interesting to show any relationship, reconstruction error with respect to the number of discrete multiclass.  
+4) How is the proposed method generalizable?",6,,ICLR2020
+djwlPbXQcEQ,4,Cri3xz59ga,Cri3xz59ga,"The paper provides interesting theoretical insights in multi-task learning using common and specific parameters modeling framework and based on least-squares SVM. Especially, it is theoretically established that the standard MTL LS-SVM is biased. Thereon a method derived from the analysis is proposed to correct the bias and allows to achieve enhanced performances. Empirical evaluations highlight the effectiveness of the method.","Summary:
+The paper provides interesting theoretical insights in multi-task learning using common and specific parameters modeling framework and based on least-squares SVM. Especially, it is theoretically established that the standard MTL LS-SVM is biased. Thereon a method derived from the analysis is proposed to correct the bias and allows to achieve enhanced performances. Empirical evaluations highlight the effectiveness of the method.
+
+Reasons for score: 
+Overall, I vote for accepting. The theoretical analysis highlight the intrinsic  relation between task statistics/relatedness and the classification performances. The analysis helps to design an adequate MT models with improved performances. My major concern is about the clarity of the paper notations. Hopefully the authors can address my concern in the rebuttal period. 
+ 
+Pros:
+- The paper provides a asymptotical analysis of the decision function $g_i$ related to each task $i$ (learned using a linear MTL LS-SVM) by leveraging on random matrix theory and by assuming large scale $n$ and high dimension $p$ with limiting growth rate. The main result highlights the influence of the task data statistics and the MTL hyper-parameters on the decision function. Essentially the paper shows that the score provided by a task decision function $g_i(\mathbf{x})$ has a Gaussian distribution in the limit case, hence one can estimate its classification error. For me, the proposed derivation is of great interest in real applications. 
+-  The derived statistical modeling of $g_i(\mathbf{x})$ allows to control the intercept of $g_i$ in order to minimize the classification error. The key to this error control is to appropriately assign the labels of each task samples according to the tasks relatedness and their data statistics which can be easily computed based on available training data. This leads to a practical and comprehensive MTL algorithm (that should be moved in the main paper). 
+- Experimental evaluations on synthetic and classical MTL datasets illustrate that the proposed method systematically ranks in the top two methods out of 5 compared algorithms. This makes the provided analysis convincing.  
+
+Cons: 
+- The mathematical notations are dense and render the overall mathematical derivation hard to read.  It might be valuable to expose the main concepts of the paper starting from a two-tasks MTL problem and then generalize to an arbitrary number of tasks. 
+-  It might be useful to report the standard deviation along with the average empirical accuracies (Table 1 for instance)
+- The analyzed framework relies on a binary MT classification problems. How the presented results transfer to the multi-class classification setting?
+- Does the analysis change if instead of the LS-SVM one uses a logistic regression as a model? Also how the proposed approach lifts to non-linear models? 
+ 
+
+Other comments:
+- Table 1 overpasses the page format. 
+
+After rebuttal
+- I read the response of the authors. The response addresses most of the concerns raised in the reviews.",7,3.0,ICLR2021
+MVl-mSVeuBm,4,BIwkgTsSp_8,BIwkgTsSp_8,"This work proposes an application-agnostic way to generate LDP representations of sensitive data or synthetic data that satisfies LDP. The proposed approach is effective for high-dimensional data. Downstream ML tasks can take these representations or synthetic data without worrying about privacy leakage, and achieve better accuracy than existing LDP solutions ","Strong point 1: The idea of putting noise insertion (via noisy data-generation models) and optimization of good representations together to obtain LDP representations and/or synthetic data seems to be effective. While (6) relies on some independency assumptions, it might be fine in most cases and empirical evidence is reported to support it
+
+Strong point 2: It is an application-agnostic approach and theoretically any downstream tasks and models can be supported... When there is a label, the privacy budget is split and random perturbation is used on labels
+
+Strong point 3: It outperforms naive LDP baselines (with noise added directly to features) a lot in experiments
+
+Weak point 1: The proof of the most important result is missing: It is said that ""sampling from $q_\phi(z|x)$ produces a representation $\tilde z$ of $x$ that satisfies $\epsilon$-LDP. I don't think it is a trivial result and the author needs to everything together (including the analysis of sensitivity, the optimization algorithm, and so on) to formally prove it
+
+Weak point 2: A minor issue: in figures of experiments, by ""clean accuracy"", do you actually mean ""accuracy"" (for some algorithms in the figures, it is privacy accuracy?)
+
+W1 is the main reason for the rating of 6 but not higher ones - highly encourage the authors to fix it before the publication ",6,5.0,ICLR2021
+Hk0nYO-El,2,HkIQH7qel,HkIQH7qel,Review,"This paper presents an architecture for answer extraction task and evaluates on the SQUAD dataset. The proposed model builds fixed length representations of all spans in the answer document based on recurrent neural network. It outperforms a few baselines in exact match and F1 on SQUAD.
+
+It is unfortunate that the blind test results are not obtained yet due to the copyright issue. There are quite a few other systems/submissions on the SQUAD leader board that were available for comparison.
+
+Given that there's no result on the test set reported, the grid search for hyperparameters on the dev set directly is also a concern, even though the authors did cross validation experiments.
+",6,5.0,ICLR2017
+5W6_ZHZO2iV,3,dx11_7vm5_r,dx11_7vm5_r,"Good paper, improves our understanding of OMWU, OGDA in zero-sum games","This paper studies the performance of optimistic multiplicative weights update (OMWU) and optimistic gradient descent (OGDA) in constrained zero-sum settings and provide linear convergence rate guarantees. For OMWU in bilinear games over the simplex, they show that when the equilibrium is unique, linear last-iterate convergence is achievable with a constant learning rate. In the case of projected OGDA algorithm, they introduce a sufficient condition under which it convergence fast with a constant learning learning rate. They show that bilinear games over any polytope satisfy this condition and OGDA converges exponentially fast even without the unique equilibrium assumption.
+
+This is overall a nice paper that extends and improves our understanding about optimistic versions of OMWU and OGDA especially in constrained bilinear zero-sum games. The paper does a good job at explaining technical improvements over prior results in the area and particularly the works by Daskalakis and Panageas and Hsieh et al. 
+
+The experimental section could be slightly improved. For example in the case of OMWU it is seems hard to detect whether the error curve is best fit by an exponential even after the initial slower phase. It would be very interesting to see numerical estimation of the base of these exponents and see how close they match their theoretical bounds. Also the question about OMWU with a continuum of equilibria could be explored experimentally as well. Do experiments support fast convergence in this case? 
+
+Overall, this is a nice paper and I recommend acceptance.
+
+Related references: 
+In terms of fast convergence in bilinear zero-sum games with fixed learning rates
+[1] Proves that even with large fixed learning rates the average duality gap of alternating GDA in unconstrained bilinear zero-sum games converges to zero at a rate of O(1/t). [2] proves O(1/sqr{t}) convergence under arbitrarily large learning rates for a variant of GDA (Follow the regularized leader with Euclidean regularizer) in small constrained bilinear zero-sum games, despite divergence of the day-to-day behavior to the boundary.
+
+[3, 4] OMWU is shown to stabilize fast in bilinear constrained zero-sum games in a different sense by arguing exponentially fast shrinking of the volume of sets of initial conditions in the dual/payoff space. 
+ 
+
+[1] Bailey et al. Finite Regret and Cycles with Fixed StepSize via Alternating Gradient Descent-Ascent. COLT 2020.
+[2]  Bailey, Piliouras. Fast and Furious learning in zero-sum games: vanishing regret with non-vanishing step sizes. Advances in Neural Information Processing Systems. 2019.
+[3] Cheung, Piliouras. Chaos, Extremism and Optimism: Volume Analysis of Learning in Games. arXiv preprint arXiv:2005.13996 (2020).
+",7,4.0,ICLR2021
+H1efFRwPFr,1,rketraEtPr,rketraEtPr,Official Blind Review #2,"The authors aim at improving the accuracy of numerical solvers (e.g. for simulations of partial differential equations) by training a neural network on simulated reference data. The neural network is used to correct the numerical solver. For different tasks they set up an approximation scheme via minimizing a square loss plus a task specific regularization (e.g. volume preservation in the Navier-Stokes equation example). This is then trained in a supervised manner. They also explore an unsupervised version by back-progagating through a differentiable numberical solver.
+The proposed method seems straight forward, but effective as the simulations seem to indicate.",6,,ICLR2020
+p_3u4RuwAw,2,fGF8qAqpXXG,fGF8qAqpXXG,An important study advancing our understanding of the optimization of neural networks,"Summary: This paper generalizes the results of Pilanci and Ergen (2020) showing that the non-convex optimization problem corresponding to the training of a one-hidden-layer muti-output ReLU neworks can be solved using convex programming. In particular they show that the problem has (I) a finite convex bidual that can be solved efficiently using some variant of the Frank-Wolfe algorithm and (II) a convex strong dual given by a copositive program. Unfortunately in the general case the complexity is exponential in the rank of the data matrix. A spike-free assumption is introduced that facilitates things and allows polynomial time algorithms, however this assumption is pretty strong and I doubt it is useful in practice as it makes the training 'almost' equivalent to training a linear classifier. Some references are missing as the problem is related to the low-rank matrix factorization problem. The notation in some critical parts of the paper is not clear and makes reading difficult.
+
+I would like to start saying that I am open to increasing my score (**Update: score updated after author feedback**) if the authors are able to clarify notation and add references/discussion, because I think that the essential contributions of this paper are important
+
+Pros:
+1. Quality/Significance: The paper continues the vein of previous convex reformulations of the training of ReLU networks of Pilanci-Ergen (2020), extending the results of the single output case to multi-output which is arguably the interesting case in contemporary applications of Deep Learning. It opens the way to certified global optimality of solutions via convex programming of ReLU networks, and provides insight into the fundamental complexity of the optimization problem in nonconvex (factored) vs
+convex form.
+
+2. Originality: There are few papers trying to understand the connections between the nonconvex formulation of the training of ReLU networks with convex formulations that provide certified optimality. For this reason I find the paper original although it builds upon previous work exploring this idea. Most papers deal with heuristic or ad-hoc arguments of convergence of GD/SGD in the non-convex formulation with many assumptions that might not hold in practice or that are cumbersome.
+
+3. Clarity: The ""text"" part of the paper is clearly written and the pace seems good. However there are crucial points where I could not understand the notation (see cons).
+
+Cons:
+1. **This has been fixed in an updated version and is now not a problem** Clarity: I could not really understand the optimization problem (9) and (11) because the variable $i$ appears as an optimization variable $\min_{i \in [P]}$ but it does not appear in the optimization objective? In the objective the letter $i$ is used but as an index in the summation $\sum_{i=1}^P$ which makes things really confusing. It appears that $i$ is not really an optimization variable and perhaps what the authors meant is that the constraint in (9) and (11) should read $V_i \in K_i \forall i \in [P]$? at least this is what I would find most natural. This should be clarified and corrected if needed.  
+
+2. **This has been fixed in an updated version and is now not a problem** Clarity: It is not so clear how Algorithm 1 follows the Frank-Wolfe template. As I understand an initial value of $t$ is chosen and then the inner minimization problem is solved. Following this a new value of t is chosen and so on. So in the inner minimization problem the constraint set (assuming my understanding in the previous point) is $\sum_{i=1}^P ||V_i|| \leq t$ . Then the step (a) looks like the LMO but it is not clear where is the constraint \sum_{i=1}^P \| V_i \|_ enforced? It looks like the FW update necessarily modifies only one $V_i$ and that the solution can be obtained from the LMO corresponding to only one constraint ||V_i||_ \leq 1 is this the case? I think it is enforced through the constraint $\|u\|_2 \leq 1$.
+It would be great if the authors can confirm and flesh out the intermediate steps in the appendix, which does not explain much besides the algorithm in the main text. This would make it more accesible.
+
+3. **The authors argue in rebuttal that ReLU networks on spike-free data is still different from a linear classifier, some experiments were added** Significance: The Spike-free assumption seems to simplify things but it might be unrealistic. what is even more evident is that
+it is somehow equivalent to learning a linear classifier as it implies that $(XU)_+ = XU$ so the network is actually $XUV^T$ so it is equivalent to $XA$ with $A=UV^T$. Under this identification it is well known that the variational formulation of the nuclear norm
+implies that the regularizer $\|U\|_F^2 + \|V\|_F^2$ is equivalent to the nuclear norm of the unfactored matrix $A$. There are multiple works studying this that should be mentioned and it should be acknowledge that the spike-free assumption is more convenient for simplifying the analysis, rather than realistic (or it should be argued why it is ok to do the assumption).  Also it is not clear in the statement of the theorem what is the relation to spike-free. Does whitened $X$ imply spike-free $X$?
+
+4. Significance: Theorem 2 is a simple consequence of this ""spike-free"" simplified setting and corresponds to the solution of the proximal operator of the nuclear norm.
+
+5. Significance: the previous two points suggest that the important results here are those corresponding to general $X$ matrix. Unfortunately in that case the complexity is exponential in the rank of the data, which I guess is just the ""real"" dimensionality of the data? for example if data is collinear this would be 1 and it is not surprising that the problem would become easier. It looks like the proposed algorithms are currently only of theoretical interest.
+
+6. **Authors have answered this in the revision** Experiments: exp 5.1: Perhaps add here the corresponding plots for the 0-1 error? it would be great to understand if the solution obtained with convex programming translates to better misclassification error compared to sgd. Currently how much time does the convex program take to solve? compared to SGD
+do you think there is any benefit given that SGD seems to find good solutions in the overparametrized case?
+
+7. **Authors have answered this in the revision** Experiments: exp. 5.2. same as before: what happens with the misclassification error? what stops you from doing the computation on all the data? memory?
+Does the red cross mean that the soft thresholded SVD is solved in less than one second?
+
+References:
+1. Unifying Nuclear Norm and Bilinear Factorization Approaches for Low-Rank Matrix Decomposition
+Ricardo Cabral, Fernando De La Torre, Joao P. Costeira, Alexandre Bernardino; Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2013, pp. 2488-2495
+2. Geometry of Factored Nuclear Norm Regularization. Qiuwei Li, Zhihui Zhu, Gongguo Tang
+3. Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization. Benjamin Recht, Maryam Fazel, and Pablo A. Parrilo
+
++ many references therein
+
+overall I would be happy to recommend acceptance if the previous issues were adressed in a succint way.
+
+**Update** After author feedback many of my concerns have been addressed, in particular the optimization problem
+templates are much easier to understand now as well as the derivation of the algorithm. Also some important differences
+between linear classifiers vs ReLU networks on 'Spike-free' data have been clarified. These were my main concerns and thus I am inclined to raise my score. ",7,5.0,ICLR2021
+#NAME?,1,paE8yL0aKHo,paE8yL0aKHo,Intuitvely reasonable and empirically beneficial exploration strategy,"**Contribution**: For better exploration, the authors propose to use curiosity to set state dependent target entropies with SAC, with the goal of inducing more diverse behavior at unfamiliar states. They use RND to provide a curiosity score, which after normalizing, is used to adjust the state dependent target entropy. Due to RND performing poorly as a curiosity measure with state-based representations instead of image based, introduce a variant X-RND which additionally uses a contrastive loss to improve the curiosity mechanism. They demonstrate benefits over regular SAC on standard Mujoco Gym benchmarks.
+
+**Prior work in exploration**: While increasing policy entropy in regions of low  confidence makes intuitive sense, it is not actually clear simply being noisier in the face of uncertainty actually leads to good exploration (for example https://arxiv.org/abs/1306.0940 argues that these ""dithering"" style exploration is inefficient). It would be good to add additional comparisons to other methods for augmenting SAC to perform better exploration. For example, OAC https://arxiv.org/abs/1910.12807 learns upper confidence bounds on Q-values and uses the optimistic Q values for an exploration policy. It could be interesting to also explore how well CAT-SAC compares to just using the (X-)RND curiosity score directly as a bonus for exploration. In general, the related work is lacking in discussion of classical RL exploration methods like UCB-style bonuses and posterior sampling.
+
+**Conclusion**: Overall, I like the work. It appears technically sound, and presents a fairly simple and intuitive way to adjust exploration with SAC to better handle unfamiliar states (as far as any undirected exploration methods do at least). I would like to see a few more comparisons against other exploration techniques in deep RL, and perhaps some experiments ont asks outside of the standard gym benchmarks that focus more on the exploration problem itself rather than control (for example some sparse reward tasks). 
+
+Fairly minor points: 
+LaTeX error: There are several instances of log in math mode that should use \log instead.
+I would also like to see extended learning curves of CAT-SAC vs SAC, particularly until we see the performance of each method saturate. It would be interesting to see if the better exploration from CAT-SAC allows it to converge to higher performing policies as well as learning faster.",6,3.0,ICLR2021
+qKj2jEu4Ddh,2,1OCTOShAmqB,1OCTOShAmqB,Review,"This paper provides theoretical insight into the mechanisms by which a simplified attention model trained with gradient descent learns to allocate more mass to relevant words in the input.
+
+In a limited toy setting, the authors derive a closed form relationship between word-score and word embedding norm in a simplified, one layer attention model. The theoretical findings are verified empirically both on the toy task and on a more realistic sentiment classification benchmark. Due to the extreme simplicity of the setting considered, as well as the number of assumptions made, it is unclear to me what to make of these results. In particular, it seems that the setting considered (fixed query attention over bag of word embeddings) is very different from real use cases of attention.
+
+**Pros**
+- The closed-form relationship between attention score and embedding norm during SGD training is novel as far as I know
+- The theoretical results are well justified in experiments: in particular the predicted ""SEN"" relationship seems to match the prediction.
+
+
+**Cons**
+- Large number of assumptions, the validity of which is unclear in practice: in particular
+   1. The assumption that the query vector is (1) a parameter and not a function of the inputs (as in self attention or cross-sentence attention) and (2) is fixed. I don't know of many ""real world"" attention networks that work this way, after all one of the main appeals of attention is its ""content-based"" nature
+   2. Assumption 1 that the score and embeddings of non-topic words don't change during training. First, this seems like something that could be proven from the earlier assumption that the topic words are updated more frequently. And second it is unclear if it holds for a real task (and a different model where eg. the attention layer attends to higher layers rather than just the embeddings)
+- Confusing notation makes the paper hard to follow (see remarks for examples)
+- Unclear takeaway: what does this paper tell us about attention as it is used in practice? 
+
+
+**Remarks**
+-  5.1: ""The “negative effect” cannot last long because the topic word embedding moves along the gradient much faster than the non-topic words due to its high occurrence rate"": This is true in the toy example in the paper, but is this the case in practice? For instance in sentiment classifications there are many words to describe sentiment that are infrequent (cue Zipf's law). Moreover, in realistic settings there will be non-topic words which appear very frequently (stop words such as ""the"", ""a"" in English). 
+- Lemma 1: while it is true that fixing q doesn't change the capacity of the model, it will definitely change its training dynamics (which is very much the theme of the paper as per the title). How important is it to fix q from this perspective?
+- A lot of the math would be easier to read if the dependence of some variables (\hat v, Z,...) on a specific sentence was marked explicitly (eg. Z_S instead of Z)
+- The notation in Lemma 2 was extremely confusing to me, due to the sudden introduction of the bracket notation and the awkward spacing with both equations on the same line. I would recommend at least putting both on separate line, and also reorganizing so that the LHS of the second equation is only ds_i/dt (move the mean to the other side)
+- In 3. I think using ""\mathbf R"" for the dictionary is unfortunate (too similar to \mathbb R). Overall I found the separation between topic and non-topic words dictionaries confusing. Why not have a global Vocabulary V, a set of topic words T and refer to the remaining words as V\T?
+- In 2. ""[Hahn and Brunner] show theoretically that the self-attention blocks are severely limited in their computational capabilities when the input sequence is long, which implies that self-attention may not provide the interpretability that one expects."": can you clarify this sentence? Limitation in computational capabilities does not seem to entail limited interpretability in general (see linear models for instance).
+- Typo in citations in 2.: ""Hahn (Hahn, 2020) and Brunner (Brunner et al., 2019)"" -> ""Hahn (2020) and Brunner et al. (2019)""
+- Typo in 3. ""The training set [...] are"" -> ""The training set [...] is""
+
+---
+
+**Post Rebuttal**
+
+In my review, the main concerns were (1) validity of assumptions, (2) confusing writing/notation and (3) unclear takeaway. The rebuttal appropriately addressed (1), although I am looking forward to the revision to see how this is discussed in the paper itself. I cannot really say anything about any improvements on the writing (2) without seeing the revision, but I am confident that the authors can address most of the issues pointed out by myself and other reviewers. Regarding (3), unclear takeaway, after reading the authors' response as well as the other reviews, my concerns are somewhat assuaged (partly because the assumptions were addressed better), although I am still unsure how or if the results in this paper could be expanded to realistic attention models.
+
+There are additional issues I raised during the discussion (general lack of citations in particular), however this can be fixed fairly easily for the camera ready so I am willing to give the benefit of doubt and raise my score to 6 (borderline accept)
+
+",6,3.0,ICLR2021
+BkgHrnDiYB,1,S1lJv0VYDr,S1lJv0VYDr,Official Blind Review #1,"Review for ""Model Imitation for Model-Based Reinforcement Learning"".
+
+The paper proposes a type of model-based RL that relies on matching the  distribution of (s,a,s') tuples rather  than using supervised learning to learn an autoregressive model using supervised learning.
+
+I vote to reject the paper for three reasons.
+
+1. The motivation for matching distributions as opposed to learning the model the traditional autoregressive way is lacking. In particular, consider the table lookup case with discrete states and a single actions. Learning the model in this case corresponds to learning a stochastic matrix / Markov Chain. Call this chain P. Define a diagonal matrix whose diagonal contains the ergodic distribution of the chain \Xi. Your framework corresponds to learning a matrix \Xi P, while standard autoregressive models would just learn P. Knowing one gives information about the other - you can go from \Xi P to by normalizing the rows and go from P to \Xi P by computing the stationary distributions. On the other hand, you seem to claim in Figure 1 and in the introduction that your framework is qualitatively different from standard autoregressive models, but the above analysis suggests you are simply approximating a slightly different object, without much of an argument about why this is preferable.
+
+2. The theory section seems a bit underwhelming. In particular:
+- Proposition 1 says that we will learn a perfect model given infinite data. That is true, but I am not sure how it helps motivate the paper.
+- The presentation of Theorem 1 makes it unclear. In particular, in equation 1, you define R (the return) to depend on the transition model and the policy, but in Theorem 1, you seem to suggest that there is no dependence on the policy. 
+
+3. In the experimental section, the Ant plot shows no learning for your method (MI). MI performs well when initialized and does not seem to learn anything (the curve is flat). Please justify why this happens.
+
+I will re-evaluate the paper if the above doubts are cleared up during the revision phase.
+
+Minor point:
+Please have the paper proof-read. If you can't find help, please run it through an automated grammar checker. The current version has severe writing problems, which make the writing unclear. Examples:
+""we analogize transition learning""
+""For deterministic transition, it (what?) is usually optimised with l2-based error""
+",6,,ICLR2020
+SJgxU3FO2X,2,HJl2Ns0qKX,HJl2Ns0qKX,Review,"
+Update:
+
+I’d like to thank the authors for their thoroughness in responding to the issues I raised. I will echo my fellow reviewers in saying that I would encourage the authors to submit to another venue, given the substantial modifications made to the original submission.
+
+The updated version provides a clearer context for the proposed approach (phychophysical experimentation) and avoids mischaracterizing GAIA as a generative model.
+
+Despite more emphasis being put on mentioning the existence of bidirectional variants of GANs, I still feel that the paper does not adequately address the following question: “What does GAIA offer that is not already achievable by models such as ALI, BiGAN, ALICE, and IAE, which equip GANs with an inference mechanism and can be used to perform interpolations between data points and produce sharp interpolates?” To be clear, I do think that the above models are inadequate for the paper’s intended use (because their reconstructions tend to be semantically similar but noticeably different perceptually), but I believe this is a question that is likely to be raised by many readers.
+
+To answer the authors’ questions:
+
+- Flow-based generative models such as RealNVP relate to gaussian latent spaces in that they learn to map from the data distribution to a simple base distribution (usually a Gaussian distribution) in a way that is invertible (and which makes the computation of the Jacobian’s determinant tractable). The base distribution can be seen as a Gaussian latent space which has the same dimensionality as the data space.
+- Papers on building more flexible approximate posteriors in VAEs: in addition to the inverse autoregressive flow paper already cited in the submission, I would point the authors to Rezende and Mohamed’s “Variational Inference with Normalizing Flows”, Huang et al.’s “Neural Autoregressive Flows”, and van den Berg et al.’s “Sylvester Normalizing Flows for Variational Inference”.
+
+-----
+
+The paper title summarizes the main claim of the paper: ""adversarial training on latent space interpolations encourage[s] convex latent distributions"". A convex latent space is defined as a space in which a linear interpolation between latent codes obtained by encoding a pair of points from some data distribution yields latent codes whose decoding also belongs to the same data distribution. The authors argue that current leading approaches fall short of producing convex latent spaces while preserving the ""high-dimensional structure of the original distribution"". They propose a GAN-AE hybrid, called GAIA, which they claim addresses this issue. The proposed approach turns the GAN generator and discriminator into autoencoders, and the adversarial game is framed in terms of minimizing/maximizing the discriminator’s reconstruction error. In addition to that, interpolations between pairs of data points are computed in the generator’s latent space, and the interpolations are decoded and treated as generator samples. A regularization term is introduced to encourage distances between pairs of data points to be mirrored by their representation in the generator’s latent space. The proposed approach is evaluated through qualitative inspection of latent space interpolations, attribute manipulations, attribute vectors, and generator reconstructions.
+
+Overall I feel like the problem presented in the paper is well-justified, but the paper itself does not build a sufficiently strong argument in favor of the proposed approach for me to recommend its acceptance. I do think there is a case to be made for a model which exhibits sharp reconstructions and which allows realistic latent space manipulations -- and this is in some ways put forward in the introduction -- but I don’t feel that the way in which the paper is currently cast highlights this very well. Here is a detailed breakdown of why, and where I think it should be improved, roughly ordered by importance:
+
+- The main reason for my reluctance to accept the paper is the fact that its main subject is convex latent spaces, yet I don’t see that reflected in the evaluation. The authors do not address how to evaluate (quantitatively or qualitatively) whether a certain model exhibits a convex latent space, and how to compare competing approaches with respect to latent space convexity. Figure 2 does present latent space interpolations which help get a sense of the extent to which interpolates also belong to the data distribution, however in the absence of a comparison to competing approaches it’s impossible for me to tell whether the proposed approach yields more convex latent spaces.
+- I don’t agree with the premise that current approaches are insufficient. The authors claim that autoencoders produce blurry reconstructions; while this may be true for factorized decoders, autoregressive decoders should alleviate this issue. They also claim that GANs lack bidirectionality but fail to mention the existing line of work in that direction (ALI, BiGAN, ALICE, and more recently Implicit Autoencoders). Finally, although flow-based generative models are mentioned later in the paper, they are not discussed in Section 1.2 when potential approaches to building convex latent spaces are enumerated and declared insufficient. As a result, the paper feels a little disconnected from the current state of the generative modeling literature.
+- The necessity for latent spaces to ""respect the high-dimensional structure of the [data] distribution"" is stated as a fact but not well-justified. How do we determine whether a marginal posterior is ""a suboptimal representation of the high-dimensional dataset""? I think a more nuanced statement would be necessary. For instance, many recent approaches have been proposed to build more flexible approximate posteriors in VAEs; would that go some way towards embedding the data distribution in a more natural way?
+- I also question whether latent space convexity is a property that should always hold. In the case of face images a reasonable argument can be made, but in a dataset such as CIFAR10 how should we linearly interpolate between a horse and a car?
+- The proposed model is presented in the abstract as an ""AE which produces non-blurry samples"", but it’s not clear to me how one would sample from such a model. The generator is defined as a mapping from data points to their reconstruction; does this mean that the sampling procedure requires access to training examples? Alternatively one could fit a prior distribution on top of the latent codes and their interpolations, but as far as I can tell this is not discussed in the paper. I would like to see a more thorough discussion on the subject.
+- When comparing reconstructions with competing approaches there are several confounding factors, like the resolution at which the models were trained and the fact that they all reconstruct different inputs. Removing those confounding factors by comparing models trained at the same resolution and reconstructing the same inputs would help a great deal in comparing each approach.
+- The structure-preserving regularization term compares distances in X and Z space, but I doubt that pixelwise Euclidian distances are good at capturing an intuitive notion of distance: for example, if we translate an image by a few pixels the result is perceptually very similar but its Euclidian distance to the original image is likely to be high. As far as I can tell, the paper does not present evidence backing up the claim that the regularization term does indeed preserve local structure.
+- Figures 2 and 3 are never referenced in the main text, and I am left to draw my own conclusions as to what claim they are supporting. As far as I can tell they showcase the general capabilities of the proposed approach, but I would have liked to see a discussion of whether and how they improve on results that can be achieved by competing approaches.
+- The decision of making the discriminator an autoencoder is briefly justified when discussing related work; I would have liked to see a more upfront and explicit justification when first introducing the model architecture.
+- When discussing feature vectors it would be appropriate to also mention Tom White’s paper on Sampling Generative Networks.",4,4.0,ICLR2019
+Skew_xxphQ,2,SkgVRiC9Km,SkgVRiC9Km,"Improving the robustness of deep Networks by modeling the manifold of hidden representations is original, efficient and well motivated","The method works by substituting a hidden layer with a denoised version. 
+Not only it enable to provide more robust classification results, but also to sense and suggest to the analyst or system when the original example is either adversarial or from a significantly different distribution.
+Improvements in adversarial robustness on three datasets are significant.
+
+Bibliography is good, the text is clear, with interesting and complete experimentations.",9,4.0,ICLR2019
+7pVvjXOHa9,1,ODKwX19UjOj,ODKwX19UjOj,Recommend rejection,"This paper presents a method to unsupervisedly discover structure in unlabeled videos, by finding events at different temporal resolutions.
+
+### Strengths:
+- The paper focuses on the important problem of exploiting weakly labeled video data, by exploiting its structure, for example by recovering temporal structure in an autoencoder fashion.
+- Use of multiple modalities to cross-supervise each other.
+- Code is available.
+
+### Weaknesses:
+- The concept of hierarchy is not well defined or well motivated. While most hierarchical papers refer to hierarhies of concepts, the hierarchy considered in this paper is much weaker as a hierarchy, and it refers to subactions within longer actions (not actions that are specific instances of more abstract actions). While this way of understanding hierarchy can be valid, it is never explained or motivated in the paper, or even compared to the standard way of understanding it. 
+- The overall motivation for the method has a lot of gaps. For example:
+	- Regeneration of low level concepts from high level concepts: what are we expecting from a network that moves from a ""high in the hierarchy"" concept to a ""low in the hierarchy"" concept? Should we expect the network to randomly select one subconcept? The specific information is not there, the problem is ill-defined (for example, we can go from ""cat"" to ""animal"", but not from ""animal"" to ""cat""). How are we expecting any reconstruction? The paper does not provide any justification or intuition.
+	- Why are the authors using those specific pairs in the L_dyn term? (last line of page 4 -- I suggest adding equation numbers). Apart from no motivation, there are no ablations showing that those are the correct pairs.
+	- Why is the ""low"" case even necessary? Wouldn't it be possible to train only with the ""high"" one? This would probably imply rethinking some of the losses, but overall the method would look very similar. This is, the idea of hierarchy would disappear, but this idea is not used in the experiments anyway.
+	- What is the motivation for the two modalities? I can understand it can help, but it is not central to the method. This is not necessarily bad, but it requires some explanation.
+- The explanation revolves around demonstration data. It is unclear why demonstration data is important for this method, and why it is not general for any human action. The introduction explaining demonstrations in a robotics scenario does not feel related to the content of the paper. For example, a lot of stress is given to ""agents"" interacting in ""environments"".
+- Some terms introduced in the paper would benefit from a change. For example, a ""concept"" in the paper is actually an ""event"", not a ""concept"". This is more in line with the literature, for example the dataset they use labels that as ""event"".
+- Results on chess are hard to believe. Do the authors think that the system has really learned (unsupervisedly) to identify interesting openings? It could instead be learning strong biases like length of openings in the dataset.
+- Quantitative results are not convincing. How does FLAT and even the supervised method perform much worse than the two trivial baselines (random and equal division)? Does FLAT use text data?
+- Conceptual assertions that I do not believe to be true: ""under a concept, the sequence of frames is nearly deterministic"". This is not true, there are nearly infinite ways of having a sequence of frames (video clip) depicting how to crack an egg, for example. Different background, different way of performing the action, different elements in the scene, speed of the action, point of view, etc. This is related to the ""regeneration"" point above.
+
+### Additional comments and questions:
+- Figure 2 is hard to understand. What is it exactly representing? What should we learn from it?
+- Does the algorithm have any ""motivation"" to not predict always a single concept per sequence?
+- What is the relationship between u and v?
+- Have you tried smaller networks? 8 layers and 8 heads just for the Encoder seem like a very big model for such a relatively simple text.
+- In the first paragraph of page 8 the paper mentions that there is only marginal improvement in low level concepts. How are these evaluated? As far as I could understand, there was only GT available for the high level ones.
+
+### Final recommendation
+Overall, I believe the paper as it stands is not ready to be presented to ICLR and I recommend a rejection.",4,4.0,ICLR2021
+D2pTcZ0iZo,2,23ZjUGpjcc,23ZjUGpjcc,Expert model provides better representations for transfer learning,"
+**Summary:**
+ 
+This paper presents a novel method for obtaining better representations for transfer learning. Specifically, instead of using a generic representation for various down-stream tasks, this paper proposed to create a family of expert models in pre-training, and selectively choose one expert to generate representation depending on the target transfer learning task. A simple yet effective k-nearest neighbor strategy is used for picking the best-fitting expert. This paper has extensive experiments including pre-training models on two large-scale image datasets, and evaluated on two transfer learning benchmarks (VTAB and datasets from DAT).
+
+
+**Reasons for score:**
+
+The general idea of this paper (i.e. replacing generic representation with one from target-dependent expert model) is very intuitive, and the experimental validations are very solid. However, the novelty and technical contribution of this paper is only moderate. Overall, I think it's a good paper and may inspire future work on more efficient and effective transfer learning.
+
+
+**Pros:**
+
+1. The proposed idea is intuitive, and empirically very effective. Different from focusing on architectures for transfer learning, this paper focused on using different representations to improve transfer learning quality. This is complementary with many existing techniques on transfer learning.
+2. The experimental validations are extensive and solid, e.g. all reported accuracies have confidence interval so the comparison is more informative.
+3. The paper is well-written and easy to follow.
+ 
+**Cons:**
+
+1. In Section 4.1 two variations of ""MoE family"" are proposed, i.e. Full ResNet Modules and Adapter Modules. For Adapter Modules, it seems all experts share blocks and only differ in adapter module (Fig.3 b). However, the effectiveness of constructing ""expert"" model in this way lacks supporting evidence, i.e. how well each ""expert"" performs on the corresponding/non-corresponding domains?
+2. I am wondering if Table 1 should add performance comparison with the baseline model B, so it would be more straightforward whether the expert branch selected by the proposed strategy is more effective on the target datasets.
+3. For ""All Experts"" in Table 2, I find it unclear as in Section 6.5 ""Combining All Experts"" it doesn't explain how this model works. Does it mean all experts are simultaneously selected, i.e. $x_i := F_i(x_{i-1} + \sum_e a_e^{(i)}(x_{i-1}))$? In that case, if each adapter module introduced 6% parameters (Section 4.1) the extra parameters will be non-negligible.
+4. Please clarify: as shown in Figure 3 (a) ResNet-50 has four blocks and adapter module is added before each block; does the proposed system support choosing different adapter module at different block? E.g. for the first two blocks of a specific dataset $a^{(1)}_i$ and $a^{(2)}_j$, is it possible that $i != j$? In other words, the ""performance proxy"" strategy is applied to each block, or at the end of the entire network?
+5. Section 6.1 mentioned that ImageNet21k use 50 experts and JFT use 240 experts, but in supplementary JFT seems to have 244 experts. Also from Table 4 in supplementary C. 6, some dataset choose ""baseline"" as the selected expert. Does the ""baseline"" serve as a standalone expert, or a base of all experts (e.g. in adapter modules setting do the adapter serve as residual to the baseline)? There seems to be some inconsistencies here.
+
+
+**Questions during rebuttal period:**
+
+Please address my questions in the cons section.
+  
+
+**Some typos and minor issues:**
+
+--  Supplementary D.1, ""Unconditional pre-training"" the last sentence is incomplete.",7,4.0,ICLR2021
+gvJLIDDlYrL,3,awnQ2qTLSwn,awnQ2qTLSwn,An interesting paper on learning to share rewards,"Summary
+The paper considers the cooperative MARL setting where agents get local rewards and they are interconnected as a graph where neighbors can communicate. The paper specifically considers the communication of reward sharing, that is, an agent shares (part of) its reward to its neighbors, such that each agent optimizes its local reward plus rewards from its neighbors. This motivates a bi-level optimization framework where the high-level policy decides how the rewards are shared and the low-level policy locally optimizes the shared rewards given the high-level’s decision. The paper’s flow motivates such a framework well. The experimental results demonstrate the method’s effectiveness. I think it is a strong paper (accept), but my confidence is low due to the following confusions I have.
+ 
+Comments/Questions
+ 
+1. I have a high-level comment on the reward sharing mechanism. It seems that the proposed method does not support multi-hop sharing because rewards can only be shared to neighbors. Why is this single-hop sharing effective in the experiments? Is it because of domain-specific reasons, or it’s because that single-hop sharing is in principle equally effective, why?
+
+2. The derivation of (18) using taylor expansion is unclear to me. Could the authors explain it with more details?
+
+3. I don’t fully understand the proof of Proposition 4.2. Specifically, does “phi can be learned in a decentralized manner” mean that the *optimal* phi can be based on only the local observation for each agent, instead of based on global state? Could the authors comment on the approximation error induced by the mean-field approximation? Why the proof begins with phi_i based on o_i and ends with phi_i based on global state s.
+
+4. In Equation (17) and (20), should phi^* be just phi (i.e. no * here)?
+
+5. The low-level policy is to optimize the shared rewards. My understanding is that any (single-agent) RL algorithm can be used for optimizing the shared rewards, e.g. DQN, DDPG, A2C, etc. Why would the authors choose DGN, a rather less popular RL algorithm? Have the authors tried more popular algorithms as the low-level policy?
+
+6. For fixed LToS,  how do we determine the fixed sharing weights?
+
+---
+Thanks for the response. I've increased my confidence. ",8,3.0,ICLR2021
+BJlCvHrG9H,3,HygOjhEYDH,HygOjhEYDH,Official Blind Review #1,"The paper proposes to directly model the (conditional) inter-event intervals in a temporal point process, and demonstrates two different ways of parametrizing this distribution, one via normalizing flows and another via a log-normal mixture model. To increase the expressiveness of the resulting model, the parameters of these conditional distributions are made to be dependent on histories and additional input features through a RNN network, updating at discrete event occurrences.
+
+The paper is very well written and easy to follow. I also like the fact that it is probably among the first of those trying to integrate neural networks into TPPs to look at directly modeling inter-event intervals, which offers a different perspective and potentially also opens doors for many new methods to come.
+
+I have just three comments/questions.
+
+1. The log-normal mixture model has a hyper-parameter K. Similarly, DSFlow also has K, and SOSFlow has K and R. How are these hyper-parameters selected? I don't seem to find any explanation in the paper (not even in appendix F.1)?
+
+2. To better demonstrate that a more complicated (e.g. multi-modal) inter-event interval distribution is necessary and can really help with data modeling, I'd be interested to see e.g. those different interval distributions (learnt from different models) being plotted against each other (sth. similar to Figure 8, but with actual learnt distributions), and preferably with some meaningful explanations as to e.g. how the additional modes capture or reflect what we know about the data.
+
+3. Even though the current paper mainly focuses on inter-event interval prediction, I think it's still helpful to also report the  model's prediction accuracy on marks in a MTPP. The ""Total NLL"" in Table 5 is one step towards that, but a separate performance metric on mark prediction alone would have been even clearer.",8,,ICLR2020
+PG0xY51egep,1,GbCkSfstOIA,GbCkSfstOIA,"The method is correct, and is simple and straightforward, which is not necessarily bad. However, the paper has several serious issues with the empirical evaluations. The paper is not at the bar to be accepted.","This paper attempts to address the semi-supervised learning topic by proposing a method based on an aggregated loss considering both cross-entry and Davies-Bouldin Index. Cross-entropy is used to ensure the maximum margin between classes and Davies-Bouldin Index is applied to the labeled data and to the whole dataset, respectively, to ensure a high quality of clustering. Evaluations in four small and simple datasets are reported to demonstrate the effectiveness of the proposed method.
+
+The idea behind the proposed method is pretty simple and straightforward. It makes sense and so I believe that the method proposed is correct conceptually. However, a correct method may not necessarily be effective, which would require an extensive evaluation. That is what the paper lacks. Let me elaborate in detail.
+
+First, all the four datasets used in the evaluations are small in scale and simple in class distributions. This is NOT the sate-of-the-art anymore. There are larger in scale and more complex datasets available and why did the authors fail to use these datasets instead?
+
+Second, the comparison studies reported in Fig. 4 are not clearly documented and possibly are not fair either. Which specific supervised learning method did you use in this comparison? Did you use the same number of labeled data samples for the supervised learning method? The same questions are also raised for the toy example experiment reported in Fig. 3. If the supervised learning method uses the same number of the labeled samples, it is not a fair comparison obviously as your method uses more regularization constraints in the loss function. If a different number of the labeled samples is used, it is not a fair comparison either. Note that more constraints you use in the loss function, the number of labeled samples required can vary more, depending upon the different data distributions. Consequently, the comparison studies as well as the ablation studies reported in Fig. 4 can be misleading, possibly just showing the best case. More datasets with varying complexity in distributions may need to further the comparison studies before any meaningful conclusion can be reached.
+
+Third, Davies-Bouldin Index is just one of the intrinsic evaluation metics for cluster analysis and there are others such as silhouette coefficient and Calinski-Harabasz score. In principle, you may also use these metics. I wonder whether you have investigated this and why you picked up Davies-Bouldin index.
+
+Finally, the paper has a lot of grammatical errors and typos that should have been fixed before the paper was sent in for review.
+
+As a minor comment, I don’t like the acronym MCMC for this proposed method. The same acronym is already in use as a very popular statistical machine learning method in the literature. The authors may want to change to another name.",4,5.0,ICLR2021
+2EI6ztnbJkq,3,xpx9zj7CUlY,xpx9zj7CUlY,Original idea with a lot of potential,"The authors introduce the novel idea of producing unbiased gradient estimates by Monte Carlo sampling the paths in the autodiff linearized computational graph.
+Based on sampling paths it is possible to save memory due to not having to store the complete linearized computational graph (or the intermediate variables necessary to reconstruct it). 
+Memory is the main bottleneck for reverse mode autodiff for functions with lots of floating point operations (such as a numerical integrator that performs many time steps, or a very deep neural network).
+The authors' idea can therefore potentially enable the gradient based optimization of objective functions with many floating point operations without check pointing and recomputing to reverse through the linearized computational graph.
+The tradeoff made is the introduction of (additional) variance in the gradient estimates.
+
+The basic idea is simple and elegant:
+The linearized computational graph of a numerical algorithm is obtained by
+a) having the intermediate variables of the program as vertices
+b) drawing directed edges from the right-hand side variables of an assignment to its left-hand side variable
+c) labeling the edges by the (local) partial derivatives of assignments' left-hand side with respect to their right-hand side.
+
+The derivative of an ""output"" y with respect to an ""input"" x of the function is the sum over all paths from x to y through the linearized computational graph taking the product of all the edges in the path.
+The sum over all paths corresponds to the expectation of a uniform distribution over the paths times the number of paths.
+That expectation can be Monte Carlo sampled.
+
+The authors suggest a way of producing the path sampling based on taking a chained matrix view of the computation graph (see e.g https://arxiv.org/abs/2003.05755) and injecting low rank random matrices.
+Due to the fact that the expectation of the product of independent random variables is equal to the product of the expectations this injection is unbiased as well if the injected matrices have the identity matrix as expectation.
+
+To take advantage of the simple idea it is in practice necessary to consider the concrete shape of the computational graph at hand in order to decide where to best randomize and save memory without letting variance increase too much.
+
+The authors present a neural network case study where they show that for some architectures the suggested approach has a better memory to variance trade off than simply choosing a smaller mini-batch size.
+Furthermore, they present a 2D PDE solver case study where their approach can save a lot of memory and still optimize well.
+
+I recommend to accept the paper.
+
+Remarks:
+
+I would love to see a more in depth analysis of the variance for example for simple but insightful toy examples.
+
+For exampl simple sketching with random variates v with E[vv^T] = I can be used to obtain an unbiased gradient estimate via E[gvv^T] = g, i.e. by evaluating a single forward-mode AD pass (or just a finite difference perturbation).
+But of course the gradient estimate has such a high variance so as to not give any advantage over finite difference methods (since with each sample / evaluation we are only capturing one direction in the input space).
+We are not gaining the usual benefit of reverse mode autodiff of getting information about the change of the function in all directions.
+
+In order for paths to be efficiently Monte Carlo-friendly it is probably necessary that they are correlated with other paths.
+In practice this will perhaps have something to do with e.g. the regularity of the PDE solution (the gradient with respect to the solution is similar to that of its neighborhood).
+
+A simple example (ODE integrator):
+
+p1 = x
+p2 = x
+for i in range(n):
+	p1 = p1 + h * sin(p1)
+	p2 = p2 + h * sin(p2)
+
+y = 0.5 * (p1 + p2)
+
+The two paths in the program compute exactly the same values so leaving one path out randomly does not make any difference at all (if we correctly re-weight the estimate).
+
+Mini-batches are often like that: Independent samples from the same class give correlated computations, hence the variance is related to the variance in the data.
+
+But if two paths involve completely independent and uncorrelated computations then the variance is such that we do not gain anything.
+We need at least two gradient steps to incorporate the information from both paths.
+Since we do not systematically cycle through them but sample randomly, we are actually going to be less efficient.
+
+In terms of arguing about memory savings for machine learning applications it would be interesting to see a case study with a large scale architecture that does not fit into memory.
+
+The random matrix injection section could be clarified by moving the sentence ""the expectation of a product of independent random variables is the product of their expectation"" further to the front and state clearly the idea that:
+E[A PP^T B QQ^T C] = A E[PP^T] B E[QQ^T] C = A I B I C = A B C
+
+In the PDE example you could clarify the notation used to properly distinguish between the continuous and the discretized solution.
+
+Also the PDE constrained optimization problem is not inherently stochastic (as can be argued for the empirical risk minimization setting in machine learning).
+Therefore, it is possible to use non-SGD methods with linear or even super-linear convergence rates (quasi-Newton methods).
+SGD with unbiased gradients has a sublinear rate of convergence.
+But the ideas of the paper are of course still useful even when trying to find the optimum up to machine precision in finite time.
+We can first use the RAD SGD approach in the early optimization and then go to the deterministic setting later in the optimization.
+
+- Page 3: eqn equation 1 -> Equation 1
+- Page 6: figure 5 -> Figure 5
+- Throughout: Perhaps clarify the meaning of batch vs mini-batch (in other papers batch can refer to full-batch)
+- Figure 5 (a) has blue curves but blue is not in the legend of Figure 5 (c)
+- Page 8: backpropogate -> backpropagate 
+",8,4.0,ICLR2021
+WddFG3GiRK,1,zI38PZQHWKj,zI38PZQHWKj,FROT is interesting but the analysis is suspicious,"*Summary*:
+The authors try to solve a special kind of high-dimensional optimal transport problem. Specifically, they consider the cases when features are grouped and the grouping is known a-priori. The authors formulate the problem into the feature-robust optimal transport (FROT) problem.
+The authors propose two solving algorithms, one based on the Frank-Wolfe method, and one based on linear programming.
+
+*Pros*:
+The connection to the feature group sounds interesting to me, as it has a natural connection to the structure of deep learning models.
+The presentation (other than the introduction) is easy to follow.
+
+*Cons*:
+Note that the first point is the main contributing factor for my rating.
+
+1. Section 3.1 is very confusing, and it seems to me that the authors fail to establish the correct convergence guarantee.
+As in page 4, the target is $min_{\pi} max_{\alpha} J(\Pi, \alpha)$ 
+If we fix $\pi$, we can solve for the optimal $\alpha$. Plug this optimal $\alpha$ back in and we obtain $G(\Pi)$.
+Intuitively one may choose to solve for $\alpha$ and $\pi$ alternatingly.
+However the convergence of $G(\Pi)$ says nothing more than, in a fixed iteration, one can solve exactly for the optimal $\alpha$ and up to $\epsilon$ accuracy for $\Pi$.
+We still don't know if the solution of the algorithm indeed minimizes the said loss.
+I checked the proof of proposition 4. It just invokes the standard FW-convergence analysis from Jaggi 2013, and argue nothing about the alternative part. Note that, even though the two subproblems (for $\alpha$ and for $\pi$) can be solved almost exactly, it could be non-trivial to set up the convergence of the entire alternating algorithm.
+Alternatively, maybe the authors want to argue that solving $min_{\pi} max_{\alpha} J(\Pi, \alpha)$ is equivalent to solving $\max_{\pi} G(\Pi)$. However, this is also not obviously true for me.
+
+2.  What are the other potential applications of FROT? While Semantic Correpondance is an interesting application, I find it hard to convince myself that FROT is better than Liu's 2020-CVPR work (requiring validation dataset is not a big problem - you can always to train-val split). With its similarity to group lasso, FROT might have more interesting applications.
+
+3. Presentation of the introduction can be improved. I find it hard to parse the introduction until I almost finished reading the entire paper. Putting figure 1 to page 2 only creates more questions in my head instead of offering intuitions. Also, it would be helpful if the author can list their contributions in a more organized way.
+
+4. I didn't quite get the high dimensional part. While 'high-dimensional' appears in the abstract, introduction, and conclusion section, I didn't find the correspondence in the main text.
+
+5. I didn't get the robust part, other than the empirical performance in the evaluation section.
+",3,3.0,ICLR2021
+S1xk8af4x,1,ry7O1ssex,ry7O1ssex,Review,"This paper presents a bridging of energy-based models and GANs, where -- starting from the energy-based formalism -- they derive an additional entropy term that is not present in the original GAN formulation and is also absent in the EBGAN model of Zhao et al (2016, cited work). The relation between GANs and energy-based models was, as far as I know, first described in Kim and Bengio (2016, cited work) who also introduced the entropy regularization. It is also discussed in another ICLR submission (Dai et al. ""Calibrating Energy-based Generative Adversarial Networks""). There are two novel contribution of this paper: (1) VGAN: the introduction of a novel entropy approximation; (2) VCD: variational contrastive divergence is introduced as a  novel learning algorithm (however, in fact, a different model). The specific motivation for this second contribution is not particularly clear. 
+
+The two contributions offered by the authors are quite interesting and are well motivated in the sense that the address the important problem of avoiding dealing with the generally intractable entropy term. However, unfortunately the authors present no results directly supporting either contribution. For example, it is not at all clear what role, if any, the entropy approximation plays in the samples generated from the VGAN model. Especially in the light of the impressive samples from the EBGAN model that has no corresponding term. The results provided to support the VCD algorithm, come in the form of a comparison of samples and reconstructions. But the samples provided here strike me as slightly less impressive compared to either the VGAN results or the SOTA in the literature - this is, of course, difficult to evaluate. 
+
+The results the authors do present include qualitative results in the form of reasonably compelling samples from the model trained on CIFAR-10, MNIST and SVHN datasets. They also present quantitative results int he form of semi-supervised learning tasks on the MNIST and SVHN. However these
+quantitative results are not particularly compelling as they show limited improvement over baselines. Also, there is no reference to the many
+existing semi-supervised results on these datasets. 
+
+Summary: The authors identify an important problem and offer two novel and intriguing solutions. Unfortunately the results to not sufficiently support either
+contribution.
+",4,5.0,ICLR2017
+B1e3G21qhm,2,rkxt8oC9FQ,rkxt8oC9FQ,"Simple idea, the presentation of the method and experiment results can be improved","Summary:
+This paper proposed to extend TARNET (Shalit et al. 2017), a representation learning approach for counterfactual inference, in the following ways.
+
+First, to extend TARNET to multiple treatment setting, k head networks (instead of 2) were constructed following the shared MLP layers, where each head network modeled the outcome of one treatment. This extension seemed quite straightforward.
+
+Second, during training, for every sample in a minibatch, find its nearest neighbors from all other treatments and add them to the minibatch. The distance was measured by the propensity score, which was defined the probability of a sample being assigned to a treatment group and could be learned by a classification model (such as support vector machine used in this work). Therefore, 1) the augmented minibatch would contain the same number of samples for each treatment group; 2) different treatment group were balanced.
+
+Third, a model selection strategy was proposed by estimating the PEHE using nearest neighbor search.
+
+Comments:
+This paper is well motivated. The key challenges in counterfactual inference is how to adjust for the bias in treatment assignment and the associated discrepancies in the distribution of different treatment groups. 
+
+The main idea of this paper, i.e., augmenting the minibatch through propensity score matching for each sample, is well explained in Section 3. However, it could be better if the introduction of model architecture (in Appendix F) was presented in the method section.
+
+Did the author need to train (k choose 2) SVMs to compute the propensity scores for samples from k treatment groups?
+
+When comparing different approaches, as were shown in Table 3, 4 and Figure 3,4, did the author run any statistical test, such as t-test, to confirm the difference between those distributions were significant? The standard deviations of those errors seemed quite large so the difference could be non-significant.
+
+Could the author provide more explanations on why the proposed approach, i.e., minibatch augmentation using propensity score matching, can outperform the TARNET? In TARNET, each sample it only used to update the head network corresponding to the sample's treatment assignment, why would balancing samples in the minibatch can improve the estimation of treatment effect?",5,4.0,ICLR2019
+U22Azq1o6bG,3,60j5LygnmD,60j5LygnmD,Interesting findings on negative learning rates in meta learning,"Summary: 
+
+This paper uses random matrix theory to study meta-learning in mixed linear regression and finds that the optimal learning rate for the inner loop/training is negative in order to minimize the test/adaption error. The results are interesting and novel. However, there are some concerns regarding the practical relevance and presentation of the results. 
+
+Major comments:
+
+1. Implementation of negative learning rates in practice: This paper provides an interesting perspective that a negative learning rate could reduce the test error. My concern is with a negative learning rate, the training loss $\mathcal{L}^{(i)}$ in Eq. (3)  increases and the algorithm may not converge (at least on the training sets). In practice, for example, on deep learning models motivated by this paper in the first paragraph, how do you decide when to stop training parameters $\theta^{(i)}$ and $\omega$? How could you use the results in this paper to provide some guidance? 
+2. Presentation of the main results: I would suggest the authors formally state the results in theorems or propositions. Currently,  the main results are presented in Eq. (5), (9), (10), (15), and (16), that seem to be informal and need clarification. First, how is $\bar{\mathcal{L}}^{test}$ defined? Second, what does $\simeq$ mean in this context? If it means ````"" approximately equal to,"" then what is the order of the estimation error? Third, the results are derived using mean-squared loss as shown in Eq. (25) and (28) in the appendix. It is helpful to be explicit about the loss function in the main text. Fourth, the loss function does not seem to have a regularization term. In the overparameterized regime, would the model suffer from overfitting? 
+3. Experiments: Could you elaborate more on why the theory matches the experiment pretty well in Figures 1.a, 2.a, and 3.a, while not the case in Figures 1.b, 2.b, and 3.b (especially Figure 3.b)? If I understand correctly, the data generating process in the simulation follows the assumption in the main results. Is it because the estimation error in $\bar{\mathcal{L}}^{test}$ (the terms omitted in the RHS of $\simeq$) is not negligible in finite samples? Also, is the curve in Figures 1.b and 3.b robust to the choice of parameters? It may be helpful to include a few other simulation setups in the appendix. ",6,4.0,ICLR2021
+SyllU9Iq37,2,Hyewf3AqYX,Hyewf3AqYX,A method to produce adversarial attack using a Frank-Wolfe inspired method,"This paper provide a method to produce adversarial attack using a Frank-Wolfe inspired method. 
+
+I have some concerns about the motivation of this method: 
+ - What are the motivations to use Frank-Wolfe ? Usually this algorithm is used when the constraints are to complicated to have a tractable projection (which is not the case for the L_2 and L_\infty balls) or when one wants to have sparse iterates which do not seem to be the case here.  
+ - Consequently why did not you compare simple projected gradient method ? (BIM) is not equivalent to the projected gradient method since the direction chosen is the sign of the gradient and not the gradient itself (the first iteration is actually equivalent because we start at the center of the box but after both methods are no longer equivalent).
+ - There is no motivations for the use of $\lambda >1$ neither practical or theoretical since the results are only proven for $\lambda =1$ whereas the experiments are done with \lambda = 5,20 or 30.
+ - What is the difference between the result of Theorem 4.3 and the result from (Lacoste-Julien 2016)?
+ 
+Depending on the answer to these questions I'm planning to move up or down my grade.
+
+ In the experiment there is no details on how you set the hyperparameters of CW and EAD. They use a penalized formulation instead of a constrained one. Consequently the regularization hyperparameters have to be set differently.
+
+ The only new result seem to be Theorem 4.7 which is a natural extension to theorem 4.3 to zeroth-order methods. 
+
+Comment: 
+- in the whole paper there is $y$ which is not defined. I guess it is the $y_{tar}$ fixed in the problem formulation Sec 3.2.  In don't see why there is a need to work on any $y$. If it is true,  case assumption 4.5 do not make any sense since $y = y_{tar}$ (we just need to note $\|\nabla f(O,y_{tar})\| = C_g$) and some notation could be simplified setting for instance $f(x,y_{tar})  = f(x)$.
+- In Theorem 4.7 an expectation on g(x_a) is missing
+
+Minor comments: 
+- Sec 3.1 theta_i -> x_i
+- Sec 3.3 the argmin is a set, then it is LMO $\in$ argmin.
+
+===== After rebuttal ======
+The authors answered some of my questions but I still think it is a borderline submission.
+",5,4.0,ICLR2019
+rXEN3-UyRhI,1,9wHe4F-lpp,9wHe4F-lpp,Official Blind Review #2,"---paper summary---:  
+
+This paper proposes to improve the BNN’s discriminative ability by introducing additional non-linearities. In addition, the paper exploits the group convolution to enable a wider network, which can strengthen BNN’s representational ability, while keeping the total overhead unchanged. 
+		
+
+---Pros---:  
+
+This paper introduces some practical methods to improve the performance of BNNs. In particular, the additional FPReLU is convincing. Moreover, the paper shows that grouped convolutions can be applied to wider BNNs to increase the representational capability while keeping the same complexity.
+
+---Cons---:
+
+1:  This paper is incremental with limited new technical insights. 
+
+a) Adding the nonlinearity can improve the representational capability of BNNs has been extensively explored in the literature. Specifically, ReActNet [Liu et al. 2020b] inserts additional RPReLU after each binary convolution to increase the non-linearity;  [A1; Martinez et al., 2020; Tang et al. 2017]  add PReLU as the nonlinearity;  Group-Net [Zhuang el al. 2019] and XNOR-Net [Rastegari et al., 2016] argue that adding additional ReLU is important to BNNs performance.  
+b) Varying width and/or depth has been studied in previous fixed-point quantization/BNNs literature, especially in NAS-based methods [Bulat et al. 2020;  A2]. The original idea comes from EfficientNet.        
+c)  Some other tricks such as replacing the 1x1 downsampling shortcut with pooling have been widely used in the community.
+
+2:   Some arguments need theoretical proofs. For example, the authors argue that “despite the big quantization error made by quantization, the binary model can achieve much higher accuracy than the real-valued model with no quantization error”. In other words, minimizing the quantization error can hinder the discriminative ability of BNNs, which is the main point of this paper. This observation is interesting, but needs further theorems to further explore whether the quantization error has some relations with the predicted accuracy under some constraints. If zero quantization error cannot lead to a good result, then what should be the best trade-off? I encourage the authors to further explore this issue. At the current stage, it is far from enough.
+
+3: The experimental results in Table 4 may have mistakes. The paper claims that BOPs are converted to equivalent FLOPs with a factor of $1/64$. However, why do smaller BOPs correspond to larger FLOPs? 
+
+4:  The necessary “AND-count” operations may have technical issues. The AND-Count with values binarized to {0,1} should be equivalent to XNOR-Count with the values binarized to {-1, 1}, with a scalar difference. The authors can derive by themselves to verify this. However, the formulations in Eq. (3) and Eq. (4) are not equivalent if both values are binarized to {-1, 1}.
+
+5: More experimental results on deeper networks (e.g., ResNet-50, -101) on ImageNet are needed to justify the effectiveness of the method.  In addition, the comparisons with RELU [Zhuang el al. 2019; Rastegari et al., 2016], PReLU [A1; Martinez et al., 2020; Tang et al. 2017],  RPReLU [Liu et al. 2020b] (optional) should be included.
+
+6: Some typos. For example, “inheriting exiting advanced structures” → “inheriting existing advanced structures”; “Base on this consideration” → “Based on this consideration”.
+
+References: 
+
+[A1]: Bayesian Optimized 1-Bit CNNs, in ICCV2019
+
+[A2]: Joint Neural Architecture Search and Quantization, arXiv:1811.09426v1
+",4,5.0,ICLR2021
+cIgCsqf7AEt,1,gBpYGXH9J7F,gBpYGXH9J7F,An incremental work which leverages known knowledge and techniques to prove refined regret bounds.,"The work studies three online learning problems with corrupted rewards as the feedback. The three problems are the stochastic multi-armed bandit, the linear contextual bandit, and the reinforcement learning of the Markov Decision Process optimization.
+
+The major contributions are three improved regret bounds for each of the problems, where the key to success is to replace the empirical mean in the arm/action selection scoring function by the median. Then, robust estimation bounds in the literature are leveraged to achieve the regret analysis. Experiments also support the theoretical findings.
+
+Concerns
+
+Besides the above positive contributions, following are some concerns:
+
+1. The target of optimization. All analyses bound the regrets with respect to the best arm/policy in the uncorrupted situation. This requires more elaboration to connect with the routing example in the introduction. If the actual routing time is the signal to measure the performance, then one does not need to take any ETA estimation into learning. If the actual routing time for the performance evaluation is the time affected by some noise, the regret definition makes no sense. It seems a reasonable situation is that the observations are corrupted but the performance feedback is not. The concern becomes stronger when it comes to the MDPs, as the trajectory of a policy in the corrupted environment will be different from that in the non-corrupted environment. It would be great if the authors can elaborate on the target chosen. Why is regret compared to the uncorrupted situation instead of the corrupted one? How does the trajectory from the corrupted observations affect or not affect the MDP analysis?
+
+2. The key modifications in the algorithms and the key inequalities in the regret analyses are from the literature. The analyses also mainly follow the similar steps in the previous works. Without observing novel technical contributions, the work is considered incremental.
+
+3. There is another work (http://proceedings.mlr.press/v32/seldinb14.html) also dealing with with the mixture of stochastic and the adversarial rewards which is worth mentioning or comparing.
+
+=====================
+
+Post Rebuttal
+
+I went through the authors' reply. My first concern is resolved by the reply. Form the authors' replies to all reviewers, I believe this is an incremental work. It is technically sound, but the lack of involved and novel technical contributions makes it more belong to an incremental work. Thus, I will keep my score unchanged.",5,4.0,ICLR2021
+k3b56qlIUhz,2,guEuB3FPcd,guEuB3FPcd,Impactful paper with strong empirical results,"## Summary
+The authors propose AlgebraNets - a previously explored approach to replace real-valued algebra in deep learning models with other associative algebras that include 2x2 matrices over real and complex numbers. They provide a comprehensive overview of prior methods in this direction and motivate their work with potential for both parameter and computational efficiency, and suggest that the latter is typically overlooked in prior literature. The paper is very well-written and follows a nice narrative, and the claims are mostly backed empirically with experimental results. 
+## Pros 
+* Empirically justified with experiments on state-of-the-art benchmarks in both computer vision and NLP. 
+* Establishes that exploring other algebras is not just an exercise for mathematical curiosity but also practical, and encourages deep learning practitioners to extend the results. 
+* Perhaps the most useful aspect is that the experiments fit well into a standard deep learning framework – with conventional operations, initialization, etc. That is, the algebras do not require significant custom ops/modifications to achieve state-of-the-art results. 
+* Shows multiplicative efficiency (parameter count and FLOPs) in many cases 
+## Cons 
+* The authors motivate this work with computational efficiency; however, I did not find any discussion or comments on the total memory footprint. Do any of the algebras require us to keep track of partial computations/intermediate steps - subsequently increasing the total memory footprint? In the case of vision examples, which are dominated by the activations, what are the implications? If the memory footprint is indeed not consistent with a real-valued algebra, then are we trading model/input size for fewer parameters/efficient computation?
+* An intuitive justification of the algebras used in these experiments, along with insight for future algebras might be a nice addition, although I wouldn't consider it a con.
+* Are certain algebras more amenable to specific hardware architectures? If so, a brief discussion would enhance the paper overall.",7,4.0,ICLR2021
+pMvuMpXMbXS,1,7aogOj_VYO0,7aogOj_VYO0,Review,"This paper studied how to use the information that the gradient lies in a low dimension space to design more accurate differentially private SGD algorithm. Specifically, the authors proposed a new method which is called Gradient Embedding perturbation method. And they showed that compared with the previous DP-SDG method, their method has higher accuracy. 
+Actually, I tend to weakly reject the paper due to the following reasons. 
+1. As I know and also the authors mentioned in the related work part, there are also two other works study how to use the low dimensional gradient to get higher utilities for the Empirical Risk Minimization problem. However, I can see that in the other two works, they all have theoretical results for the utility, while in this paper there is just a variance analysis, so I wish to see the authors use Theorem 3.2 to get the excess error for their method when the loss functions are convex so that we can see the difference between the other two methods clearly. 
+2. In the experimental part, I think it will be better to have a comparison with the above two papers to see the superiority of the method in this paper. just comparing with the base-line is not enough.  Moreover, the authors set $\delta=10^{-5}$ or $10^{-6}$, I think it will be better to set $\delta=\frac{1}{n}$. And I want to see more comments about the number of public data to get the anchor subspace.",5,5.0,ICLR2021
+rke4PHfTFB,1,ryxgJTEYDr,ryxgJTEYDr,Official Blind Review #3,"This paper is about a policy design, where the policy is expressed as a mixture of policies called primitives.  Each primitive is made of an encoder and a decoder, mapping state to actions, rather than temporally extended actions (or options in RL).  The primitives compete with each other to be selected in each state and thus do away with the need for a meta-policy to select the primitives.  The selected primitive in each state trades between reward maximization and information content.  
+
+The paper is well written and is enjoyable to read.  It is helpful for me to have equation (3) in mind before reading about the explanation on the tradeoff between the reward and information, but this is a minor point.  My concern is that by scaling the reward in proportion to L_k redistributes the rewards in a way that is not reflective of the underlying reward structure of the MDP.  If so, the constructed policy \pi could place a high probability on the suboptimal actions.  How do we know if the action selected according to policy \pi will indeed lead to high rewards?",6,,ICLR2020
+AT7jJ1BlQzN,1,uFk038O5wZ,uFk038O5wZ,Official blind review,"Summary: This paper proposes a knowledge graph enhanced network to improve abstractive dialog summarization with graphs constructed from the dialog structure and factual knowledge. The dialog graph is composed of utterances as nodes and 3 heuristic types of edges (such as utterances of the same speaker, adjacent utterances). The factual graph is constructed via openIE and dependency parsing, which the authors claim are complementary as the triplets (results of openIE) are not always available. 
+
+---
+Pros:
++ The proposed method outperforms all the compared baselines on two dialog summarization datasets.
++ Human evaluation shows that the proposed method leads to increased relevance and readability.
+
+---
+Cons:
+- In the ablation study (Table 3), the performance of each variant is close to the full model, and removing either module still outperforms the compared baselines. Given such performance, I wonder if the difference between the ablated variants and the full model is statistically significant. Also, what is the performance of the proposed method without graph information? Is it effectively the same as a Pointer-Generator?
+- Details of human evaluation are lacking. Who are the annotators and how many annotators are there? What is the inter-annotator agreement?
+- The paper is poorly represented with unclear descriptions (not only typos/grammar but also definitions of various concepts), which makes it hard to follow. To name a few, in 3.2, the definition of edges gives one the impression that it is a fully connected graph. In 3.2.1, “keyword” is defined after the description of the use of the keyword, and I can’t really tell what is the “keyword neighborhood”.  A general suggestion is to define them beforehand (when they first appear) instead of describing them later or in the footnote.
+
+---
+Comments for rebuttal and revised paper
+
+Thanks for providing a detailed response and an improved version of the paper. One thing that I am still concerned with is how come the updated ablation study is so different from the initial results. Originally, the differences between KGEDCg and KGEDCg-GE and KGEDCg-FKG were very minor (one of my questions above), but now the margins are as large as 7+ pts. Given such discrepancies without explanation, I'd hold my original evaluation.
+",5,4.0,ICLR2021
+SJlnCEcaFS,2,rJeIGkBKPS,rJeIGkBKPS,Official Blind Review #3,"** post rebuttal start **
+
+After reading reviews and authors' response, I decided not to change my score.
+I recommend to strengthen their theoretical justification or make their method scalable to improve their work.
+
+
+Detailed comments:
+
+2. ""Moreover, their results on some of the gray-scale experiments are significantly worse compared to ours.""
+-> If you are talking about the comparison in MNIST-variants, please note that experimental results on MNIST cannot be seriously taken unless there is a strong theoretical background; especially, MNIST-variants are too small to talk about the scalability of the method. It is hard to convince readers only with results in MNIST-variants, unless the method has a strong theoretical justification.
+However, if your claim is true for general gray-scale images, e.g., preprocessing CIFAR to be in gray scale, then you may add supporting experiments about it.
+
+4. Again, if the method is only applicable to MNIST-variants due to its computational complexity while it has no strong theoretical justification, I can't find benefits from it.
+
+** post rebuttal end **
+
+
+
+- Summary:
+This paper proposes to improve confident-classifiers for OOD detection by introducing an explicit ""reject"" class. Although this auxiliary reject class strategy has been explored in the literature and empirically observed that it is not better than the conventional confidence-based detection, the authors provide both theoretical and empirical justification that introducing an auxiliary reject class is indeed more effective.
+
+
+- Decision and supporting arguments:
+Weak reject.
+
+1. Though the analysis is interesting, it is not applicable to both benchmark datasets and real-world cases. Including the benchmark datasets they experimented, the input to the model is in general bounded, e.g., natural images are in RGB format, which is typically normalized to be bounded in [0,1]. Therefore, the polytopes would not be stretched to the infinity in most cases.
+On the other hand, note that softmax classifiers produce a high confidence if the input vector and the weight vector of a certain class are in the same direction (of course feature/weight norm also matters, but let's skip it for simplicity). Therefore, if there is an auxiliary reject class, only data in the same direction will be detected as OOD; in other words, OOD is ""modeled"" to be in the same direction with the weight vector of the auxiliary reject class. However, the conventional confidence-based detection does not model OOD explicitly. Since OOD is widely distributed over the data space by definition, modeling such a wide distribution would be difficult. Thus, the conventional approach makes more sense to me.
+
+2. The experiment is conducted only on MNIST variations, so it is unclear whether their claim is true on large-scale datasets and real-world scenario.
+Why don't you provide some experimental results on other datasets commonly used in other OOD detection papers, such as CIFAR, SVHN, TinyImageNet, and so on?
+
+
+- Comments:
+1. In section 4, the authors conjectured the reason why the performance of reject class in Lee et al. (2018a) was worse is that the generated OOD samples do not follow the in-distribution boundaries well. I think Appendix E in the Lee et al.'s paper corresponds to this reasoning, but Lee et al. actually didn't generate OOD samples but simply optimized the confidence loss with a ""seen OOD."" Lee et al. didn't experiment on MNIST variations but many natural image datasets. So, it is possible that the auxiliary reject class strategy is only effective in MNIST variations. I suggest the authors to do more experiments on larger datasets to avoid this criticism.",3,,ICLR2020
+rJeMg4_gqB,3,H1ezFREtwH,H1ezFREtwH,Official Blind Review #3,"This paper presents an approach in which new tasks can be solved by an attention model that can weigh the contribution of different base policies conditioned on the current state of the environment and task-specific goals. The authors demonstrate their method on a selection of RL tasks, such as an ant maze navigation task and a more complicated “ant fall” task, which requires the agent to first move a block to fill a gap in the environment before it is able to reach the goal. 
+
+I found the paper interesting and well written. My primary concern is that the primitive policies are learned independently of the composite policies, which might limit the application of this approach to more complex problems. Additionally, it would be great to also see the concurrent and sequential form of skill combination for the more complex tasks, and not just the point navigation task shown in Figure 7. 
+
+Standard errors on Figures 5 and 6 seem to be missing. Additionally, I was curious that in Figure 4a and Figure 6a, the composite’s performance is already a lot better than the other methods after 0 training steps. Maybe the authors can elaborate on that. Maybe the performance at step 0 is just hard to make out in the graphs?
+
+I would suggest accepting the paper but there could be a more detailed analysis of how the pre-trained sub-modules are used and learning both composite and sub-policies together would make the paper stronger. 
+
+Additional comments:
+- Where is the training graph for the Composite-SAV applied to the “ant fall” task? Maybe I missed it?
+- Algorithm 1 should probably be moved to the main text. 
+
+####After rebuttal####
+The authors' response and the revised paper address my already minor concerns. ",6,,ICLR2020
+NkTLPf0DCy0,3,O7ms4LFdsX,O7ms4LFdsX,The proposed method has theoretical foundations and shows promising results.,"Summary:
+This paper proposes R-WAE to learn disentangled representations. Benefit from the characteristics of WAE, this paper shows that R-WAE can better disentangle the sequential data into content space and motion space. R-WAE achieves state-of-the-art performance in both disentanglement and unconditional generation.
+
+
+Reasons for score: 
+Overall, I vote for acceptance. The proposed method has theoretical foundations and shows excellent results.
+ 
+
+Pros: 
+ 
+-- The paper provided strict theoretical formulations, like the comparison between WAE and VAE, how WAE can generalize to the sequential format, and the connection between mutual information(MI) and the objective function of R-WAE. The experiments support the theorems.
+ 
+-- The authors provide sufficient experiments on multiple datasets. 
+The experiments cover tasks of various domains, including video and audio, which indicate the proposed method could be easily generalized to different tasks.
+ 
+-- Many architectures are investigated, including WAE-MMD, WAE-GAN, and simple/complex encoder.
+
+
+Cons: 
+
+-- When generation (Fig. 1 (a)), the dependency between h in different time steps is considered. However, during the inference phase (Fig. 1 (b)), the dependency is ignored. Any good reason? 
+
+-- The illusions of Figure 5 do not keep the same across frames, while z^m shouldn’t change illusion. More analysis and discussion are probably needed for this result.
+ 
+-- The comparison between R-WAE(GAN) and R-WAE(MMD) can be further discussed. The comparison is shown in Table 3. The results show that R-WAE(GAN) performs better, but the reason is unclear. 
+ 
+-- For the audio experiments, ASR results or phoneme classification might be needed to support that z^m keeps the local information. It would be better to provide audio demos of your cross reconstruction result shown in Figure 11.
+
+-- Figure 6 shows that WAE usually gives a tighter gap between classes of z^c, since WAE computes divergence between Q(Z^c) and P(Z^c), which causes z^c spread over the entire latent space; while VAE gives a noticeable gap between different classes of contents (such as the one shown in Figure 3 of scalable FHVAE), which leaves space for the unseen contents. Would the tighter gap cause issue when training and testing data are mismatched? If the testing data has some contents that have never been seen during the training, what would happen? (for example, training on MUG than testing on VoxCeleb video data) Will the z^c of the unseen content data be forced mapped to the content in the training set instead of keeping its own information?
+
+",7,3.0,ICLR2021
+pzAPksk4OoO,3,W1G1JZEIy5_,W1G1JZEIy5_,Review: MIROSTAT: A NEURAL TEXT DECODING ALGORITHM THAT DIRECTLY CONTROLS PERPLEXITY,"In the context of neural text generation, the authors study how perplexity varies with top-$k$ and top-$p$ sampling and propose a sampling algorithm that uses Zipf's law to dynamically adjust $k$ in order to control per-sequence perplexity.
+
+Overall, the theoretical analysis and relationship between log probability and repetition was interesting, but there are several concerns with the method and experimental evaluation, detailed below. The idea is interesting and I hope the authors continue down this line, but in its current form I would not recommend acceptance. (edit: see discussion below, I have adjusted the score to above the acceptance threshold)
+
+
+#### Pros
+- The theoretical analysis of cross-entropy growth with top-$k$ versus top-$p$ was interesting (e.g. summarized in Figure 1). 
+- Nice empirical demonstration of repetition correlating with log probability.
+
+#### Clarity
+- The presentation in Section 2 could be simplified or made more concrete - overall it seems like this section is building up to a standard definition of perplexity in a complicated way. 
+    - are these generic definitions of cross-entropy rate and perplexity (defined using the Shannon-McMillan-Breiman theorem) needed? I don't see them used in the main text, so it would be helpful to shorten this section, or concretely say how each step corresponds to a language model.
+    - Equation (3) assumes that $P_N$ is a stationary ergodic source. Why can a neural language model be considered a stationary ergodic source?
+    - Why *surprise* instead of *information content*? *Surprise rate* is not used again in the text.
+    - In the abstract you say ""target value of perplexity"" but then $\tau$ is called the ""target surprise value"", and in Appendix A it reports ""target *average* surprise value"". In this review I'll use 'perplexity', but it would be helpful to check whether there are inconsistencies in the paper.
+
+#### Method
+
+- **Dependence on hyperparameter.** The authors mention that for low values of $k$ and $p$, perplexity drops, leading to repetition, while for high values of $k$ and $p$, perplexity increases, leading to incoherence. The authors claim that *Mirostat avoids both traps*. However it requires setting a target value ($\tau$). What is the difference between having to choose $k$ or $p$ versus choosing $\tau$? Wouldn't Mirostat fall into the traps with low $\tau$ or high $\tau$ (e.g. Figure 4.d)? Since you showed that perplexity grows linearly with $p$ (Fig 1), why is Mirostat needed versus using top-$p$?
+
+- **Fixed perplexity per continuation.** Mirostat enforces the average token log-probability of each *individual continuation* to be near a hyperparameter $\tau$. However, won't the ""ideal"" perplexity vary based on the prefix? I.e. for some prefixes there may be low conditional entropy in the true distribution, meaning a small number of high-probability continuations are much more reasonable than others. In this case, generating a sequence with perplexity based on $\tau$ would filter out these high-probability continuations. Could the authors comment on this issue? The underlying assumption of the method is that it is a good idea to have a fixed perplexity for all continuations.
+
+- **Zipf's motivation.** While I understand that the Zipf's law assumption was needed to derive the theoretical results, it's unclear why Zipf's law is used to motivate the practical method (Algorithm 1). Why would we want to estimate the zipf's exponent on the top-100 words at each timestep, and choose k using (7)? This motivation, and a comment on the guarantees it gives us on full sequences, should be more clearly stated. 
+
+#### Experiments
+
+- **Simple 'perplexity target baselines'.** Related to the ""Zipf's motivation"" comment above, what's missing is evaluating different ways of controlling the perplexity, in order to evaluate that the proposed method based on Zipf's law is the best (for some definition of best). If the goal is to control the perplexity of each sequence, why not sample several sequences with top-$k$ and choose one that has perplexity close to a target $\tau$? What is the performance of adjust $k$ or $p$ based on a different heuristic, e.g. absolute difference between the perplexity of the sequence-so-far and $\tau$?
+
+- **Human evaluation.** Human evaluation is required to get a full measure of the generated text's quality; currently the paper just argues for quality/coherence by showing a few examples. It's possible that this form of dynamic $k$ adjustment introduces some artifacts. For instance, the behavior of the ""surprise"" in Figure 5.d under Mirostat doesn't resemble that of humans, so it's possible that some odd behavior is introduced.
+
+- **Misc.** In figure 5, why is mirostat preferable to top-k or top-p? Why are $k,p$ of 0.4 and 1.0 and $\tau=1.8$ selected here?",6,5.0,ICLR2021
+9MvVOSfxl2,3,q3KSThy2GwB,q3KSThy2GwB,"iterative work on adding sparsity to RNNs, need more clarification","## Second Review
+The author's thoughtful response has clarified most of the missing details in the paper. It is true that idea is interesting and theoretical analysis are promising. However, I still have issues understanding failure conditions. If the method is purely based on trial and error to determine optimal threshold for sparsity, then it requires many engineering tricks. Thus I would request authors to provide such details, such that young researchers can extend this work to create better training paradigm for RNNs. Other issue as pointed out by other reviewers is using only 3 runs to report results.  I really appreciate explanation about difference between 2 variants of GRU and how it adds sparsity to the model. Additionally JAX code helped in understanding many key apsects of the work. I would encourage authors to make it public, and provide key insights for training RNNs using snap. Nonetheless, the proposed method and theoretical analysis are insightful and I believe this to be a first step towards building scalable RNNs which efficiently gets rid of credit assignment issue. This paper does add significant pedagogical value, which can benefit complex task such as grammatical inference. I'm increasing my score from 5 to 7 and I hope this paper is accepted. 
+
+## Summary
+Paper introduces snap which adds sparsity to the influence matrix extracted from RTRL which acts as a practical approximation for RTRL, snap is extension of prior work on snap-1 used to train LSTM, and authors have shown that one can train dense as well as sparse RNNs using snap achieving similar performance as BPTT on real as well as synthetic dataset. Few clarifications in terms of snap working and few key information w.r.t to parameters are missing.
+
+## Clarification
+
+How does one evaluate level of sparsity required for any given task? At what n (sparsity ratio) optimal performance is observed which leads to better performance. It is well known that full RTRL (forward propagation helps compared with backward propagation), especially in case of continual or online learning [Ororbia and Mali 2020] and copy task (KF-RTRL, UORO). Does current sparsity measure work on variant of RTRL? And how do you determine top k to select k elements for creating a sparse matrix? is it random like 70-80-99? Many key details are missing, and authors are requested to provide more information to better understand model flow.
+
+In appendix authors talk about “modeling performance of the two variants of GRU. which has been shown to be largely the same, but the second variant is faster and results in sparser D_t and I_t”. I am confused, what is the relationship between sparsity and two variants? Please provide some numbers explaining how sparsity is increased by moving reset gate after the matrix multiplication.
+
+How does sparsity measure is introduced in this work? Does model stay consistent whenever regularization approaches such as zoneout or dropout are used or introduced into the model? Do you observe the similar performance? Does network roughly converge to similar performance with optimal sparsity or sparsity measure changes as other regularization approaches are introduced? Did you do grid search for language modelling task or copy task (beside learning rate)? If so please provide details? Citation and comparison missing with Sparse attentive backtracking, which in theory can work with sparse networks and its temporal credit assignment mechanism can help in introducing sparsity [ke and goyal 2019]. 
+
+Authors states that “In order to induce sparsity, we generate a sparsity pattern uniformly at random and fix it throughout training” What is the range for random uniform? Is model sensitive whenever sparsity pattern is changed while training (may be per epoch or k epochs). How can one ensure that the sparsity pattern at start is the optimal one for any network? Does similar pattern work for all GRU, LSTM, RNN or one needs to adapt scheme based on architecture?
+
+Advantage of snap-2 and 3 over snap-1, snap-1 is similar to (Hochreiter & Schmidhuber, 97) work on training LSTM, what modification is introduced on snap-1 beside training it on GRU? And sparse networks. It is still unclear what advantage these 3 variants add. It is important to show speed (with various sparsity, convergence plots or else these variants would have similar performance and memory requirement compared with vanilla RTRL
+
+
+[Ke and Goyal 2018] Ke, N.R., GOYAL, A.G.A.P., Bilaniuk, O., Binas, J., Mozer, M.C., Pal, C. and Bengio, Y., 2018. Sparse attentive backtracking: Temporal credit assignment through reminding. In Advances in neural information processing systems (pp. 7640-7651).
+
+[Ororbia and Mali 2020] Ororbia, A., Mali, A., Giles, C.L. and Kifer, D., 2020. Continual learning of recurrent neural networks by locally aligning distributed representations. IEEE Transactions on Neural Networks and Learning Systems",7,3.0,ICLR2021
+r1q6Vx9lM,2,BkXmYfbAZ,BkXmYfbAZ,Deep MTL through Soft Layer Ordering Review,"Summary: This paper proposes a different approach to deep multi-task learning using “soft ordering.”  Multi-task learning encourages the sharing of learned representations across tasks, thus using less parameters and tasks help transfer useful knowledge across. Thus enabling the reuse of universally learned representations and reuse them by assembling them in novel ways for new unseen tasks. The idea of “soft ordering” enforces the idea that there shall not be a rigid structure for all the tasks, but a soft structure would make the models more generalizable and modular. 
+
+The methods reviewed prior work which the authors refer to as “parallel order”, which assumed that subsequences of the feature hierarchy align across tasks and sharing between tasks occurs only at aligned depths whereas in this work the authors argue that this shouldn’t be the case. They authors then extend the approach to “permuted order” and finally present their proposed “soft ordering” approach. The authors argue that their proposed soft ordering approach increase the expressivity of the model while preserving the performance. 
+
+The “soft ordering” approach simply enable task specific selection of layers, scaled with a learned scaling factor, to be combined in which order to result for the best performance for each task. The authors evaluate their approach on MNIST, UCI, Omniglot and CelebA datasets and compare their approach to “parallel ordering” and “permuted ordering” and show the performance gain.
+
+Positives: 
+- The paper is clearly written and easy to follow
+- The idea is novel and impactful if its evaluated properly and consistently 
+- The authors did a great job summarizing prior work and motivating their approach
+
+Negatives: 
+- Multi-class classification problem is one incarnation of Multi-Task Learning, there are other problems where the tasks are different (classification and localization) or auxiliary (depth detection for navigation). CelebA dataset could have been a good platform for testing different tasks, attribute classification and landmark detection.  
+(TODO) I would recommend that the authors test their approach on such setting.
+- Figure 6 is a bit confusing, the authors do not explain why the “Permuted Order” performs worse than “Parallel Order”. Their assumptions and results as of this section should be consistent that soft order>permuted order>parallel order>single task. 
+ (TODO) I would suggest that the authors follow up on this result, which would be beneficial for the reader.
+- Figure 4(a) and 5(b), the results shown on validation loss, how about testing error similar to Figure 6(a)? How about results for CelebA dataset, it could be useful to visualize them as was done for MNIST, Omniglot and UCL. 
+(TODO) I would suggest that the authors make the results consistent across all datasets and use the same metric such that its easy to compare.
+
+Notation and Typos:
+- Figure 2 is a bit confusing, how come the accuracy decreases with increasing number of training samples? Please clarify.
+1- If I assume that the Y-Axis is incorrectly labeled and it is Training Error instead, then the permuted order is doing worse than the parallel order.
+ 2- If I assume that the X-Axis is incorrectly labeled and the numbering is reversed (start from max and ending at 0), then I think it would make sense.
+- Figure 4 is very small and not easy to read the text. Does single task mean average performance over the tasks? 
+- In eq.(3) Choosing \sigma_i for a task-specific permutation of the network is a bit confusing, since it could be thought of as a sigmoid function, I suggest using a different symbol.
+ Conclusion: I would suggest that the authors address the concerns mentioned above. Their approach and idea is very interesting and relevant, and addressing these suggestions will make the paper strong for publication.",7,4.0,ICLR2018
+4pBc6e8_UNd,1,A-Sp6CR9-AA,A-Sp6CR9-AA,Review,"The paper presents a modification to conditional batch normalization, wherein an extra affine layer is introduced between the standardization and the conditional affine layer. The empirical results show the benefits of introducing this extra ""sandwich"" affine layer.
+
+The intuition behind the approach makes sense, but the formulation makes it difficult to see why SaBN is in fact beneficial compared to CCBN. It does not appear that the approach imposes any restrictions or regularization on the CCBN affine parameters; therefore the only difference between CCBN and SaBN seems to be a different parameterization. I would then expect both CCBN and SaBN to reach the same optimal training loss. It may be that the reparameterization provided by SaBN yields the optimization trajectories that lead to better-generalizing solutions, but it is not explained in the paper whether, or why, this happens. One way to probe this might be to reparameterize each gamma in BN or CCBN as $\gamma=\gamma_1 \gamma_2$ (or even $\gamma = \gamma_1^2$) and study the behavior of the resulting model.
+
+The paper proposes to measure the heterogeneity in the found representations (between different branches of CCBN) via the CAPV measure. In its definition, I could not find what the overbars signify but I assume it means the average over the channels. Also, the indexing of gammas from 0 to N should probably be from 1 to C, for consistency with Eq. (2). The definition of CAPV as the variance of gammas seems problematic, however, in ReLU models with batch normalization: models whose gammas differ by a constant factor represent the same function, so the variance of gammas can be arbitrarily changed without affecting the model. A more useful measure of heterogeneity would need to take this scale invariance into account.
+
+The paper shows several empirical studies, including one for architecture search -- with the main paper using DARTS, and the appendix using GDAS. This choice seems suboptimal, given that in the NAS-Bench-201 paper (https://openreview.net/pdf?id=HJxyZkBKDr), Table 5 seems to indicate that DARTS performs much worse than GDAS. In this work, the appendix devoted to GDAS seems to indicate that (1) there is no consistency on using or not using affine, between CIFAR-100 and Imagenet, and (2) GDAS-SaBN is not statistically significantly better than the better of GDAS and GDAS-affine.
+
+The results in this paper are encouraging, but I believe the paper needs to explain more clearly why SaBN is expected to work (given that it preserves the hypothesis class as well as the minimum of the training objective). Additionally, since it appears to amount to a per-dimension reparameterization, the reader might expect that some of the other reparameterizations could have a similar effect (including such simple interventions as changing the learning rates for some of the gamma or beta parameters), and compellingly demonstrating that the specific reparameterization given by SaBN outperforms such alternatives would make the paper stronger.",3,4.0,ICLR2021
+ryx1A71lqB,2,H1loF2NFwr,H1loF2NFwr,Official Blind Review #1,"This paper studies an important problem, evaluating the performance of existing neural architecture search algorithms against a random sampling algorithm fairly. 
+
+Neural architecture search usually involves two phases: model search and model tuning. In the search phase, best architectures after limited training are selected. In model tuning, the selected architectures are trained fully. However, it has been noticed that best architectures after limited training may not translate to globally best architectures. Although previous research has tried comparing to random sampling, such as Liu et al. 2019b, but the random architectures were not trained fully. The authors train random architectures fully before selecting the best one, which turns out to perform as well or better than the sophisticated neural architecture search methods. The paper also identifies that parameter sharing turns out to be a major reason why the sophisticated NAS methods do not really work well. 
+
+The insights are obviously important and valuable. The insight on parameter sharing is even a bit disheartening. Parameter sharing is the main reason why NAS can scale to very large domains. Without it, is NAS still practical or useful? On the other hand, it is a bit unsatisfactory that the paper does not provide or even suggest solutions to remedy the identified issues.
+
+Another comment is it is a stretch to consider the evaluation done in the paper a new framework. It is simply a new baseline plus a new experiment design.
+
+About Equation (1) in Appendix A.2, it seems to simplify to p=(r/r_max)^n. Is the formula correct?",6,,ICLR2020
+BygDCwcstH,2,HkeO104tPB,HkeO104tPB,Official Blind Review #2,"The paper tackles the problem of self-supervised reinforcement learning through the lens of goal-conditioned RL, which is in line with recent work (Nair et al, Wade-Farley et al, Florensa et al, Yu et al.). The proposed approach is a simple one - it uses the relabeling trick from (Kaelbling, 1993; Andrychowicz et al., 2017) to assign binary rewards to the collected trajectories. They apply two simple tricks on top of relabeling:
+
+1. Reward balancing: Balancing the number of 0/1 rewards used for training the policy.
+2. Reward filtering: A heuristic that rejects certain negative-reward transitions for learning if the q value for the transition is greater than some threshold q_0.
+
+While I like the simplicity of the proposed approach when compared to competing methods, my overall recommendation is reject based on the current state of the paper, because of the following reasons:
+
+1. The technical novelty is quite limited - the paper mostly uses the framework from Andrychowicz et al., 2017 with a specific choice of epsilon (=0) for giving positive rewards. The method does reward balancing, but similar goal-sampling have been used in prior work like Nair et al., 2018, and is not essential to obtaining good results (Appendix C). The main technically novel component is likely the reward filtering mechanism, but I find it to be somewhat ad-hoc since it assumes that the Q-values learned by the Q-network to be reasonably good during training time, which is not the case for most modern Q-learning based methods [1, 2]. 
+2. The provided analysis is not particularly illuminating, see my detailed notes below.
+3. The experiments are underwhelming, see my detailed notes below. 
+
+I would be willing to overlook 1 or 2 if the authors did a more thorough experimental evaluation which showed the method working well when compared to alternatives, but that is not the case right now. 
+
+Note on Section 6 (analysis) 
+The authors provide a simple analysis in Section 6 to bound the suboptimality of the learned policy. Unless I’m missing something, the resulting bound of t_3 <= t_1 + d is trivially true, since d is defined to be the diameter of O_{+}(o_g), and t_1 is the number of timesteps taken by an optimal policy to go from o_t to O_{+}(o_g). As a result, I don’t find this analysis illuminating or interesting - perhaps the author can provide counter arguments here to change my mind. 
+
+Note on Section 7 (experiments) 
+For sim experiments, two of the tasks are extremely simple (free space reaching in 2D and 3D, respectively) where essentially everything works - the proposed method, baselines and ablations. The third task of rope manipulation is fairly interesting at a first look - but it appears to have been greatly simplified. The authors consider a high-level action space and an episode of only three timesteps. Further, the authors make the simulated robot arm invisible in the supplementary videos, which greatly simplifies the problem visually. Since the entire motivation is about learning well in real world settings, I feel this is a bit underwhelming. Figure 1 is misleading, since it shows a visible robot arm in front of a rope. This also appears to hint that the method did not work well with realistic visuals, highlighting a major limitation of the proposed approach. I think it would be valuable to include such failures (and discussions around them) in future submissions. 
+
+For the real world experiments, the task being considered is extremely simple (free space reaching), and does not even require pixel observations. Even for this simple task, the error achieved by the method is 10cm (starting error was 20cm), which is quite poor - robotic arms like the Sawyer should be able to achieve much lower errors. Even the oracle reward achieves an error of 10cm, which might indicate a bug in the author’s real world robotic setup. In comparison, prior work such as Nair et al. is able to tackle harder problems in the real world (like non-prehensile pushing). 
+
+Minor points
+- Section 4 contains a nice discussion on false positives and false negatives when using non-oracle reward functions for reinforcement learning, where they also perform a simple experiment to show how false positives can negatively impact learning much more severely than false negatives. This does a good job of motivating the method (i.e. avoiding false positives), but also undermines the motivation behind reward filtering, which is perhaps the main technically novel component of the proposed approach. 
+- Section 2.3 (i.e. related work on deformable object manipulation) states that ""Our approach applies directly to high-dimensional observations of the deformable object and does not require a prior model of the object being manipulated.”, and only cites prior work that assumes access to deformable object models. However, there is recent work that enable similar manipulation skills without access to such models. For example, Singh et al. [3] are able to learn to manipulate a deformable object (i.e. a piece of cloth) directly from high-dimensional observations using deep RL in the real world, and do not require object models (or ground truth state), but do require other forms of sparse supervision.
+- Typo: On page 7, “is kept the same a other approaches.” -> “is kept the same as other approaches.”
+
+[1]: Diagnosing Bottlenecks in Deep Q-learning Algorithms. Fu et al., ICML 2019
+[2]: Double Q-learning. V. Hasselt. NIPS 2010
+[3]: End-to-End Robotic Reinforcement Learning without Reward Engineering. Singh et al., RSS 2019.
+",1,,ICLR2020
+H1ejMTzTtB,2,BkxFi2VYvS,BkxFi2VYvS,Official Blind Review #2,"- This paper proposes a semi-supervised learning strategy for semantic segmentation of road scenes. Specifically, authors propose to include an auxiliary network that will predict the confidence (at pixel-level) of the predictions on unlabeled images. These confidence values will be used to  generate a new auxiliary ground-truth to retrain the network using the unlabeled images.
+- Even though the idea is somehow interesting and results seem to improve with respect to the baselines, this paper is very similar to standard semi-supervised learning approaches for natural images that employ image proposals (i.e., EM-based methods). In those works, unlabeled images are segmented with the network trained on labeled images, generating some proposals. These proposals are later employed as a fake ground truth to re-train the network employing both labeled and unlabeled images. The only difference in this work is to employ the virtual confidence map to mask-out some pixels (those with lowest confidence values).  
+- Related work section is extremely weak. Authors merely mention few papers (some other relevant papers are missing), and throw a sentence for each one, without making connections between works. This makes difficult to place their work among the literature (e.g., which limitations of previous approaches the current method intend to address?). Authors should significantly improve this section.
+- I am not sure about the fact that employing only the probability maps as input to generate a confidence map is reliable, if no other information is employed. These predictions (e.g., confidence map) will be based only on the probabilities obtained by the first network. While this is already a good indicator of the confidence of the network to make those predictions, I believe that input images should also be included. The intuition behind this is that there may exist some regions with similar probabilities (from first network), which are incorrectly classified (leading to 0-masked pixels on the auxiliary ground-truth) in some cases, while correctly classified in other situations (leading to 1-masked pixels on the auxiliary ground-truth).  
+- Eq.1) is basically the standard cross-entropy loss, with the difference of the weighting terms.
+- a_{h,w} is the softmax of the auxiliary network, isn’t it?
+- Further, authors threshold the values of the confidence map to generate the new auxiliary ground truth. Why not to use the raw values so that each pixel is weighted differently according to its importance? 
+- Authors make some claims which were never demonstrated. For example, they mention that the proposed approach performs better on small targets than previous approaches. Nevertheless, only mean results (over all the classes) are shown. To this end authors should report per-class performances, instead of the mean.
+- Furthermore, authors make several over claims, misleading information. For example, they mentioned that they proposed a highly efficient segmentation method. Nevertheless, from Table 2 it can be observed that the proposed method ranks in the middle in terms of both speed and parameters, compared to other state-of-the-art models. Similarly, authors mention that their model is equipped with a carefully designed auxiliary loss function during training, while they basically employ a standard cross-entropy weighted by some values to account for imbalance between classes and between positive and negative pixels within the same class.   
+- The paper contains many grammatical errors.
+
+",3,,ICLR2020
+k0xNV0z_iUi,3,3tFAs5E-Pe,3tFAs5E-Pe,"Neural based method for continuous Wasserstein-2 barycenters, with good performance","This work introduces a new Wasserstein-2 barycenter computation method. The authors first derive the dual formulation of the Wasserstein-2 barycenter problem, and then parametrize the convex potentials by ICNNs. The congruent and conjugacy conditions are enforced by regularization terms, respectively. They then show that the algorithm can find a good barycenter if the objective function is properly minimized.
+
+Pros:
+1. The algorithm does not introduce bias.
+2. The algorithm does not require minimax, which is efficient.
+3. The empirical performance is much better than existing methods, probably due to the above two reasons.
+
+Areas to improve:
+1. It is good that the empirical analysis include how the performance change w.r.t. D. It would be better if there is a similar analysis to N. Furthermore, since 2N ICNNs are needed to be trained, it would be better if the training time is also reported, so that we can have a more comprehensive understanding of the method. Will there be a setting that discrete method can be faster than the proposed method to enforce comparable approximation error (say, large N for 3D applications)?
+2. Since the congruent and conjugacy conditions are enforced by regularizations, they are not guaranteed to be satisfied. Therefore, it would be better if there is an experiment showing that how the conditions are satisfied. 
+3. The first section of related work should also briefly include https://arxiv.org/abs/1605.08527 and https://arxiv.org/abs/1905.00158.
+
+After rebuttal:
+
+The additional experiment results provided in the rebuttal stage suggests the efficiency of the proposed method, as well as the congruent and conjugacy conditions are approximately satisfied. I therefore believe this paper should be accepted.",6,4.0,ICLR2021
+DfFPE_dTKsE,1,j1RMMKeP2gR,j1RMMKeP2gR,This paper is a solid study of the problem with lots of potential research directions for the future work,"The paper presents theoretical analysis of MDPs with execution delay together with an algorithm that achieves better performance on the task than the baselines. The main theoretical result highlights the need for non-stationary Markov policies that is different from standard MDPs that can be solved using stationary Markov policies.
+
+*Quality*
+The authors conducted a solid theoretical study of MDPs with the execution delay. The presented claims showcase why the existing approach based on augmenting the state space is not feasible for large delays. The suggested algorithm is a based on a simple idea to estimate the state of MDP m steps in the future, but it seems to work quite well when the MDP is not too stochastic. Overall, this paper is a solid study of the problem with lots of potential research directions for the future work.
+
+*Clarity*
+The paper is well-written in general. 
+
+*Originality*
+To my best knowledge, the results are new and the need for non-stationary policies is a novel highlight.
+
+*Significance*
+rather significant, execution delay is a common issue in practice and the paper lays foundations for analysis of MDPs with execution delay.
+
+Pros
+* Theoretical analysis of ED-MDPs that guides the presented algorithm
+* Great results on Tabular Maze and Physical domain problems
+
+Cons
+* No analysis on how the stochasticity of environment affects the performance of Delayed-Q
+* Atari results use the simulator to predict the future state",8,4.0,ICLR2021
+mHSjWDP76Xn,1,ol_xwLR2uWD,ol_xwLR2uWD,Lack of novelty,"This paper proposes to use orthogonal weight constraints for autoencoders. The authors demonstrate that under orthogonal weights (hence invertible), more features could be extracted. The theory is conducted under linear cases while the authors claim it can be applied to more complicated scenarios such as higher dimension and with nonlinearity. The experiments demonstrate the performance of proposed model on classification tasks and generative tasks. Several baselines are compared.
+
+The paper is poorly written. It is full of inconsistent and irrelevant claims. The method is not clarified. All experiments are in low quality. 
+
++ves: 
+
++ This paper discusses its connection to several of topics such as mutual information, greedy learning, SVD etc.
+
+ 
+Concerns: 
+
+- No novelty. Using orthogonal weight regularization has been widely studied.
+
+- The theory does not apply to higher dimensional cases or nonlinear cases. The discussion seems trivial. There are no connection of corresponding theory and the model in the experiments.
+
+- Throughout the paper, the ""pretraining"" process is not clarified. 
+
+- The authors claims applying their method to GAN but I don't see how they combine their model with it. 
+
+- The classification and generation experiment results are not convincing. In Figure 6, 7, the difference are in range of error bar. In Figure 8, 9, there is no advantage from proposed method.
+
+=====POST-REBUTTAL COMMENTS========
+
+I would like to thank the authors for their response. The authors have clarified the method in their response. I appreciate all the experimental details in the appendix. I tend to agree this is a promising idea and worth explored.
+
+However, this paper clearly cannot be accepted in its current form. The paper is poorly organized and poorly written. The method in the paper needs to be clarified. Their theory goes nowhere and proves nothing. In terms of the experimental results, the authors choose unclear baselines (which they claim state-of-the-art) and report improvement in terms of percentage increase (percentage over percentage). None of this is convincing to me.
+
+I slightly increased my score.
+
+
+
+",4,4.0,ICLR2021
+SkglAno0YS,2,B1xwcyHFDr,B1xwcyHFDr,Official Blind Review #1,"In this paper, the authors extend the Information Bottleneck method (to build robust representations by removing information unrelated to the target labels) to the unsupervised setting. Since label information is not available in this setting, the authors leverage multi-view information (e.g., using two images of the same object) , which requires assuming that both views contain all necessary information for the subsequent label prediction task. The representation should then focus on capturing the information shared by both views and discarding the rest. A loss function for learning such representations is proposed. The effectiveness of the proposed technique is confirmed on two datasets. It is also shown to work when doing data augmentation with a single view.
+
+Overall the paper is well motivated, well placed in the literature and well written. Mathematical derivations are provided. Experimental methodology follows the existing literature, seem reasonable and results are convincing. I do not have major negative comments for the authors. This is however not my research area and have only a limited knowledge of the existing body of work.
+
+Comments/Questions:
+- How limiting is the multi-view assumption? Are there well-known cases where it doesn't hold? I feel it would be hard to use, say, with text. Has this been discussed in the literature? Some pointers or discussion would be interesting.
+- Sketchy dataset: Could the DSH algorithm (one of the best prior results) be penalized by not using the same feature extractor you used?
+- Sketchy dataset: Can a reference for the {Siamese,Triplet}-AlexNet results be provided?
+- Sketchy dataset: for reproducibility, what is the selected \beta?
+- I find it very hard to believe that the accuracy stays constant no matter the number of examples per label used. How can an encoder be trained on 10 images? Did I misunderstand the meaning of this number? Can this be clarified?
+- Again for reproducibility, listing the raw numbers for the MNIST experiments would be nice.
+- If I understood the experiments correctly, ""scarce label regime"" is used for both the MIR-Flickr and MNIST datasets, meaning two different things (number of labels per example vs number of examples per label), which is slightly confusing.
+
+Typos:
+Page 1: it's -> its
+Page 6: the the -> the
+Page 7: classifer -> classifier
+Page 8: independently -> independent
+",8,,ICLR2020
+MZjm7AIw6dZ,1,5mhViEOQxaV,5mhViEOQxaV,The proposed technique does not match with the announced contribution.,"This paper proposes a novel controllable Pareto multi-task learning framework, which aims to learn the whole Pareto optimal front for all tasks with a single model. The motivation is straight forward and the proposed method is inspiring. However, the proposed technique does not match with the announced contribution.
+1. This paper announces that it proposes a novel Pareto solution generator that can learn the whole Pareto front for MTL. However, actually, this paper adopts fixed shared parameters (pretrained feature extractor) and only optimizes a part of parameters of a MTL, which degrade the solution space and the solutions may not be the Pareto solutions. The Pareto front generated by such method may not be the real Pareto front for MTL.
+2. Using fixed shared parameters conflicts with the essence Pareto MTL. In MTL, the tasks mutually regularized by the feature extractor, which improve the generalization ability of each task. 
+
+This paper is globally well organized and clearly written. However, some important details are missing.
+1. The details about the hypernetwork are unclear.
+2. The paper lacks of analysis on the experimental result.
+3. Some notations are not clear, e.g., does the loss used in the paper denotes empirical loss?",4,5.0,ICLR2021
+KgWY2lcpI9I,4,EsA9Nr9JHvy,EsA9Nr9JHvy,An Insightful Paper to be accpeted,"This paper gives a theoretical study of the tail behavior of the SGD in a quadratic optimization problem and explores its relationship with the curvature, step size and batch size. To prove their results, the authors approximate the SGD recursion by a linear stochastic recursion and analyze the statistical properties by the tools from implicit renew theory. Under this setting, they show that the law of the SGD iterates converge to a heavy-tailed stationary distribution depending on the Hessian structure of the loss function at the minimum and choices of the step size and batch size. They take a further step to clarify the relationship and study the moment bounds and convergence rate. 
+
+Overall, I vote for accepting. I think the study of heavy tail phenomenon in SGD is quite a novel field and full of interest. 
+1.	This paper is the first one to study the originating cause of heavy tail phenomenon in SGD and give a rigorous proof of the relationship between tail index and the choice of step size and batch size. This helps a better understanding of the generalization properties of SGD. 
+2.	The paper provides comprehensive experiments to support their result. Experiments are conducted on both synthetic data and neural networks. The design of experiments is reasonable, and the results of the experiment not only support the claim that the tail index is deeply linked to the curvature of the loss and the ratio of step size and batch size but also give an insight on more general cases besides the quadratic optimization.
+
+However, I am still concerned about the setting in the theoretical frame. This paper completes its proof under the settings of quadratic optimization and infinite streaming data, which may limit the applicability of the theoretical result. Although these issues have been discussed in this paper, whether or why this extension is feasible remains questionable.
+",7,4.0,ICLR2021
+rkgzpVtf3X,1,HygTE309t7,HygTE309t7,Interesting promising solution to outlier detection; application of proposed scheme to general outlier detection seems limited,"Pros
+----
+
+[Originality/Clarity]
+The manuscript presents a novel technique for outlier detection in a supervised learning setting where something is considered an outlier if it is not a member of any of the ""known"" classes in the supervised learning problem at hand. The proposed solution builds upon an existing technique (deep neural forests). The authors clearly explain the enhancements proposed and the manuscript is quite easy to follow.
+
+[Clarity/Significance]
+The enhancements proposed are empirically evaluated in a manner that clearly shows the impact of the proposed schemes over the existing technique. For the data sets considered, the proposed schemes have demonstrated significant improvements for this scoped version of outlier detection.
+
+[Significance]
+The proposed scheme for improving the performance of the ensemble of the neural decision trees could be of independent interest in the supervised learning setting.
+
+Limitations
+-----------
+
+[Significance]
+Based on my familiarity with the traditional literature on outlier detection in an unsupervised setting, it would be helpful for me to have some motivation for this problem of outlier detection in a supervised setting. For example, the authors mention that this outlier detection problem might allow us to identify images which are incorrectly labelled as one of the ""known"" classes even though the image is not a true member of any of the known classes, and might subsequently require (manual) inspection. However, if this technique would actually be used in such a scenario, the parameters of the empirical evaluation, such as a threshold for outliers that considers 5000 images as outliers, seem unreasonable. Usually number of outliers (intended for manual inspection) are fairly low. Empirical evaluations with a smaller number of outliers is more meaningful and representative of a real application in my opinion.
+
+[Significance]
+Another somewhat related question I have is the applicability of this proposed outlier detection scheme in the unsupervised scheme where there are no labels and no classification task in the first place. Is the proposed scheme narrowly scoped to the supervised setting?
+
+[Comments on empirical evaluations]
+- While the proposed schemes of novel inlier-ness score (weighted sum vs. max route), novel regularization scheme and ensemble of less correlated neural decision trees are extremely interesting and do show great improvements over the considered existing schemes, it is not clear to me why the use of something like Isolation Forest (or other more traditional unsupervised outlier detection schemes such as nearest/farthest neighbour based) on the learned representations just before the softmax is not sufficient. This way, the classification performance of the network remains the same and the outlier detection is performed on the learned features (since the learned features are assumed to be a better representation of the images than the raw image features). The current results do not completely convince me that the proposed involved scheme is absolutely necessary for the considered task of outlier detection in a supervised setting.
+- [minor] Along these lines, considering existing simple baselines such as auto-encoder based outlier detection should be considered to demonstrate the true utility of the proposed scheme. Reconstruction error is a fairly useful notion of outlier-ness. I acknowledge that I have considered the authors' argument that auto-encoders were formulated for dimensionality reduction.
+
+[Minor questions]
+- In Equation 10, it is not clear to me why (x,y) \in \mathcal{T}. I thought \mathcal{T} is the set of trees and (x,y) was the sample-label pair. 
+- It would be good understand if this proposed scheme is limited to the multiclass classification problem or is it also applicable to the multilabel classification problem (where each sample can have multiple labels).
+",5,4.0,ICLR2019
+HydZAdHgG,1,S1LXVnxRb,S1LXVnxRb,Review,"SUMMARY.
+
+The paper presents a cross-corpus approach for relation extraction from text.
+The main idea is complementing small training data for relation extraction with training data with different relation types.
+The model is also connected with multitask learning approaches where the encoder for the input is the same but the output layer is different for each task. In this work, the output/softmax layer is different for each data type, while the encoder is shared.
+The authors tried two different sentence encoders (cnn-based and tree-lstm), and final results are calculated on the low resource dataset. 
+
+Experimental results show that the tree-rnn encoder is able to capture valuable information from auxiliary data, while the cnn based does not.
+
+----------
+
+OVERALL JUDGMENT
+The paper shows an interesting approach to data augmentation with data of different type for relation extraction.
+I would have appreciated a section where the authors explain briefly what relation extraction is maybe with an example.
+The paper is overall clear, although the experimental section has to be improved I believe.
+From section 5.2 I am not able to understand the experimental setting the authors used, is it 10-fold CV? Did the authors tune the hyperparameters for each fold?
+Are the results in table 3 obtained with tree-lstm? 
+What kind of ensembling did the authors chose for those experiments?
+The author overstates that their model outperforms the state-of-the-art models they compare to, but that is not true for the EU-ADR dataset where in 2 out of 3 relation types the proposed model performs on par with the state-of-the-art model.
+Finally, the authors used only one auxiliary dataset at the time, it would be interesting to see whether using all the auxiliary dataset together would improve results even more.
+
+I would suggest the author also to check and revise citations (CNN's are not Collobert et al. invention, the same thing for the maximum likelihood objective) and more in general to improve the reference on relation extraction literature.",4,4.0,ICLR2018
+S1eAQNnPnm,1,B1xf9jAqFQ,B1xf9jAqFQ,The paper presents a new speed reading model by combined several existing ideas. The idea is novel and the results are good.,"The paper presents a novel model for neural speed reading. In this new model, the authors combined several existing ideas in a nice way, namely, the new reader has the ability to skip a word or to jump a sequence of words at once. The reward of the reader is mixed of the final prediction correctness and the amount of text been skipped. The problem is formulated as a reinforcement learning problem. The results compared with the existing techniques on several benchmark datasets show consistently good improvements.
+
+In my view, one important (also a little surprising) finding of the paper is that the reader can make jump choices successfully with the help of punctuations. And, blindly jumping a sequence of words without even lightly read them can still make very good predictions.
+
+The basic idea of the paper, the concepts of skip and jump, and the reinforcement learning formulation are not completely new, but the paper combined them in an effective way. The results show good improvements majorly in FLOPS.
+
+The way of defining state, rewards and value function are not very clear to me. Two value estimates are defined separately for the skip agent and the jump agent. Why not define a common value function for a shared state? Two values will double count the rewards from reading. Also, the state of the jump agent may not capture all available information. For example, how many words until the end of the sentence if you make a jump. Will this make the problem not a MDP? 
+
+Overall, this is a good paper.
+
+I read the authors' response. The paper should in its final version add the precise explanation of how the two states interact and how a joint state definition differs from the current one.",7,4.0,ICLR2019
+S1eLOUDJn7,1,HyzMyhCcK7,HyzMyhCcK7,Interesting idea but novelty may not be enough,"After the rebuttal:
+
+1.  Still, the novelty is limited. The authors want to tell a more motivated storyline from Nestrove-dual-average, but that does not contribute to the novelty of this paper. The real difference to the existing works is ""using soft instead of hard constraint"" for BNN. 
+
+2. The convergence is a decoration. It is easy to be obtained from existing convergence proof of proximal gradient algorithms, e.g. [accelerated proximal gradient methods for nonconvex programming. NIPS. 2015].
+
+---------------------------
+This paper proposes solving binary nets and it variants using proximal gradient descent. To motivate their method, authors connect lazy projected SGD with straight-through estimator. The connection looks interesting and the paper is well presented. However, the novelty of the submission is limited.
+
+1. My main concern is on the novelty of this paper. While authors find a good story for their method, for example,
+- A Proximal Block Coordinate Descent Algorithm for Deep Neural Network Training
+- Training Ternary Neural Networks with Exact Proximal Operator
+- Loss-aware Binarization of Deep Networks
+
+All above papers are not mentioned in the submission. Thus, from my perspective, the real novelty of this paper is to replace the hard constraint with a soft (penalized) one (section 3.2). 
+
+2. Could authors perform experiments with ImageNet?
+
+3. Could authors show the impact of lambda_t on the final performance? e.g., lambda_t = sqrt(t) lambda, lambda_t = sqrt(t^2 lambda",5,4.0,ICLR2019
+J_RcevSrOQj,4,punMXQEsPr0,punMXQEsPr0,A robust and effective pre-training strategy for document understanding and is independent of optimal order information but lacks of some details,"> Summary: 
+
+The paper studies the problem of large-scale pre-training for semi-structured documents. It proposes a new pre-training strategy called BERT relying on Spatiality (BROS) with area-masking and utilizes a graph-based decoder to capture the semantic relation between text blocks to alleviate the serialization problem of LayoutLM. 
+
+It points out that LayoutLM fails to fully utilize spatial information of text blocks and will face difficulties when text blocks cannot be easily serialized.  
+
+The three drawbacks of LayoutLM are listed:
+* X-axis and Y-axis are treated individually with point-specific embedding
+* Pre-training is identical to BERT so does not consider spatial relations between text blocks
+* Suffer from the serialization problem
+
+The proposed three corresponding methods of BROS are:
+* Continuous 2D positional encoding 
+* Area-masking pre-training on 2D language modeling
+* Graph-based decoder for solving EE & EL tasks
+
+> Strength:
+
+* The paper makes incremental advances over past work (LayoutLM) and the proposed BROS models achieves SOTA performance on four EE/EL datasets (i.e., FUNSD, SORIE*, CORD, and SciTSR)
+
+* The paper is generally easy to follow and could be better if provide more important details in Section 3.2 & 3.3
+
+* The experiment and discussion for Section 5.3 are quite convincing. BROS could achieve robust and consistent performances across all the four permuted version datasets, which demonstrates that BROS is adaptive to documents from the practical scenarios. 
+
+> Major concerns:
+
+- For Section 3.2, the author didn’t even provide the pre-training objective for the area-masked language model. I think the author should include the objective and define the exponential distribution explicitly.
+
+* I’m confused about why the performance difference in Table 4 between original LayoutLM and BROS is larger than that in Table 1. In the original LayoutLM, the 2D position encoding method is tied with token embedding. This applies to both Table 1 and Table 4. However, in Table 4 the performance difference on FUNSD EE is 42.46 vs 74.44, while in Table 1 the performance comparison is 78.89 vs 81.21. Could the author give some explanations on this?
+
+- In Table 4, it is better for the author to clearly indicate each ablation’s components. The author should also give the performance of the original LayoutLM and performances after gradually adding each new component to the original LayoutLM: such as original LayoutLM + Sinusoid & Linear, original LayoutLM + Sinusoid & Linear + untied with token embedding, etc.
+
+* For encoder design in Eq. (2), the second term is used to model the spatial dependency given the source semantic representation.  How about adding a symmetric term to model the semantic dependency given the source spatial representation. My question is simply that why not adopt a symmetric design for Eq. (2)?
+
+* Can the author give the reason behind the design of $\mathbf{t}^{ntc}$ in Eq.(4)? I’m not so clear about the function of $\mathbf{t}^{ntc}$.  Does the $\mathbf{f}_{ntc}$ output a distribution of the probability to be the next token over all N tokens? 
+
+* Could the author give a detailed analysis on which strategy contributes the most to BROS’ robustness against permuted order information? From the results of Table 4, it is not the SPADE decoder and the most important factor seems to be calculating semantic and position attentions separately, i.e., untied with token embedding and explicitly modeling semantic/position relations between text blocks. Do the authors agree with my conjecture?
+
+> Minor concerns:
+
+* Although SPADE is not the most important component of BROS, I believe including details of the grade-based decoder will help the readers to understand the model much better.
+
+* I’m curious about the performance of SpanBERT on the four datasets since the author said that area-masking of BROS is inspired by SpanBERT.
+
+* In Table 3, the value of LayoutLM - FUNSD should be 78.89 since all other p-FUNSD & FUNSD related values are consistent with Table 1 & 2.",6,4.0,ICLR2021
+YdiWfQ9Iex,1,ZTFeSBIX9C,ZTFeSBIX9C,Official Blind Review #4 ,"In *non-autoregressive* neural machine translation (NMT), learning from the predictions of *autoregressive* teacher models through sequence-level knowledge distillation (Kim and Rush, 2016) has been an important step to improve the performance of the non-autoregressive student models. Despite the success and prevalence of this knowledge distillation (KD) technique, this paper hypothesises---and empirically validates---that this KD procedure has a detrimental side-effect of propagating **lexical choice errors** that the autoregressive teacher model makes by mistranslating **low-frequency words**, into the non-autoregressive student model. 
+
+To overcome this issue, this paper proposes a way to incorporate lexical choice **prior knowledge** from the *raw parallel text* (as opposed to the autoregressive teacher's output that may propagate lexical choice errors for low-frequency words). More concretely, this work specifies two prior distributions: (i) a word alignment distribution that specifies a list of plausible target words for each token in the source sentence, as obtained from automatic word alignment tools, and (ii) a target distribution from an identical non-autoregressive teacher model trained on the raw data (i.e. *self-distillation* or born-again network), which does not contain the same lexical choice error propagation that the autoregressive teacher's model output has. The student model is then trained to minimise KL divergence with respect to these prior distributions, in conjunction with the standard sequence-knowledge distillation loss that learns from the autoregressive teacher model's output, as determined by an interpolation coefficient with logarithmic decay (Eq. 6).
+
+**Pros:**
+1. This paper does a great job of motivating the problem of lexical choice error propagation on low-frequency words from the autoregressive teacher to the non-autoregressive student. The paper clearly states its hypothesis, provides a nice motivating example, and runs extensive preliminary experiments that convincingly confirm the existence of the lexical choice problem for low-frequency words.
+
+2.  The proposed prior knowledge approach is simple to implement, and yet provides decent improvements across all four datasets. The improvements are also consistent across different autoregressive teacher model sizes and language pairs, and the two kinds of prior knowledge can be combined to produce stronger results.
+
+3. The paper features a fairly comprehensive analysis (including in preliminary experiments) and ablation studies that help better understand where exactly the improvements are coming from. 
+
+**Cons:**
+Despite the positive aspects above, I still have two major concerns regarding the paper:
+1. In Eq. 6 (page 5), the interpolation rate $\lambda$ controls how much weight is assigned to learning from the prior knowledge vs from the autoregressive teacher. But the proposed logarithmic decay function does not make sense to me. Let $i$ be the number of steps. At the very beginning of training, $i=0$, so based on Eq. 6, $\lambda = 1$. This makes sense since, at the beginning of training, the model only learns from the prior knowledge. But according to Eq. 6, $\lambda$ will actually **get larger** as training progresses (up to $i \leq I/2$). In the case where $i=I/2 - 1$, Eq. 6 will translate into $\lambda = 2$. This does not make sense for three reasons. First, $\lambda$ is an interpolation coefficient that therefore should be between 0 and 1. Second, $\lambda=2$ means that the interpolation weight assigned to distilling the autoregressive teacher is -1. Third, based on the description, $\lambda$ is designed to get smaller as more training steps are done, instead of getting larger.
+
+2. My second concern is that there is a much simpler way of injecting the prior knowledge. For instance, what if we simply provide a decaying learning schedule (i.e. a curriculum) where, at the beginning of training, the non-autoregressive student is trained to learn more from the *raw dataset*, while at the later stages of training, the non-autoregressive student is trained to learn more from the *autoregressive teacher's output*? This can potentially accomplish the exact same goal of learning the prior knowledge from the raw dataset first, and then move on to learn more from the teacher model's output. This simpler baseline should at least be compared against.
+
+3. This is a minor concern, but there are some grammatical mistakes and presentational suggestions in the paper that can be modified to improve clarity, as detailed below.
+
+**Question**
+1. In page 4, it is mentioned that: ""The chosen procedure [to get the ""gold"" lexical choice for each word] is as follows: ..."". How accurate is this procedure? Did you examine the output and double check whether the ""gold"" lexical choice corresponds well to human judgment or a dictionary entry for each word?
+
+**Presentation/Typos/Grammatical Errors:**
+1. In page 3, section 2.2, paragraph ""Datasets."", ""... to avoid unknown *works*..."" -> ""words"".
+2. In page 4, just under Eq. 3, it is mentioned that ""$V$ is the source side vocabulary in test set"". If I understand correctly, $f$ is a token on the source sentence, so shouldn't $\mathbf{V}^{test}_{src}$ be the list of *word tokens* (rather than source side vocabulary) on the source sentence? 
+3. In page 5, ""Through fine-grained *analyzing*"" -> ""analysis"".
+4. In page 5, ""...becomes significantly accurate (AoLC).."" -> ""significantly more accurate"".
+5. In page 6, ""...signal of the *priority* information"" -> ""prior"".
+6. In Table 3, I would suggest displaying the AoLC performance *just on rare words* (rather than overall AoLC), since that is the problem that the paper is trying to solve.
+7. In Table 5, mention what the column ""Iter."" means.
+8. In page 7, ""... It is *worthy* noting ..."" -> ""worth"".
+9.  In page 7,  ""... by substantially *increase* the lexical choice accuracy ..."" -> ""increasing"".
+10. In section 4.3, I suggest saying a bit more about how the human judgment is collected. 
+11. In page 8, ""For AT models, ..."" -> remove ""For"".
+
+-----Update after the authors' response-----
+Thank you for the detailed authors' response, and for meticulously taking the feedback into account. The response has addressed most of my concerns.
+
+Some further comments:
+
+1. In Eq. 6, I think ""up to $i \leq I/2$"" should be replaced with ""up to $i \leq (I/2 - 1)$"", since $\lambda$ would be negative with when $i=I/2$. Other than this, the equation looks good to me.
+2. I look forward to the addition of the results with the ""decay curriculum"" into the main paper. It is encouraging that the proposed approach outperforms this simpler ""decay curriculum"" baseline.
+
+Since the authors have addressed most of my concerns, I am therefore raising my recommendation to a ""6"".
+
+
+",6,4.0,ICLR2021
+G5aAtvF1LyH,1,37nvvqkCo5,37nvvqkCo5,This papers presents a general loss function for long-tail classification with several previous work as its special cases.,"This papers presents a general loss function for long-tail classification with several previous work as its special cases.
+
+This is a well-written paper, and the results are impressive. The approach builds upon prior work and a general framework is presented. The proposed approaches are eveluated on several commonly used datasets and show some improvements. 
+
+My one major technical concern are as follows:
+1. The originality of this paper is not very high since the proposed framework and its components are not novel (there might be some minor novelty such as the fisher consistency property of the objective);
+2. Regarding the post-hoc logit adjustment, I am supposing it is sensitive to $\pi_y^\tau$, which is not very much similar with weight normalization;
+3. For the balanced error, I am interested in why it is supposed to a better performance measure, given that the test data distribution is uniform (as per datasets used in experiments);
+4. In the experiments, e.g., Table 2, in my humble opinion, some of resfults for comparison methods are incorrect. Since prior work (including Weight normalisation, Adaptive, Equalised) report Top-1 classification error, instead of the balanced error. Hence, I guess that the comparison is not fair at all.
+
+========== after reading the authors feedback =========
+
+Thanks the authors for addressing my concerns and I am convinced that this work is very much different from prior literature. In addition, the evaluation metrics are correct in the studied problem setup. Based on that, I would like to raise my score from 5 to 6.",6,4.0,ICLR2021
+SJxf1Z_85B,4,SkxMjxHYPS,SkxMjxHYPS,Official Blind Review #4,"This paper presents a simple methodological study on the effect of the distribution of  convolutional filters on the accuracy of deep convolutional networks on the CIFAR 10 and CIFAR 100 data sets. There are five different kind of distributions studied: constant number of filters, monotonically increasing and decreasing number of filters and convex/concave with a local extremum at the layer in the middle.  For these distributions, the total number of filters is varied to study the trade-off between running-time vs. accuracy, memory vs. accuracy and parameter count vs. accuracy.
+
+Although the paper is purely experimental without any particular theoretical considerations, it presents a few surprising observations defying conventional wisdom:
+- The standard method of increasing the number of filters as the number of convolutional nodes is increasing is not the most optimal strategy in most cases.
+- The optimal distribution of channels is highly dependent on the network architecture.
+- Some network architectures are highly stable with respect to the distribution of channels, while others are very sensitive.
+
+Given that this paper is easy to read and presents interesting insights for the design of convolutional network architectures and challenges mainstream views, I would consider it to be a generally valuable contribution, at least I enjoyed reading it.
+
+Despite the intriguing nature of this paper, there are several weaknesses which make me less enthusiastic about the quality of the paper:
+- The experiments are done only on CIFAR-10 and CIFAR-100. These benchmarks are somewhat special. It would be useful to see whether the results also hold for more realistic vision benchmarks. Even if running all the experiments would be costly, I think that at least a small selection should be reproduced on OpenImages or MS-Coco or other more realistic benchmarks to validate the findings of this paper.
+- It would be interesting to see whether starting from the best channel distributions, applying MorphNet would end up with different distributions. In general: whether MorphNet would end up with similar distributions automatically.
+-  The paper does not clarify how the channel sizes for Inception were distributed, since proper balancing of the 1x1 and more spread out convolutions is a key part of that architecture. This is not clarified in this paper.
+- The grammar of the paper is poor, even the abstract is hard to read and interpret.
+- The paper presents itself as a methodology for automatically generating the optimal number of channels, while it is more of a one-off experiment and observation than a general purpose method.
+
+Another small technical detail regarding the choice of colors in the diagrams: the baseline distribution and constant distribution are very hard to distinguish. This is especially critical because these are the two best distributions on average. Also the diagrams could benefit from more detailed captions.
+
+The paper presents interesting, valuable experimental findings, but it is not extremely exciting theoretically. Also its practical execution is somewhat lacking. If it contained at least partial results on more realistic data sets, I would vote for strong accept, but in its current form, I find it borderline acceptance-worthy.
+ ",6,,ICLR2020
+nbL49zr2SGR,4,3hGNqpI4WS,3hGNqpI4WS,Review,"The authors  highlight the problem of iterated batch RL (and thus its deployment efficency) and propose an algorithm along
+with novel evaluation schemes.
+
+There are several things I like about this paper:
+
+- The topic is very important and underexplored, especially the deployment efficency w.r.t. the number of batch collections
+- Lots of relevant work is cited
+- The experiment make sense given the research question. Especially evaluations such as in Figure 2.
+- The results show significant improvements of the proposed method over state of the art
+- The method  is ""simple"", in a good way, and has not that much moving pieces, compared to other RL algorithms
+
+The are a few things I think could be improved:
+
+- Because the paper considers ""repeated batch"" it should touch the subject of ""if I can collect x batches, what batches would tell me the most?"". I.e the algorithm as written right now is ""greedy"" in the sense that the policy will only act in the way it considers most optimal (and is not completely different from the behavior policy).  Batch exploration strategies would be highly interesting
+and as ""iterative batch RL"" is the domain of this paper, these need to be addressed.
+
+- Related:  While epistemic uncertainty is modeled (via Ensembles of dynamic models) it is not explicitly utilized.  Eq. (4) contains a ""safety mechanism"" via the trust region approach.  Why not include a risk criterion over the ensemble spread?
+
+- Why use only deterministic models?  Using something like [1,2] that enables modeling stochastic effects would improve the 
+universality of the approach further, not being restricted to deterministic problems.
+
+I am a bit unsure about the novelty of Section 4  --  most pieces of the algorithm are already known or ""obvious"".  To me the 
+biggest novelty lies in identifying the ""repeated batch"" as  an important problem, that is underexplored and shows how 
+classical RL method underperform strongly here. From this  evaluations like shown in Figure 2 are novel and make a lot of sense. 
+
+
+[1] Chua, Kurtland, et al. ""Deep reinforcement learning in a handful of trials using probabilistic dynamics models."" Advances in Neural Information Processing
+
+[2] Lakshminarayanan, Balaji, Alexander Pritzel, and Charles Blundell. ""Simple and scalable predictive uncertainty estimation using deep ensembles."" Advances in neural information processing systems. 2017.
+",7,4.0,ICLR2021
+Iynm0T7BFVZ,1,V1N4GEWki_E,V1N4GEWki_E,"The idea is novel and interesting, while the experiments design is poor. I give borderline reject. I expect the response from the authors. If all concerns are addressed, I will raise my scores.","Overview:
+
+Summary:
+This paper tries to answer the following two questions: i) why training unstructured sparse networks from random initiation perform poorly? 2) what makes LTs and DST the exception? The authors show the following findings:
+1. Sparse NNs have poor gradient flow at initialization. They show that existing methods for initializing sparse NNs are incorrect in not considering heterogeneous connectivity. Improved methods are sample initialization from a dynamic gaussian whose variance is related to the fan-in numbers. fan-in = fan-out rule plays an important role here and improves the gradient flow.
+2. Sparse NNs have poor gradient flow during training. They show that DST based methods achieving the best generalization have improved gradient flow.
+3. They find the LTs do not improve gradient flow, rather their success lies in re-learning the pruning solution they are derived from.
+
+
+Strength bullets:
+1. The idea is very interesting. I appreciate the novel analysis. The proposed methods are well-motivated.
+2. The paper is well written and easy to understand.
+3. The finding is surprising but the experiment design is poor which I will list more detailed limitations in the weakness sections. I like the idea, I will raise my score if the authors can completely address my confusion and concerns.
+
+
+Weakness bullets:
+1. For Table 1, a Strong baseline is missing. Why not compare with the performance of the lottery ticket setting? I think it is a more natural baseline than SET and RigL.
+2. In my opinion, there is a must-do experiment: Lottery ticket mask + proposed initialization and compare it to LT and random tickets. Because the LT mask + random reinitialization = random tickets fail in the previous literature. According to the explanation in the paper, it can also be the problem of random reinitialization. Thus, strong supportive evidence is that show proposed modified random reinitialization + LT mask can surpass random ticket performance.
+3. Missing details what is the pruning ratio of each stage in iterative magnitude pruning? The appendix only tells me the author using 95% and 80% sparsity, why pick these two sparsity? Because this sparsity gives the extreme matching subnetworks? And the author uses iterative magnitude pruning, if they follow the original LTH setting, pruning 20% for each time. Then the sparsity should be 1-0.8^i, how to achieve 95% and 80%?
+4. What is the definition of ""pruning solution""? Is it the obtained mask or initialization or subnetworks contains both mask and initialization? Super confused
+5. Conflicted experiments results with Linear Mode Connectivity and the Lottery Ticket Hypothesis paper, ResNet 50 IMP LT on ImageNet without Early weight rewinding can not have good linear mode connectivity. However, the pruning solution and LT solution have good linear mode connectivity. It is wired, even for two LTs (ResNet 50 IMP LT on ImageNet) trained with the same initialization in different data orders, they do not have a linear path where interpolated training loss is flat, as evidenced in figure 5 in the paper ""Linear Mode Connectivity and the Lottery Ticket Hypothesis"". Early weight rewinding is needed for the presented results while I think the author did not use it. 
+6. The comparison in Table 2 is unfair. Scratch settings are trained from five different random initialization, while LT settings are trained from the same initialization with different data orders. LT setting results should also be from different initialization, otherwise can not achieve the conclusion that ""Lottery Tickets Learn Similar Functions to the Pruning Solution"".
+
+Minor:
+1. The definition of LTH in 3.3 ""perform as well as O^N(f,\theta)*M"", why there is M? It should be the full dense model without the mask, right?
+
+------ Post Rebuttal------
+
+Thanks to the authors for the extra experiments and feedback!
+
+[Lottery baseline for Table-1] Although RigL does not need dense network training, it cost more to find the mask (Table 2 of the RigL paper).
+
+[Random tickets] Random Ticket = LT mask + random re-initialization rather than random pruning + random init. The front one will be much more interesting. ""Because the LT mask + random reinitialization = random tickets fail in the previous literature. According to the explanation in the paper, it can also be the problem of random reinitialization. Thus, strong supportive evidence is that show proposed modified random reinitialization + LT mask can surpass random ticket performance."" I personally do the experiment that performing proposed initialization on random tickets and the performance is unchanged. Of course, there may exist lots of reasons for the results. I will not degrade the paper according to my experiments.
+
+Other concerns are will-addressed. Thanks!
+
+Although I do like the idea of this paper, I think it might need to be revised and resubmitted, incorporating the extensive discussion presented by all the reviewers. I tend to keep my scores unchanged. But I don’t think this is 100% a clear reject and depending on the opinions of the other reviewers I would not feel that accepting this paper was completely out of bounds.
+",5,4.0,ICLR2021
+BklSdgXQtH,1,r1gc3lBFPH,r1gc3lBFPH,Official Blind Review #3,"Overview
+
+This paper presents a very interesting application of speech keyword spotting techniques; the aim is to listen to continuous streams of community radio in Uganda in order to spot keywords of interest related to agriculture to monitor food security concerns in rural areas. The lack of internet infrastructure results in farmers in rural areas using community radio to share concerns related to agriculture. Therefore accurate keyword spotting techniques can potentially help researchers flag up areas of interest. The main engineering challenge that the study deals with is the fact that there isn’t a lot of training data available for languages like Lugandu.
+
+Detailed Comments
+
+ Section 3
+
+1. The word “scrapped” should be replaced with “scraped”.
+2. Its not entirely clearly what is meant by “as well as an alternative keyword in form of a stem”. What is meant by “stem” in this context? Maybe a citation or an example would make it more clear. 
+3. The corpus of keywords supposedly contains 193 distinct keywords however the models in Section 4 are only trained to discriminate between 10 randomly sampled keywords. I don’t understand why this is. Training on the full corpus would allow the network to see more training data and consequently might result in more accurate models. 
+
+Section 4
+
+1. The authors refer to Equation 3 in Section 4.1, but I think the reference is to Equation 2.
+2. The 1-D CNN has 14 layers but it is not clear to me what the size of the final network is. It would be useful to provide more details of the architecture and a comment about the network size either in terms of number of parameters, number of computations or size of the network weights on disk. 
+3. The figure for  the Siamese network should be numbered. 
+4. The authors say that the inputs to the ResNets are of shape 32x100. What do these dimensions refer to? Are their 32 frequency bins? Are their 100 frames as inputs? What are the parameters of the FFT computation? Do 100 frames correspond to a second of audio? 
+5. How many parameters are their in the Siamese network? 
+6. Its not clear to me how the Siamese networks are used during inference. The authors say they use the L1 distance to determine if a given test example is similar to a given keyword. How are the examples for the keyword of interest selected? Are all training examples used and the scores average? Are the training examples averaged first to form a centroid vector for  the keyword? 
+
+Comments on the methodology
+
+One of the main challenges in this work is the fact that there isn’t a large number of training examples available. However the authors still train relatively large acoustic models with 14 layers for 1-D CNN and a ResNet for the Siamese architecture. There are several studies for keyword spotting on mobile devices that aim to train tiny networks that consume very little power [1] [2]. I think it would be more appropriate to start with architectures similar to these due to the small size of the training dataset. Additionally, it would also be very useful to try and identify languages that are phonetically similar to Lugandu and that have training datasets available for speech recognition. Acoustic models trained on this data can then be adapted to the keyword spotting task using the training set collected for this task. Note that the languages need to have only a small amount of phonetic overlap in order for these acoustic models to be useful starting points for training keyword spotting models. 
+
+Comments on the evalation
+
+The keyword spotting models presented here are trained with the intention of applying them to streaming radio data. However, the models are trained and tested on fixed chunks of audio. A big problem with this experimental design is that the models can overfit to confounding audio cues in the training data. For example, if all training data are in the form of 1 second audio chunks and all chunks have at least some silence at the start, then the models learn that a keyword is always preceded by a certain duration of silence. This is not the case when keywords occur in the middle of sentences in streaming audio data. The fact that the evaluation is also performed on individual chunks of audio fails to evaluate how the trained models would behave when presented streaming audio. I am certain that when applied to streaming audio these models would false trigger very often, however this fact isn’t reflected by measuring precision and recall on fixed chunks of test audio. 
+
+A more principled evaluation strategy would be to present results in the form of detection-error tradeoff (DET) curves [1], [2]. Here the chunks with examples of the keyword in question can be used to measure the number of false rejects by the system. And long streams of audio data that do not contain the keyword should be used to measure the false alarms. Given that its relatively easy to collect streams of radio data, the false alarms can be measured in terms of the number of false alarms per unit time (minutes, hours or days). This evaluation strategy roughly simulates the streaming conditions in which this model is intended to be deployed. Furthermore, DET curves present the tradeoff between false alarms and false rejects as a function of the operating threshold, which provides much more insight into the performance/accuracy of the trained models. 
+
+Summary
+
+I think the area of application of this work is extremely interesting, however the training and evaluation methodologies have to be updated in order to realistically measure the way this system might perform in real-world test conditions. 
+
+References
+
+[1] Small-footprint Keyword Spotting Using Deep Neural Networks
+[2] Efficient Voice Trigger Detection for Low Resource Hardware
+",1,,ICLR2020
+38RwnwiLlqQ,4,QQzomPbSV7q,QQzomPbSV7q,A simple sampling manner for diverse and distinct sub-classes in metric learning,"The authors find that the popular triplet loss will force all same-class instances to a single center in a noisy scenario, which is not optimal to deal with the diverse and distinct sub-classes. After some analyses, the authors propose a simple sampling strategy, EPS, where anchors only pull the most similar instances. The method achieves good visualization results on MNIST and gets promising performance on benchmarks.
+
+Avoiding class collapse is meaningful and important in metric learning when dealing with some tasks. The analyses in the paper provide insights. Here are some possible issues of this paper.
+1. The authors should discuss when we need to avoid such class collapse. Maybe in some cases, pulling all similar instances to a single point leads to more discriminative embeddings. Even some methods are designed following that consideration. Some examples and demonstrations are required.
+2. It's better to write a sketch of the analysis on how to extend it to multi-class cases and analyze will the definition of the noise influence the final results.
+3. Maybe the authors need to find another real-world dataset with multiple meanings in one class and show the advantage of the proposed method. We can find the improvement of performance on the benchmarks, but the numbers are hard to illustrate the effect of the method.",6,5.0,ICLR2021
+SyXWKhaxM,3,SJ-C6JbRW,SJ-C6JbRW,Interesting data collection scheme,"The paper provides an interesting data collection scheme that improves upon standard collection of static databases that have multiple shortcomings -- End of Section 3 clearly summarizes the advantages of the proposed algorithm. The paper is easy to follow and the evaluation is meaningful.
+
+In MTD, both data collection and training the model are intertwined and so, the quality of the data can be limited by the learning capacity of the model. It is possible that after some iterations, the data distribution is similar to previous rounds in which case, the dataset becomes similar to static data collection (albeit at a much higher cost and effort). Is this observed ? Further, is it possible to construct MTD variants that lead to constantly improving datasets by being agnostic to the actual model choice ? For example, utilizing only the priors of the D_{train_all}, mixing model and other humans' predictions, etc.
+
+
+
+",8,5.0,ICLR2018
+S1eRWHHgTQ,3,rkxraoRcF7,rkxraoRcF7,"Interesting approach, somewhat artificial setup, limited interpretation of ""disentangling representation learning""","The authors address the problem of representation learning in which data-generative factors of variation are separated, or disentangled, from each other. Pointing out that unsupervised disentangling is hard despite recent breakthroughs, and that supervised disentangling needs a large number of carefully labeled data, they propose a “weakly supervised” approach that does not require explicit factor labels, but instead divides the training data in to two subsets. One set, the “reference set” is known to the learning algorithm to leave a set of generative “target factors” fixed at one specific value per factor, while the other set is known to the learning algorithm to vary across all generative factors. The problem setup posed by the authors is to separate the corresponding two sets of factors into two non-overlapping sets of latents. 
+
+Pros:
+
+To address this problem, the authors propose an architecture that includes a reverse KL-term in the loss, and they show convincingly that this approach is indeed successful in separating the two sets of generative factors from each other. This is demonstrated in two different ways. First, quantitatively on an a modified MNIST dataset, showing that the information about the target factors is indeed (mostly) in the set of latents that are meant to capture them. Second, qualitatively on the modified MNIST and on a further dataset, AffectNet, which has been carefully curated by the authors to improve the quality of the reference set. The qualitative results are impressive and show that this approach can be used to transfer the target factors from one image, onto another image.
+
+Technically, this work combines and extends a set of interesting techniques into a novel framework, applied to a new way of disentangling two sets of factors of variation with a VAE approach. 
+
+Cons:
+
+The problem that this work solves seems somewhat artificial, and the training data, while less burdensome than having explicit labels, is still difficult to obtain in practice. More importantly, though, both the title and the start of the both the abstract and the introduction are somewhat misleading. That’s because this work does not actually address disentangling in the sense of “Learning disentangled representations from visual data, where high-level generative factors correspond to independent dimensions of feature vectors…” What it really addresses is separating two sets of factors into different parts of the representation, within each of which the factors can be, are very likely are, entangled with each other.
+
+Related to the point that this work is not really about disentangling, the quantitative comparisons with completely unsupervised baselines are not really that meaningful, at least not in terms of what this work sets out to do. All it shows is whether information about the target factors is easily (linearly) decodable from the latents, which, while related to disentangling, says little about the quality of it. On the positive side, this kind of quantitative comparison (where the authors approach has to show that the information exists in the correct part of the space) is not pitted unfairly against the unsupervised baselines.
+
+===
+Update: 
+The authors have made a good effort to address the concerns raised, and I believe the paper should be accepted in its current form. I have increased my rating from 6 to 7, accordingly. ",7,4.0,ICLR2019
+JSfNIp_7hv3,2,QHUUrieaqai,QHUUrieaqai,"Interesting, creative, and well-motivated approach for giving mathematical inductive biases to a model.","In this work, the authors introduce a method called LIME for imparting certain mathematical inductive biases into a model. The structure of the approach is to first pretrain the model on synthetic tasks that are designed around three principles of mathematical reasoning: deduction, induction, and abduction. Each of these pretraining tasks is a sequence-to-sequence mapping involving 3 basic components of reasoning (Rule, Case, and Result), where two of these three components are provided as input and the third component is the target output. After pretraining on these tasks, the authors fine-tune on 3 different proof datasets, and find that the pretraining almost always improves performance, sometimes by a large margin.
+
+Strengths:
+1. This approach is creative and thought-provoking; pretraining is an important topic in ML nowadays, and this paper gives several interesting insights about how to structure pretraining. Therefore, publishing this paper at ICLR could help inspire others to use and develop improved variations of pretraining.
+
+2. One aspect of the pretraining that I found particularly impressive was how the authors found such clear improvements from such small amounts of pretraining. This is in stark contrast to the usually massive pretraining datasets that are used, and stands as an especially strong piece of evidence for the model’s usefulness.
+
+3. The experimental setup is well-motivated, drawing on a principled analysis of the problem domain. 
+
+4. The paper is overall clearly written and clearly structured.
+
+5. There were some interesting discussion points and ablation studies analyzing the approach in more detail. I particularly liked the discussion about how loading the vocabulary weights had little effect, showing that the inductive biases that were imparted were abstract in nature. It was also useful to see that LIME was more useful than other pretraining tasks, ruling out the possibility that you could get similar improvements from just any pretraining task.
+
+Weaknesses:
+
+1. Part of the paper’s motivation for imparting inductive bias through a dataset, rather than through an architecture, is that designing an architecture “strongly requires human insight.” This is true, but LIME also seems to strongly rely on human insight, so this point is not a benefit for LIME over architectural approaches. This is not a huge problem, but it does not seem like a great motivation for LIME. 
+
+2. Related to the previous point, it would be good to discuss the fact that the usefulness of LIME may be limited by the need to design the right pretraining task(s). As Table 4 shows, the nature of the pretraining task is very important; and although the authors were able to create some successful pretraining tasks for mathematical reasoning, it might be harder to create similarly useful tasks for larger-scale tasks in, e.g., language or vision. Again, this is not a huge problem, but I think it at least deserves some discussion.
+
+3. Though the goal of the approach (if I am understanding correctly) is to give inductive biases for induction, deduction, and abduction, the paper gives no direct evidence that it has done so: The authors create an approach *intended* to impart certain inductive biases, and this approach improves performance on 3 tasks that plausibly would benefit from those biases. But this result does not necessarily mean that the model has the inductive biases that were intended to be imparted; it’s possible that LIME imparted some other inductive biases that are also useful for mathematical reasoning but that are not related to induction, deduction, and abduction. Thus, there is a bit of a gap between the motivation and the actual experiments.
+
+4. It’s not entirely clear to me that the specific tasks (Deduct, Induct, Abduct) will necessarily enforce the types of reasoning that they are intended to enforce. For instance, consider the following input/output example: {A : a, B: b, C: d+e} <s> A A + B = C -> a a + b = d + e. Such an example is intended to show deduction, but it could instead be viewed as induction (where A A + B = C is the Result, a a + b = d + e is the Rule, and the Case dictionary should be read in reverse, treating the values as keys and the keys as values). Thus, related to the previous point, I think there is some concern that the LIME tasks may not necessarily encode the intended primitives. The results show that the LIME tasks clearly encode something useful, but it’s not clear exactly what useful things they encode. 
+
+Recommended citations: (you definitely don’t need to include all of these or even any of these, but I’m pointing to them just in case they’re useful):
+
+1. You already cite the GPT-3 paper (Brown et al.), But it might make sense to cite it in a second place as well, for the sentence where you say “However, there is another potential advantage of pre-training--it may distill inductive biases into the model that are helpful for training on downstream tasks.” Another paper you can cite for this point is this one: Can neural networks acquire a structural bias from raw linguistic data? https://arxiv.org/pdf/2007.06761.pdf
+
+2. Like your approach, the following paper also uses carefully-constructed synthetic datasets as a way to impart targeted inductive biases into a model. (However, they use these tasks for meta-training, not pre-training): Universal LInguistic Inductive Biases via Meta-Learning. https://arxiv.org/pdf/2006.16324.pdf. This paper might also be useful as an example of how you can address the last two points I listed under weaknesses, as this paper gives examples of how to test whether a model has some specific inductive biases; the paper I linked to in the previous bullet (Warstadt and Bowman) also does this. (However, adding such analyses might be more work than would be doable for a camera-ready).
+
+3. It might be good to cite Peirce when first mentioned in the intro; right now, the citation to Peirce is buried deep in the paper, after he has already been discussed at length.
+
+4. Some more potentially-relevant examples of architecturally encoding inductive biases for math: https://arxiv.org/pdf/1910.02339.pdf, https://arxiv.org/pdf/1910.06611.pdf 
+
+Other comments (these are not things that have affected my assessment. Instead, they are just comments that I think might be helpful in revising):
+
+1. Note that there is another approach in ML called LIME, which could potentially cause confusion. It’s completely up to you, but I would consider renaming to avoid confusion. Here is the other LIME by Ribeiro, Singh, and Guestrin: https://dl.acm.org/doi/pdf/10.1145/2939672.2939778?casa_token=VrGSeKoqOnkAAAAA:tmzXq2uCWkUVyPdd9ytCNK4LSdRfIwsIeX4hd8EMkjnjevZ4d-rCeIIM7acIRWGtQlQemUqDlAJx-Q 
+
+2. Abstract: “neural architecture” should be “neural architectures”
+
+3. Abstract: “on three large very different mathematical reasoning benchmarks” should be “on three very different large mathematical reasoning benchmarks”
+
+4. Abstract: I did not understand what “dominating the computation” meant until I read the rest of the paper.
+The intro says “It is commonly believed that the benefit of pre-training is that the model can learn world knowledge by memorizing the contents of the natural language corpus.” This statement seems strong - I am more inclined to think that much of the benefit comes from learning linguistic structure, not world knowledge. So it might be safer to reword as saying “One plausible explanation for the benefit of pretraining is…”
+
+5. Page 3 says “the BERT pretraining objective,” which suggests that BERT is the objective. But BERT is a model, not an objective; the objectives are masked language modeling and next-sentence prediction.
+
+6. Table 1: The formatting of the table makes it look like the first two rows are numbers copied from Li et al. But from the prose of your paper, and from looking at Li et al, I’m pretty sure that these numbers are from your own re-implementation. Is that correct? If so, it might be best to format the table different - using the citation within the body of the table gives a strong suggestion that the numbers come from Li et al., in my opinion.
+
+7. Table 4 and Table 5: In the caption, say what task these results are for, so that the table can be understood on its own.
+
+8. Please double check the references: Several of them seem to only list authors, title, and year when there is at least an arXiv version that could be listed as well. E.g., “Mathematical reasoning via self-supervised skip-tree training”, “Enhancing sat solvers with glue variable predictions”, “transformers generalize to the semantics of logics”. Also, where possible, cite a paper’s actual publication venue instead of arXiv - e.g., the Raffel et al. T5 paper appeared in JMLR, not just arXiv. 
+
+Summary: Overall, I am rating this an 8 because I find the strengths compelling but think that the weaknesses in framing hold the paper back from an even higher score. I would consider increasing the score if those weaknesses were addressed, though those weaknesses are deep enough that it would be hard to properly address them in time.
+",8,4.0,ICLR2021
+r1lB7pU2FS,1,SkxlElBYDS,SkxlElBYDS,Official Blind Review #1,"The method proposes a method for continual learning. The method is an extension of recent work, called orthogonal weights modification (OWM) [Zheng,2019]. This method aims to find gradient updates which are perpendicular to the input vectors of previous tasks (resulting in less forgetting). However, the authors argue, that the learning of new tasks is happening in the solution space of the previous tasks, which might severely limit the ability to adapt to new tasks. The authors propose a ‘principal component’-based solution to this problem. The method is considering the ‘task continual learning’ scenario (also known as task-aware) which means that the task label is given at inference time.
+
+Conclusion:
+
+1. The paper is not well-positioned in related works. I think the work is more related to works with ‘parameter isolation methods’ such as Piggyback, Packnet, HAT. These methods reserve part of the capacity of the network for tasks. I think the authors should relate their work with these methods, and provide an argument of the problem with these previous methods, which is addressed by their approach. I can see that rather than freezing weights (PackNet) or features (HAT) , the method freezes linear combinations of features. But it is for me not directly clear that that is desirable. In HAT the backpropagated vector is projected on the mask vector which coincides with the neurons (activations). 
+
+2. The experimental verification of the paper is too weak, and only comparison to EWC and OWM (not well known) are provided. At least a comparison with the more related works PackNet and HAT should be included. For more recent method for task-aware CL see also ‘Continual learning: A comparative study on how to defy forgetting in classification tasks’. Also results seem bad. For example on CIFAR10, 5 tasks in TCL setting is two-class problem per task; I would expect better results. 
+
+3. The authors claim that OWM is effective if tasks are similar, but not when dissimilar. And the proposed PCP solves this problem. However, all experiments are on similar tasks, and no cross domain tasks are considered, e.g. going from MNIST (task1) to EMNIST-26 (task2) etc. This would empirically support the claim. Also, the authors expect the difference between PCP and OWM to be even larger then. 
+
+4. Some more analysis of the success of PCA in representing the distribution would be appreciated, e.g. the percentage of total energy which is captured (sum of selected eigenvalues divided by sum of all eigenvalues). Such an analysis of P_l^k as a function of the tasks (and for several layers) would be interesting to see, for example for EMNIST-47(10 tasks). 
+
+5. Novelty with respect to OWM is rather small.
+
+6. The authors should mention that the method is pretrained on ImageNet in section 4.3. Given these datasets, I think it makes more sense to train from scratch and I would like to see those results. 
+
+Minor remarks:
+- I wonder if you use OWM or PCP you discard the possibility of positive backward transfer. Maybe the authors could comment on that. 
+
+- The authors write that ‘TCL setting the classification results are usually better than those of the CCL’ is that not per definition true ? Anything correctly classified under CCL is correctly classified under TCL but not the other way around. 
+",3,,ICLR2020
+SkH7fQYez,2,SkJd_y-Cb,SkJd_y-Cb,Nice idea but weak experimental section,"The paper presents a method to use non-linear combination of context vectors for learning vector representation of words. The main idea is to replace each word embedding by a neural network, which scores how likely is the current word given the context words. This also allowed them to use other context information (like POS tags) for word vector learning. I like the approach, although not being an expert in the area, cannot comment on whether there are existing approaches for similar objectives.
+
+I think the experimental section is weak. Most work on word vectors are evaluated on several word similarity and analogy tasks (See the Glove paper).  However, this paper only reports numbers on the task of predicting next word.
+
+Response to rebuttal:
+
+I am still not confident about the evaluation. I feel word vectors should definitely be tested on similarity tasks (if not analogy). As a result, I am keeping my score the same. ",4,4.0,ICLR2018
+ry5vY1teG,2,SJICXeWAb,SJICXeWAb,"Review of ""Depth separation and weight-width trade-offs for sigmoidal neural networks""","This paper proves a new separation results from 3-layer neural networks to 2-layer neural networks. The core of the analysis is a proof that any 2-layer neural networks can be well approximated by a polynomial function with reasonably low degrees. Then the authors constructs a highly non-smooth function can be represented by a 3-layer network, but impossible to approximate by any polynomial-degree polynomial function.
+
+Similar results about polynomial approximation can be found in [1] (Theorem 4). To me, the result proved in [1] is spiritually very similar to propositions 3-4. The authors need to justify the difference.
+
+The main strength of the new separation result is that it holds for a larger class of input distributions. Comparing to Daniely (2017) which requires the input distribution to be spherically uniform, the new result only needs the distribution to be lower bounded by 1/poly(d) in a small ball of radius 1/poly(d). Conceptually I don't think this is a much weaker condition. For a ""truly"" non-uniform distribution, one should allow its density function to be very close to zero at certain regions of the ball. Nevertheless, the result is a step forward from Daniely (2017) and the paper is well written.
+
+I am still in doubt of the practical value of such kind of separation results. The paper proves the separation by constructing a very specific function that cannot be approximated by 2-layer networks. This function has a super large Lipschitz constant, which we don't expect to see in practice. Consider the function f(x)=cos(Nx). When N is chosen large enough, the function f can not be well approximated by any 2-layer network with polynomial size. Does it imply that the family of cosine functions is rich enough so that it is a better family to learn than 2-layer neural networks? I guess the answer would be negative. In addition, the paper doesn't show that any 2-layer network can be well approximated by a 3-layer network, which is a missing piece in justifying the richness of 3-layer nets.
+
+Finally, the constructed ""hard"" function has order d^5 Lipschitz constant, but Theorem 7 assumes that the 2-layer networks' weight must be bounded by O(d^2). This assumption is crucial to the proof but not well justified (especially considering the d^5 factor in the function definition).
+
+[1] On the Computational Efficiency of Training Neural Networks, Livni et al., NIPS'14",5,4.0,ICLR2018
+HJiml81-z,3,B1ZvaaeAZ,B1ZvaaeAZ,Review of WRPN,"This paper presents an simple and interesting idea to improve the performance for neural nets. The idea is we can reduce the precision for activations and increase the number of filters, and is able to achieve better memory usage (reduced). The paper is aiming to solve a practical problem, and has done some solid research work to validate that.  In particular, this paper has also presented a indepth study on AlexNet with very comprehensive results and has validated the usefulness of this approach.   
+
+In addition, in their experiments, they have demonstrated pretty solid experimental results, on AlexNet and even deeper nets such as the state of the art Resnet. The results are convincing to me. 
+
+On the other side, the idea of this paper does not seem extremely interesting to me, especially many decisions are quite natural to me, and it looks more like a very empirical practical study. So the novelty is limited.
+
+So overall given limited novelty but the paper presents useful results, I would recommend borderline leaning towards reject.",5,4.0,ICLR2018
+rylH3T2dKB,1,SJecKyrKPH,SJecKyrKPH,Official Blind Review #3,"*Paper summary*
+
+The authors build an input transformation invariant CNN using the TI-Pooling architecture of (Laptev et al., 2016). They make some modifications to this, namely 1) they replace the convolutional filters, with input-dependent convolutional filters, and 2) they add a decoder network from the final representation, which reconstructs an input transformation rectified image, to encourage the final representation to be fully transformation invariant.
+
+*Paper decision*
+
+I have decided to assign this paper a reject because of two main reasons. 
+
+*Supporting arguments*
+
+One reason is that the base architecture is not novel. This in itself is not a key issue, but I would expect the authors to have done some in depth analysis or experimentation otherwise to compensate for this. I regret, the authors may just not have known that the ideas were already explored in the literature. The second reason is that the work is not well placed in context with prior works. This is both evident in the lack of referenced works (see below for a list) and the lack of sufficient baselines, against which they compare. For instance, if the authors had considered “Learning Steerable Filters for Rotation Equivariant CNNs” by Weiler et al. (2018) they would have known that their MNIST-rot-12k results are not state of the art as they state. In Weiler et al., the authors report 0.714 test set error on MNIST-rot-12k compared to ICNN’s 0.98.
+
+This all said, I think the paper is well-written and very clear. The structure is straightforward and the experiments seem repeatable from the descriptions made. The stated aims of the paper are also clear: to learn input transformation invariant CNNs using input-conditioned filters.
+Unfortunately a lot of supporting material and prior work has been missed. I list a lot of them here. 
+
+Works on input-conditioned filters and invariance. These are the most important
+
+-Dynamic Steerable Frame Networks, Jacobsen et al., 2017
+-Dynamic Steerable Blocks in Deep Residual Networks., Jacobsen et al., 2017
+
+Works on input-conditioned filters:
+
+-HyperNetworks, Ha et al., 2016
+-Dynamic Filter Networks, de Brabandere et al., 2016
+
+Works on invariance:
+
+-Invariance and neural nets, Barnard and Casasent, 1991
+-Group Equivariant Convolutional Networks Cohen and Welling (2015)
+-Harmonic Networks: Deep Translation and Rotation Equivariance: Worrall et al. (2017)
+-Steerable CNNs, Cohen and Welling (2017)
+-Spherical CNNs, Cohen et al. (2018)
+-CubeNet: Equivariance to 3D Rotation and Translation, Worrall and Brostow (2018)
+-Learning steerable filters for rotation equivariant CNNs, Weiler et al. (2018)
+-Gauge Equivariant Convolutional Networks and the Icosahedral CNN, Cohen et al. (2019)
+
+*Questions/notes for the authors*
+
+- Please address the missing references
+- Are the input-conditioned filters conditional on position in the activations, or are they shared across all spatial locations of the image? This is not clear from the text.
+- The image reconstruction reminds me of Transforming Auto-encoders (Hinton et al., 2016) and Interpretable Transformations with Encoder-Decoder Networks (Worrall et al., 2017). How is your setup different?
+
+
+
+",1,,ICLR2020
+n0_EQ7h4Kkh,2,rmd-D7h_2zP,rmd-D7h_2zP,Interesting approach using pre-trained representations to encode domain-slot pairs for DST,"Summary:
+This paper showcases how pre-training can help with Dialogue State Tracking. The authors explicitly model
+the relationship between domain-slot pairs. With their encoding and using strong pre-trained initializations
+they are able to improve the joint goal accuracy by almost 1.5 points which is impressive.
+
+Reasons for score: 
+This is a very well written paper and will be a good resource for people working on the task of Dialogue State Tracking.
+The authors show how they can model relationships between domain-slot pairs and how they can encode them effectively
+using pre-trained representations.
+I am hoping that the authors can address some of the cons during the rebuttal period.
+
+Pros:
+1. Good dialogue representation which helps with the task of state tracking
+2. Simple model consisting of encoders and 2 classifiers which are well explained.
+3. Clear ablation study showing the value of 1) pre-training and 2) modeling relationship between domain-slot values
+
+
+Cons:
+1. This approach, like other popular approaches, suffers from the problem of having a fixed output vocabulary for slot values - hence limiting its scalability. While this cannot be addressed in this work, this is a drawback of this approach.
+2. Some of the design decisions are stated but not well explained
+- Only one pre-training method compared
+- Authors mention they drop ""dontcare"" from slot gating but don't show the affect with or without it.
+- Not much details on the setup and how it was trained.
+3. Not much qualitative analysis.
+
+Please address and clarify the cons above 
+
+Typos/Areas for improvement:
+1. Section 3.2 and 3.3 can be shortened a lot. I would suggest showing more analysis.
+- More examples of type of  mistakes fixed.
+- Which turn in the dialogue does the error decrease the most.
+- How much is the training time/ accuracy tradeoff
+2. Adding another layer to make DS-split work should be trivial, there is no reason to leave that to future work.
+Could you show how the results look with that?
+
+------
+
+Updating score based on authors' response.",7,4.0,ICLR2021
+r1lZZNN5nQ,2,SJeFNoRcFQ,SJeFNoRcFQ,Interesting idea but not made rigorous/precise,"The paper analyzes the empirical spectral density of DNN layher matrices and compares them with the traditionally-regularized statistical models, and develop a theory to identify 5+1 phases of training based on it. The results with different batch sizes illustrate the Generalization Gap pheneomena, and explains it as being causes by implicit self-regularization.
+
+However, the paper seems a little bit handwavy to me, without any serious theoretical justification. For example, why are \mu=2 and 4 chosen as the threshold between weakly/moderately/very heavy-tailed? In addition, the paper is build upon o the 5+1 model as in Figure 2 and the graphical comparison between the empirical ESD and the expected ESD of the five models in Table 1, and they lack any mathematical/rigorous definition---see table 2. The simulations are performs over a particular data set and a particular setting, and I wonder if the observations would be different for a different data set and a different setting. 
+
+As a result, it may give some important intuition, but the content is not sufficiently rigorous to my knowledge.
+",4,1.0,ICLR2019
+uj51zpUtNTK,4,9FWas6YbmB3,9FWas6YbmB3,Interesting model; missing insights,"Summary: This work proposes a modified DARTS optimiser for NAS which assumes a  factorised dirichlet distribution over architecture parameters. It uses pathwise derivatives to learn an MLE estimate of these concentration parameters and adds appropriate regulariser terms to stabilise the training. Furthermore, it employs a protocol for progressively increasing channel fraction to stabilise training within a computation budget. The paper is easy to follow, and relevant experiments have been included. 
+
+- With the given probabilistic formulation of this work, it would be useful to include details on how the factorisation of appropriate distributions varies between PARSEC and DrNAS? It might be useful to ground the modelling assumptions within a prior and approximate-posterior framework for better clarity.
+- On the same line it would be useful to get insights on how the obtained models qualitatively differ from ProxylassNAS, PARSEC and SNAS. 
+- The interplay of progressive learning with modelling assumption is a bit unclear. The number of parameters and test error in Table 2 are inversely correlated across SNAS, ProxylessNAS, PARSEC and DrNAS which is perhaps not as surprising. I was wondering if authors have any insights on what aspect of the algo (with and without the two stage progressive learning) contributed to network size.
+- Would it be possible to employ the progressive learning policy mentioned in section 4.1 across other DART flavours and understand its impact on model performance?
+
+",6,2.0,ICLR2021
+hVJ3L5xXl-P,1,UQz4_jo70Ci,UQz4_jo70Ci,"The proposed method is effective and technically sound, but lacks novelty and originality","#### 1.summary
+In this paper, the authors introduce an cross-channel attrention mechanism and the anchor-free box regression branch with Diou-loss to deal with the clutters during the tracking procedure.
+#### 2.strengths
+The experimental results show that this method has excellent performance on several public benchmark datasets.
+#### 3.weaknesses
+- The novelty of the paper is deficient , the proposed method such as cross-attention mechanism [1], anchor-free regression [2, 3]  have been previously exploited in existing models.
+- The proposed method is obviously based on the SiamBAN[2], however, there exist much reduplicate content which has already mentioned in [2].
+- In experiments part, there are missing some important algorithms to compare, such as SiamBAN which is the baseline method for this paper.
+- the analysis of the proposed method is deficient.
+
+  *[1] Y. Yu, Y. Xiong, W. Huang, M. R. Scott, Deformable siamese attention networks for visual object tracking, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 6728– 6737.*
+
+  *[2] Zedu Chen, Bineng Zhong, Guorong Li, Zhang Shengping, and Ji Rongrong. Siamese box adaptive network for visual tracking. In IEEE Conference on Computer Vision and Pattern Recognition, pages 6668–6677,  2020.*
+
+  *[3] Yinda Xu, Zeyu Wang, Zuoxin Li, Ye Yuan, and Gang Yu. SiamFC++: Towards robust and accurate visual tracking with target estimation guidelines. In AAAI, pages 12549-12556, 2020.*
+
+
+#### 4. Correctness
+The claims, method, and empirical methodology are correct.
+#### 5. clarity
+The paper writing is clear and easy to follow. But there exist some grammar faults. Moreover ,in sec 3.3, there are two same subtitles and similar content, which may be confusing. Besides, in figure 4, the arrangement of the pictures does not seem to match the description text.
+#### 6. Relation to prior work
+It is an incremental work based on the prior work.
+#### 7. Reproducibility
+Yes.
+
+
+",4,5.0,ICLR2021
+HkHqRoIEe,3,BJh6Ztuxl,BJh6Ztuxl,Review,"
+The authors present a methodology for analyzing sentence embedding techniques by checking how much the embeddings preserve information about sentence length, word content, and word order. They examine several popular embedding methods including autoencoding LSTMs, averaged word vectors, and skip-thought vectors. The experiments are thorough and provide interesting insights into the representational power of common sentence embedding strategies, such as the fact that word ordering is surprisingly low-entropy conditioned on word content.
+
+Exploring what sort of information is encoded in representation learning methods for NLP is an important and under-researched area. For example, the tide of word-embeddings research was mostly stemmed after a thread of careful experimental results showing most embeddings to be essentially equivalent, culminating in ""Improving Distributional Similarity with Lessons Learned from Word Embeddings"" by Levy, Goldberg, and Dagan. As representation learning becomes even more important in NLP this sort of research will be even more important.
+
+While this paper makes a valuable contribution in setting out and exploring a methodology for evaluating sentence embeddings, the evaluations themselves are quite simple and do not necessarily correlate with real-world desiderata for sentence embeddings (as the authors note in other comments, performance on these tasks is not a normative measure of embedding quality). For example, as the authors note, the ability of the averaged vector to encode sentence length is trivially to be expected given the central limit theorem (or more accurately, concentration inequalities like Hoeffding's inequality).
+
+The word-order experiments were interesting. A relevant citation for this sort of conditional ordering procedure is ""Generating Text with Recurrent Neural Networks"" by Sutskever, Martens, and Hinton, who refer to the conversion of a bag of words into a sentence as ""debagging.""
+
+Although this is just a first step in better understanding of sentence embeddings, it is an important one and I recommend this paper for publication.",8,5.0,ICLR2017
+r1xmrgpU9H,2,BkeaEyBYDB,BkeaEyBYDB,Official Blind Review #3,"Update: I thank authors for the rebuttal. I agree that direction of exploring personalization in FL is interesting. With a stronger methodological contribution, this could become a good paper.
+
+----------------------------------------------------------------------------------------------------------------
+The main contribution of this paper is to notice the connection between Federated Averaging (FedAvg) and Model Agnostic Meta Learning algorithms (MAML). Authors also consider an algorithm that first trains with FedAvg and then continues training using Reptile.
+
+Pros:
+
+Interpretation of FedAvg as a meta-learning algorithm is interesting.
+
+Cons:
+
+Very limited methodological contribution. Proposed algorithm is essentially two existing algorithms applied one after another.
+
+Experiments are not conducted rigorously enough. There are many arbitrary hyperparameter choices which may bias the conclusions made in the paper. Statement ""We tried SGD with a range of other learning rates, and in all cases we observed this value to work the best."" is alarming suggesting that authors tried a variation of settings observing test data performance and reported a few selected runs. Although ""each experiment was repeated 9 times with random initialization"", the train/test split of the clients was fixed. Randomizing over client train/test split could help to improve the reliability of the results.
+
+EMNIST-62 is the only dataset analyzed in some detail. This dataset has drastically varying P(y|x) across clients, i.e. some people write zeros as some others write 'o's. This suggests that it is very hard to train a good global model and personalization is necessary. However this doesn't mean that Shakespeare dataset ""does not provide any interesting insights"". Perhaps, it is indeed more interesting and challenging, demanding more advanced methodology.
+
+In Figure 1, number of communication rounds may be impractical for FL (considering also addition 200 Reptile rounds). On Shakespeare, FedAvg paper reports 54% accuracy achieved in under 50 communication rounds in one of the settings. There are also recent works on improving communication efficiency that were not discussed or studied for personalization quality, e.g. FedProx from ""Federated Optimization in Heterogeneous Networks"" and PFNM from ""Bayesian Nonparametric Federated Learning of Neural Networks"".
+
+Questions about Figure 2 experiments:
+1. Fine-tuning requires 200 extra epochs over the initially trained model. What's the initial model accuracy when FedAvg is further trained with Adam optimizer for 200 extra communication rounds?
+2. The personalized test accuracy with FedAvg and Reptile fine-tuning reaches the same value in 10 update epochs, even when Reptile fine-tuning gets 200 extra initial training epochs. Does Reptile fine-tuning provide additional benefits to the initial model as compared to running FedAvg for more number of epochs?",1,,ICLR2020
+HJxe-1m22Q,2,r1gOe209t7,r1gOe209t7,interesting heuristics but no justification either theoretically or empirically,"This paper proposes a special dropout procedure for densenet. The main argument is standard dropout strategy may impede the feature-reuse in Densenet, so the authors propose a pre-dropout technique, which implements the dropout before the nonlinear activation function so that it can be feeded to later layers. Also other tricks are discussed, for example, channel-wise dropout, and probability schedule that assigns different probabilities for different layers in a heuristic way. 
+
+To me this is a mediocre paper. No theoretical justification is given on why their pre-dropout structure could benefit compared to the standard dropout. Why impeding the feature-reuse in the standard dropout strategy is bad? Actually I am not quite sure if reusing the features is the true reason densenet works well in applications.
+
+Heuristic is good if enough empirical evidence is shown, but I do not think the experiment part is solid either. The authors only report results on CIFAR-10 and CIFAR-100. Those are relatively small data sets. I would expect more results on larger sets such as image net.
+
+Cifar-10 is small, and most of the networks work fairly well on it. Showing a slight improvement on CIFAR-10 (less than 1 point) does not impress me at all, especially given the way more complicated way of the dropout procedure. 
+
+The result of the pre-dropout on CIFAR-100 is actually worse than the original densenet paper using standard dropout. Densenet-BC (k=24) has an error rate of 19.64, while the pre-dropout is 19.75.
+
+Also, the result is NOT the-state-of-the-art. Wide-ResNet with standard dropout has better result on both CIFAR-10 and CIFAR-100, but the authors did not mention it.  
+",3,3.0,ICLR2019
+A50BpS-ipA0,2,34KAZ9HbJco,34KAZ9HbJco,A large disconnect between the proposed additions and the problems the paper tries to solve,"The paper proposes three additions to improve a monolithic multilingual end-to-end ASR system. The problem of training a monolithic multilingual ASR system is that using data from multiple languages does not necessary improve over individual monolingual systems. The three additions are a large multilingual language model, the use of language adapters, and smoothing on the token probabilities. Mixing the three additions in a specific way helps improve the average word error rates.
+
+There are two major problems in the paper. One is the imprecise use of words, and the other is the disconnect between the additions and the problems they try to solve. Details are as follows.
+
+The paper contains a lot of imprecise use of words. The term ""long tail"" is used throughout the paper, but it is never clearly defined. The long tail of a distribution refers to a significant total amount of probability mass spread on a large support. In the context of this paper, when the paper talks about the long-tail problem, what distribution are we talking about? Is it a distribution that captures how likely a phone or a word piece is used in all of the world's languages?
+
+While the long-tail problem is not properly defined, the class imbalance problem more or less is. There is still a certain amount of ambiguity. For example, what are the classes? Are the classes languages, phones, or word pieces?
+
+Given that the long-tail problem is not defined, it is hard to see why the proposed additions solve the problem. I can understand using a larger language model would help the final performance, but how does this solve the long-tail problem and the class imbalanced problem? The same applies to language adapters. The smoothing technique does have a effect on generalizing to low frequency or even unseen tokens, but the paper does not mention the connection or cite the proper papers.
+
+The paper also ignores the relationships among languages. For example, it is obvious that none of the word pieces in Mandarin are shared with the other languages. It is also the only tonal language. As another example, Tatar is Turkic but uses the Cyrillic script; Turkish is also Turkic but it uses the Latin alphabet; Russian is not Turkic but uses the Cyrillic script. These relationships are important in interpreting the results when training multiple languages together.
+
+Here are a list of detailed comments.
+
+> x \in R^{T,F}
+
+T,F is a rather unconventional notation. I would suggest T \times F.
+
+> KL(y_{ATTN} || y)
+
+Are the y's labels? This is also an unconventional (if not wrong) notation. It should be the the KL of distributions, not labels. Later on, for example in equation (3), y is used as labels.
+
+> equation (3)
+
+\mathcal{Y} is undefined.
+
+> Figure 7 depicts ...
+
+Figure 7 is in the appendix. The main content without the appendix should be as self-contained as possible.
+
+> Let t denote the current time step.
+
+This is confusing. It's actually not the time in the actual speech, but the t-th token.
+
+> A natural adjustment is to scale the raw logits ...
+
+The term logit is misused. Please look it up, stop misusing it, and define the symbols properly.
+
+> equation (6)
+
+The symbol * should really be \times.
+
+> equation (9)
+
+It is confusing to denote the probability as y_t^{adj}. Again, because the bold face y is used as a sequence of labels else where, such as equation (11).
+
+> ... and 2 times gradient accumulation in a single GPU ...
+
+What does this mean exactly? Please elaborate.
+
+> This is due to the human languages share some common sub-phonetic articulatory features (Wang & Sim, 2014) ...
+
+1. This sentence is ungrammatical. 2. This is a well-known fact, and the citation cannot be this recent. 3. No evidence in this paper is shown that this is the actual cause of the improvement. Please state it clearly if this is only a speculation.
+
+> ... even MT models improve the performance of the low-resource languages significantly.
+
+This is not exactly true. For example, the performance on Mandarin actually degrades quite significantly.
+
+> ... compared to the MT, the tail classes ... However, the head classes suffer ...
+
+Are the terms tail classes and head classes defined?
+
+> ... and possibly model overfitting to the tail classes.
+
+This is easy to check. What's the performance on the training set?
+
+> The gains ... of the head languages, although tail languages ...
+
+Again, what are head and tail languages?",4,4.0,ICLR2021
+BGxzL7l0hbm,4,R4aWTjmrEKM,R4aWTjmrEKM,Interesting work but exposition makes it hard to assess.,"##########################################################################
+Summary:
+ 
+The paper provides an interesting approach to speeding up the convergence time of the Policy-Space Response Oracles framework by re-using the Q-functions of past best-responses to transfer knowledge across epochs.
+##########################################################################
+Reasons for score: 
+ 
+Overall, I am low confidence on my assessment of this paper due to the exposition in the algorithm section being relatively confusing. The experimental results are interesting which suggests that the method has value but there is key missing information on how the best response policies are constructed that make it difficult to assess the paper and lead to my not wanting to recommend its acceptance. I would highly recommend being more detailed in Sec. 3 to allow me to reassess the paper. I would certainly be willing to update my score if the paper was clearer to read.
+ ##########################################################################Pros: 
+ 
+1. The experimental results on Leduc Poker are very speedy in terms of time-steps. 
+
+2. The idea of reusing the prior Q functions and just mixing them together rather than re-learning all of the policies is very good. 
+
+ 
+##########################################################################
+Cons: 
+
+1. The algorithm boxes are so high level that I am struggling to understand how the algorithms work. I would not be able to implement it from reading the paper. 
+
+#########################################################################
+Things that would improve readability:
+
+- It would be nice in the algorithm boxes to connect Q-mixing to how the best policy is explicitly output. I was not able to understand how Q-mixing was connected to either Algorithm 2 or 3 and subsequently had difficulty following the paper.
+- \lambda does not appear to be defined anywhere but appears in the Mixed Oracles algorithm box
+- What is the OpponentOracle and the TransferOracle? They are defined in the algorithm boxes but are not clearly defined elsewhere.
+- The specific example of RPS in section 3.2 does not provide useful intuition by going through the numerics, it may be more helpful to walk through a more high level description. 
+- It would probably be useful to move more of the experimental results to an appendix to leave room for the exposition of the algorithms.",7,4.0,ICLR2021
+ByRJLo-Vg,2,SJQNqLFgl,SJQNqLFgl,Review,"The authors have grouped recent work in convolutional neural network design (specifically with respect to image classification) to identify core design principles guiding the field at large. The 14 principles they produce (along with associated references) include a number of useful and correct observations that would be an asset to anyone unfamiliar with the field. The authors explore a number of architectures on CIFAR-10 and CIFAR-100 guided by these principles.
+
+The authors have collected a quality set of references on the subject and grouped them well which is valuable for young researchers. Clearly the authors explored a many of architectural changes as part of their experiments and publicly available code base is always nice.
+
+Overall the writing seems to jump around a bit and the motivations behind some design principles feel lost in the confusion. For example, ""Design Pattern 4: Increase Symmetry argues for architectural symmetry as a sign of beauty and quality"" is presented as one of 14 core design principles without any further justification. Similarly ""Design Pattern 6: Over-train includes any training method where the network is trained on a harder problem than necessary to improve generalization performance of inference"" is presented in the middle of a paragraph with no supporting references or further explanation.
+
+The experimental portion of this paper feels scattered with many different approaches being presented based on subsets of the design principles. In general, these approaches either are minor modifications of existing networks (different FractalNet pooling strategies) or are novel architectures that do not perform well. The exception being the Fractal-of-Fractal network which achieves slightly improved accuracy but also introduces many more network parameters (increased capacity) over the original FractalNet.
+
+
+Preliminary rating:
+It is a useful and perhaps noble task to collect and distill research from many sources to find patterns (and perhaps gaps) in the state of a field; however, some of the patterns presented do not seem well developed and include principles that are poorly explained. Furthermore, the innovative architectures motivated by the design principles either fall short or achieve slightly better accuracy by introducing many more parameters (Fractal-Of-Fractal networks). For a paper addressing the topic of higher level design trends, I would appreciate additional rigorous experimentation around each principle rather than novel architectures being presented.",3,4.0,ICLR2017
+r1H61czNl,3,H13F3Pqll,H13F3Pqll,,"In this work, the authors propose to use a (perhaps deterministic) retrieval function to replace uniform sampling over the train data in training the discriminator of a GAN.
+Although I like the basic idea, the experiments are very weak.  There are essentially no quantitative results, no real baselines, and only a small amount of not especially convincing qualititative results.   It is honestly hard to review the paper- there isn't any semblance of normal experimental validation.
+
+Note:  what is happening with the curves in fig. 6?",3,4.0,ICLR2017
+BJeMLJhw2Q,1,H1eRBoC9FX,H1eRBoC9FX,An interresting but rushed paper,"*Summary:* The present paper proposes to use  Model Agnostic Meta Learning to (meta)-learn in an unsupervised fashion the reward function of a Markov decision process in the context of Reinforcement Learning (RL). The distribution of tasks corresponds to a distribution of reward functions which are created thanks to random discriminators or diversity driven exploration.
+
+*Clarity:* The goal is well stated but the presentation of the method is confusing.
+
+There is a  constant switch between caligraphic and roman D. Could you homogenize the notations?
+
+Could you keep the same notation for the MDP (eg in the introduction and 3.5, the discount factor disappeared)
+
+In the introduction the learning algorithm takes a MDP in mathcal{T} and return a policy. In the remaining of the paper mathcal{D} is used. Could you clarify? I guess this is because only the reward of the MDP is meta-learned, which is itself based on D_phi?
+
+you choose r = log(Discriminator). Could you explain this choice? Is there alternative choices?
+
+In subsection 3.4, why the p in the reward equation?
+
+Algorithm 1 is not clear at all and needs to be rewritten:
+   - Could you specify the stopping criterion for MAML you used?
+   - Could you number the steps of the algorithm?
+
+Concerning the experiments:
+
+In my opinion the picture of the dataset ant and cheeta is irrelevant and could be removed for more explainations of the method.
+
+It would be very nice to have color-blind of black and white friendly graphs.
+
+In the abstract, I don't think the word demonstrate should be used about the experimental result. As pointed out in the experimental section the experiment are here rather to give additional insight on why and when the proposed method works well.
+
+Your method learns faster than RL from scratch on the proposed dataset in terms of iteration. What about monitoring the reward in terms of time, including the meta-learning step. Is there any important constant overhead in you the proposed method? How does the meta training time impact the training time? Do you have examples of datasets where the inductive bias is not useful or worst than RL from scratch? If yes could you explain why the method is not as good as RL from scratch?
+
+The presentation of the result is weird.
+Why figure 4 does not include the ant dataset? Why no handcrafted misspecified on 2D navigation?
+Figure 3 and 4 could be merged since many curves are in common.
+
+How have you tuned your hyperparameters of each methods? Could you put in appendix the exact protocol you used, specifying the how hyperparameters of the whole procedured are chosen, what stopping criterion are used, for the sake of reproducibility. A an internet link to a code repository used to produce the graphs would be very welcome in the final version if accepted.
+
+In the conclusion, could you provide some of the questions raised?
+
+*Originality and Significance:* As I'm not an expert in the field, it is difficult for me to evaluate the interest of the RL comunity in this work. Yet to the best of my knowledge the work presented is original, but the lack of clarity and ability to reproduce the results might hinder the impact of the paper. 
+
+Typos:
+Eq (2) missing a full stop
+Missing capital at ´´ update´´  in algorithm 1
+End of page 5, why the triple dots?
+",4,2.0,ICLR2019
+BkeaQvJptS,2,SklgfkSFPH,SklgfkSFPH,Official Blind Review #2,"This paper propose a second-order approximation to the empirical loss in the PAC-Bayes bound of random neural networks. Though the idea is quite straightforward, the paper does a good job in discussing related works and motivating improvements.
+
+Two points made about the previous works on PAC-Bayesian bounds for generalization of neural networks (especially Dziugaite & Roy, 2017) are:
+* Despite non-vacuous, these results are obtained on ""significantly simplified"" datasets and remain ""significantly loose""
+* The mean of q after optimizing the PAC-Bayes bound through variational inference is far different from the weights obtained in the original classifier.
+
+These points are valid. But it's unclear to me that the proposed method fixes any of them. My concerns are summarized below:
+* The inequalities are rather arbitrary and not convincing to me. BY Taylor expansion one actually get a lower bound of the right hand side, However the authors write it as first including the higher-order terms, which results in an upper bound, then throwing the higher-order term and arguing the final equation as an approximate upper bound. I believe this can be incorrect when the higher-order terms plays an nonnegligible role.
+* The theorems are easy algebras and better not presented as theorems.
+* The proposed diagonal and layer-wise approximation to hessian are very rough estimate of the original Hessian and it is not surprising that it doesn't give meaningful approximation of the original bound. 
+* There is no explicit comparison with previous methods using the same dataset and architecture. It would be much more convincing if the authors include the results of previous works using the same style of figures as Figure 2/3.
+
+Minor:
+* I understand using the invalid bound (optimizing prior) as a sanity check. But the presentation in the paper could better be improved by explaining why doing this.
+* Do the plots in Figure 2 correspond to the invalid or valid bound?
+* Many papers are complaining that Hessian computation is difficult in autodff libs without noticing this is a fundamental limitation of these reverse-mode autodiff libraries and no easy fix exists.
+* I believe MCMC is not used and the authors are refering to MC (page 7, first paragraph).
+",3,,ICLR2020
+SkgrWz6n2m,3,HyVbhi0cYX,HyVbhi0cYX,Complexity of training ReLU Neural Networks,"This paper claims results showing ReLU networks (or a particular architecture for that) are NP-hard to learn. The authors claim that results that essentially show this (such as those by Livni et al.) are unsatisfactory as they only show this for ReLU networks that are fully connected. However, the authors fail to criticize their own paper for only showing this result for a network with 3 gates. For the same reason that the Livni et al. results don't imply anything for fully connected networks, these results don't imply anything for larger networks. Conceivably certain gadgets could be created to ensure that the larger networks are essentially forced to ignore the rest of the gates. This line of research isn't terribly interesting and furthermore the paper is not particularly well written. 
+
+For learning ReLUs, it is already known (assuming conjectures based on hardness of improper PAC learning) that functions that can be represented as a single hidden layer ReLU network cannot be learned even using a much larger network in polynomial time (see for instance the Livni et al. paper, etc.). Proving NP-hardness results for proper isn't as useful as they usually are very restricted in terms of architectures the learning algorithm is allowed to use. However, if they do want to show such results, I think the NP-hardness of learning 2-term DNF formulas will be a much easier starting point. 
+
+Also, I think there is a flaw in the proof of Lemma 4.1. The function f *cannot* be represented by the networks the authors claim to use. In particular the 1/\eta outside the max(0, x) term is not acceptable.",3,5.0,ICLR2019
+BJlp2oCh2Q,3,H1lJws05K7,H1lJws05K7,"Some good ideas, many issues","The authors prove some theoretical results under the mean field regime and support their conclusions with a small number of experiments. Their central argument is that a correlation curve that leads to sub-exponential correlation convergence (edge of chaos) can still lead to rapid convergence if the rate is e.g. quadratic. They show that this is the case for ReLU and argue that we must ensure not only sub-exponential convergence, but also have a correlation curve that is close to the identity everywhere. They suggest activation functions that attain conditions as laid out in propositions 4/5 as an alternative.
+
+The paper has many flaws:
+- the value of the theoretical results is unclear
+- the paper contains many statements that are either incorrect or overly sweeping
+- the experimental setup and results are questionnable
+
+Theoretical results:
+**Proposition 1: pretty trivial, not much value in itself
+**Proposition 2: Pretty obvious to the experienced reader, but nonetheless a valuable if narrow result.
+**Proposition 3: Interesting if narrow result. Unfortunately, it is not clear what the ultimate takeaway is. Is quadratic correlation convergence ""fast""? Is it ""slow""? Are you implying that we should find activation function where at EOC convergence is slower than quadratic? Do those activation functions exist? It would be good to compare this result against similar results for other activation functions. For example, do swish / SeLU etc. have a convergence rate that is less than quadratic?
+**Proposition 4: The conditions of proposition 4 are highly technical. It is not clear how one should go about verifying these conditions for an arbitrary activation function, let alone how one could generate new activation functions that satisfy these conditions. In fact, for an arbitrary nonlinearity, verifying the conditions of proposition 4 seems harder than verifying f(x) - x \approx 0 directly. Hence, proposition 4 has little to no value. Further, it is not even clear whether f(x) - x \approx 0 is actually desirable. For example, the activation function phi(x)=x achieves f(x) = x. But does that mean the identity is a good activation function for deep networks? Clearly not.
+**Proposition 5: The conditions of prop 5 are somewhat simpler than those of prop 4, but since we cannot eliminate the complicated condition (ii) from prop 4, it doesn't help much.
+**Proposition 6: True, but the fact that we have f(x) - x \approx 0 for swish when q is small is kind of obvious. When q is small, \phi_swish(x) \approx 0.5x, and so swish is approximately linear and so its correlation curve is approximately the identity. We don't need to take a detour via propposition 4 to realize this.
+
+Presentation issues:
+- While I understand the point figures 1, 2 and 4b are trying to make, I don't understand what those figures actually depict. They are insufficiently labeled. For example, what does each axis represent?
+- You claim that for ReLU, EOC = {(0,\sqrt{2})}. But this is not true. By definition 2, EOC is a subset of D_\phi,var. But {(0,\sqrt{2})} is not in D_\phi,var, because it simply leaves all variances unchanged and does not cause them to converge to a single value. You acknowledge this by saying ""For this class of activation functions, we see (Proposition 2) that the variance is unchanged (qal = qa1) on the EOC, so that q does not formally exist in the sense that the limit of qal depends on a. However,this does not impact the analysis of the correlations."" Section 2 is full of complicated definitions and technical results. If you expect the reader to plow through them all, then you should really stick to those definitions from then on. Declaring that it's fine to ignore your own definitions at the beginning of the very next section is bad presentation. This problem becomes even worse in section 3.2, where it is not clear which definition is actually used for EOC in your main result (prop 4), making prop 4 even harder to parse than it already is.
+
+Correctness issues:
+- ""In this chaotic regime, it has been observed in Schoenholz et al. (2017) that the correlations converge to some random value c < 1"" Actually, the correlation converges deterministically, so c is not random.
+- ""This means that very close inputs (in terms of correlation) lead to very different outputs. Therefore, in the chaotic phase, the output function of the neural network is non-continuous everywhere."" Actually, the function computed by a plain tanh network is continuous everywhere. I think you mean something like ""the output can change drastically under small changes to the input"". But this concept is not the same as discontinuity, which has an established formal definition.
+- ""In unreported experiments, we observed that numerical convergence towards 1 for l ≥ 50 on the EOC."" Covergence of a sequence is a property of the limit of the sequence, and not of the 50th element. This statement makes no sense. Also if you give a subjective interpretation of those experimental results, you should present the actual results first.
+- ""Tanh-like activation functions provide better information flow in deep networks compared to ReLU-like functions."" This statement is very vague and sweeping. Also, one could argue that the fact that ReLU is much more popular and tends to give better results than tanh in practice disproves the statement outright.
+- ""Tanh-like activation functions provide better information flow in deep networks compared to ReLU-like functions. However, these functions suffer from the vanishing gradient problem during back-propagation"" At the edge of chaos, vanishing gradients are impossible! As Schoenholz showed, at the edge of chaos, \chi_1=1, but \chi_1 is also the rate of growth of the gradient. Pascanu et al (2013) discussed vanishing gradients in RNNs, which is a different story.
+- ""Other activation functions that have been shown to outperform empirically ReLU such as ELU (Clevert et al. (2016)), SELU (Klambauer et al. (2017)) and Softplus also satisfy the conditions of Proposition 4 (see Supplementary Material for ELU)."" Firstly, SeLU does not satisfy proposition 4. f(x) \approx x requires \phi to be close to a linear function in the range where the pre-activations occur. Since SeLU has a kink at 0, it cannot be close to a linear function no matter how small the pre-activations are. Secondly, softplus also doesn't satisfy proposition 4, as \phi(0) = 0 does not hold. Thirdly, this statement is too sweeping. If ELU / SELU / Softplus ""outperform"" ReLU, why is ReLU still used in practice? At best, those nonlinearities have been shown to outperform in a few scenarios.
+- ""We proved in Section 3.2 that the Tanh activation guarantees better information propagation through the network when initialized on the EOC."" Prop 4 only applies in the limit as \sigma_b converges to 0. So you can't claim that you showed tanh as ""better information propagation"" in general.
+- ""However, for deeper networks (L ≥ 40), Tanh is stuck at a very low test accuracy, this is due to the fact that a lot of parameters remain essentially unchanged because the gradient is very small."" But in figure 6b the accuracy for tanh is decreasing rapidly, so therefore the parameters are not remaining ""essentially unchanged"", as this would also cause the accuracy to remain essentially unchanged. Also, if the parameter changes are too small ... why not increase the learning rate?
+- ""To obtain much richer priors, our results indicate that we need to select not only parameters (σb , σw ) on the EOC but also an activation function satisfying Proposition 4."" Prop 4 only applies when \sigma_b is small, so you additionally need to make sure \sigma_b small.
+- ""In the ordered phase, we know that the output converges exponentially to a fixed value (same value for all Xi), thus a small change in w and b will not change significantly the value of the loss function, therefore the gradient is approximately zero and the gradient descent algorithm will be stuck around the initial value."" But you are using Adam, not gradient descent! Adam explicitly corrects for this kind of gradient vanishing, so a small gradient can't be the reason for the lack of training success.
+
+Experimental issues:
+- ""We use the Adam optimizer with learning rate lr = 0.001."" You must tune the learning rate independently for each architecture for an ubiased comparison.
+- In figure 6b, why does tanh start with a high accuracy and end up with a low accuracy? I've never seen a training curve like this ... This suggests something is wrong with your setup.
+- You should run more experiments with a larger variety of activation functions.
+
+Minor comments: 
+- ""Therefore, it is easy to see that for any (σb , σw ) such that F is increasing and admits at least one fixed point,wehaveKφ,corr(σb,σw) ≥ qwhereqistheminimalfixedpoint;i.e. q := min{x : F(x) = x}."" I believe this statement is true, but I also think it requires more justification.
+- At the end of page 3, I think \epsilon_r should be \epsilon_q
+
+There are some good ideas here, but they need to be developed/refined/polished much further before publication. The above (non-exhaustive) list of issues will hopefully be helpful for this.
+
+
+### Addendum ###
+After an in-depth discussion with the authors (see below), my opinion on the paper has not changed. All of my major criticisms remain: (1) There are far easier ways of achieving f(x) ~ x than propositions 4/5/7, i.e. we simply have to choose \phi(x) approximately linear. (2) The experiments are too narrow, and learning rates are badly chosen. (3) The authors do not discuss the fact that as f(x) gets too close to x, performance actually degrades as \phi(x) gets too close to a linear function. (Many other criticisms also remain.)
+
+The one criticism that the authors disputed until the end of the discussion is criticism (1). Their argument seems to hinge on the fact that their paper provides a path to construct activation function that avoid ""structural vanishing gradients"", which they claim 'tanh' suffers from. While they acknowledge that tanh does not necessarily suffer from ""regular"" vanishing gradients (as shown by ""Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice"" and ""Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks""), they claim it suffers from structural vanishing gradients. I do not believe that there is such a thing as structural vanishing gradients. However, even if such a concept did exist, it falls on the the authors to provide a clear definition / explanation, which they neither do in the paper nor the rebuttal.",3,5.0,ICLR2019
+SJxGhB28hX,2,S1en0sRqKm,S1en0sRqKm,Insightful empirical study of the effect of batch size for convergence speed,"This paper empirically investigates the effect of batch size on the convergence speed of the mini-batch stochastic gradient descent of popular deep learning models. The fact that there is a diminishing return of batch size is not very surprising and there is a well-known theory behind it, but the theory doesn't exactly tell when we will start to suffer from the diminishing return. Therefore, it is quite valuable for the community to have an empirical analysis across popular ML tasks and models. In this regard, however, It would've been even nicer if the paper covered more variety of popular ML models such as Machine Translation, Speech Recognition, (Conditional) Image Generation, etc which open source implementations are readily available. Otherwise, experiments in this paper are pretty comprehensive. The only additional experiment I would be interested in is to tune learning rate for each batch size, rather than using a base learning rate everywhere, or simple rules such as LSR or SRSR. Since the theory only gives us asymptotic form of the optimal learning rate, empirically you should be tuning the learning rate for each batch size. And this is not totally unrealistic, because you can use a fraction of computational time to do cross-validation for searching the learning rate.
+
+pros:
+* findings provide us useful direction for future research (that data-parallelism centered distributed training is going to hit the limit soon)
+* extensive experiments across 5 datasets and 6 neural network architectures
+
+cons:
+* experiments are a bit too much focused on image classification
+* error bars in figures could've provided greater confidence in robustness of findings",8,4.0,ICLR2019
+rygzGj-p2X,3,rJedV3R5tm,rJedV3R5tm,Interesting work which makes Gumbel-softmax relaxation work in GAN-based text generation using a relational memory,"Overall:
+This paper proposes RelGAN, a new GAN architecture for text generation, consisting of three main components: a relational memory based generator for the long-distance dependency modeling, the Gumbel-Softmax relaxation for training GANs on discrete data, and multiple embedded representations in the discriminator to provide a more informative signal
+for the generator updates.
+
+Quality and Clarity:
+The paper is well-written and easy to read. 
+
+Originality :
+Although each of the components (relational memory, Gumbel-softmax) was already proposed by previous works, it is interesting to combine these into a new GAN-based text generator. 
+However, the basic setup is not novel enough. The model still requires pre-training the generator using MLE. The major difference are the architectures (relational memory, multi-embedding discriminator) and training directly through Gumbel-softmax trick which has been investigated in (Kusner and Hernandez-Lobato, 2016). 
+
+Significance:
+The experiments in both synthetic and real data are in detail, and the results are good and significant.
+
+-------------------
+Comments:
+-- In (4), sampling is known as non-differentiable which means that we cannot get a valid definition of gradients. It is different to denote the gradient as 0.
+-- Are the multiple representations in discriminator simply multiple “Embedding” matrices?
+-- Curves using Gumbel-softmax trick + RM will eventually fall after around 1000 iterations in all the figures. Why this would happen?
+-- Do you try training from scratch without pre-training? For instance, using WGAN as the discriminator
+
+
+Related work:
+-- Maybe also consider to the following paper which used Gumbel-softmax relaxation for improving the generation quality in neural machine translation related?
+Gu, Jiatao, Daniel Jiwoong Im, and Victor OK Li. ""Neural machine translation with gumbel-greedy decoding."" arXiv preprint arXiv:1706.07518 (2017).
+",6,4.0,ICLR2019
+rkMXszYgM,1,SJCPLLpaW,SJCPLLpaW,Needs more data to support,"This paper develops a framework for parallelization of convolutional neural nets. In the framework, parallelism on different dimensions are explored for convolutional layers to accelerate the computation. An algorithm is developed to find the best global configuration.
+
+The presentation needs to be more organized, it is not very easy to follow.
+
+1. Computation throughput is not defined.
+
+2. Although the author mentions DeePa with Tensorflow or Pytorch several times, I think it is not proper to make this comparison. The main idea of this paper is to optimize the parallelization scheme of CNN, which is independent of the framework used. It is more useful if the configuration searching can be developed on tensorflow / pytorch.
+
+3. The per layer comparison is not very informative for practice because the data transfer costs of convolution layers could be completely hidden in data parallelization. In data parallelism, the GPU devices are often fully occupied during the forward pass and backward pass. Gaps are only in between forward and backward, and between iterations. Model parallelism would add gaps everywhere in each layer. This could be more detrimental when the communication is over ethernet. To be more convincing, it is better to show the profile graph of each run to show which gaps are eliminated, rather than just numbers.
+
+4. The batch size is also a crucial factor, difference batch size would favor different methods. More comparisons are necessary.",4,5.0,ICLR2018
+aMdBiXX42P7,1,E3UZoJKHxuk,E3UZoJKHxuk,leveraging causality to learn invariant models across domains ,"This paper proposes a VAE based model for learning latent causal factors given data from multiple domains. Similar to [Kingma and Hyv¨arinen, 2020], it utilizes additional labels as supervision signals and learns the model using a Bayesian optimization approach given a fixed hypothetical causal structure. The identifiability is obtained by assuming the casual mechanism to be domain invariant, which is partially supported by some empirical experiments.
+I have three concerns for the current version of the paper.
+1.	Directly mitigating the identifiability result of [Kingma and Hyv¨arinen, 2020] to this model seems to be inappropriate. The result of [Kingma and Hyv¨arinen, 2020] shows that the sufficient statistics of the latent code are recoverable. However, it does not mean the causal structure is unique: additional transformations may be allowed to be applied to the adjacency matrix. As a result, it is in question whether your causal mode learns the correct factors under given structure.
+2.	The symbols are a bit confusing. Some important concepts stay intuitive and lack rigorous mathematical definitions. For example, “output-causative” “cross-domain causal effect” stability measure etc should be defined. A clear table listing the symbols and its meaning would be helpful.
+3.	The empirical results of table 1 do not fully support your conclusion “true causal factors are learnable”. Simply computing a MCC to the original factors is not enough to me. Some experiments, like examining the vulnerability and performance of the system under the condition that the latent factors are controlled or intervened, for the claimed “invariant causal model” are better to be included to convince readers.
+",5,4.0,ICLR2021
+rkpVV6H4l,3,BJuysoFeg,BJuysoFeg,trivially simple yet effective,"This paper proposes a simple domain adaptation technique in which batch normalization is performed separately in each domain.
+
+
+Pros:
+
+The method is very simple and easy to understand and apply.
+
+The experiments demonstrate that the method compares favorably with existing methods on standard domain adaptation tasks.
+
+The analysis in section 4.3.2 shows that a very small number of target domain samples are needed for adaptation of the network.
+
+
+Cons:
+
+There is little novelty -- the method is arguably too simple to be called a “method.” Rather, it’s the most straightforward/intuitive approach when using a network with batch normalization for domain adaptation.  The alternative -- using the BN statistics from the source domain for target domain examples -- is less natural, to me. (I guess this alternative is what’s done in the Inception BN results in Table 1-2?)
+
+The analysis in section 4.3.1 is superfluous except as a sanity check -- KL divergence between the distributions should be 0 when each distribution is shifted/scaled to N(0,1) by BN.
+
+Section 3.3: it’s not clear to me what point is being made here.
+
+
+Overall, there’s not much novelty here, but it’s hard to argue that simplicity is a bad thing when the method is clearly competitive with or outperforming prior work on the standard benchmarks (in a domain adaptation tradition that started with “Frustratingly Easy Domain Adaptation”).  If accepted, Sections 4.3.1 and 3.3 should be removed or rewritten for clarity for a final version.",6,3.0,ICLR2017
+HJNRbAbVe,2,r1Usiwcex,r1Usiwcex,Review,"The paper presents a way to model the distribution of four-part Bach chorales using Convolutional Neural Networks. Furthermore it addresses the task of artificial music generation by sampling from the model using blocked Gibbs sampling and shows
+
+The CNN model for the distribution seems very appropriate for the data at hand. Also the analysis of the proposed sampling schemes with the analogy between Gibbs sampling and human music composition are very interesting.
+I am not too sure about the evaluation though. Since the reported likelihoods are not directly comparable to previous work, I have difficulties judging the quality of the quantitative results.  For the human evaluation I would like to see the data for the direct comparisons between the models. E.g. How did NADE vs. Bach perform. Also I find the question: ‘what piece of music do you prefer’ a stronger test than the question ‘what piece is more musical to you’ because I don’t really know what ‘musical’ means to the AMT workers.
+
+Finally, while I think the Bach Chorales are interesting musical pieces that deserve to be subject of the analysis but I find it hard to judge how well this modelling approach will transfer to other types of music which might have a very different data distribution.
+
+Nevertheless, in conclusion, I believe this is an exciting model for an interesting task that produces non-trivial musical data.
+",6,3.0,ICLR2017
+HylyhLLF3X,1,rJNwDjAqYX,rJNwDjAqYX,Nice experimental paper on curiosity based RL,"In this paper, the authors presented a large experimental study of curiosity-driven reinforcement learning on various tasks. In the experimental studies, the authors also compared several feature space embedding methods, including identical mapping (pixels), random embedding, variational autoencoders and inverse dynamics features. The authors found that in many of the tasks, learning based on intrinsic rewards could generate good performance on extrinsic rewards, when the intrinsic rewards and extrinsic rewards are correlated. The authors also found that random features embedding, somewhat surprisingly, performs well in the tasks.
+
+Overall, the paper is well written with clarity. Experimental setup is easy to understand. The authors provided code, which could help other researchers reproduce their result.
+
+Weaknesses: 
+
+1) as an experimental study, it would be valuable to compare the performance of curiosity-based learning versus learning based on well-defined extrinsic rewards. The author is correct that in many tasks, well-behaved extrinsic rewards are hard to find. But for problems with well-defined extrinsic rewards, such a comparison could help readers understand the relative performance of curiosity-based learning and/or how much headroom there exists to improve the current methods.
+
+2) it is surprising that random features perform so well in the experiments. The authors did provide literature in classification that had similar findings, but it would be beneficial for the authors to explore reasons that random features perform well in reinforcement learning.",7,3.0,ICLR2019
+Hke2kYg2nm,3,ByeTHsAqtX,ByeTHsAqtX,"This paper describes an interesting phenomenon, but some of the experimental evidence is a bit lacking. ","This paper describes an interesting phenomenon: that most of the learning happens in a small subspace. However, the experimental evidence presented in this paper is a bit lacking. The authors also cook up a toy example on which gradient descent exhibits similar behavior. Here are a few detailed comments:
+
+1. The Hessian overlap metric is suitable for showing the gradient lies in an invariant subspace of the Hessian, but does not show it lies in the dominant invariant subspace. 
+2. There are well-established notions of distances between subspaces in linear algebra, and I suggest the authors comment on the connection between their notion of overlap between subspaces and these established notions.
+3. The authors make a few statements along the lines of ``the Hessian is small, so the objective is flat''. This is a bit misleading as it is possible for the gradient to be large but the Hessian to be small.",4,3.0,ICLR2019
+KnW5S5EUtr2,1,Rw_vo-wIAa,Rw_vo-wIAa,"Interesting method, missing ablations, some lack of detail, potentially unfair experimental comparison. ","This works follows the footsteps of other centralized value function approaches for multi-agent learning, building directly on the counterfactual value function proposed in COMA. 
+
+The main idea is to sample additional joint actions to variance-reduce the advantage estimate for each of the agents at each time step.
+The paper also introduces a KL penalty that is supposed to ensure that the policies are updated slowly, making the advantage estimate more accurate. 
+
+A few comments:
+1) "", we consider using centralized critic to predict joint Q values, and the marginal advantages can be directly calculated with these Q values, which avoids interactive simulation."". This directly contradicts Figure 2, where it is mentioned that ""one step simulations are executed"" for each of the samples. This point is relevant since extra simulation steps would drastically change the sample requirements for the method. Furthermore, this assumes that there is a simulator that can be set to arbitrary state transitions, which is different from the standard RL assumption. 
+2) ""which means the counter-factual advantage in equation(5) can be replaced by any form of advantage function used in SARL."". This is generally true. How to estimate the advantage is orthogonal to single agent vs multi-agent learning. In particular, the COMA critic (Q-function) could easily be replaced with a GAE or similar. On a related note, while using a critic that depends only on the Q(s,u) (ie. the central state and joint action) is common practice in this line of work, it is not generally appropriate. Clearly, in Dec-POMDPs, the future actions will depend on the action-observation histories (AOHs), $\tau$ of all the agents. As a consequence, in general the critic should condition on these AOHs.
+3) ""And in addition, the last actions of ally units are accessable for all agents."" This is different form the standard SMAC setting, where only the last actions of the *visible* agents are observable by each of the agents. Please clarify this rationale.
+4) Please specify what advantage estimation is being used in this paper. My understanding is that the paper uses the joint-Q function similar to the COMA paper. Is this accurate? 
+5) Overall the results look competitive, but are currently not very informative due to missing ablations. It's impossible to tell if the PPO-like KL penalty or the multi-sample variance reduction are responsible for the improved performance. 
+6) The training curves look very odd. In particular, they look unstable but have very narrow errorbars, indicating that many of the runs drop at the same time. This suggests that the experiments may have accidentally reused the same random seed.  
+7) MMM2 is the only ""superhard"" scenario used from SMAC. Did the method fail on the other ones? 
+
+
+",5,4.0,ICLR2021
+H1wZwQwef,1,B1jscMbAW,B1jscMbAW,Neural networks enriched with divide and conquer strategy,"This paper proposes to add new inductive bias to neural network architecture - namely a divide and conquer strategy know from algorithmics. Since introduced model has to split data into subsets, it leads to non-differentiable paths in the graph, which authors propose to tackle with RL and policy gradients. The whole model can be seen as an RL agent, trained to do splitting action on a set of instances in such a way, that jointly trained predictor T quality is maximised (and thus its current log prob: log p(Y|P(X)) becomes a reward for an RL agent). Authors claim that model like this (strengthened with pointer networks/graph nets etc. depending on the application) leads to empirical improvement on three tasks - convex hull finding, k-means clustering and on TSP.  However, while results on convex hull task are good, k-means ones use a single, artificial problem (and do not test DCN, but rather a part of it), and on TSP DCN performs significantly worse than baselines in-distribution, and is better when tested on bigger problems than it is trained on. However the generalisation scores themselves are pretty bad thus it is not clear if this can be called a success story.
+
+I will be happy to revisit the rating if the experimental section is enriched.
+
+Pros:
+- very easy to follow idea and model
+- simple merge or RL and SL in an end-to-end trainable model
+- improvements over previous solutions
+
+Cons:
+- K-means experiments should not be run on artificial dataset, there are plenty of benchmarking datasets out there. In current form it is just a proof of concept experiment rather than evaluation (+ if is only for splitting, not for the entire architecture proposed). It would be also beneficial to see the score normalised by the cost found by k-means itself (say using Lloyd's method), as otherwise numbers are impossible to interpret. With normalisation, claiming that it finds 20% worse solution than k-means is indeed meaningful. 
+- TSP experiments show that ""in distribution"" DCN perform worse than baselines, and when generalising to bigger problems they fail more gracefully, however the accuracies on higher problem are pretty bad, thus it is not clear if they are significant enough to claim success. Maybe TSP is not the best application of this kind of approach (as authors state in the paper - it is not clear how merging would be applied in the first place). 
+- in general - experimental section should be extended, as currently the only convincing success story lies in convex hull experiments
+
+Side notes:
+- DCN is already quite commonly used abbreviation for ""Deep Classifier Network"" as well as ""Dynamic Capacity Network"", thus might be a good idea to find different name.
+- please fix \cite calls to \citep, when authors name is not used as part of the sentence, for example:
+Graph Neural Network Nowak et al. (2017) 
+should be
+Graph Neural Network (Nowak et al. (2017))
+
+# After the update
+
+Evaluation section has been updated threefold:
+- TSP experiments are now in the appendix rather than main part of the paper
+- k-means experiments are Lloyd-score normalised and involve one Cifar10 clustering
+- Knapsack problem has been added
+
+Paper significantly benefited from these changes, however experimental section is still based purely on toy datasets (clustering cifar10 patches is the least toy problem, but if one claims that proposed method is a good clusterer one would have to beat actual clustering techniques to show that), and in both cases simple problem-specific baseline (Lloyd for k-means, greedy knapsack solver) beats proposed method. I can see the benefit of trainable approach here, the fact that one could in principle move towards other objectives, where deriving Lloyd alternative might be hard; however current version of the paper still does not show that.
+
+I increased rating for the paper, however in order to put the ""clear accept"" mark I would expect to see at least one problem where proposed method beats all basic baselines (thus it has to either be the problem where we do not have simple algorithms for it, and then beating ML baseline is fine; or a problem where one can beat the typical heuristic approaches).
+
+",6,3.0,ICLR2018
+CDeGCVgEh_H,4,MpStQoD73Mj,MpStQoD73Mj,A library for Differentiable Weighted Finite-State Transducers,"Summary:
+ 
+The authors introduce a  library for differential weighted finite-state transducers. WFST are commonly used in speech or handwriting recognition systems but are generally not trained jointly with the deep neural networks components such as ConvNN. This is not due to theoretical limitation of WFST but rather to a lack of available implementation and the need of important computational power to train them. The authors show that this new library can be used to encode the ASG criterion, by combining the emission graph (coming from a NN for example), the token graph (base recognition units) and the label graph (the sequence annotation) on one hand and the emission graph and a language model graph on the other hand. The authors show how word pieces decomposition can be learnt through marginalisation. Finally, convolution Wfst are rapidly presented. Preliminary experiments are reported on wSj data base for speech recognition and IAM database for handwriting recognition.
+
+
+##########################################################################
+
+Reasons for score: 
+ 
+ 
+ I am very pleased to see an implementation of the GTN approach which has been proposed more than 20 years ago. WFST approaches have been shown to be more effective (and more elegant) than ad-hoc implementation for both speech and handwriting recognition. If efficient, this library will certainly have a major impact on future ASR and HTR systems. However, implementation details are not given or explained and experiments are still preliminary. Despite its importance and impact, this work seems to be in a too early stage to be accepted to ICLR this year.
+
+
+
+##########################################################################Pros: 
+
+Pros:
+
+* first implementation of a differentiable WFST library
+* experiments both on ASR and HTR with interesting results for learning WFST parameters
+* a new convolutional WFST is introduced
+
+
+##########################################################################
+
+Cons: 
+
+* we dont' know to what extent the operations on WFST needed to build a real ASR/HTR application  are available (determinisation, minimization,  weight pushing, etc)
+* ASR/HTR systems are not compared to state of the art, to measure the remaining progress to reach SOTA.
+* As said by the authors, include WFST in a differentiable stack of layers needs a lot of computation. Is it trackable for large scale systems ? Table 2 gives epoch times for 1000 word pieces (which is small) and for bigrams only. Is it on TPU or CPU ?
+* the section 4 on learning algorithms is not very generic as only an implementation of ASG is first presented then a comparison to CTC.
+* section 4.3 on conv. WFST is too short to be really understand the proposed model. Maybe this part should be dropped to leave more room to basic algorithms presentation.
+ 
+",6,5.0,ICLR2021
+U4IqNAxpJPQ,3,qFQTP00Q0kp,qFQTP00Q0kp,"Interesting architecture for self-supervised temporal representation learning, but the novelty is limited  compared to contrastive learning.","**Summary**
+This paper presents a general Self-supervised Time Series representation learning framework. It explores the inter-sample relation reasoning and intra-temporal relation reasoning of time series to capture the underlying structure pattern of the unlabeled time series data.  The proposed method achieves new state-of-the-art results and outperforms existing methods by a significant margin on multiple real-world time-series datasets for the classification tasks.
+
+**Contributions**
+1. The paper is well written and easy to follow. The organization is good. 
+2. The architecture is well motivated. It is reasonable to use unsupervised temporal relations to learn video features. 
+3. The qualitative results are numerous, insightful, and convincing on multiple datasets. The authors conduct extensive experiments to demonstrate the effectiveness of the proposed method, including inter-sample relation reasoning and intra-temporal relation reasoning.
+
+**Details**
+1. The novelty of the proposed method.
+The Inter-sample relation reasoning is very similar to SimCLR, which also maximizes agreement between different views of augmentation from the same sample via a contrastive loss.  Considering this, the novelty is relatively incremental.
+
+2. Additional video recognition experiments.
+The used dataset is small-scale that makes the task simple. I would have wanted to see results on large-scale classification tasks, such as video action classification.  The performance of action classification is closely related to the temporal feature modeling. So the effects on this task can make the proposed method more convincing. 
+
+**Conclusion**
+overall, this paper proposes an interesting architecture for self-supervised temporal modeling. But the novelty is relatively limited compared to the recent SimCLR work. And it requires additional experiments on harder video classification task and datasets to show the effects and robustness of the proposed method.
+
+**After rebuttal**
+
+Thanks for the detailed response.
+This paper can be seen as an interesting attempt to use self-supervised on time series data. 
+Although the basic idea is similar to SimCLR, It is still interesting work considering the  computation complexity and new loss function.
+So I update my score to 6.
+
+",6,3.0,ICLR2021
+moVB59VNzVL,2,lbc44k2jgnX,lbc44k2jgnX,Interesting upper and lower bounds of Random Coordinate LMC under various smoothness assumptions,"Post rebuttal update:
+I read the other reviewers' responses, and, although I am still positive about this paper, I agree with R2 and R4 that safely fixing the theoretical proofs would require a full revision. For this reason, I am lowering my score to 6.
+
+%%%%%%%%%%%%%%%%%%%%%%%%%
+
+The authors propose a variant of Unadjusted Langevin Algorithm by replacing the full gradient of the log-density by the gradient of a single coordinate selected at random according to some chosen probability distribution phi. When the log-density of the target distribution is gradient Lipschity and strongly convex, and the step size for updating a coordinate is inversly proportional to the probability of selecting it, the authors show approximate convergence in 2-Wasserstein distance of Random Coordinate LMC (RC-LMC) to the target distribution. The convergence guarantees, in terms of cost, match the ones of classical LMC in terms of dimension and accuracy dependence.
+
+In the case where all dimensional Lipschitz constants are known, the authors propose a new choice of coordinate selection probability distribution phi and step sizes that yield similar convergence guarantees, but with d^2 kappa dependence replaced by (sum_i kappa_i) where kappa is the global condition number, and kappa_i's are the dimensional condition numbers. Hence, this yields improved convergence guarantees in the case of high dimensional and highly skewed log-density.
+
+In the case where, in addition, the log-density has a Lipschitz Hessian, and under a proper choice of the coordinate selection distribution phi and step sizes depending on the dimensional gradient and Hessian Lipschitz constants, the authors show convergence of RC-LMC in 2-Wasserstein distance with rate O(d^3/2 epsilon), improving upon the best known rate for this setting.
+
+The authors show a lower bound for RC-LMC in the gradient and Hessian Lipschitz case, matching the previously shown upper bound. Such a lower bound is appreciated, especially in the literature of Langevin dynamics based sampling algorithm where convergence upper bounds are plentiful and not much is known about lower bounds (especially in the deterministic gradient setting).
+
+Finally, the authors perform a numerical experiment in which they estimate the expectation of some test function of some randome variable following a skewed Gaussian distribution from N samples. They demonstrate that RC-LMC, with our without the knowledge of the dimensional Lipschitz constant, converge faster than classical LMC.
+
+Concerning this experiment, the various methods converge to different saturation thresholds. However, since the objective is strongly log-concave, all methods should converge arbitratily close to the target distribution when choosing the step size small enough (or using a decaying step size). The only limitant factor should then only be due to N being finite, which is common to all methods. Could you please mention and argue upon the choice of step sizes for each method? It would also be nice to plot the optimal saturation threshold for estimating the expecation of the test function from N samples, which should be computable exactly for a Gausian target distribution.",6,4.0,ICLR2021
+HJyvnP74e,2,SypU81Ole,SypU81Ole,A mixture of many things,"This paper proposed a set of different things under the name of ""sampling generative models"", focusing on analyzing the learned latent space and synthesizing desirable output images with certain properties for GANs.  This paper does not have one single clear message or idea, but rather proposed a set of techniques that seem to produce visually good looking results.  While this paper has some interesting ideas, it also has a number of problems.
+
+The spherical interpolation idea is interesting, but after a second thought this does not make much sense.  The proposed slerp interpolation equation (page 2) implicitly assumes that the two points q1 and q2 lie on the same sphere, in which case the parameter theta is the angle corresponding to the great arc connecting the two points on the sphere.  However, the latent space of a GAN, no matter trained with a uniform distribution or a Gaussian distribution, is not a distribution on a sphere, and many points have different distances to the origin.  The author's justification for this comes from the well known fact that in high dimensional space, even with a uniform distribution most points lie on a thin shell in the unit cube.  This is true because in high-dimensional space, the outer shell takes up most of the volume in space, and the inner part takes only a very small fraction of the space, in terms of volume.  This does not mean the density of data in the outer shell is greater than the inner part, though.  In a uniform distribution, the data density should be equal everywhere, a point on the outer shell is not more likely than a point in the inner part.  Under a Gaussian model, the data density is on the other hand higher in the center and much lower on the out side.  If we have a good model of data, then sampling the most likely points from the model should give us plausible looking samples.  In this sense, spherical interpolation should do no better than the normally used linear interpolation.  From the questions and answers it seems that the author does not recognize this distinction.  The results shown in this paper seem to indicate that spherical interpolation is better visually, but it is rather hard to make any concrete conclusions from three pairs of examples.  If this is really the case then there must be something else wrong about our understanding of the learned model.
+
+Aside from these, the J-diagram and the nearest neighbor latent space traversal both seems to be good ways to explore the latent space of a learned model.  The attribute vector section on transforming images to new ones with desired attributes is also interesting, and it provides a few new ways to make the GAN latent space more interpretable.
+
+Overall I feel most of the techniques proposed in this paper are nice visualization tools.  The contributions however, are mostly on the design of the visualizations, and not much on the technical and model side.  The spherical interpolation provides the only mathematical equation in the paper, yet the correctness of the technique is arguable.  For the visualization tools, there are also no quantitative evaluation, maybe these results are more art than science.",5,3.0,ICLR2017
+RgmaLuaZ1w3,4,RB0iNPXIj60,RB0iNPXIj60,simple method to refine detection results.,"Summary:
+This paper presents a simple yet powerful and flexible framework to refine the predictions of a two-stage detector. The approach can produce more precise predictions by using mixture data of image information and the objects' class and center. They showed a simple scheme can increase the mAP of the SOTA models and it is able to produce predictions that are more precise than ground truth.
+
+Weakness:
++ The idea of this paper is to use a refinement module to boost the performance of the two-stage detectors. I find this work to contain very limited novelty that other researchers can use/build on. The proposed method simply uses a naive refine module to extract the Region feature from the crop. In my opinion, this simple module is similar to Cascade R-CNN. The only difference is that it extracts features from the crop of the images. It does not advance the understanding of this field although is reasonable to me.
+
++ The ablation experiments are weak and inadequate. Only one experiment is provided to compare the performance of the refine module. The author should do more ablation studies to support his contribution. e.g. (1) the comparison with the Cascade R-CNN which extracts the feature from the region feature maps rather than the images. (2) the architecture or the refinement module number. 
+
++ The claim of run in real-time on standard hardware, without any time cost or FPS results in this paper.  However, the speed of the refinement module may be slow owing to the extracting feature from the crops. For the two-stage detector, e.g. FPN, the proposals of the detector are 512 under the common setting. 
+
++ The results would have been more complete if results were shown in a setting where the region feature is used without the use of the original crops. In other words, an ablation study on the effect of the feature extraction strategies. 
+
++ How important is the crop size to the proposed method? Considering the paper states that this is required to get a good crop, some ablation studies on showing the crop strategies would be useful for understanding.
+
++ In Abstract, the author of this paper provides his code which is non-anonymous. It shows that the repository of BBREFINEMENT is the ""IRAFM AI"" and the author's name is easy to be found.  This behavior violates the rules of the anonymous code mentioned in the Author Guide. 
+
+Finally, I suggest rejecting the paper. BTW, The author should pay attention to the rules the next time.
+",4,4.0,ICLR2021
+BJlxflLD3X,2,r1MSBjA9Ym,r1MSBjA9Ym,"Good, interesting results - presentation improved","The paper studies failure modes of deep and narrow networks. I find this research extremely valuable and interesting. In addition to that, the paper focuses on as small as possible models, for which the undesired behavior occurs. That is another great positive, too much of a research in DL focuses on the most complex and general cases in my opinion. I would be more than happy to give this paper a very strong recommendation, if not for numerous flaws in presentation. If those get improved, I am very eager to increase my rating. Here are the things that I think need an improvement:
+1. The formulation of theorems.
+The paper strives for mathematical style. Yet the formulations of the theorems are very colloquial. Expression ""by assuming random weights"" is not what one wants to see in a rigorous math paper. The formulations of the theorems need to be made rigorous and easy to understand, the assumptions need to be clearly stated and all concepts used strictly defined.
+2. Too many theorems
+9 (!) theorems is way too much. Theorem is a significant contribution. I strongly suggest having 1-2 strong theorems, and downgrading more technical lemmas to a lemma and proposition status.
+In addition - the problem studied is really a study of bad local minimas for neural networks. More mentions of the previous work related to the topic would improve the scientific quality additionally, in my opinion.
+",7,5.0,ICLR2019
+SyxMi84Z6Q,3,HJehSnCcFX,HJehSnCcFX,"This paper proposes an algorithm for missing data problem in continuous time events data (ie, point processes) where both past and future events are helpful. ","This paper tackles a very important and practical problem in event stream planning. The problem is very interesting and the approach taken is standard.
+
+The presentation of the paper is not clear enough. The notations and definitions and methods are presented in a complicated way. It's difficult to follows.
+
+From the contribution point of view the paper looks like to be a combination of several existing and well developed approach: Neural Hawkes Process + particle smoothing + minimum bayes risk + alignment. It's not very surprising to see these elements together. It would have helped if the authors made it clear why each part is chosen and clearly state what is the novelty and contributed of the paper to the field.
+
+The paper in its current format is not ready for publication. But it's a good paper and can be turned to a good paper for the next venue.",5,4.0,ICLR2019
+rJ1RyB7Ng,2,SygGlIBcel,SygGlIBcel,lacks experimental evidence,"this paper proposes a model for representing unseen words in a neural language model. the proposed model achieves poor results in LM and a slight improvement over a baseline model. 
+
+this work needs a more comprehensive analysis:
+- there's no comparison with related work trying to address the same problem
+- an intrinsic evaluation and investigation of why/how their work should be better are missing.
+- to make a bolder claim, more investigation should be done with other morphologically rich languages. Especially for MT, in addition to going from En-> Language_X, MRL_X -> En or MRL_X -> MRL_Y should be done.
+",2,5.0,ICLR2017
+r1lL6hgdKH,3,rJeINp4KwH,rJeINp4KwH,Official Blind Review #3,"The authors propose another method of doing population-based training of RL policies. During the training process, there are N workers running in N copies of the environment, each with different parameter settings for the policies and value networks. Each worker pushes data to a shared replay buffer of experience. The paper claims that a natural approach is to have a chief job periodically poll for the best worker, then replace the weights of each worker with the best one. Whenever this occurs, this reduces the diversity within the population.
+
+In its place, the authors propose a soft-update in the chief. At every merging cycle, the chief queries which worker performs best. If that worker is worker B, it emits pi_B's parameters to each of the other workers. Instead of replacing the parameters exactly, worker i's loss is then augmented by beta * D(pi_i, pi_B), where D is some distance measure that is measured over states sampled from the replay buffer. The ""soft"" update encourages individual workers to match pi_B without directly replacing their parameters, which maintains diversity in the population. In this work, pi is always represented by a deterministic policy and D is the mean-squared-error in action space (this is argued as equivalent to the KL divergence between the two policies if the policies were represented by Gaussian with the same, constant standard deviation). The beta parameter is updated online using heuristics based on how D(pi_i, pi_B) compares to D(pi_i, old_pi_i). Using TD3 as a base algorithm, the population-based version performs better, and there are ablations for various parts of the population algorithm.
+
+I thought this paper was interesting, but thought it was strange that there were very few comparisons to other population / ensemble-based training methods. In particular they mention the copying problem as a downside of population-based training (PBT), but do not compare against PBT at all. Additionally, my understanding of PBT is that when they replace bad agents with the best agent, they only replace the worst performing agents (not all of them), and they additionally add some random perturbations to their hyperparameter settings. This goes counter to the claim that they collapse the population to a single point- by my reading the exploration step avoids this collapse.
+
+An experiment I'd like to see is trying PBT, where different workers do in fact use different hyperparameters. My understanding is that in P3S-TD3 there is a single hyperparameter setting shared across all workers (plus some hyperparameters deciding the soft update).
+
+I'd also like to see ablations for the Resetting variant (Re-TD3), where only the bottom half or 2/3rds of the workers are reset. This would give empirical evidence for the ""population collapse"" intuition - we should expect to see some improvements if we avoid totally collapsing the population, while still copying enough to partially exploit the current best setting.
+
+Many inequalities in the paper are argued by compare the expectation of negative Q of one policy to the negative Q of another - I believe the derivations would be much easier to follow if the authors simply multiplied all sides by -1 and adjusted inequalities accordingly. It is much easier to think about Q-value-1 > Q-value-2 rather than -Q-value-1 < -Q-value-2 when trying to interpret what the equation is saying.
+
+For related work, papers on evolutionary strategies and the various self-play-in-a-population papers seem relevant, since these often take the form of having each worker i do a different perturbation that is later merged by a chief.
+
+In Figure 4 it feels weird that results are the regular Mujoco envs for 2 problems and the delayed envs for the other 2 problems. When looking at the appendix, it's rather clearly cherry picked to show the best results in favor of PS3-TD3. I would prefer the Delayed MuJoCo experiments be in a separate figure, or to include the TD3/SAC/ACKTR/PPO/etc. results for the delayed envs as well (these don't appear to be in the appendix)
+
+On the theoretical results: the 1st assumption seems very strong. The first assumption argues that pi_B is always 1-step better than pi_old for every state. That assumption already takes you very far towards arguing ""updating pi_old to pi_B is good"". The 2nd assumption is more reasonable but I'm confused how rho and d play into the theoretical results. Do they play any role in how much the policy is expected to improve, or do the constants just need to exist?
+
+The last comment on the theory side is that I still don't understand the intuition for why we want to learn beta such that 
+
+KL(pi_new || pi_b) = max {rho * KL_max(pi_new || pi_old), d}
+
+In the practical algorithm, beta is updated online to increase / decrease the importance of the ""match pi_B"" term if the ratio between the two strays too far from 1 (with the threshold set to [1/1.5, 1.5] in a manner similar to PPO's approach). But why should it be important for the two values to be close to one another? Let me write out the derivation continuing from Eqn (57) in the appendix.
+
+With a substitution that doesn't use (c) to drop the beta * (KL - KL) term, we get
+
+E_{pi_b}[-Q_new] >= E_{pi_new}[-Q_new] + beta * (KL - KL)
+-->
+E_{pi_new}[Q_new] >= E_{pi_b}[Q_new] + beta * (KL - KL)
+
+Then, in Theorem 1, we recursively apply this inequality, accumulating a number of beta * (KL - KL) terms. In the end we get
+
+Q_new >= (discounted sum rewards from pi_b) + (discounted sum of beta * (KL - KL) with expectation over states from pi_b)
+= Q_pi_b + (sum of beta *(KL - KL) terms)
+
+By my reading, shouldn't this mean we want KL(pi_new || pi_b) - max {rho * KL_max(pi_new || pi_old), d} to be as large as possible, rather than 0? The more positive this term is, the more improvement we get between Q_new and Q_pi_b.
+
+--------------------------
+
+Overall, this feels like a good paper, but I'm not too familiar with prior empirical results for population-based RL methods. The ablations suggested that pretty much any reasonable population-based method outperformed using a single worker, and because of this it seems especially important to have ablations to other population-based prior work, rather than just variants of its own method. 
+
+I would be okay with this paper as-is despite some of its flaws, but think it could be better pending rebuttal.
+
+Edit: I have read the author reply and other reviews. I do not plan to change my rating but do think the paper is improved by the added baselines and better explanation of what the beta adaptation rule is doing. I would ask the authors to make sure this description is as clear as possible, as the argued improvement gap seems central to the work.",6,,ICLR2020
+S1eSaJc53Q,2,HyesW2C9YQ,HyesW2C9YQ,A renewed attempt for adapting dialog responses to emotional context,"The paper describes a new study about how to make dialogs more empathetic.
+The work introduced a new dataset of 25k dialogs designed to evaluate the
+role that empathy recognition may play in generating better responses 
+tuned to the feeling of the conversation partner.  Several model
+set-ups, and many secondary options of the set-ups are evaluated.
+
+Pros:
+
+A lot of good thoughts were put into the work, and even though the techniques
+tried are relatively unsophisticated, the work represents a serious attempt
+on the subject and is of good reference value.
+
+The linkage between the use of emotion supervision and better relevancy is interesting.
+
+The dataset by itself is a good contribution to the community conducting studies in this area.
+
+Cons:
+
+The conclusions are somewhat fuzzy as there are too many effects
+interacting, and as a result no clear cut recommendations can be made
+(perhaps with the exception that ensembling a classifier model trained
+for emotion recognition together with the response selector is seen
+as having advantages).
+
+There are some detailed questions that are unaddressed or unclear from
+the writing.  See the Misc. items below.
+
+Misc.
+
+P.1, 6th line from bottom: ""fro"" -> ""from""
+
+Table 1:  How is the ""situation description"" supposed to be related to the
+opening sentence of the speaker?  In the examples there seems to be substantial
+overlap.
+
+Figure 2, distribution of the 32 emotion labels used:
+this is a very refined set that could get blurred at the boundaries between similar emotions.
+As for the creators of those dialogs,  does everyone interpret the same emotion label the same way?
+e.g. angry, furious; confident, prepared; ...; will such potential ambiguities impact the work?
+One way to learn more about this is to aggregate related emotions to make a coarser set,
+and compare the results.
+
+Also, often an event may trigger multiple emotions, which one the speaker chooses to focus on
+may vary from person to person.  How may ignoring the secondary emotions impact the results?
+To some extent this is leveraged by the prepending method (with top-K emotion predictions).
+What about the other two methods?
+
+P. 6, on using an existing emotion predictor:  does it predict the same set of emotions
+that you are using in this work?
+
+",7,4.0,ICLR2019
+SJlq8XkRFB,1,S1gtclSFvr,S1gtclSFvr,Official Blind Review #3,"This paper presents a phrase-based encoder-decoder model for machine translation. The encoder considers all possible phrases (i.e. word sequences) up to a certain length and compute phrase representations using bidirectional LSTMs from contextual word embeddings computed with another bidirectional LSTM layer. The decoder also considers possible segmentations and computes contextual representations for the previously generated segments. Each word in the current segment is generated by a Transformer model by attending to all phrases in the source sentence. The authors present a dynamic programming method for considering all possible segmentations in decoding. They also present a method for incorporating a phrase-to-phrase dictionary built by Moses into the decoding process.
+
+I like the idea of phrase-to-phrase translation and the relatively simple architecture proposed in the paper. At the moment, however, I am not quite sure how practical their approach is. One reason is the experimental setting. Both of the datasets used in the experiments are quite small and it is not clear how the proposed model performs when several millions of sentence pairs are available for training. 
+
+Another reason is that the computational cost of the proposed model is not really clear. The authors state that it is much more efficient than NPMT but it is not clear how it compares to the standard Transformer approach. It seems to me that the computational cost of their model is highly dependent on the value of P (maximum length of phrases). 
+
+At first, I thought the decoder was implemented with LSTMs, but I realized that it was actually implemented with a Transformer by reading the appendix. I think this should be explained in the main body of the paper. I am also wondering how the authors’ model compares to a standard seq-to-seq model whose decoder is implemented with a Transformer.
+
+The equation in section 2.2 seems to suggest that the model prefers segmentations with small numbers of segments. I am wondering if there is any negative effect on the translation quality.
+
+Here are some minor comments:
+
+p.2 valid of -> valid
+p.4 lookup -> look up?
+p.4 forr -> for
+p.4 indict -> indicate?
+p.5 Table 1 -> Table 1",3,,ICLR2020
+S1xgzttTFH,2,HyePberFvH,HyePberFvH,Official Blind Review #2,"The authors propose a scalable method based on Monte Carlo arithmetic for quantifying the sensitivity of trained neural networks to floating point rounding errors. They demonstrate that the loss of significance metric K estimated from the process can be used for selecting networks that are more robust to quantization, and compare popular architectures (AlexNet, ResNet etc.) for their varying sensitivities.
+
+Strengths:
+- The paper tackles an important problem of analyzing sensitivity of networks to quantization and offers a well-correlated metric that can be computed without actually running models on quantized mode
+- Experiments cover a wide range of architectures in image recognition
+
+Weaknesses:
+- The proposed method in Section 4.2 appears to be a straightforward modification to MCA for NN
+- Experiments only demonstrate model selection and evaluating trained networks. Can this metric be used in optimization? For example, can you optimize for lowering K (say with fixed t) during training, so you can find a well-performing weight that also is robust to quantization? 1000 random samples interleaved in training may be slow, but perhaps you can use coarse approximation. This could significantly improve the impact of the paper. Some Bayesian NN literatures may be relevant (dropout, SGLD etc). 
+
+Other Comments:
+- How is the second bullet point in Section 4.1 addressed in the proposed method?
+- Can you make this metric task-agnostic or input-distribution-agnostic (e.g. just based on variance in predictions over some input datasets)? (e.g. you may pick a difference loss function or different test distribution to evaluate afterwards) 
+- Does different t give different K? If so, what’s the K reported? (are those points on Figure 3)?",3,,ICLR2020
+SJgOQ8pTKr,1,rJe9fTNtPS,rJe9fTNtPS,Official Blind Review #2,"*** Summary ***
+MinHash is a well-known method for approximating set similarities in terms of the jaccard similarity. The idea is to use k random permutations (hashes) on all elements of the sets and check how often two sets hash into the same bucket. Larger values of k yield more accurate estimates of the set similarities but require more time to compute. One permutation hashing (OPH) aims to reduce the number of hash computations per element to 1 and maintaining bins. However, some of those bins may remain empty which negatively influences the similarity estimate. Optimal OPH (OOPH) hashes empty bins to non-empty bins and effectively reuses bins. This paper proposes Amortization Hashing (AHash) which reduces the occurrence of empty bins and thus leads to better similarity estimates.
+
+*** Evaluation ***
+The paper proposes an interesting idea for approximating set similarities much faster than MinHash. However, I have some issues with the submission.
+
+I believe that the manuscript has a limited impact. The approach performs on par or marginally better as OOPH within the first and only reasonable experiment (4.2). As the authors state themselves on the page break 9/10, the advantage of AHash vanishes for small and large values of k. Hence, AHash only benefits of moderate choices of k. Moreover, I can see that OOPH might have a minor problem of estimating the set similarities which AHash aims to fix, but why should it outperform MinHash in terms of accuracy? Why are the pairs of set only chosen from RCV1? Why those particular set sizes? Why does no plot show standard deviation/error?
+
+The remaining experiments yield very limited insight. Considering a linear SVM on standard datasets where the test error is >99.8% seems to be obsolete. In addition, the most important parameter k is held fix to an arbitrary value. Same holds for b. Since AHash only benefits from moderate sizes of k, why was k chosen in favor of AHash? The performance should definitely be shown in dependence of k. Instead, the most unimportant parameter (C) is varied. This should have been done in a proper cross-validation. Similar arguments hold for the near neighborhood search. What is the query set being used?
+
+There are more flaws within the manuscript. The mathematical presentation is rather poor. The theorems lack text and assumptions and solely consist of equations. The corresponding proofs are also short on text and hard to follow. Unfortunately, there is no analysis of the expected error as a function of k. The proof of Theorem 3 is almost two pages and should be moved to the supplementary material since it does not provide much insight; it just distracts the reading flow. In addition, every equation is numbered but none is ever referenced. The citation style (numbers in round brackets) is really uncommon and can be easily confused with equation numbers. Most importantly, I want to note that a different font was used and that the spacing was clearly tricked in several places (e.g. within Section 4). This makes it especially hard to judge whether the manuscript has the correct length.
+
+
+*** Further Comments ***
+- The font was changed. It does not match the font of the other submissions.
+- The spacing is tricked in several places, especially in Section 4.
+- Links [1,29] should not be references but footnotes.
+- Citations should never be in round brackets like (1), because they can be confused with equation numbers. Instead they should be in square brackets like [1] or, more preferably, the natbib package should be used as in the ICLR style guidelines.
+- What does OOPH stand for? It is never stated.
+- Every equation has a number, but none is ever referenced.
+- Math/Equations are part of the text and should be treated as such, i.e., there should be proper punctuation marks.
+- What is a 2-universal hashing?
+- Algorithm 1: ""output range"" sounds like an interval whereas the number of distinct hash values is meant.
+- ""(14) proposed"", no past tense
+- Instead of ""(11) proposes"", please use ""Shrivastava and Li [11] propose""
+- Why ""Theorem 1"" and ""Proof 3.1""?
+- Why are the theorems lacking the assumptions and text? They basically consist of equations.
+- Theorem 3 should have a ""less or equal"" instead of a ""strictly less"".
+- Eq. (40): ""0andm""
+- Why does Proof 3.3 have a end of proof sign (not right-aligned) but the other proofs don't?
+- None of the experimental results shows standard deviations/errors although the experiments are repeated several times. Why? It would be also nice to see whether the approximation tends to over- or underestimate J. This could be done with a violin plot.
+- How are the pairs of sets in Section 4.2 chosen and why only from RCV1? This seems to be the most important experiment.
+- Why is k (and b) fixed to an arbitrary value in the remaining experiments? Please select C within a proper cross-validation.
+- There are a lot of enumerations which unnecessarily make the manuscript longer, e.g. in Sections 1.4, 2.1 and 4.1. In addition, the proof of Theorem 3 almost takes two pages but is not super informative. It should be moved to the supplementary material. This in combination with the font mismatch makes it difficult to determine the real length of the submission.
+- It is nice that the source code is published online, but uncommented c++ code is not really helpful.
+",1,,ICLR2020
+BkeZKd3atr,2,Hkex2a4FPr,Hkex2a4FPr,Official Blind Review #1,"The paper ""On Variational Learning of Controllable Representations for Text without Supervision"" tackles the problem of latent vacancy of text representation via variational text auto-encoders. Based on the observation that a single factor of the sentence encoding gathers most of relevant information for classifying the sentence as positive or negative (sentiment classification), authors study the impact of manipulating this factor in term of the corresponding decoded sentences. They reasonnably claim that if such a manipulation fails at decoding accurate sentences, it is because we fall in representation areas that the decoder never seen during training. Thus they propose a way to constrain the posterior mean to a
+learned probability simplex and only perform manipulation within the probability simplex. 
+
+The tackled problem is important. Variationnal text auto-encoding is a very challenging task, for which no perfect solution has been proposed yet. A important issue is that usually the posterior collapses, with the auto-regressive decoder eventually ignoring the codes. Another problem is that indeed the representation space can contain many holes, in codes have never been seen during training. The authors propose to cope with both problems by encouraging a part of the code mean to live in a simplex, which prevents from posterior collapsing. Next, to ensure that information is filled in this constrained part, they define a pairwise ranking loss which enforce the mean of each sentence to be more similar to the output of the encoding lstm than the mean of other sentences. For this part, more intuition justification is needed to well understand the effect of this additional loss. In what sense does it ensure that the space does not contain holes ? What ensures that the constrained part of the code is actually used by the decoder ? 
+
+My main concern is with regards to the experiments, which are clearly not enough detailled. First, I cannot understand what NLL is considered in the preliminary experiments. Authors study the effect of code manipulation on an NLL. But the likelihood of what ? Of the encoded sentence ? If yes it is natural that the NLL is impacted since we move the representation so the manipulated representation encodes another sentence... Or maybe it is w.r.t. a generated sentence from the obtained code ? But what sense does it make to assess the nll of the generated sentence ? Ok if the distribution is to flat, the NLL would not be good, but is it really what we want to observe ? Also, authors compare the impact of modifications on the representations of $\beta$-VAE with modifications on their model, but these are not the same modifications. What ensure that they have the same magnitude ? I cannot understand the paragraph on vp and vn in the experimental setup. Comparisons with metrics on text style transfer are also difficult to understand to me. What is the reference sentence ? 
+
+Minor questions:
+    - z(1) is said to be parametrized by a MLP with sentences representations before eq2 and is said to be encoded by a LSTM after eq 2. What am I missing ? 
+     - Please better detail fig2      
+
+    ",3,,ICLR2020
+SJluQZ2kor,2,S1e0ZlHYDB,S1e0ZlHYDB,Official Blind Review #2,"The paper proposes using progressive encoding of images and re-arrange of data blocks in images to improve reading speed and therefore training speed.
+
+To fully analyze the maximum possible speed of training, it would be great to the measure upper bound of images/sec, when avoiding reading from disk and just using images from memory. 
+
+Decoding a typical progressive JPEG image usually takes about 2-3 times as much time as decoding a non-progressive JPEG, for full resolution, analyzing the time to read vs time to decode the images would be great. It is not clear how changing the number of total groups would affect the image size and the reading speed.
+
+Based on the current experiments it is not clear what is the impact of the batch size when creating PCRs and when reading the image blocks, or the impact of the batch size on the training speed.
+
+Figure 3 is really hard to read and compare times to convergence, authors should provide a table with times to X% accuracy.  Although time to convergence is the key metric, it would be great to know the difference in images/sec of different settings.
+
+Using ImageNet 100 classes (not clear how the 100 classes were chosen) instead of the usual 1000 classes, can distort the results, since it is not clear if higher resolution would be needed to distinguish more classes or not.
+
+Have the authors considered other image compression formats like WebP? How tie is the proposed record encoding with the image compression?  ",3,,ICLR2020
+YSrVSP67rBH,4,p8agn6bmTbr,p8agn6bmTbr,Studies minimality of neural network representations using a simple neuroscience-motivated task,"The authors contribute to the recent research on whether neural network training (in particular, SGD) favors minimal representations, in which irrelevant information is not represented by deeper layers. They do so by implementing a simple neuroscience-inspired task, in which the network is asked to make a decision by combining color and target information. Importantly, the network's output is conditionally independent of the color information, given the direction decision, so the color information is in some sense irrelevant at the later stages. Using this, the authors quantify the 'relevant' and 'irrelevant' information in different layers of the neural network during training. Interestingly, the authors show that minimal representation are uncovered only if the network is started with random initial weights. Information is quantified using a simple decoder network.
+
+The article is clearly written and has a simple (in a good way) and interesting message. However, I also have some criticisms, especially regarding the conceptual underpinnings.
+
+When any neural network is predicting a deterministic function f : X -> Y, *all input features* are irrelevant to the output distribution when conditioned on the output itself. In other words, the minimal representation in a deterministic task is simply the output itself. (The situation is different when the task involves predicting a non-degenerate probability distribution P(Y|X), in which case the minimal representation -- i.e., the sufficient statistics -- can have an arbitrary amount of information.) In the information bottleneck community, this was mentioned in https://arxiv.org/pdf/1703.00810.pdf (section 2.4) and explored in https://arxiv.org/abs/1808.07593.
+
+In motivating the paper, the authors appear to confuse two types of ""irrelevant features"": 
+	(1) when an input feature is useless for prediction, i.e., changing it does not change the predictions, and 
+	(2) when information about an input feature is independent of the output distribution, when conditioned on the output.
+For a deterministic prediction task, all features type 2, but not all features are type 1. The authors have the following text:
+      ""We believe this task ... captures key structure from deep learning tasks. For example, in image classification, consider classifying an image as a car, which take on various colors. A representation in the last layer is typically conditionally independent of irrelevant input variations (i.e., the representation does not change based on differences in color)."" 
+If I understand the example, this is building off the intuition that ""color of car"" is irrelevant because it is a type 1 feature (not useful for prediction). In fact, it can be conditionally independent because it is type 2. Moreover, in the authors' task ""color of checkerboard"" is not type 1 (it is very relevant for the output -- changing it changes the output) but it can also be conditionally independent (since it is type 2).
+
+Given the above arguments, the degree to which features are conditionally independent in middle layers does not necessarily reflect how useful they are for prediction.
+
+I have two other, more minor comments:
+1) The notion of ""direction information"" is somewhat confusing, as one can think about two kinds of direction information: (1) information about which targets (i.e., directions) correspond to which colors (which is provided as part of the input), and (2) information about the final reaching direction (i.e., the output). Given the points made above, if I understand correctly, information about which targets correspond to which colors is just as irrelevant as the color information, when conditioned on the output. I would suggest referring to the second kind of information (the one mainly discussed in the paper) as ""output information"".
+2) The authors should probably cite (and may be interested in) https://arxiv.org/abs/2009.12789 (NeuroIPS 2020), which also proposes to estimate mutual information using a practical family of decoders.
+",7,4.0,ICLR2021
+eZUi0uRWs7U,1,zcOJOUjUcyF,zcOJOUjUcyF,Review,"Summary: This paper combines semi-supervised learning with active learning, arguing that we should try to focus on actively choosing to label points that improve the convergence rate of the model after adding that example to the training set. They argue that especially in pool-based active learning, where an unlabeled pool of candidate data points are available to choose from to label, methods should also use this unlabeled pool for semi-supervised learning. To optimize convergence rate, they try to select points that maximize the smallest eigenvalue of the empirical NTK over the final layer only, as an approximation (which the authors show seems to do similarly to computing the full NTK, when the next training episode of the active learning is warm-started with the weights of the previous episode.
+
+Strengths:
+- The method is general to any SSL method, and the authors consider one of the more recent SSL methods, FixMatch. 
+- The use of SSL in the pool-based active learning setup makes good sense. 
+- They use a nice tractable formulation of the convergence rate optimization objective through the eigenvalues of the NTK on the final layer only.
+- They show empirically that more positive eigenvalues of the last layer NTK leads to convergence in fewer training epochs.
+- The paper suggests a subtle difference between randomly initializing the networks between each phase of active learning and warm-starting from the previous model, noting that the confidence scores, etc. used to select examples for active sampling could be dependent on the random initialization, noise during optimization, among others, and thus may not transfer to the next phase if the model is then randomly re-initialized. 
+
+Weaknesses: 
+- The authors argue that since active learning and semi-supervised learning both have the potential to improve sample efficiency exponentially, their combined benefit is ""most likely marginal"". However, there is not much theoretical or empirical evidence for this beyond some intuitions. It's quite unclear whether the combined benefit is only marginal, since the assumptions on the data that are needed for semi-supervised learning and active learning are not really the same. This would seem to suggest that combining them still makes sense to robustify the gains in sample complexity, even ignoring potential stacking of the gains. 
+- The convergence rate control method seems to be motivated by wanting the active learning component to ""optimize something different"" to the SSL component (which is optimizing for test generalization). It's unclear whether faster convergence rate is really different from improving sample complexity/test generalization. Consider running a learning algorithm on a stream of data examples sampled IID. Consider an algorithm that can learn to low test error with few samples; this means that with few samples, the model can get low loss on unseen samples from the distribution, leading to low loss if you continued optimization. This would mean that the training has more or less converged. Now consider a deep learning algorithm with fast convergence rate; this means that with few iterations (read: few samples seen), the algorithm converges to a solution that continues to have low loss if you continue optimization (and thus low error on IID samples from the same distribution => low test error). Could this method of selecting examples to improve the convergence rate be actually optimizing something similar to improving generalization, instead of doing something alternative?
+- In Algorithm 2, it seems like Q/G random groups of unlabeled data are considered, and the Q/G best groups are added to the dataset. Doesn't this mean that all the samples in the Q/G random groups will be added regardless of the NTK eigenvalue calculation? 
+- Empirically, it's unsatisfying that even with more labeled images than FixMatch (with 250 labeled examples, they get 94\%), this FixMatch + active learning method gets worse accuracy (85\%, Table 3, 300 labeled examples). What is the performance of FixMatch without active learning using the training setup that the authors used, just for comparison? It seems surprising that some of the Fixmatch + active learning methods get only about 50% accuracy when Fixmatch itself can get >90\%. If the labeled seed set were balanced, would we see results surpassing Fixmatch? 
+
+Other:
+- Is it true that reinitializing the networks hurts CRC but doesn't hurt the Entropy method much? Why is that? ",5,3.0,ICLR2021
+SJlg24cQcr,3,rkxVz1HKwB,rkxVz1HKwB,Official Blind Review #4,"The work addresses an important problem of robustness of interpretation methods against adversarial perturbations. The problem is well motivated as several gradient-based interpretations are sensitive to small adversarial perturbations. 
+
+The authors present a framework to compute the robustness certificate (more precisely, a lower bound to the actual robustness) of any general saliency map over an input example. They further propose variants of SmoothGrad interpretation method which are claimed to be more robust.    
+
+The empirical validation of the underlying theory and use of the sparsified (and relaxed) SmoothGradient interpretation methods is unconvincing because of the following reasons:
+
+1. In the demonstrated experiment, the proposed alternative to SmoothGrad involves setting the lowest 90% of the saliency values to zero, and the top 10% (for sparsified SmoothGrad) or top 1% (in the case of relaxed sparsified SmoothGrad) to one. The problem with clamping most of the lower values to zero and the remainder (or most of the remainder) higher values to one is that it defeats the purpose of having a saliency map in the first place, which exist to characterize the relative importance of the input features. 
+
+2. The paper claims that the proposed variant maintains the high visual quality of SmoothGrad, however, the claim is unsubstantiated. With the current setup, there is a clear trade-off between robustness and fidelity of interpretation, which the paper fails to acknowledge. In principle, one can always build extremely sparse or dense interpretation methods (close to all zeros or all ones), which would produce high robustness certificates but would be much less meaningful as they are not faithful to the underlying mechanism of prediction, and the characteristics of the input.
+
+3. The authors present empirical evidence on just one set of sparsification parameters and K. It would be more conclusive to evaluate the robustness of the proposed variations with different values of sparsification parameters, and K.
+",3,,ICLR2020
+oTbIk6nG4Bq,2,_HsKf3YaWpG,_HsKf3YaWpG,review,"In this paper, the authors claimed that uniformity in embedding space if the key for good generalization, and then propose an adversarial training based method to improve the uniformity of feature space. The claim is from previous work, thus the key contribution is the way to impose such regularization. The method itself makes sense to me. 
+
+One lacking aspect is that the authors provide no evidence on how the method works, neither quantitatively (the distance between uniform distribution and learned feature distribution) nor qualitatively (e.g. t-sne visualization on the learned feature). 
+
+Another point could improve is that I suspect the effect of uniformity is quite like imposing margin on loss function (such as AM-softmax, arcface, etc), it is better to discuss and compare with them. 
+
+I am not familiar with the dataset and SOTA performance used in evaluation. The results look reasonable to me, and could demonstrate the effectiveness of the proposed method.
+
+Above all, I think this paper may contain some ideas that publishable, however the authors fail to dig deeper into it, and lack sufficient ablation experiments to demonstrate the method works as expected. ",6,3.0,ICLR2021
+rV1pv8kEVg,1,sI4SVtktqJ2,sI4SVtktqJ2,I am not an expertise of the area of this paper. But I think the idea is interesting.,"This paper proposes a method based on denoising to protect classifiers from adversarial attacks. Unlike existing methods based on randomized smoothing with various noise distributions to retrain several classifiers, the proposed one uses denoising as the preprocess of the classifier. The experimental results demonstrate the proposed method has good performance.",6,1.0,ICLR2021
+H1g_B4n6FS,2,rJxvD3VKvr,rJxvD3VKvr,Official Blind Review #3,"The paper considers the impact of initialization bias on test error in strongly overparameterized neural networks. The study uses tools from recent literature on the generalization of overparameterized neural networks, i.e. neural tangent kernels and interpolating kernel method, to provide useful insights on how the variance of weights initialization affects the test error. I have a few questions about theoretical results, but the paper has a convincing experiment that supports its theoretical claims. Addressing the following points will improve the exposition of the paper. 
+1. Please provide a little hint on how Lemma 2 rewrites the equation (13) for linearized function for easier readability without referring to the Appendix.
+2. In the case of cross-entropy error, would the effect be similar? Could this be verified with a similar experiment as for MSE?
+3. To what extent this result is observed in not as strongly overparameterized settings? In other words, it would be interesting to see what happens if you fix the architectural choice while increasing the number of training parameters, how long does the test error effect persist?
+
+Minor remark:
+- a few typos are present on pages 4, 5, 7, 8",6,,ICLR2020
+HJxnbHqiFB,1,r1gV3nVKPS,r1gV3nVKPS,Official Blind Review #3,"This paper proposed a new diffusion operation for the graph neural network. Specifically, the ballistic graph neural network does not require to calculate any eigenvalue and can propagate exponentially faster comparing to traditional graph neural network. Extensive experiments have been conducted to verify the performance of the proposed method.
+
+1. The motivation of this method is to accelerate the diffusion speed in a graph. However, as we know, a very severe issue of graph neural network is the over-smoothness issue. The reason is that, in the high layer, the node feature is diffused to far neighbours.  When using the proposed ballistic filter, node features diffuse much faster than the regular GNN. Thus, the over-smoothness will appear in the shallow layer very fast. As a result, we cannot use many layers so that the non-linearity of deep neural networks cannot be fully utilized. Thus, is it necessary to accelerate the diffusion speed for graph neural network?
+
+2. There is only one dataset for  the comparison of the performance of different graph neural networks. More datasets are needed to thoroughly verify the performance of the proposed ballistic graph neural network.
+
+3. Is it possible to slow down the diffusion speed with the proposed ballistic filter?
+
+
+",3,,ICLR2020
+r1lG531aKr,2,HJxp9kBFDS,HJxp9kBFDS,Official Blind Review #3,"This paper shows empirically that rotational invariance and l infinity robustness are orthogonal to each other in the training procedure. However, the reviewer has the following concerns,
+
+It is already shown in (Engstrom et al., 2017) that models hardened against l infinity-bounded perturbations are still vulnerable to even small, perceptually minor departures from this family, such as small rotations and translations. What is the message beyond that paper that the authors would like to convey?
+The experiments are only on MNIST and CIFAR-10.  Training on a larger dataset  like imagenet would make the experiments more convincing.
+Going beyond the observation, what shall we do to improve against different perturbation simultaneously? Or is it an impossible task to improve on both?
+
+",1,,ICLR2020
+B1ljgG7inm,2,BkloRs0qK7,BkloRs0qK7,"An empirical study of CF, but more recent methods could have been also added for the study","Thanks for the updates and rebuttals from the authors. 
+
+I now think including the results for HAT may not be essential for the current version of the paper. I now understand better about the main point of the paper - providing a different setting for evaluating algorithms for combatting CF, and it seems the widespread framework may not accurately reflect all aspects of the CF problems. 
+
+I think showing the results for only 2 tasks are fine for other settings except for DP10-10 setting, since most of them already show CF in the given framework for 2 tasks. Maybe only for DP10-10, the authors can run multiple tasks setting, to confirm their claims about the permuted datasets. (but, I believe the vanilla FC model should show CF for multiple permuted tasks.)
+
+I have increased my rating to ""6: Marginally above acceptance threshold"" - it could have been much better to at least give some hints to overcome the CF for the proposed setting, but I guess giving extensive experimental comparisons could be valuable for a publication. 
+
+=====================
+Summary:
+
+The paper evaluates several recent methods regarding catastrophic forgetting with some stricter application scenarios taken into account. They argue that most methods, including EWC and IMM, are prone to CF, which is against the argument of the original paper. 
+
+Pro:
+- Extensive study on several datasets, scenarios give some intuition and feeling about the CF phenomenon. 
+
+Con:
+- There are some more recent baselines., e.g., Joan Serrà, Dídac Surís, Marius Miron, Alexandros Karatzoglou, ""Overcoming catastrophic forgetting with hard attention to the task"" ICML2018, and it would be interesting to see the performance of those as well. 
+- The authors say that the permutation based data set may not be useful. But, their experiments are only on two tasks, while many work in literature involves much larger number of tasks, sometimes up to 50. So, I am not sure whether the paper's conclusion that the permutation-based SLT should not be used since it's only based on small number of tasks. 
+- While the empirical findings seem useful, it would have been nicer to propose some new method that can get around the issues presented in the paper. ",6,5.0,ICLR2019
+bT9_nAK8G2I,1,V3o2w-jDeT5,V3o2w-jDeT5,"An interesting, well-written paper introducing a method of reweighting labelled pairs from multiple sources to optimize a machine learning model for an unlabelled dataset.","Strong points:
+- Well-written and easy to read and follow.
+- Results show the method does what it promises to.
+
+Weaknesses:
+- While it was easy to follow the paper's rationale, I found it difficult to motivate. Until the final experiment, I was left wondering what kind of application would have these constraints but also where the assumptions would be reasonable. This point I think should be easy to address given a small rework of the intro or perhaps a running example to periodically come back to.
+- After reading about the last experiment, I found myself wondering why this solution is billed as a hyperparameter optimization solution; it sounds to me that the parameters of the model are also being optimized along the way. Again, this is a question of clarification and can easily be addressed.
+- The fact that the Naive method beats the other two baselines and performs comparably to the unbiased proposed method makes me wonder whether (a) these are challenging enough tasks, or (b) those are competitive baselines.
+- The density estimator and the divergence estimator are moving parts the errors of which perhaps warrant quantifying. For example, in both experiments the true labels were known and the authors can measure the error in the divergence estimate.
+
+I'd be happy to increase my score if the above points are addressed.",5,4.0,ICLR2021
+ByxWj_pFnQ,2,S1xjdoC9Fm,S1xjdoC9Fm,Confusing and unsatisfactory,"This paper proposes the use of Bayesian inference techniques to mitigate the issues of miscalibration of modern Deep and Conv Nets. 
+
+The presentation form of the paper is unsatisfactory. The paper seems to imply that Bayesian Deep Nets are used to calibrate Deep/Conv Nets, so I was expecting something like post-calibration using Bayesian Deep Nets. After reading through the paper a few times, it seems that the Authors are proposing the use of Bayesian inference techniques to infer parameters of Deep/Conv Nets in order to improve their calibration compared to non-Bayesian counterparts. This is the only contribution of the paper, and I believe it is insufficient. Guo et al., (2017) already points out that regularization of modern Deep/Conv nets improves calibration, so the fact that Bayesian Deep/Conv Nets are calibrated is not surprising, giving that the prior over the parameters act as a regularizer. 
+
+It is surprising to see ECE values above one - unless these have been scaled by a factor of 100 - but this is not mentioned anywhere.
+
+Previous work shows that Monte Carlo Dropout for Conv Nets offers well calibrated predictions (https://arxiv.org/abs/1805.10522), so I think a comparison against this inference method should be included in the paper. 
+
+The paper makes a number of imprecise claims/statements. A few examples:
+
+- ""Bayesian statistics make use of the predictive distribution to infer a random variable by computing the expected value of all the possible likelihood distributions. This is done under the posterior distribution of the likelihood parameters"" - very unclear and imprecise explanation of Bayesian inference
+
+- ""When using neural networks to model the likelihood"" - the likelihood is a function of the labels given model parameters",3,4.0,ICLR2019
+r1ex9g6pKS,3,Bkgq9ANKvB,Bkgq9ANKvB,Official Blind Review #3,"This paper studies the problem of learning classifiers from noisy data without specifying the noise rates. Inspired by the literature of peer prediction, the authors propose peer loss. First, a scoring function is introduced, minimizing which we can elicit the Bayes optimal classifier f*. Then the authors use the setting of CA to induces a scoring matrix, and then the peer loss. Moreover, this paper explores the theoretical properties of peer loss when p=0.5. In particular, the authors propose \alpha weighted peer loss to provide strong theoretical guarantees of the proposed ERM framework. The calibration and generalization abilities are also discussed in section 4.3. Finally, empirical studies show that the propose peer loss indeed remedies the difficulty of determining the noise rates in noisy label learning.
+
+This paper is well written. The theoretical properties of the proposed peer loss are thoroughly explored. The motivation is rational with a good theoretical guarantee, i.e. Theorem 1. Moreover, the tackled problem, i.e. avoiding specifying the noise rates, is significant to the community.
+
+Nevertheless, Some parts of this paper may be confusing:
+- The computation of the scoring matrix delta is not that clear. Can the authors provide the detailed computation steps of the example? 
+- In the proof of Lemma 6, can the authors provide a proof sketch of the equivalence of the last two equations?
+- Third, where is the definition of p?
+
+In the experiments, the authors propose to tuning the hyperparameter alpha. I would be appreciated if the authors provide the sensitivity experiments of alpha to show its fluence for the final prediction.
+
+Though I'm not that familiar with learning from noisy labels, I think it is a good paper and I suggest to accept.",8,,ICLR2020
+KDaHrdGBxiK,1,uUlGTEbBRL,uUlGTEbBRL,review,"This paper formulates CNNs with high-order inputs into an explicit Tucker model, and provides sample complexity analysis to CNNs as well as compressed CNNs via tensor decomposition. Experiments support their theoretical results. Sample complexity analysis of CNNs and compressed CNNs is an interesting topic. This paper is well written and is easy to follow.
+
+Cons: 
+
+1. Technical novelty is limited. It has been well understood that CNNs can be formulated as a tensor model, see Lebedev et al. (2015); Hayashi et al. (2019);  Kossaifi et al. (2019). By assuming a realization model, the sample complexity analysis of CNNs and compressed CNNs was transferred to the sample complexity analysis of tensor regression model and low-rank tensor regression model. The latter analysis is not new given a rich literature on this topic, see e.g., Bahadori et al. (2014); Yu and Liu (2016); Kossaifi et al. (2020). 
+
+Lebedev, V., et al. ""Speeding-up convolutional neural networks using fine-tuned CP-decomposition."" 3rd International Conference on Learning Representations, ICLR 2015-Conference Track Proceedings. 2015.
+
+Hayashi, Kohei, et al. ""Exploring Unexplored Tensor Network Decompositions for Convolutional Neural Networks."" Advances in Neural Information Processing Systems. 2019.
+
+Kossaifi, Jean, et al. ""T-net: Parametrizing fully convolutional nets with a single high-order tensor."" Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.
+
+Bahadori, Mohammad Taha, Qi Rose Yu, and Yan Liu. ""Fast multivariate spatio-temporal analysis via low rank tensor learning."" Advances in neural information processing systems. 2014.
+
+Yu, Rose, and Yan Liu. ""Learning from multiway data: Simple and efficient tensor regression."" International Conference on Machine Learning. 2016.
+
+Kossaifi, Jean, et al. ""Tensor regression networks."" Journal of Machine Learning Research 21.123 (2020): 1-21.
+
+
+2. The sample complexity analysis is for the global minimizer from a non-convex optimization, see (5).  It would be more interesting to study the sample complexity analysis for the estimator from some polynomial algorithm.  
+
+
+",5,3.0,ICLR2021
+HJeDsmmIqS,3,rylrdxHFDr,rylrdxHFDr,Official Blind Review #3,"Review for ""State alignment-based imitation learning.""
+
+Summary: 
+
+This paper addresses the problem of learning from demonstrations where the demonstrator may potentially have different dynamics from the learner. To address this, the paper proposes to align state-distributions (rather than state-action distributions) between the demonstrations and the learner. To alleviate issues arises from doing this alone, they also propose to use a local learned prior to guide their policy to select actions that take it back to the original demonstrations. The paper shows a number of experiments on control tasks in support of their claims. 
+
+Pros:
++ The problem that the paper solves is fairly relevant, and some experiments (such as cross morphology imitation learning) are promising in concept. 
++ The paper is mostly well written (save some small improvements that could be made in clarity) and can be followed. 
++ The paper presents a series of experiments across several agents and some other baselines.
+
+Cons (and primary concerns): 
+
+1) The idea of matching state distributions in the context of learning behaviors is not new. In particular, [1] also uses similar ideas of matching state distributions in the imitation learning context, if via different machinery. [4] points towards such ideas as well (noted on page 43). Works such as [2, 3] also use this idea in the context of Reinforcement Learning. Further, ideas of deviation correction in the imitation learning domain have been addressed before in [5]. The paper would benefit from a more thorough treatment of these related works, and how the proposed work differs from these. 
+
+2) The choice of approach (in particular, the use of the Wasserstein distance to match state distributions, and the manner of learning a local prior by training an autoregressive Beta VAE) are lacking motivation, and it is unclear if or why these choices are the best way to approach the problem. 
+
+3) While the paper presents a large number of comparisons, the analysis of the relative performance of the proposed approach against the baselines is lacking. For example, in section 5.1.1., vanilla BC seems to do very well - why is it the proposed approach only marginally outperforms BC on several of these tasks? In section 5.2, why is SAIL able to outperform other IL techniques on same-dynamics tasks? What about SAIL provides this performance benefit? Similarly, in section 5.3.3, what is it about the Wasserstein objective and the KL that together enables good learning? This ablation seems crucial to assessing the paper, and is lacking a deeper analysis. Further, the relevance of section 5.3.1 is questionable - as no new insight is provided over the original Beta-VAE paper. 
+
+Other Concerns: 
+
+1) The paper ultimately uses a form of prior that is defined over actions, and not states (so that it may be used in the KL divergence term). How is the choice of form of prior made? It is unclear why it is better to have a prior learned over states converted to actions via eq. 7, versus a similarly designed prior over actions. 
+
+2) It is unclear why the expression of the reward function (Eq. 4) is necessary - if it is possible to compute the Wasserstein distance (and hence the cummulative reward), it is possible to update the policy purely from this cummulative reward. 
+Is the function of the per-timestep reward simply to provide a denser signal to the policy optimization? 
+
+3) The authors claim to introduce a ""unified RL framework"" in their regularized policy objective. It appears that this is simply the addition of the KL between the policy and the prior $p_a$ to the global alignment objective (subsumed into $L_{CLIP}$), hence the reviewer questions whether this can indeed be treated as a novel contribution of the paper. 
+
+4) The problem this paper addresses (and the fundamental thesis for its approach) is that action-predictive methods are likely to suffer from deviation from the original demonstrations, as compared to state-predictive methods. 
+What purpose does section 5.3.2 serve beyond reiterating this point? 
+
+5) State-matching (and the implied use of inverse models) means the feasibility of retrieved actions is not guaranteed, as compared to models that predict actions directly. 
+
+Minor Points: 
+
+1) Explaining the various phases of training (as observed in the algorithm) would be useful. 
+
+2) Discussing how states are compared in the cross morphology experiments (Section 5.1.2.) would also be useful. 
+
+Cited Literature: 
+
+[1] State Aware Imitation Learning, Yannick Shroecker and Charles Isbell, https://papers.nips.cc/paper/6884-state-aware-imitation-learning.pdf
+[2] Efficient Exploration via State Marginal Matching, Lisa Lee et. al., https://arxiv.org/abs/1906.05274
+[3] State Marginal Matching With Mixtures Of Policies, Lisa Lee et. al., https://spirl.info/2019/camera-ready/spirl_camera-ready_25.pdf
+[4] An Algorithmic Perspective on Imitation Learning, Takayuki Osa et. al., https://arxiv.org/pdf/1811.06711.pdf
+[5] Improving Multi-step Prediction of Learned Time Series Models, Arun Venkatraman et. al., https://www.ri.cmu.edu/pub_files/2015/1/Venkatraman.pdf
+
+Initial Decision: Weak reject
+
+#######
+Post Rebuttal Comments: 
+
+Considering the authors' motivations of approach and the additional analysis provided in the comments below, I change my decision to weak accept. I would like to encourage the authors to include the details listed below in their paper. ",6,,ICLR2020
+Fm9JEvNOmHX,1,yEnaS6yOkxy,yEnaS6yOkxy,Interesting Simple Idea but not Convinced about Motivation and Results,"**Overview**: The paper presents a simple regularizer term that aims to force a GAN to generate samples following a uniform distribution over different classes. The regularizer depends on a classifier that works well on an imbalanced or long-tailed dataset. The paper presents experiments on CIFAR-10 and LSUN that were synthetically long-tailed or imbalanced. The results show that the proposed term generates samples that follow a more uniform distribution over classes.
+
+*Pros*:
+- Interesting idea as it can help a generative algorithm to remove an imbalance from a dataset. 
+- The proposed regularizer is simple but depends on a classifier (see below for more details).
+
+*Cons*:
+- The regularization term depends on a classifier that works well already on the imbalanced dataset. Getting a classifier to work on long-tailed datasets is not an easy task and people are still investigating the development of techniques to learn from imbalanced datasets (see for example I).  From a practical point of view, this is a hard requirement that can reduce the chances of adoption.
+
+- Proposed loss may have conflicting terms. The final loss composed of the relativistic loss and the regularizer may be conflicting. According to the text (below Eq. 3), this loss follows the training distribution which in the context of the paper is long-tailed. However, the proposed regularizer penalizes the GAN to generate samples following a long-tailed distribution. Aren't these two terms then conflicting? If so, can this conflict impact the convergence of the network?
+
+- Insufficient experiments. While the experiments show good results on two small and synthetically long-tailed datasets, it is unclear if this method can work on naturally long-tailed datasets (e.g., iNaturalist). Unfortunately, the CIFAR-10 and LSUN datasets have a small set of classes in them. How does this method work on naturally long-tailed (e.g., iNaturalist) and/or large-scale datasets with a larger set of classes (e.g., ImageNet-LT)? Also, how do the generated images look like? Does this method still preserve a good perceptual image?
+
+- Lack of clear impact on applications. After reading the introduction, I did not have a clear application where this component can be crucial to either enable a new application or solve a bottleneck. The discussion section briefly mentions a few applications. However, I think the paper would've been stronger if it showed experiments using the proposed approach and showing its impact on a clear application.
+
+References:
+I. Liu et al. Large-Scale Long-Tailed Recognition in an Open World. CVPR 2019.
+
+Minor comments:
+1. The contribution list of the Introduction section uses terms that have not been defined, i.e., FID and UAP.
+2. If using latex, please use \min, \max, \log to properly display the operators.
+
+----------------------------------------------------
+Post Rebuttal Update
+
+While I think the idea is interesting, I still think the proposed loss is not consistent as I still think the two terms in the loss collide with each other, its practical value is limited mainly because making a GAN to work on various datasets is a challenging task, and that the experiments now raised more questions than answers. For these reasons I still lean towards rejection as I believe the paper can benefit from a revision.",4,4.0,ICLR2021
+BJxm08CJcr,2,rkxZCJrtwS,rkxZCJrtwS,Official Blind Review #1,"This paper shows how the derivatives from a differentiable
+environment can be used to improve the convergence rate of
+the actor and critic in DDPG.
+This is useful information to use as most physics simulators
+have derivative information available that would be useful
+to leverage when training models.
+The empirical results show that their method of adding
+this information (D3PG) slightly improves DDPG's
+performance in the tasks they consider.
+As the contribution of this work is empirical is nature,
+I think a very promising future direction fo work is to
+add derivative information to and evaluate similar
+variants of some of the newer actor-critic methods
+such as TD3 and SAC.
+
+I have two minor questions:
+1) Figure 2(a) shows the convenrgence of regularizing states,
+   actions, and both states and actions and the text
+   describing the figure states that this is
+   ""expected to boost the convergence of Q.""
+   However the figure shows that regularizing both states and
+   actions results in a slower convergence than doing
+   them separately. Why is this?
+2) How should I interpret the visualization of the
+   learned Q surface in Figure 2(f) in comparison to
+   the true Q function in Figure 2(g)?
+   It does not look like a good approximation.",3,,ICLR2020
+XhbXpSxPOrQ,3,GafvgJTFkgb,GafvgJTFkgb,Paper deals with Bias Amplification aspect of FairML but lacks a more comprehensive study of the metric and discussion of the normative perspectives. ,"The paper builds on the ""bias amplification"" aspect of fairness in machine learning literature i.e. the tendency of models to make predictions that are biased in a way that they amplify societal correlations. The paper claims three major contributions: a metric, discussion about the dependence of bias measurements on randomness, and a normative discussion about the use of bias amplification ideas in different domains. 
+
+Overall I find the metric as the only major contribution of the paper, and below I will explain why.
+
+The BiasAmp metric makes a significant contribution in terms of fixing the drawbacks of the previously proposed metric from Zhao et al. 2017(BiasAmp_MALS). It would be more effective if the work also included a study such as Zhao et al demonstrating how to mitigate the bias as measured by the BiasAmp measure. 
+
+The discussion around--the usage of error bars because of the Rashomon effect seems incomplete and almost trivial. First, in my opinion, it is indeed necessary to have error bars if there are metrics whose measurements vary across different runs--regardless of whether that measure was being optimized on or not. But a more nuanced idea would be either: (a) usually a fairness metric is used as a guidance of whether a specific model is deployable or not, rather than being a property of a training method unless however, an ensemble of a number models trained by the same method is being used in deployment. (b) Fairness metrics may be used as a model selection criterion: since the fairness metric is not being optimized upon but a proxy of accuracy, it just means that, out of the models with equally high accuracy, we need to choose the model with the least bias (as measured by a trustworthy metric). 
+
+I find the discussion around the 'consistency' of different metrics incomplete, where average precision (AP) seems to rank models consistently while BiasAmp and others don't. BiasAmp is a pretty sophisticated metric and because of the use of conditional probabilities of all kinds prone to pitfalls such as Simpson's paradox when measuring bias. I would be curious if it were robustly tested e.g. another experiment such as Fig 4 where the authors control the amount of the bias by tuning at the source of the bias. 
+
+The discussion around the use of bias amplification in terms of prediction problems where the outcome is chance-based is a little confusing and it does not provide a fresher perspective of discussion already in the fairness literature around understanding the significance of inherent 'uncertainty' or 'randomness' in application domains other than vision (and probably language). For example, the Fair ML book (Barocas, Hardt, and Narayanan 2017) does mention uncertainty in the context of such applications (on page 34, 56). 
+
+Overall, I am not very convinced that the paper should be accepted unless fellow reviewers think strongly otherwise. However, I think the paper has the potential to be a more complete and important contribution with a more comprehensive study around the technical contributions and clearer discussion about the normative contributions.",5,4.0,ICLR2021
+NUJ7CxAwUy,3,uVnhiRaW3J,uVnhiRaW3J,"review for ""Learning Safe Policies with Cost-sensitive Advantage Estimation""","The authors' rebuttal has addressed some of my confusion regarding the paper, which is greatly appreciated. The additional baseline of early termination would still be interesting to have, though I agree it's not critical for the presented line of work. In general, I think the work is interesting and will keep my current score (6).
+
+=================================
+
+The paper introduces a learning algorithm for training a control policy to complete certain tasks while satisfying some safety constraints. The main ideas in this work are to first use a cost-sensitive advantage function, where the advantage values for the unsafe states are set to zero. Second, a more conservative estimation of the safety cost is proposed to further improve the safety of the robot during the learning process. The authors demonstrated that with the proposed modified reward function, the algorithm would obtain a controller that completes the task while being safe. The proposed algorithm is evaluated on a set of simulated control problems and with both proposed components used, the algorithm achieves better performance in terms of both task completion and safety than prior methods.
+
+The paper solves an important problem of training a performant control policy while taking the safety of the robot into consideration. The paper is well written and the experiments show good results compared to prior methods.
+
+However, I do have a few questions about the paper:
+1. The paper mentioned first about zeroing the advantage for the unsafe states and showed the equivalence of doing that to a shaped reward which replaces the original reward with the shaped reward. I'm wondering if this equivalence also holds for the safe states? For example, within a trajectory that ends in an unsafe state, a prior safe state's advantage will take the original reward of the unsafe state into consideration if only the unsafe state's advantage is modified, which might be different from the shaped reward version. It would be great if the authors could help me understand this part better.
+2. In the implementation of the work, is it using the modified advantage function in Eq. 4, or the shaped reward in Eq. 5? It's a bit confusing in that the title of the paper suggests the use of Eq. 4, while section 4.2 seems to be saying it's using the shaped reward? If it's using the shaped reward version, is there a reason not to use Eq. 4 directly?
+3. It seems a simple baseline to compare to is to terminate the rollout when the robot enters an unsafe state, as is done in various OpenAI Gym tasks. There seems to be some analogy in this strategy and the proposed one in that when the rollout is terminated, the policy will not have any updates regarding the unsafe states, i.e. zero gradient, and thus corresponds to zero advantage in policy gradient formulation. I do note that there are still many differences between the two methods, but I feel it's worth trying given the simplicity of the approach and how it's been effective in teaching the robots to run upright in existing problems.
+
+Overall, I think it's an interesting approach with good results.
+
+
+",6,3.0,ICLR2021
+H1lZuixYcS,3,H1lZJpVFvr,H1lZJpVFvr,Official Blind Review #3,"The paper is interested in robustness w.r.t. adversarial exemples. 
+
+The authors note that:
+* features reflecting the global structure are more robust wrt adversarial perturbations, but generalize less;
+* features reflecting the local structure generalize well, but are less robust wrt adversarial perturbations. 
+In hindsight, these claims are intuitive: adversarial perturbations and unseen shape variations are of the same flavor; one should resist to both or handle both, with the difference that the latter is bound to occur (and should be handled) and the former is undesired (and should be resisted). 
+
+The goal thus becomes to define local features that are robust. 
+
+The proposed approach is based on 
+* enforcing the invariance of the intermediate representation through shuffling the blocks of the training images; 
+* building normal adversarial images x' and deriving the block shuffling RBS such that the x' and RBS(x') are most similar w.r.t. the logit layer
+* adding these RBS(x') to the training set;
+
+The idea is nice; the experiments are well conducted and convincing (except for the addition of uniform noise, which is unrealistic; you might consider instead systematic noise mimicking a change of light);
+I'd like more details about:
+* The computational cost of line 7 in algo (deriving the best RBS).
+
+You might want to discuss the relationship between the proposed approach and the multiple instance setting (as if the image was a bunch of patches). ",8,,ICLR2020
+SJeJvkj5hX,2,HkNDsiC9KQ,HkNDsiC9KQ,"Novel idea of learning rules for unsupervised learning, need more theory/evidences on what/why meta objectives are sufficient for learning the unsupervised learning rules","This work brings a novel meta-learning approach that learns unsupervised learning rules for learning representations across different modalities, datasets, input permutation, and neural network architectures. The meta-objectives consist of few shot learning scores from several supervised tasks. The idea of using meta-objectives to learn unsupervised representation learning is a very interesting idea.
+
+Authors mentioned that the creation of an unsupervised update rule is treated as a transfer learning problem, and this work is focused on learning a learning algorithm as opposed to structures of feature extractors. Can you elaborate on what aspect of learning rules and why they can be transferable among different modalities and datasets? For this type of meta-learning to be successful, can you discuss the requirements on the type of meta-objectives? Besides saving computational cost, does using smaller input dimensions favor your method over reconstruction type of semi-supervised learning, e.g. VAE?
+
+In the section ""generalizing over datasets and domains"", the accuracy of supervised methods and VAE method are very close. This indicates those datasets may not be ideal to evaluate semi-supervised training.
+
+In the section ""generalizing over network architectures"", what is the corresponding supervised/VAE learning accuracy?
+
+In the experimentation section, can you describe in more details how input permutations are conducted? Are they re-sampled for each training session for tasks? If the input permutations are not conducted, will the comparison between this method, supervised and VAE be different?
+
+After reviewing the author response, I adjusted the rating up to focus more on novelty and less on polished results.",8,3.0,ICLR2019
+I5Mj9xZGkog,3,XQQA6-So14,XQQA6-So14,review for #665,"This work investigates a new class of parameterizations for spatio-temporal point processes
+which uses Neural ODE to enable flexible, high-fidelity models of discrete events that are localized in continuous time and space. 
+
+Strengths: this work is essentially an extension of the Neural Jump SDEs [Jia & Benson (2019)] where the temporal dynamics is modeled as an ODE and the spatial pdf is modeled as a history dependent Gaussian mixture distribution. In this work, the spatial pdf is further extended to an ODE based dynamics. For this purpose, three different continuous normalizing flow models are proposed (Time-Varying CNF, Jump CNF, Attentive CNF). Also, a large number of experiments are conducted and baselines are compared to validate the conclusion.
+
+I recommend rejection at the current stage for the reasons below.
+
+Weakness: A major concern is, if my understanding is right, every mark x^(i) is modeled as an ODE of x^(i)_t on [0, t_i] in the in Time-Varying CNF and Attentive CNF, so there are N (the number of points) ODEs in the model. This setup is problematic because any points except the 1st are impossible to happen at time 0, so they impossibly possess a mark x^(i) at time 0 (in fact, any time before t_(i-1) is impossible). A more reasonable way to characterize the dynamics of x^(i) is to model the ODE on [t_(i-1), t_i] which is used in the Jump CNF. I understand this setup contributes to the parallel computation with the reparameterization trick. In fact, this is reason why both Time-Varying CNF and Attentive CNF can be computed in parallel, but Jump CNF cannot. The Attentive CNF can be seen as a generalized version of Time-Varying CNF due to the introduction of history dependence, but the Jump CNF is a different model as stated above. 
+Also, the jump CNF can model the abrupt change of the spatial pdf but the Time-Varying CNF and Attentive CNF cannot. Theoretically speaking, the jump CNF should have a more powerful fitting capability (assuming other parts are same) compared with those two models. Why does the Attentive CNF model achieve a better or close performance than jump CNF in most experiments? Does that mean the dynamics in most datasets have no discontinuity? Maybe a simple synthetic experiment with discontinuity in dynamics can help prove this. 
+
+Some specific concerns: some synthetic data experiments with specific setup (e.g. discontinuity) are needed to give a deep understanding of the two proposed spatial CNF models.
+
+Typo: or-->of, the second line from the bottom in the first page.
+ 0-->1, the second line of Eq.(19). ",5,4.0,ICLR2021
+S1riI7OxM,3,rk49Mg-CW,rk49Mg-CW,Convincing demonstration of stochastic video predictions on real data,"The submission presents a method or video prediction from single (or multiple) frames, which is capable of producing stochastic predictions by means of training a variational encoder-decoder model. Stochastic video prediction is a (still) somewhat under-researched direction, due to its inherent difficulty.
+
+The method can take on several variants: time-invariant [latent variable] vs. time-variant, or action-conditioned vs unconditioned. The generative part of the method is mostly borrowed from Finn et al. (2016). Figure 1 clearly motivates the problem. The method itself is fairly clearly described in Section 3; in particular, it is clear why conditioning on all frames during training is helpful. As a small remark, however, it remains unclear what the action vector a_t is comprised of, also in the experiments.
+
+The experimental results are good-looking, especially when looking at the provided web site images. 
+The main goal of the quantitative comparison results (Section 5.2) is to determine whether the true future is among the generated futures. While this is important, a question that remains un-discussed is whether all generated stochastic samples are from realistic futures. The employed metrics (best PSNR/SSIM among multiple samples) can only capture the former, and are also pixel-based, not perceptual.
+
+The quantitative comparisons are mostly convincing, but Figure 6 needs some further clarification. It is mentioned in the text that ""time-varying latent sampling is more stable beyond the time horizon used during training"". While true for Figure 6b), this statement is contradicted by both Figure 6a) and 6c), and Figure 6d) seems to be missing the time-invariant version completely (or it overlaps exactly, which would also need explanation). As such, I'm not completely clear on whether the time variant or invariant version is the stronger performer.
+
+The qualitative comparisons (Section 5.3) are difficult to assess in the printed material, or even on-screen. The animated images on the web site provide a much better impression of the true capabilities, and I find them convincing.
+
+The experiments only compare to Reed et al. (2017)/Kalchbrenner et al. (2017), with Finn at al. (2016) as a non-stochastic baseline, but no comparisons to, e.g., Vondrick et al. (2016) are given. Stochastic prediction with generative adversarial networks is a bit dismissed in Section 2 with a mention of the mode-collapse problem.
+
+Overall the submission makes a significant enough contribution by demonstrating a (mostly) working stochastic prediction method on real data.",7,4.0,ICLR2018
+SJXrqMPgf,1,Hk2MHt-3-,Hk2MHt-3-,This work proposed a reconfiguration of the existing state-of-the-art CNN model using a new branching architecture.,"This work proposed a reconfiguration of the existing state-of-the-art CNN model architectures including ResNet and DensNet. By introducing new branching architecture, coupled ensembles, they demonstrate that the model can achieve better performance in classification tasks compared with the single branch counterpart with same parameter budget. Additionally, they also show that the proposed ensemble method results in better performance than other ensemble methods (For example, ensemble over independently trained models)  not only in combined mode but also in individual branches.
+
+Paper Strengths:
+* The proposed coupled ensembles method truly show impressive results in classification benchmark (DenseNet-BC L = 118 k = 35 e = 3).
+* Detailed analysis on different ensemble fusion methods on both training time and testing time.
+* Simple but effective design to achieve a better result in testing time with same total parameter budget.
+	
+Paper Weakness:
+* Some detail about different fusing method should be mentioned in the main paper instead of in the supplementary material.
+* In practice, how much more GPU memory is required to train the model with parallel branches (with same parameter budgets) because memory consumption is one of the main problems of networks with multiple branches.
+* At least one experiment should be carried out on a larger dataset such as ImageNet to further demonstrate the validity of the proposed method.
+* More analysis can be conducted on the training process of the model. Will it converge faster? What will be the total required training time to reach the same performance compared with single branch model with the same parameter budget?
+",6,4.0,ICLR2018
+peTyZOxb7gq,2,MCe-j2-mVnA,MCe-j2-mVnA,"Interesting large-scale analysis of learned optimizers, but unfortunately no barriers are overcome","Summary
+
+This paper attempts to address the fundamental barriers of learned optimization. The authors identify three barriers: computational requirements, number of training tasks and lack of inductive bias. A “large-scale” evaluation and comparison of learned optimizers is then carried out using many (1024) multi-core CPUs. A simple modification of an existing learned optimizer is also proposed that involves adding more input features. Unfortunately the results don’t seem to reveal any new insights.  
+
+Strengths
+- The paper proposes a new, simple hierarchical learned optimizer that outperforms existing learned optimizers. The proposed model is very simple in theory but the implementation seems to still require quite a bit of “hand-engineering” in terms of selecting features etc.
+
+
+- The experimental investigation reveals some interesting (albeit unsurprising) insights of large scale training of learned optimizers. These include things like training with more tasks improves performance, that learned optimization performs well in the hyper-parameter regime in which it was trained, that it learns some form of regularization and that it outperforms Adam when a non-optimal learning rate is used.
+
+
+Concerns:
+- My main concern with the paper is that some claims are over-blown. Although it is not clear at all that the current generation of learned optimizers can outperform hand-crafted optimizers, the paper makes misleading claims that can easily be taken out of context. Statements like, “We see this final accomplishment as being analogous to the first time a compiler is complete enough that it can be used to compile itself.” and “we believe learned algorithms will transform how we train models"" are too strong given the current evidence of the performance of the learned optimizers. I would suggest the authors tone down these claims.
+
+
+- The proposed hierarchical learned optimizer (as well as existing ones) seem to be more fragile than hand-crafted approaches such as Adam. For example, on CIFAR-10 in Figure 5 the learned optimizer fails even for batch sizes in the training regime. Is there any reason why this might be the case, especially considering that it has access to all the same information as Adam and Adam’s “hand-crafted” operations are quite simple? 
+
+
+- The disadvantages of the proposed learned optimizer still seems to outweigh the benefits. For example, the “careful tuning of learning rate schedules and momentum timescales” is traded instead for the selection of design of a sufficient range of tasks on which to train the optimizer. This seems to be a far more difficult task than just tuning a few hyperparameters. In addition, although hand-crafted optimizers “do not leverage alternative sources of information beyond the gradient”, the learned optimizers do not do much better and just seem to learn very simple regularization strategies. Currently the discussion on the advantages and disadvantages is completely separate. I think these need to be contrasted and compared on the grounds of what properties a user would prefer in an optimizer.
+
+
+- The contribution of the paper in terms of new insight or knowledge is not clear. The “large-scale” training on a wide range of tasks and many unrolled steps is interesting but I’m not sure what new insights can be inferred from this? Furthermore, the hierarchical optimizer seems to be a small improvement of the mode proposed in (Wichrowska et al., 2017) with some additional input information (like validation loss).
+",4,3.0,ICLR2021
+S1gKct_DoX,1,HkeILsRqFQ,HkeILsRqFQ,"Impressive theme and motivation, but limited contribution","This paper insists layer-level training speed is crucial for generalization ability. The layer-level training speed is measured by  angle between weights at different time stamps in this paper. To control the amount of weight rotation, which means the degree of angle movement, this paper proposes a new algorithm, Layca. This algorithm projects the gradient vector of SGD (or update vector of other variants) onto the space orthogonal to the current weight vector, and adjust the length of the update vector to achieve the desirable angle movement. This paper conducted several experiments to verify the helpfulness of Layca.
+
+This paper have an impressive theme, the layer-level training speed is important to have a strong generalization power for CNNs. To verify this hypothesis, this paper proposes a simply SGD-variant to control the amount of weight rotation for showing its impact on generalization. This experimental study shows many insights about how the amount of weight rotation affect the generalization power of CNN family. However, the contribution of this paper is limited. I thought this paper lacks the discussion of how much the layer-level training speed is important. This paper shows the Figure 1 as one clue, but this figure shows the importance of each layer for generalization, not the importance of the layer-level training speed. It is better to show how and how much it is important to consider the layer-level training speed carefully, especially compared with the current state-of-the-art CNN optimization methods or plain SGD (like performance difference).
+
+In addition, figures shown in this paper are quite hard to read. Too many figures, too many lines, no legends, and these lines are heavily overlapped. If this paper is accepted and will be published, I strongly recommend authors choose some important figures and lines to make these visible, and move others to supplementary material.",5,2.0,ICLR2019
+BJx-6PRtnQ,1,rJl-HsR9KX,rJl-HsR9KX,Method motivation and experimental results are not very convincing,"This paper presents a new approach to an active learning problem where the idea is to train a classifier to distinguish labeled and unlabeled datapoints and select those that look the most like unlabeled.
+
+The paper is clearly written and easy to follow. The idea is quite novel and evokes interesting thoughts. I appreciated that the authors provide links and connections to other problems. Another positive aspect is that evaluation methodology is quite sound and includes comparison to many recent algorithms for AL with neural networks. The analysis of Section 5.5 is quite interesting.
+However, I have a few concerns regarding the methodology. First of all, I am not completely convinced by the fact that selecting the samples that resemble the most unlabeled data is beneficial for the classifier. It seem that in this case just the data from under-explored regions will be selected at every new iteration. If this is the purpose, some simpler methods, for example, relying on density sampling, can be used. Could you elaborate how you method would compare to them? I can see this method as a way to measure the representativeness of datapoints, but I would see it as a component of AL, not an AL alone. What would happen it is combined with Uncertainty and you use it to labeled the points that are both uncertain and resemble unlabeled data? 
+Besides, the proposed approach does not take the advantage of all the information that is available to AL, in particular, it does not use at the information about labels. I believe that labels contain a lot of useful information for making an informed selection decision and ignoring it when it is available is not rational.  
+Next, I have conceptual difficulties understanding what would happen to a classifier at next iteration when it is trained on the data that was determined by the previous classifier. Seems that the training data is non-iid and might cause some strange bias. In addition to this, it sounds a bit strange to use classification where overfitting is acceptable.
+Finally, the results of the experimental evaluation do not demonstrate a significant advantage of the proposed method and thus it is unclear is there is a benefit of using this method in practice. 
+
+Questions:
+- Could you elaborate why DAL strategy does not end up doing just random sampling?
+- Nothing restrict DAL from being applied with classifiers other than neural networks and smaller problems. How do you think DAL would work on simpler datasets and classifiers?
+- How does the classifier (that distinguished between labeled and unlabeled data) deal with very unbalanced classes? I suppose that normally unlabeled set is much bigger than labeled. What does 98% accuracy mean in this case?
+- How many experiments were run to produce each figure? Are error bars of most experiments so small that are almost invisible?
+
+Small comments:
+- I think in many cases citep command should be used instead of cite. 
+- Can you explain more about the paragraph 3 of related work where you say that uncertainty-based approach would be different from margin-based approach if the classifier is neural network?
+- Last sentence before 3.1: how do you guarantee in this case that the selected examples are not similar to each other (that was mentioned as a limitation for batch uncertainty selection, last paragraph on page 1)?
+- It was hard to understand the beginning of 5.5, at first it sounds like the ranking of methods is going to be analysed.
+- I am not sure ""discriminative"" is a good name for this algorithm. It suggested that is it opposite to ""generative"" (query synthesis?), but then all AL that rank datapoints with some scoring function are ""discriminative"".",4,4.0,ICLR2019
+rQ5mjcmKv31,2,HkUfnZFt1Rw,HkUfnZFt1Rw,The paper empirically evaluate the distance measure between nodes in a graph under the setting of kernel k-means and LFR data generation.,"
+1. No good reasons to choose settings of evaluation, particularly kernel k-means and LFR. 
+
+2. I think this paper may think about something very obvious. The setting is kernel k-means, and if the similarity measure is given as some kernel, it might be more clearly shown what kernel is good under what condition, under kernel k-means.  In other words, we may find some connection between a kernel and the condition of data generation under the kernel k-means already before doing some experiments.  I think this type of investigation is missing in this paper.
+
+3. So what is the reason why SCCT is the best and/or why highly-ranked methods are so? I think that would be simply connected to the scheme of data generation of LFR or another feature in generating data or noise. I think this point might be obvious but might become some good contribution, while just data generation and comparison would not be something people can say contribution.
+
+ ",5,2.0,ICLR2021
+81o2hcEfXDE,2,szUsQ3NcQwV,szUsQ3NcQwV,Review,"This paper proposes an observation factorization method to avoid the influence of the irrelevant part on value estimation. Specifically, they design an entity-wise attention network with a masking procedure. This network is used to filter the irrelevant part of the original observation of each agent. Then the output is used to estimate the individual q-value, as well as input to the mixing network to generate the Q_tot. Two kinds of Q_tot are trained together by combing two loss functions linearly with a hyper-parameter. Experimental results show REFIL combined with QMIX surpasses vanilla QMIX and VDN in several SMAC scenarios.
+
+This paper is related to the topics of ICLR. However, I think the related work is not sufficient to cover the background. More detail comments can be found below. 
+
+*****Some specific comments:*****
+
+It is not clear that what is the initialization of two masks, and how to update the masks. 
+
+The authors mentioned there are two groups of entities. However, the entity type is also unclear. I guess SMAC only contains two entity types: alive agents and died agents? How to represent an entity inactive?
+
+One question is why just consider two kinds of groups, what would happen if there exist more than two groups for all entities. In SMAC or soccer, it does contain more than two common patterns. Furthermore, it seems that the masking procedure is hard to extend to the situation with a larger number of groups.
+
+Actually, I think REFIL is similar to ROMA [1] and ASN [2] in different ways. First, REFIL considers two kinds of groups corresponding to a simple version of ROMA which has two roles. Second, REFIL does the same thing as ASN that learns the value estimation by considering a more useful part of the observation. ASN directly divide the observation based on the action semantics, while REFIL tries to learn a suitable observation factorization through entity-wise attention with masking. However, these two very relevant works are not discussed and compared in this paper.
+
+Some suggestions,
+
+I think current experiments could not well support motivation. If authors show some examples in SMAC that what kinds of common patterns agents learn would be better to support this idea.
+
+Since REFIL can be integrated into current MARL algorithms, it is better to consider more recent published MARL methods as baselines, such as QTRAN, QATTEN, QPLEX.
+
+[1] Roma: Multi-agent reinforcement learning with emergent roles. ICML. 2020.
+
+[2] Action Semantics Network: Considering the Effects of Actions in Multiagent Systems. ICLR. 2020.
+",5,3.0,ICLR2021
+SklW-AE5nQ,3,Hkesr205t7,Hkesr205t7,application of multimodal VAE for zero shot learning,"This paper proposes a multimodal VAE model for the problem of generalized zero shot learning (GZSL). In GZSL, the test classes can contain examples from both seen as well as unseen classes, and due to the bias of the model towards the seen classes, the standard GZSL approaches tend to predict the majority of the inputs to belong to seen classes. The paper proposes a multimodal VAE model to mitigate this issue where a shared manifold learning learn for the inputs and the class attribute vectors.
+
+The problem of GZSL is indeed important. However, the idea of using multimodal VAE for ZSL isn't new or surprising and has been used in earlier papers too. In fact, multimodal VAEs are natural to apply for such problems. The proposed multimodal VAE model is very similar to the existing ones, such as Vedantam et al (2017), who proposed a broad framework with various types of regularizers in the multimodal VAE framework. Therefore, the methodological novelty of the work is somewhat limited.
+
+The other key issue is that the experimental results are quite underwhelming. The paper doesn't compare with several recent ZSL and GZSL approaches, some of which have reported accuracies that look much better than the accuracies achieved by the proposed method. The paper does cite some of these papers (such as those based on synthesized examples) but doesn't provide any comparison. Given that the technical novelty is somewhat limited, the paper falls short significantly on the experimental analysis.",4,5.0,ICLR2019
+kPn6zkx0bwg,1,rQYyXqHPgZR,rQYyXqHPgZR,"Theoretical analysis partly known, partly erroneous. Algorithmic contribution needs comparison to simpler alternatives and more extensive evaluation.","------------------------------------------
+POST-REBUTTAL COMMENTS
+
+Thanks for your comments.
+
+Re: A2, time augmentation in finite-horizon settings increases the size of the state space you need to keep in memory by at most a factor of 2... But in any case, further discussion of this issue will have to wait until the revision with the additional experiments that you mentioned.
+
+Re: A5, regarding ""D"" being analogous to ""gamma"" -- fair enough, but this approach is still a meaningful baseline to compare against.
+
+Regarding the statement ""contrary to this, initializing state values pessimistically requires task-specific knowledge"", this isn't accurate. In a MAXPROB instance, initializing the the value function at all states to 0 is a problem-independent pessimistic initialization. Of course, task-specific knowledge may help design a better one, but it isn't _necessary_.
+
+In conclusion, as the original review mentioned, I believe that the presented ""loop penalty"" idea may well have conceptual merit, but I encourage you to think more carefully how you ""sell"" it, because so far neither the original submission nor the rebuttal present convincing arguments that it is better than the alternatives either theoretically or empirically.
+
+---------------------------------------------------
+
+
+The paper argues that in many RL settings the expect discounted reward criterion common in RL is less appropriate than undiscounted success rate maximization, which this paper claims to introduce. The paper argues that the two result in different solutions, points out that success rate maximization can result in instability of existing RL approaches due to introducing ""uniformity"" of state values within state loops, and proposes modified losses for PPO, Monte-Carlo, and Q-learning algorithms that allows them to optimize the success rate more reliably than they are able to using their standard losses.
+
+The paper's motivation is sound: the discounted-reward criterion is indeed conceptually less appropriate than success rate maximization for goal-directed decision-making problems. Unfortunately, however, the paper's claimed contributions in addressing this issue are not novel, and largely flawed:
+
+
+1) Contrary to the paper's claimed contribution, success rate maximization in MDPs isn't new. Paper [1] introduced this criterion, calling it ""MAXPROB"", and analyzed it mathematically. Without an official name this criterion was known even before that (see, e.g., [1a]). The analysis in [1] focuses on MDP planning, i.e., assumes known model, but the mathematical properties pointed out there that affect value iteration convergence on these MDPs hold in the RL setting just the same.
+
+2) The submission's claims about the success rate/MAXPROB criterion causing the instability of value iteration based approaches are partly imprecise and partly outright mistaken. In the intro, the paper states, ""this expression belongs to undiscounted problems and the convergence of value iteration
+often cannot be guaranteed (Xu et al., 2018)"". First of all, I couldn't find any such claims in that paper. Second, that paper doesn't deal with *finite-horizon* MDPs, whereas this submission does, and the problem is that in finite-horizon MDPs succcess rate maximization poses no issues for value iteration at all. The reason is that value iteration in finite-horizon MDPs such as those in Section 3.1 essentially operates on time-augmented state space, needs a single backward pass from states with 0 steps-to-go to states with T steps-to-go in order to compute the optimal value function, and its convergence doesn't depend on the properties of the reward function.
+Note that Section 3.2.C and Figure 3 that it refers to doesn't talk about value iteration's convergence difficulties during success rate maximization in finite-horizon MDPs from Section 3.1 anymore. It just states in a hand-wavy way that there are some difficulties with this criterion, but doesn't explain exactly what they are.
+
+3) The concept of uniformity, as flawed as it is for finite-horizon MDPs defined in Section 3.1, is actually not novel and is subsumed by previously published analysis, again from paper [1]. [1] analyzes the success rate/MAXPROB criterion in *infinite-horizon* undiscounted MDPs with absorbing non-goal states, where vanilla value iteration truly has difficulties with this criterion, and shows that these difficulties are indeed caused by state values being uniform in loops/strongly connected components of the MDP's transition graph. In particular, as shown there, this uniformity introduces additional fixed points of the Bellman backup operator that value iteration relies on, and value iteration can converge to any of these fixed points as a result.
+
+4) Several of the submission's other theoretical claims are quite sloppy as well. For instance, in the intro there is a statement ""We believe that success rate is different from expected discounted return"". I think this notion of difference should be made sufficiently precise so as to take faith out of the equation. Same goes for the loop penalty surrogate criterion in Section 4.1. Does optimizing it, at least in tabular MDPs using vanilla value iteration, result in obtaining a policy that maximizes success rate?
+
+5) By itself, the loop penalty criterion looks new. However, a) as mentioned above, it's not clear whether it is a heuristic or has actual optimality guarantees w.r.t. success rate optimization and b) more importantly, papers [1] and [2] suggest at least two alternatives to fixing value iteration's convergence for success rate maximization (in infinite-horizon undiscounted MDPs):
+
+   (a) As shown there, for value iteration/Q-learning-like methods, initializing state values *inadmissibly* (i.e., pessimistically in the face of uncertainty) and amending the Bellman backup operator to deal with ""value uniformity"" yields an optimal algorithm for this criterion.
+
+   (b) Turn the success rate maximization MDP into a stochastic shortest path MDP (see [3] or almost any textbook by Bertsekas and Tsitsiklis) by assigning a positive cost to every action, introducing a ""cap"" D on the highest possible state cost, and minizing the undiscounted expected cost (see stochastic shortest path MDPs with finite dead-end penalty in [2]). D is a hyperparameter, and for *any* such cost assignment there exists a D s.t. the optimal policy for this surrogate MDP will that maximizes success rate in the original MDP. Results from [4] may even help prove something about convergence rate of this approach, although empirically convergence speed and resulting policy (note that there are generally many success rate-maximizing policies) will depend on the specific cost function choice.
+
+At least method (b) is conceptually simpler than the loop disorientation penalty, and may even be theoretically similar to the latter, and provides a natural baseline for the proposed approach.
+
+6) Last but not least, the empirical evaluation is too limited to be able to assess the merits of the proposed approach. While there are certainly problems where optimizing for success rate directly is preferable to optimizing the discounted reward, the use of discount factor in RL is important in many problems for mitigate the effects of estimation errors -- see, e.g., Xu et al. 2018 and [5]. Therefore, to get a better picture of whether success rate optimization is worth it in practice, one would need a more extensive evaluation on benchmarks such as goal-directed Atari or Procgen games and/or more complex robotics scenarios.
+
+
+Thus, despite studying an interesting topic, I think this work needs to be significantly revised and extended before publication.
+
+
+
+[1] Kolobov, Mausam, Weld, Geffner. ""Heuristic search for generalized stochastic shortest path MDPs"" ICAPS-2011
+
+[1a] Little, Thiebaux ""Probabilistic Planning vs Replanning"" An ICAPS-2007  workshop 
+
+[2] Kolobov, Mausam, Weld. ""A Theory of Goal-Oriented MDPs with Dead Ends"" UAI-2012
+
+[3] Bertsekas, Tsitsiklis. ""An Analysis of Stochastic Shortest Path Problems"" Mathematics of Operations Research, 1991
+
+[4] Yu, Bertsekas. ""On Boundedness of Q-Learning Iterates for Stochastic Shortest Path Problems"" Mathematics of Operations Research, 2013
+
+[5] Jiang, Kulesza, Singh, Lewis. ""The Dependence of Effective Planning Horizon on Model Accuracy"" AAMAS-2015",2,5.0,ICLR2021
+y7FGxG175rE,4,aCgLmfhIy_f,aCgLmfhIy_f,Clarity and presentation issues; Unclear whether the gains are from the pre-training stage or not,"This paper proposes a method for learning representations for relation statements as well as classes (prototypes) for relation extraction tasks. I think the key insight is that the relation statements and classes (corresponding to relation types) can be learned jointly using contrastive training objectives. The paper also proposes to use the similarity metric as 1 / (1 + exp(cos(u, v))) and claims that this will provide a clear geometric interpretation and more interpretability. In the experiments, they first train this framework on distantly-supervised data constructed from Wikidata + Wikipedia and then fine-tune it on several downstream tasks: FewRel, SemEval, and a new dataset they created focused on identifying false positives in distantly supervised data.
+
+Overall, I think the proposed approach is quite reasonable and also seems to work well in the evaluation tasks (learning prototype embeddings instead of taking the average of instance embeddings and adopting contrastive losses between relation statements and prototypes). However, I think the paper has many clarity and presentation issues that make it difficult to evaluate the significance of the work. 
+
+First of all, I think this pre-training stage on weakly-supervised data is very crucial and the details of the data collection (which relations and how many instances have been used) should be moved to the main body of the paper instead of the Appendix. In realistic few-shot scenarios, you only have a very small number of examples for a new relation k so it is difficult to learn the r_k embedding from only a few examples. My interpretation is that for the FewRel evaluation, the relations must have been already seen in the pre-training stage (given both the weakly-supervised data and FewRel are collected based on Wikidata) unless I have misunderstood something (especially that the model can achieve a good accuracy when there is no training data used in Figure 3 & 4). For the SemEval evaluation, I assume the prototype embeddings must have been learned from the training data but it has 6k training examples so it is fine. I think this point really needs to be clarified and can be a weakness of the approach, especially in few-shot settings.
+
+For the results in Table 1 & 3, it is not very clear to me whether the numbers of previous approaches are from their papers or re-run by the authors. This should be clarified. My main concern is the pre-training data used in this paper can be different from what has been used in (Soares et al, 2019 -- they didn’t use Wikidata and only consider Wikipedia and the links) and it makes the comparisons unfair. 
+
+The authors claimed that this similarity metric is crucial but there is no ablation study or comparison to other alternatives… How if you just compute the dot product between the two embeddings? I think the comparisons to commonly used similarity metrics need to be added to justify why this design choice is important.
+
+I also don’t know what the L_{CLS} training loss is used for. If z_k is just a set of learnable embeddings (one embedding per relation) and if it is used to predict the relation k, isn’t it just multiplied by another set of K embeddings? What is the benefit here?
+
+Also, I don’t understand the fixed prototype baseline (denoted as IND). What does “fixed prototypes z that are pre-computed by vectorize extracted patterns for each relation” mean?
+
+To sum up, I can’t recommend the acceptance of the paper if the above issues cannot be addressed. I am also concerned whether the highlight of the approach is the contrastive losses and prototype embeddings, or it has to be coupled with some type of pre-training (or even specifically pre-training on distantly-supervised data). 
+
+Minor:
+- Equation (7): The second S2Z should be S2Z’.
+",4,4.0,ICLR2021
+S1KIF7olf,3,ByJWeR1AW,ByJWeR1AW,Interesting empirical study but unclear outcome and results do not seem to support the conclusions,"This paper presents an empirical study of whether data augmentation can be a substitute for explicit regularization of weight decay and dropout.  It is a well written and well organized paper.  However, overall I do not find the authors’ premises and conclusions to be well supported by the results and would suggest further investigations.  In particular:
+
+a) Data augmentation is a very domain specific process and limits of augmentation are often not clear.  For example, in financial data or medical imaging data it is often not clear how data augmentation should be carried out and how much is too much.  On the other hand model regularization is domain agnostic (has to be tuned for each task, but the methodology is consistent and well known).  Thus advocating that data augmentation can universally replace explicit regularization does not seem correct.
+
+b) I find the results to be somewhat inconsistent.  For example, on CIFAR-10, for 100% data regularization+augmentation is better than augmentation alone for both models, whereas for 80% data augmentation alone seems to be better.  Similarly on CIFAR-100 the WRN model shows mixed trends, and this model is significantly better than the All-CNN model in performance.  These results also seem inconsistent with authors statement “…and conclude that data augmentation alone - without any other explicit regularization techniques - can achieve the same performance to higher as regularized models…”
+",5,4.0,ICLR2018
+NQNzTH8uHIt,1,HHSEKOnPvaO,HHSEKOnPvaO,Novel idea of using random graphs for continual learning,"Summary:
+
+The paper proposes a novel way of using random graphs to improve task-free continual learning method. It builds to random graphs, G and A, based on the similarity of images stored in the memory and those of the current tasks, and utilize the relative information to build representation of the images and predict. The idea is well-formulated, and carried out in a sound way. The graph regularization term resembles the knowledge-distillation, as the authors also mentioned, but it serves different purpose of preserving the covariance structure of the outputs of the image encoders. 
+
+The experimental results look quite strong and the ablation study also looks sound. Overall, I find the paper quite strong. 
+
+Con & Questions:
+
+- It seems like there are 4 hyperparameters: ""temperature"" for the Gumbel-Softmax relaxation and 3 lambdas. Fig 6 shows the effects of $\lambda_G$, but what about others? How sensitive is the performance with respect to other hyperparamters? Since the problem setting is a single-pass setting, how are the hyperparameters selected?
+
+- The inference time is a bit longer (not too much than others) since the algorithm has to sample graphs 30 times. What happens if the number of samples is less than 30 so that the inference time becomes similar to ER? Would the performance degrade significantly? How critical is the sample size 30?",7,4.0,ICLR2021
+10hN6uh2oaE,1,eU776ZYxEpz,eU776ZYxEpz,Gain modulation of inhibitory feedforward inhibition,"This is a great investigation on how to scale the gain of the inhibitory weights to balance the impact that the changes that the excitatory and inhibitory connections have on the layer’s output. I think using the KL distance that naturally connects with the Fisher Information is neat. I appreciate the effort that the authors make to connect the manner neural circuits are designed and connect it with ANN. You never know when the breakthrough can arise.
+
+I love the experiments that the authors present illustrating with clarity the impact that having the proper gain modulation of the inhibitory changes have in the speed of convergence.
+
+My single constructive criticism is that the inspiration in cortical circuits do not prevent the authors to get inspiration from smaller neural circuits like in insects for example. The Mushroom Bodies of the insects are the equivalent of the cortex and present feedforward inhibition. The number of layers is much smaller but the neural principles that operate are fairly consistent across multiple animal species. Drawing from that experience, the mutual inhibition within layer may provide a natural mechanism to keep balance in the output distribution as shown for example in mean field models that investigate the regulation of activity in a dynamical neural layer (see for example https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003133). 
+
+Other that this comment I learn and enjoy from reading this paper. I think it should be accepted.
+",9,5.0,ICLR2021
+Bkf8UgS4l,3,ryMxXPFex,ryMxXPFex,Review,"Paper proposes a novel Variational Encoder architecture that contains discrete variables. Model contains an undirected discrete component that captures distribution over disconnected manifolds and a directed hierarchical continuous component that models the actual manifolds (induced by the discrete variables). In essence the model clusters the data and at the same time learns a continuous manifold representation for the clusters. The training procedure for such models is also presented and is quite involved. Experiments illustrate state-of-the-art performance on public datasets (including MNIST, Omniglot, Caltech-101). 
+
+Overall the model is interesting and could be useful in a variety of applications and domains. The approach is complex and somewhat mathematically involved. It's not exactly clear how the model compares or relates to other RBM formulations, particularly those that contain discrete latent variables and continuous outputs. As a prime example:
+
+Graham Taylor and Geoffrey Hinton. Factored conditional restricted Boltzmann machines for modeling motion style. In Proc. of the 26th International Conference on Machine Learning (ICML), 1025–1032, 2009.
+
+Discussion of this should certainly be added. 
+",8,2.0,ICLR2017
+rklyq2NAFS,2,Bylh2krYPr,Bylh2krYPr,Official Blind Review #1,"The authors propose a framework to assess to which extent representations built by predictive models such as action-conditional CPC or SimCore contain sufficient information to answer questions about the environment they are trained/test on. The main idea is to train an independent LSTM (a Question-answer decoder) so that given the hidden state of the predictive model and a question about the environment, it is able to answer the question. 
+
+The authors give empirical evidence that the representations created by SimCore contain sufficient information for the LSTM to answer questions quite accurately while the representations created by CPC (or a vanilla LSTM) do not contain sufficient information. Based on the experimental results, the authors argue that the information encoded by SimCore contains detailed and compositional information about objects, properties and spatial relations from the physical environment.
+
+The idea is clearly explained and seems sensible, the paper is well written, the execution is competent and the authors provide a sufficient amount of details so that reproducibility should be possible. As a result, I am positive, however, I think it would be best accepted as a workshop paper given that:
+
+- The experiment are only carried out on a single environment, however, their claims are rather general. To support such general claims, experiments on additional environments seem necessary.
+
+- While the idea is sensible, the study is quite narrow because it only compares three models.
+
+- While sensible, the methodological contribution is rather straightforward.
+
+- The take home is quite brief.",6,,ICLR2020
+PxJPEXS09xJ,2,6NFBvWlRXaG,6NFBvWlRXaG,Difficult to understand and lack of evaluation,"This work is a theoretical paper investigating the sufficient conditions for an equivariant structure to have the universal approximation property. They show that the recent works: Tensor Field networks and Fuchs et al., are universal under the proposed framework. A simple theoretical architecture is presented as another universal architecture.
+
+Pros:
+- Achieving rotation equivariance is important to point cloud network as it is the key to improve the expressiveness of point cloud features. Hence the study of the universality of network with such property is important to the community.
+- Overall, the proposed proof looks plausible to me. 
+- A minimal universal architecture is proposed that satisfies the D-spanning property. This provides the theoretical starting point to design a more advanced and complex equivariant point cloud network.
+
+Cons:
+- The paper is quite difficult to follow. I'm not an expert in group theory and had a difficult time understanding some of the theorems and proofs. It would be great if the writing can be broken down into more fundamental modules and provide more illustrations.
+- In addition, the paper doesn't provide any evaluation of the proposed new universal architectures. Though this is a theoretical paper, it would be nice to show the proposed theory have some practical use. For instance, it would be great to provide a simple implementation of the minimal universal architecture and show it indeed achieves the rotation equivariant features on the point cloud data. That would make this work much stronger and more practical. 
+
+Minors:
+There are some typos in the inset figure on page 1: ""Equivarint"" -> ""Equivariant"".",6,2.0,ICLR2021
+SkeXaQV0cr,3,SyeyF0VtDr,SyeyF0VtDr,Official Blind Review #3,"The paper proposes a recurrent and autorgressive architecture to model temporal knowledge graphs and perform multi-time-step inference in the form of future link prediction. Specifically, given a historical sequence of graphs at discrete time points, the authors build sequential probabilistic approach to infer the next graph using joint over all previous graphs factorized into conditional distributions of subject, relation and the objects. The model is parameterized by a recurrent architecture that employs a multi-step aggregation to capture information within the graph at particular time step. The authors also propose a sequential approach to perform multi-step inference. The proposed method is evaluated on the task of future link prediction across several baselines, both static and dynamic, and ablation analysis is provided to measure the effect of each component in the architecture.
+
+The authors propose to model temporal knowledge graphs with the key contribution being the sequential inference and augmentation of RNN with multi-step aggregation. The paper is well written in most parts and provides adequate details with some exceptions. I appreciate the extended ablation analysis as it helps to segregate the effect of each component very clearly. However, there are several major concerns which makes this paper weaker:
+
+- The paper approaches temporal knowledge graphs in discrete-time fashion where multiple events/edges are available at each time step. While this is intuitive, the authors fail to position the paper in light of various existing discrete-time approaches that focus on representation learning over evolving graphs [1,2,3,4,5]. Related work mentions [1] learns evolving representations but all these methods can do future link prediction and hence this is a big miss for the paper. A discussion and comparison with these approaches is certainly required as most of static and dynamic baselines currently compared also focus on learning representations, hence that is not a valid argument to miss comparison. 
+
+- The baselines tested by the authors are either support static graphs, supports interpolation or supports continuous time data. However, as the authors explicitly propose a discrete time model starting from Section 3, it is important to perform experiments on atleast few of the discrete time baselines to demonstrate the efficacy of the proposed method. For instance, authors can augment relation as extra feature or use their encoders and optimization function to perform experiments e.g. Evolve-GCN  only require to replace GCN with R-GCN.
+
+- From the ablation it is clear that aggregation is the most important component as without it, the performance drops much closer to ConvE which is a static baseline and significantly worse than other RE-Net variants. However, the aggregation techniques are not novel contributions but augmentation to the RNN architecture. Hence it is important to show how augmenting aggregation module with other baselines (for instance, ConvE and TA-DistMult)) and the above mentioned discrete baselines would affect the performance of these baselines. 
+
+- While the authors describe attentive Pooling Aggregator, the experiments only show mean aggregator and multi-step one. Is there a reason Attentive pooling is not used for any experiments? 
+-It appears that global vector H_t is not playing significant role based on ablation study. Can the authors explain why that si the case? Also, what aggregation is used to compute H_t? Is it sum over all previous h_t's?
+
+- Algorithm 1 is not very clearly explained. When the authors mention that they only use one sample, does that mean a single subject is sampled at each time point t'? If so, how do you ensure the coverage is good across subjects in the newly generated graph? I admit I am not clear on this and would recommend the authors to elaborate in response and also in the paper. Also, the inference computation complexity is concerning. While it seems fine for the provided dataset, most real-world graphs have billion of nodes and I all of E, L and D would be larger for such graphs. This seems to put a strict limitation on scalability of inference module. 
+
+- It is not clear what is the difference between RE-NET and RE-NET w. GT. Could the authors elaborate this more? It seems the authors do not update history when they perform RE-NET w/o multi-step. However, in the RE-NET w. GT, where is the ground truth history used in Algorithm 1?
+
+- The time span expansion for WIKI and YAGO is very unnatural and it is not clear if these experiments provide any value. For instance, can the authors show that in multi-step inference scheme, they can actually predict events at multiple time points corresponding to time span events in actual dataset? As multiple triplets can appear at consecutive time points, the current modification just makes them equivalent which doesn't seem correct. 
+
+I am willing to revisit my score if the above concerns are appropriately addressed and requested experiments are provided.
+
+[1] Evolve-GCN: Evolving Graph Convolutional Networks for Dynamic Graphs, Pareja et. al.
+[2] DynGEM: Deep embedding method for dynamic graphs, Goyal et. al.
+[3] dyngraph2vec: Capturing network dynamics using dynamic graph representation learning, Goyal et. al.
+[4] Dynamic Network Embedding by Modeling Triadic Closure Process, Zhou et. al.
+[5] Node Embedding over Temporal Graphs, Singer et. al.",3,,ICLR2020
+HJ-Dj_fVe,3,Bkul3t9ee,Bkul3t9ee,"Simple well motivated approach, but requires better references and comparisons to existing methods","The paper tries to present a first step towards solving the difficult problem of ""learning from limited number of demonstrations"". The paper tries to present 3 contributions towards this effort:
+1. unsupervised segmentation of videos to identify intermediate steps in a process
+2. define reward function based on feature selection for each sub-task
+
+Pros:
++ The paper is a first attempt to solve a very challenging problem, where a robot is taught real-world tasks with very few visual demonstrations and without further retraining.
++ The method is well motivated and tries to transfer the priors learned from object classification task (through deep network features) to address the problem of limited training examples.
++ As demonstrated in Fig. 3, the reward functions could be more interpretable and correlate with transitions between subtasks.
++ Breaking a video into subtasks helps a video demonstration-based method achieve comparable performance with a method which requires full instrumentation for complex real-world tasks like door opening.
+
+Cons:
+1. Unsupervised video segmentation can serve as a good starting point to identify subtasks. However, there are multiple prior works in this domain which need to be referenced and compared with. Particularly, video shot detection and shot segmentation works try to identify abrupt change in video to break it into visually diverse shots. These methods could be easily augmented with CNN-features.
+(Note that there are multiple papers in this domain, eg. refer to survey in Yuan et al. Trans. on Circuits and Systems for video tech. 2007)
+
+2. The authors claim that they did not find it necessary to identify commonalities across demonstrations. This limits the scope of the problem drastically and requires the demonstrations to follow very specific set of constraints. Again, it is to be noted that there is past literature (video co-segmentation, eg. Tang et al. ECCV'14) which uses these commonalities to perform unsupervised video segmentation.
+
+3. The unsupervised temporal video segmentation approach in the paper is only compared to a very simple random baseline for a few sample videos. However, given the large amount of literature in this domain, it is difficult to judge the novelty and significance of the proposed approach from these experiments.
+
+4. The authors hypothesize that ""sparse independent features exists which can discriminate a wide range of unseen inputs"" and encode this intuition through the feature selection strategy. Again, the validity of the hypothesis is not experimentally well demonstrated. For instance, comparison to a simple linear classifier for subtasks would have been useful.
+
+Overall, the paper presents a simple approach based on the idea that recognizing sub-goals in an unsupervised fashion would help in learning from few visual demonstrations. This is well motivated as a first-step towards a difficult task. However, the methods and claims presented in the paper need to be analyzed and compared with better baselines.",4,4.0,ICLR2017
+H1e970N2Fr,1,BJgEd6NYPH,BJgEd6NYPH,Official Blind Review #3,"This paper proposes ellipsoidal trust region methods for optimization on neural networks. This approach is motivated by the adaptive gradient methods and classical trust region methods. The idea of the design is reasonable, but the theoretical and empirical results are not strong. I can not support acceptance for current version.
+
+Major comments:
+
+1. Section 4.2 says “Algorithm 1 with A_rms ellipsoids converges with the classical rate O(\epsilon^{-2}, \epsilon^{-3}) thanks to Proposition 2 above and Theorem 6.6.8 in Conn et al.  (2000).” I have two questions about this statement.
+
+a) What is (\epsilon^{-2}, \epsilon^{-3})? Some subscript or min/max notation looks missing.
+
+b) I have checked Theorem 6.6.8 in Conn et al. (2000). It only shows the limit point of the sequence of iterates is a second order critical point. How to obtain the convergence rate of Algorithm 1 seems not directly. I hope the authors provide detailed derivation and present Theorem 6.6.8 here (Maybe the version of the book I have seen is not identical to yours).
+
+2. Is there any stop criterion of the sub-problem solver? How the precision of the sub-problem affects the global convergence rate?
+
+3. The experimental section only reports “log-loss”, which is not enough to deep learning applications. It would be interesting to validate weather the proposed method achieves lower test error than baselines.
+
+
+Minor comments:
+
+1. The values of kappa in the second and the third sub-figures of Figure 1 should be different.
+
+2. The box of Proposition 3 is spited into page 14 and 15.
+
+3. Equation (37) is too long and does not fit within the margins.
+
+===================================
+
+The authors have provided detailed derivation of the key theorem. I decide to raise my rating to 3. 
+",3,,ICLR2020
+45G0C9kZyiJ,1,enhd0P_ERBO,enhd0P_ERBO,RL approach to solve various VRPs,"Summary
+--------------
+The paper presents a reinforcement learning approach to learn a routing policy for a family of Vehicle Routing Problems (VRPs). More precisely, the authors train a model for the min-max capacitated multi vehicle routing problem (mCVRP), then use it to solve variants of the problem that correspond to various VRP problems (with a single vehicle, no capacity constraints, no fueling stations, etc). They use a GNN to represent the states and the PPO algorithm to learn the policy. They validate their approach on both random instances and literature benchmarks.
+
+Strong points
+-------------------
+1. The goal of using a single policy to solve a variety of routing problems
+2. First RL-based approach to tackle the multiple vehicle setting of VRPs
+3. Extensive numerical experiments on both randomly generated and benchmark instances
+
+
+Weak points
+-----------------
+4. There are a lot of imprecisions/typos/lack of definitions/mathematical imprecisions (see Feedback to improve the paper)                                                        
+5. According to the definition of the rewards, the expected return of the policy would be: $\sum_v \sum_t (r^v_{visit} + r^v_{refuel})$. It is important to note that this does not correspond to the objective function of the mCRVP problem. This choice of reward does not look natural to me and it would be useful to better motivate it.
+6. Sec 3.3: “When an vehicle node i reaches the assigned customer node, an event occurs and the vehicle node i computes its node embedding and selects one of its feasible action…”. This is crucial but not clear to me. At a step t, some vehicles might still be in between two cities. How is that taken into account in the state s_t? With the definition of the transition function, I understand that when an action is taken at t, the vehicle arrives at destination at t+1. Does it mean that you sequentially assign only one vehicle to a city and then “wait for it to arrive” before computing the next assignment? In that case what are the events about?
+7. Tables 1, 2, 3: to be relevant the results should averages over a number of random instances. Maybe it’s already the case but it is not mentioned.
+8. Sec 3.4 about the training should be more precise. It was difficult for me to find the relevant information because it was scattered at different places in the (10 pages) appendix.
+9. The authors do not mention whether they will share their code.
+
+Recommendation
+-------------------------
+I would vote for reject. Although the problem addressed is interesting, the paper is not well-written and there are too many typos and missing explanations to understand the method.
+ 
+Arguments for recommendation: 
+----------------------------------------------
+see weak points
+
+
+Questions to authors
+-----------------------------
+10. What prevents a vehicle from getting to a customer and then not having enough fuel to go back to a refuelling station?
+11. Table 1: are the results averaged over a number of random instances?
+12. To improve the results on TSP and VRP, have you tried including instances with only 1 vehicle during training?
+	
+
+
+Feedback to help improve the paper
+---------------------------------------------------
+13. In the formal definition of the mCRVP (Section 2.1), there is no mention that the customers should all be visited. As it is presented, the problem is solved trivially by all vehicles doing nothing.
+14. In the state definition, in the vehicle state “x_v^t is the allocated node that vehicle v to sist”. This is important and not clear. I think the authors mean that it’s the node where the vehicle v is currently located. But then in customer state, x_c^t is a location. V^c should depend on t. 
+15. Action a_t in {V_c union V_R} is not mathematically correct with the definition of V_C and V_R. The update of the v_t^c variable is missing (although in definition there was no dependence on t)
+16. In the definition of the refuelling rewards (section 2.2.3)
+(a) the index t is missing in q^v. The rewards should also be dependent on t.
+(b) What values for \alpha? If F_v < 1 then the reward is negative or might be undefined if F_v =1
+17. Section 2.3: TSP is a special case of mCRVP. The fact that there is no fuel constraint is missing
+18. Equation (2) what is R_C. How is it chosen?
+19. Equation (3) R_R = F^v for which v?
+20. Right after equation (5),\alpha_ij is defined as softmax(e_ij) but then e_ij is defined as a function of \alpha
+21. Equation 7, I believe the sum should be over N_C(i) \union N_R(i)
+
+
+",5,5.0,ICLR2021
+S1gAnrEtdr,1,Syx_f6EFPr,Syx_f6EFPr,Official Blind Review #3,"1. Summary
+The authors propose a scheme for simultaneous dictionary learning and classification based on sparse representation of the data within the learned dictionary. Their goal is achieved via optimization of a three-part training cost function that explicitly models the accuracy and sparsity of the sparse model, simultaneously with usual classification error minimization. An alternating optimization algorithm is used, alternating between sparse representations and other parameters.
+
+The problem they want to address with this scheme is training over incomplete/partial feature vectors. A clean theoretical statement is provided that provides conditions under which a classifier trained via partial feature vectors would do no better in terms of accuracy, had it been trained on complete feature vectors. The authors claim this condition can be checked after training, although this ability is not validated/illustrated numerically.
+
+2. Decision and Arguments
+Weak Accept
+a) Very nice mathematical result and justification. The proof is clear. However, why wasn’t the result used in the numerical section? You claim that the condition (4) can be evaluated to test optimality: how do your trained dictionaries compare to say another dictionary learning scheme? After all, this doesn’t seem to inform your actual learning scheme and is not used to evaluate your results. It would be nice to see a numerical validation/illustration of this result. Also is \delta_K really that easy to compute?
+b) The numerical results are good—but lack error bars and comparators. I don’t understand why you considered so many comparators on synthetic data and none on more ‘realistic’ benchmark data.
+c) Also I don't feel very satisfied doing image examples, it would be more interesting to work on difficult (eg medical) classification problems with large feature vectors
+
+4. Additional Feedback
+a) Very well and clearly written with intuitive examples and clean math. My only suggestion is to clarify in the abstract that there are missing *features*-- when I first read ""incomplete data"" I think of entire data samples that are missing from the training set. That makes no sense, but it became clear when I got to section 2.
+b) Please use markers and dashes etc. with *every* line in your plots. It is hard (or impossible for many) to compare as is with such tiny thin lines.
+
+5. Questions
+a) Could you comment on statistical significance of your results? For synthetic data it should be easy to perform the experiments on a number of mask realizations and include error bars.
+b) Why no comparators on benchmark datasets?
+c) The proof of Thm 3.2 is nice—but how reasonable is the assumption that you have two dictionaries each with the exact same RIP constant \delta_K? Can that property be enforced (even approximately) during training? Or is it trivial that, given one such dictionary, there exists a second one? 
+d) See 2.a
+",6,,ICLR2020
+HJHEwdBEl,2,SJTQLdqlg,SJTQLdqlg,,"A new memory module based on k-NN is presented.
+The paper is very well written and the results are convincing. 
+
+Omniglot is a good sanity test and the performance is surprisingly good.
+The artificial task shows us that the authors claims hold and highlight the need for better benchmarks in this domain.
+And the translation task eventually makes a very strong point on practical usefulness of the proposed model.
+
+I am not a specialist in memory networks so I trust the authors to double-check if all relevant references have been included (another reviewer mentioned associative LSTM). But besides that I think this is a very nice and useful paper. I hope the authors will publish their code.",8,3.0,ICLR2017
+3J_u0t9-3tX,2,N5Zacze7uru,N5Zacze7uru,Poorly motivated and unclear contributions,"This paper proposes an MPC algorithm based on a learned (neural network) Lyapunov function. In particular, they learn both the Lyapunov function and the forward model of the dynamics, and then control the system using an MPC with respect to these models.
+
+Cons
+- Poorly written
+- Unclear connections to related work
+- Weak experiments
+
+It is unclear exactly what problem the authors are attempting to solve. In general, the authors introduce a large amount of notation and theory, but very little of it appears to be directly related to their algorithm. For example, they refer to the stability guarantees afforded by Lyapunov functions, but as far as I can tell, they never prove that their algorithm actually learns a Lyapunov function (indeed, Lemma 1 starts with “Assume that V(x) satisfies (5) [the Lyapunov condition] ...”).
+
+Similarly, they allude to “robustness margins to model errors”, but nothing in the algorithm actually takes into account model errors. Is the point of these margins just to show that they exist? If so, it’s not clear the results (either theoretical or empirical) are very meaningful, given that they depend on the unknown model error (which they assume to be bounded).
+
+In addition, the different loss functions they use (e.g., (10)) are poorly justified. Why is this loss the right one to use to learn a Lyapunov function?
+
+Furthermore, the authors’ approach is closely related to learning the value function and planning over some horizon using the value function as the terminal cost (indeed, the value function is a valid Lyapunov function, but not necessarily vice versa). For instance;
+
+Buckman et al., Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion. In NeurIPS, 2018.
+
+The most closely related work I’m aware of is the following:
+
+Deits et al., Lvis: Learning from value function intervals for contact-aware robot controllers. In ICRA, 2019.
+
+The authors should clarify their contributions with respect to these papers. More importantly, the authors should discuss their motivation for indirectly learning a Lyapunov function instead of simply learning the value function (which appears to be more natural and potentially more effective).
+
+Next, the authors’ experiments are very weak. They only consider two environments, the inverted pendulum and car, both of which are very simple. The inverted pendulum starts near the unstable equilibrium, which further trivializes the problem. In addition, they do not even appear to give the dynamics model of the car they are using (or the state space).
+
+Finally, this paper is poorly written and hard to follow. They provide a lot of definitions and equations without sufficient explanation or justification, and introduce a lot of terminology without giving sufficient background.
+
+------------------------------------------------------------------------------------------------------------------------------------------------
+
+Post rebuttal: While I appreciate the authors' comments, they do not fundamentally address my concerns that the paper is too unclear in terms of the meaning of its technical results to merit acceptance. As a concrete example, in their clarification, the authors indicate that they obtain ""probabilistic safety guarantees"" by checking the Lyapunov condition (5) using sampling. However, at best, sampling can ensure that the function is ""approximately"" Lypaunov (e.g., using PAC guarantees) -- i.e., satisfies (5) on all but 1-\epsilon of the state space.
+
+Unfortunately, an ""approximately"" Lyapunov function (i.e., satisfies the Lyapunov condition (5) on 1-\epsilon of the state space) provides *zero* safety guarantees (not even probabilistic safety at any confidence level). Intuitively, at each step, the system has a 1-\epsilon chance of exiting a given level set of the Lyapunov function. These errors compound as time progresses; after time horizon T, only 1 - T * \epsilon of the state space is guaranteed to remain in the level set, so eventually the safety guarantee is entirely void.
+
+One way to remedy this is if the Lyapunov function is Lipschitz continuous. However, then, the number of samples required would still be exponential in the dimension of the state space. At this point, existing formal methods tools for verifying Lyapunov functions would perform just as well if not better, e.g., see:
+
+Soonho Kong, Sicun Gao, Wei Chen, and Edmund Clarke. dReach: δ-Reachability Analysis for Hybrid Systems. 2015.
+
+This approach was recently applied to synthesizing NN Lyapunov functions (Chang et al. 2019). My point isn't that the authors' approach is invalid, but that given the current writing it is impossible for me to understand the theoretical properties of their approach.
+
+Overall, I think the paper may have some interesting ideas, but I cannot support publishing it in its current state",3,4.0,ICLR2021
+m1EA9JHYyjt,2,vkxGQB9f2Vg,vkxGQB9f2Vg,"The authors propose a general framework for deriving stochastic back-propagation rules, reconnecting with known rules and proposing a general method for deriving new ones. ","In the present work, the authors present a general method for deriving stochastic back-propagation rules, using the link between Fourier transforms and the characteristic function associated to the random variable probability distribution and transferring the derivative directly to the random variable. They are thus able to encompass many well known back-propagation rules from the literature and derive new ones for different special cases. 
+The presented numerical experiments show how this method can be used to match state-of-art performance in simple models.
+The authors also briefly discuss the bottlenecks associated to this method and propose some workarounds and rules of thumb (e.g., truncation of the series expansion entailed in the method) that seem to allow robust and efficient optimization.
+Finally, the authors highlight the fact that deterministic neural networks and usual back-propagation can also be framed by the same method, by simply considering a Dirac's delta probability distribution for the parameters.
+I think the paper is well written and that the presented framework nicely connects many ideas and methods developed in the literature in the past decades. It seems clear that methods based on the reparametrization trick will always be more viable for deep models, but they are based on ad hoc rules. The presented method is instead general and might be more effective in special cases where an early truncation of the series is justified. Personally, I would only put less accent on the link with deterministic back-propagation, which looks more like a sanity check than an important results.",6,3.0,ICLR2021
+B1xhJIFQpm,2,S1gBz2C9tX,S1gBz2C9tX,Simple and interesting method; some questions with the main theoretical results; would like to see the comparison with FQI,"This paper introduces the concept of Sampling Importance Resampling (SIR) and give a simple method to adjust the off-policyness in the TD update rule of (general) value function learning, as an alternative of importance sampling. The authors argue that this resampling technique has several advantages over IS, especially on the stability with respect to step-size if we are doing optimization based the reweighted/resampled samples. In experiment section they show the sensitivity to learning rate of IR TD learning is closer to the on-policy TD learning, comparing with using IS or WIS.
+
+Main comments: 
+The proposed IR technique is simple and definitely interesting in RL settings. The advantage about sensitivity of step-size choice in optimization algorithm looks appealing to me, since that is a very common practical issue with IS weighted objective. However I feel both the theoretical analysis and empirical results will be more convinced to me if a more complete analysis is presented. Especially considering that the importance resampling itself is well known in another field, in my point of view, the main contribution/duty of this paper would be introducing it to RL, comparing the pros/cons with popular OPPE methods in RL, and characterize what is the best suitable scenario for this method. I think the paper could potentially do a better job. See detailed comments:
+
+1. The assumption of Thm 3.2 in main body looks a little bit unnatural to me. Why can we assume that the variance is bounded instead of prove what is the upper bound of variance in terms of MDP parameters? I believe there exists an upper bound so that result would be correct, but I’m just saying that this should be part of the proof to make the theorem to be complete.
+2. If my understanding to section 3.3 is correct, the variance of IR here is variance of IR just for one minibatch. Then this variance analysis also seems a little bit weird to me. Since IR/IR-BC is computed online (actually in minibatch), I think a more fair comparison with IS/WIS might be giving them the same number of computations over samples. E.g. I would like to see the result of averaged IR/IR-BC estimator (over n/k minibatch’s) in either slicing window (changed every time) or fully offline buffer, where n is the number of samples used in IS/WIS and k the size of minibatch. I think it would be more informative than just viewing WIS as an (upper bound) benchmark since it uses more samples than.
+3. From a higher level, this paper considers the problem of learning policy-value function with off-policy data. I think in addition to TD learning with IS adjustment, fitted Q iteration might be a natural baseline to compare with. It is also pretty widely-used and simple. Unlink TD, FQI does not need off-policy adjustment since it learns values for each action. I think that can be a fair and necessary baseline to compare to, at least in experiment section.
+4. A relatively minor issue: I’m glad to see the author shows how sensitive each method is to the change of learning rate. I think it would be better to show some results to directly support the argument in introduction -- “the magnitude of the updates will vary less”, and maybe some more visualizable results on how stable the optimization is using IS and IR. I really think that is the most appealing point of IR to me.
+
+Minor comments:
+5. The authors suggest that the second part of Var(IR), stated in the fifth line from the bottom in page 5, is some variability not related to IS ratio but just about the update value it self. I think that seems not the case since the k samples (\delta_ij’s, j=1 to k) actually (heavily) depend on IS raios, unless I missed something here. E.g. in two extreme case where IS weights are all ones or IS weights are all zero except for one (s,a) in the buffer, the variance is very different and that is because of IS ratio but not the variance of updates themselves. 
+6. Similar with (5), in two variance expressions on the top of page 6, it would be better to point out that the distribution of k samples are actually different in two equations. One of them is sampled uniformly from buffer and the other is proportional to IS ratios.
+7. I think it is a little bit confused to readers when sometimes both off-policy learning and off-policy policy evaluation are used to describe the same setting. I would personally prefer use off-policy (policy) learning only in the “control” setting: learning the optimal policy or the optimal value function, and use the term off-policy policy evaluation referring to estimating a given policy’s value function. Though I understand that sometimes we may say “learning a policy value function for a given policy”, I think it might be better to clarify the setting and later use the same term in the whole paper.
+
+Overall, I think there are certainly some interesting points about the IR idea in this paper. However the issues above weakens my confidence about the clarity and completeness of the analysis (in both theory and experiment) in this paper.",5,3.0,ICLR2019
+B1e3e6yTKH,1,BJedt6VKPS,BJedt6VKPS,Official Blind Review #3,"The paper proposes a set of rules for the design and initialization of well-conditioned neural networks by naturally balancing the diagonal blocks of the Hessian at the start of training. Overall, the paper is well written and clear in comparison and explanation. However, the reviewer is concerned with the following questions:
+Assumption A2 does not make sense in the context. In particular, it is not praised to assume it only for the convenience of computation without giving any example when the assumption would hold. Also the assumptions are vague and hard to understand, it is better to have concrete mathematical formulation after text description.
+Are the experiment results sensitive to the choice of different models with different width and layers or different batch sizes? Does it have a strong improvement than random initialization? It’s less clear the necessity of guaranteeing well-conditioned at initialization since during the training procedure, the condition number is harder to control.
+",1,,ICLR2020
+SJxbm19sKB,1,rye5YaEtPr,rye5YaEtPr,Official Blind Review #3,"In the setting of online convex optimization, this paper investigates the question of whether adaptive gradient methods can achieve “data dependent” logarithmic regret bounds when the class of loss functions is strongly convex. To this end, the authors propose a variant of Adam - called SAdam - which indeed satisfies such a desired bound. Importantly, SAdam is an extension of SC-RMSprop (a variant of RMSprop) for which a “data independent” logarithmic bound was found. Experiments on optimizing strongly convex functions and training deep networks show that SAdam outperforms other adaptive gradient methods (and SGD).  
+
+The paper is very well-written, well-motivated and well-positioned with respect to related work. The regret analysis of SAdam is conceptually simple and elegant. The experimental protocol is well-detailed, and the results look promising. In a nutshell, this is an excellent piece of work.
+
+I have just a minor comment. In the experiments, SAdam was tested using $\beta_1 = 0.9$ and $\beta_{2t} = 1 - \frac{0.9}{t}$. Since Corollary 2 covers a wide range of admissible values for these parameters, it would be interesting to report (for example in Appendix) a sensitivity analysis of SAdam, using different choices of $\beta_1$ and $\beta_{2t}$. 
+",8,,ICLR2020
+rJlyvMZy5B,2,ryxjnREFwH,ryxjnREFwH,Official Blind Review #1,"This paper discusses an extended DSL language for answering complex questions from text and adding data augmentation as well as weak supervision for training an encoder/decoder model where the encoder is a language model and decoder a program synthesis machine generating instructions using the DSL. They show interesting results on two datasets requiring symbolic reasoning for answering the questions.
+
+Overall, I like the paper and I think it contains simple extensions to previous methods referenced in the paper enabling them to work well on these datasets.
+
+A few comments:
+
+1 - Would be interesting to see the performance of the weak supervision on these datasets. In other words, if the heuristics are designed to provide noisy instruction sets for training, we need to see the performance of those on these datasets to determine if the models are generalising beyond those heuristics or they perform at the same level which then may mean we don't need the model.
+
+2 - From Tab 4, it seems the largest improvements are due to span-selection cases as a result of adding span operators to the DSL. A deep dive on this would be a great insight (in addition to performance improvement statements on page 7).
+
+3 - Since the span and value operators require indices to the text location, could you please clarify in the text how that is done? Do LSTMs output the indices or are you selecting from a preselection spans as part of preprocessing?
+
+",6,,ICLR2020
+kswogymT5UZ,1,Bi2OvVf1KPn,Bi2OvVf1KPn,"Simplistic assumptions, misses important prior work","# Summary
+
+The papers studies the problem of robust machine learning, where the labels of the a fraction of samples are arbitrarily corrupted. The paper proposes an algorithm to tackle this problem and evaluates it on a standard datasets.  
+
+# Positives
+
+The paper studies an important problem prevalent in modern machine learning, and proposes two algorithms to solve these problems. The experiments suggest that the proposed algorithm is better than the baselines.
+
+# Negatives 
+
+The paper does not cite highly relevant papers, overclaims its results, and the theoretical results in this paper are immediate. Moreover, the paper is not well-written. More details are given below:
++ Page 1: ""Instead of developing an accurate criterion for detection corrupted samples, we adopt a novel perspective and focus on limiting the collective impact of corrupted samples during the learning process through robust mean estimation of gradients.""
++ This is not a novel perspective and has been known in robust machine learning community for some time [1,2]. These papers have the same underlying idea, but they are not discussed in this paper. [1] is only briefly mentioned in Remark 2, but the comparison is not fair. The results in [1] hold under fairly general conditions, where the results in this paper require the gradient to be uniformly bounded, which makes the problem significantly simple.
++ Theorem 2 is a trivial result, well-known in field.  Moreover, the way it is presented is misleading and confusing. The error would depend on the quantile of  norms in G, which has been hidden under the O(.) notation. The proof is also missing from the paper. 
++ Assumption 1, i.e., Lipschitz continuity of the loss function is very restrictive, which is not satisfied by popular choices of loss function. This assumption trivializes the problem and restricts its applicability.
++ In the same vein, Theorem 3 assumes unrealistic assumptions. The assumption that $||W||_{op} \leq C$ is very restrictive and does not hold for usual learning tasks.	This assumption in a sense is restricting that the covariates x in $R^d$ have bounded norms, whereas the norm of a typical vector in $R^d$ increases as $\sqrt{d}$.
+
+# Score
+I propose to reject this paper. Prior work ([1,2]) has studied this problem in a much greater generality, which are not discussed in this work. The assumptions in the present work are severely restrictive.
+## Other major comments:
++ Robust linear regression, with arbitrary corruptions in responses, has been extensively studied in the literature but they have not been cited.  For example, see [3,4]. In particular, the least trimmed squares is an algorithm that removes outliers based on loss values, and comes with a theoretical guarantee via an alternating minimization algorithm [3,4].
++ Theorem 1 is a folklore, and this should be reflected in main text. Currently, this information is only given in Appendix.
++ The paper is not well written: 
+ 1.Proof of Theorem 2 is missing.
+ 2. $O(.)$ notation hides the dependence on the important quantity in the papers.
+ 3. Important notations have not been defined in the paper.
+ 4. Abbreviations should not be used, for example, Thm., Algo., Asm., etc.
+ 5. There are numerous typos and grammatical errors. For example, ""has a remarkably impact"". 
+
+## Relevant papers
+1. Diakonikolas, I., G. Kamath, D. M. Kane, J. Li, J. Steinhardt, and A. Stewart. “Sever: A Robust Meta-Algorithm for Stochastic Optimization.” In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 97:1596–1606. Proceedings of Machine Learning Research. PMLR, 2019. http://proceedings.mlr.press/v97/diakonikolas19a.html.
+2. Prasad, A., A. S. Suggala, S. Balakrishnan, and P. Ravikumar. “Robust Estimation via Robust Gradient Estimation.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 82, no. 3 (July 2020): 601–27. https://doi.org/10.1111/rssb.12364.
+3. Bhatia, K., P. Jain, P. Kamalaruban, and P. Kar. “Consistent Robust Regression.” In Advances in Neural Information Processing Systems 30, NeurIPS 2017, 2110–2119, 2017. http://papers.nips.cc/paper/6806-consistent-robust-regression.
+4. Bhatia, K., P. Jain, and P. Kar. “Robust Regression via Hard Thresholding.” In Advances in Neural Information Processing Systems 28, NeurIPS 2015, 721–729, 2015. http://papers.nips.cc/paper/6010-robust-regression-via-hard-thresholding.
+",3,5.0,ICLR2021
+Rg68wP-ai3V,1,_WnwtieRHxM,_WnwtieRHxM,Interesting problem but the paper is not clea,"It is now well-understood that when the data are linearly separable,  gradient descent over the linear class of functions converges toward the hard margin solution. It highlights the implicit bias of gradient descent. Among all solutions interpolating the dataset, gradient descent selects the one with larger margin, partly explaining why over-parametrized models may generalize.  The picture for non-linear class is a little bit more complicated. 
+In this paper, the authors study the impact of importance weighting on the implicit bias of gradient descent for both linear and non linear predictors. The problem is interesting and non trivial because importance weighting may affect the geometry of the gradient descent. In particular, a natural question is the following: does the importance weighting affect the margin of the solution (and thus change its generalization properties ?). Note that the authors does not clearly define what is the goal of the paper and to which problem there are trying to answer
+
+
+Here are some questions and remarks about the paper:
+
+- You study importance weighting for neural-networks: is it used in practice for these class of functions ? Could you give some references ? 
+
+- The introduction is not clear. For example the first paragraph aims to define importance sampling. Introduce a formal definition. You sentence ""Importance weighting is a standard tool used to estimate a quantity under a target distribution while
+only the source distribution is accessible"" is not clear at all.  Then, you present exploratory ideas before clearly presenting the contributions of the paper. We are completely lost ... You should clearly define what is the goal of the paper and how you achieve it. I had to read 3 times before understanding what you really wanted to do 
+
+- Do you focus only on the binary case ? It is one again not clear ""For the sake of notation, we mostly study the binary setting"", if so just say it. 
+
+- You use concepts that you do not define clearly: what does mean that $\mathcal D$ is separated by $f(\theta^{t},x)$ at some point $x$ ? Do you mean that there exists $\theta^{t}$ such that  $y_if(\theta^{t},x_i) \geq 0$ ? 
+
+- The paragraph after claim 1 is not clear at all (and you forgot the exponent $\alpha$ on the norm of $\theta$). I really don't get what you are trying to say here ... 
+
+- There are problems with your bibliography (page 3 paragraph beginning section 3, page 12 paragraph beginning A.1.3 for examples)
+
+- Sometimes you write $\|\cdot \|$ and sometimes $\|\cdot \|_2$. Keep the same notation along the paper 
+
+
+The proofs of the paper are a little bit hard to follow but seem to be correct. I also have an other question. Would you be able to generalize the analysis without the condition $w_i \in [1/M,M]$ ? This condition is restrictive and we cannot use $w_i = 0$ which can be very useful for robust purposes. So is it possible to look at the condition $w_i \leq M$ instead ? It is, in my opinion, an interesting question.
+
+
+To summarize, the paper tackles an interesting problem (but in the restrictive setting $w_i \in [1/M,M]$)? However it paper is poorly written and not clear at all. I had the feeling that the authors were completely in a rush and did not organize the paper. Instead they put ideas together, without a clear common thread. I think you should focus on the form of the paper and then resubmit it for publication. There are too many problems right now. Work on the clarity of your ideas. ",7,4.0,ICLR2021
+rJldqyBDnX,1,r1f78iAcFm,r1f78iAcFm,Review,"The paper provides a new system that combines a number of neural networks to predict chemical reactions. The paper brings together a number of interesting methods to create a system that outperforms the state of the art.
+
+Good about this paper: 
+ - reported performance: the authors report a small but very consistent performance improvement.
+ - the authors propose an approach that puts together many pieces to become an effective approach to chemical reaction prediction. 
+ -
+
+Problematic with this paper
+ - this paper is impossible to understand if you only refer to the ten pages of content. There are at least 5 pointers in the paper where the authors refer to the appendix for details. Details in many of these cases are necessary to even understand what is really being done:  p3: rewards p4: message passing functions, p5: updating states, p9: training details. Further, The paper has some details that are unnecessary - e.g. the discussion of global vs. local network on p4 - this could go into the appendix (or be dropped entirely)
+
+ - the model uses a step-wise reward in the training procedure (p3) -> positive reward for each correct subaction. It is not clear from the paper whether the model requires this at test time too (which should not be available). It's not clear what the authors do in testing. I feel that a clean RL protocol would only use rewards during training that are also available in testing (and a final reward)
+
+ - eq 7: given there is an expontential in the probability - how often will the sampling not pick the top candidate? feels like it will mostly pick the top candidate. 
+
+ - eq 9: it's unclear what this would do if the same pair of atoms is chosen twice (or more often)
+
+ - the results presented in table 3: it appears that GTPN alone (and with beam search) is worse than the previous state of the art. only the various post processing steps make it better than the previous methods. It's not clear whether the state of the art methods in the table use similar postprocessing steps or whether they would also improve their results if the same postprocessing steps were applied. 
+ 
+
+minor stuff: 
+p2: Therefore, one can view GTPN as RL -> I don't think there is a causality. Just drop ""Therefore""
+p2: standard RL loss -> what is that? 
+eq. 2: interestin gchoice to add the vectors - wouldn't it be easier to just concatenate?
+p4: what does ""NULL"" mean? how is this encoded?
+p4 bottom: this is quite uncommon notation for me. Not a blocker but took me a while to parse and decrypt.
+p5: how are the coefficients tuned?",5,4.0,ICLR2019
+reSiiImfcIO,1,J7bUsLCb0zf,J7bUsLCb0zf,"Simple idea, good manuscript.","### Paper summary
+
+This work proposes LeVER, a method that modifies general off-policy RL algorithms with a fixed layer freezing policy for early embedding layers (in this particular case, a few early layers of a CNN). As a direct consequence, the method enables to store embeddings in the experience replay buffer rather than observations, with a potential decrease in memory required, as well as providing a boost in clock time due to fewer gradient computations needed for every update. The method is benchmarked with a couple of off-policy RL algorithms against a few different environments.
+
+### Good things
+
+- The approach is extremely relevant to most of the RL community. We are training for longer periods of time and generating significantly more data than we did a few years ago, so any method that enables to increase training efficiency is extremely welcomed.
+- The method is simple, but clever, and the manuscript quite nicely details the steps taken to improve learning stability (which can arise due to the obvious possibility of bad model freezes).
+- The coverage of related work in the literature throughout the manuscript is excellent, and provides enough pointers for the reader to understand how the manuscript is placed in it.
+- The experimental setting clearly states hypotheses and questions that will be answered.
+- Section 5.4 convincingly argues that the freezing method is empirically justified.
+
+### Concerns
+
+This is a good paper, so generally I don't have any strong negative thoughts, however I think it would be good to report how the method does when different choices are made with respect to how much of the network is frozen.
+That is, in the proposed experiment setting the choices were reasonable but nonetheless a little arbitrary, so knowing a little bit more about learning dynamics with this approach would probably make the paper stronger and more robust for future readers.
+
+### Questions:
+
+- I wonder whether the authors would shed more details on the transfer learning setting (e.g. whether the transfer capacity changes wrt. changes in method hyperparameters such as freezing time, different saved embedding functions, etc.), and whether the results do generally show up in more environments/algorithms.
+- The reduction in variance after the freezing is interesting; I wonder if the plots could show all the single runs, and whether the authors have any explanations for this somewhat consistent (!) change in learning behaviour.
+",7,4.0,ICLR2021
+By5ZMrqxG,2,SJZ2Mf-0-,SJZ2Mf-0-,Review,"The authors propose a model for QA that given a question and a story adaptively determines the number of  entity groups (banks). The paper is rather hard to follow as many task specific terms are not explained. For instance, it would benefit the paper if the authors introduced the definitions of a bank and a story. This will help the reader have a more comprehensive understanding of their framework.
+
+The paper capitalized on the argument of faster inference and no wall-time for inference is shown. The authors only report the number of used banks. What are the runtime gains compared to Entnet? 
+This was the core motivation behind this work and the authors fail to discuss this completely.",4,3.0,ICLR2018
+SygZYijTFH,1,ryxsUySFwr,ryxsUySFwr,Official Blind Review #2,"The authors present the empirical observation that regression models trained using weight decay tend to produce high-level features that lie on low-dimensional subspaces. They then use this observation to propose an algorithm for detecting out of distribution data by fitting a simple GMM to the features produced during training.
+
+The main contribution of the paper is a way of calculating the ""OOD scores"" that are reported in the different figures. However, after reading the paper 3 times I'm still unable to find a definition of these scores. I'm guessing it's negative log-likelihood under the fitted GMM, but I'm unsure.
+
+The observation (features on low-dimensional manifolds) is interesting, and the proposed algorithm is potentially useful. However the experiments are quite limited, with only 2 training datasets. The analysis of why/when the algorithm is expected to work and how this depends on the model is shallow. The two main results are for models pre-trained on Imagenet: presumably this has a huge impact on the features that end up being used for OOD detection, but this is not discussed in the paper at all.
+
+Questions to the authors:
+- Why focus only on regression? It seems to me that your analysis for the features lying on a low-dimensional subspace should also apply to classification models.
+- How do your results depend on the specific models that are used? I can come up with models with very few final features where your method would not work. Do you have any analysis or guidance?
+- What do all abbreviations in the tables mean? Please make the captions more informative so the reader does not need to search in the main text.",3,,ICLR2020
+OYvz-Toq7qH,2,cTQnZPLIohy,cTQnZPLIohy,Learning symmetries through Lie-algebra convolutions,"Summary:
+
+The paper meticulously builds a theoretical framework for Lie-algebra convolutional layers and then goes on to show how CNNs, GCNs and FC layers are a special case of L-convs. The paper also demonstrates how the underlying generators can be learnt from data and provide convincing supporting experimental results. The proposed L-conv layers also use much fewer parameters compared to previous works. 
+
+Key strengths:
+
+The theoretical framework developed in this paper, starting from Lie groups and equivariance and invariance definitions is very elegant and convincing. I checked the maths at each step and am convinced that it is correct, to the best of my knowledge. I did need to refer to Hall 2015 though. Intuitively as well as mathematically, it makes sense to me. The comparison to MSR (Zhou 2020) seems fair to me. 
+The experiments, though limited, in the main paper, are quite convincing. The experiments are cleverly constructed and provide enough justification to support the utility of L-conv layers in comparison to CNN and FC layers. 
+
+Questions:
+
+For Figure 2: For CIFAR100 and FashionMNIST, CNN seems to do better on ""rotated+scrambled"" compared to ""rotated"". What is the reason behind that? This is not seen in any other method or dataset.  
+
+Suggestions for improvements:
+
+1. The paper inherently assumes familiarity with Lie groups/Lie algebra or even exp/log of matrices which I am familiar with, but not all readers will be. Therefore, instead of citing Hall 2015, it would be good to cite Sections within the textbook. This will aid uptake of an important mathematical sub-field. 
+2. There is no mention of accompanying code in the manuscript. Would the authors consider making it available upon acceptance? It would help further research in this area. 
+3. The section on linear regression (3.1) seems to occur again in an expanded form in the supplementary material (Sec C). The derivation in the supplementary material was a little bit clearer. 
+4. It would be worthwhile checking the paper for typos. A few that I noted: larest, wight, ""a too many""
+5. Figure 1 is not easy to interpret. A substantial caption would be beneficial. It's also not referenced in the text. 
+
+Overall comments:
+The main contribution of this paper is the development of the theoretical framework for Lie-algebra convolutions. The paper does so very convincingly and I regard this as an important contribution to the area of deep learning. This may open the door to the field using the correct inductive biases for many problems in vision, speech and physics. I enjoyed reading this paper, including the supplementary material that makes a start on imposing orthogonality during regularization for complex-valued neural networks. ",8,4.0,ICLR2021
+Bkgb0FksYS,1,rJeqeCEtvH,rJeqeCEtvH,Official Blind Review #4,"Overview:
+
+This paper proposes to use semi-supervised learning to enforce interpretability on latent variables corresponding to properties like affect and speaking rate for text-to-speech synthesis. During training, only a few training items are annotated for these types of properties; for items where these labels are not given, the variables are marginalised out. TTS experiments are performed and the approach is evaluated objectively by training classifiers on top of the synthesised speech and subjectively in terms of mean opinion score.
+
+I should note that, although I am a speech researchers, I am not a TTS expert, and my review can be weighed accordingly.
+
+Strengths:
+
+The proposed approach is interesting. I think it differs from standard semi-supervised training in that at test time we aren't explicitly interested in predicting labels from the semi-supervised labelled classes; rather, we feed in these labels as input to affect the generated model output. I agree that this is a principled way to impart interpretability on latent spaces which are obtained through unsupervised modelling aiming to disentangle properties like affect and speaking rate.
+
+Weaknesses:
+
+This work misses some essential baselines, specifically a baseline that only makes use of the (small number of) labelled instances. In the experiments, the best performance is achieved when gamma is set very high, which (I think) correspond to the purely supervised case (I might be wrong). Nevertheless, I think a model that uses only the small amount of labelled data (i.e. without semi-supervised learning incorporating unlabelled data) should also be considered.
+
+As a minor weakness, the evaluation seems lacking in that human evaluations are only performed on the audio quality, not any of the target properties that are being changed. For affect specifically, it would be helpful to know whether the changes can be perceived by humans. As a second minor weakness, some aspects of the paper's presentation can be improved (see below).
+
+Overall assessment:
+
+The paper currently does not contain some very relevant baselines, and I therefore assign a ""weak reject"".
+
+Questions, suggestions, typos, grammar and style:
+
+- p. 1: ""control high level attributes *of of* speech""
+- p. 2: It would be more helpful to state the absolute amount of labelled data (since 1% is somewhat meaningless).
+- p. 2: I am not a TTS expert, but I believe the last of your contributions have already been achieved in other work.
+- Figure 2: It would be helpful if these figures are vectorised.
+- p. 4: ""*where* summation would again ...""
+- Figure 4: Is there a reason for the gamma=1000 experiment, which performs best in (a), not to be included in (b) to (d)?
+- Section 5: Table 1 is not references in the text.
+- Section 5.1: ""P(x|y,z_s,z_u)"" -> ""p(x|y,z_s,z_u)""
+- In a number of places, I think the paper meant to cite [1] but instead cited the older Kingma & Welling (2013) paper; for instance before equation (6) (this additional loss did not appear in the original VAE paper).
+
+References:
+
+[1] https://arxiv.org/abs/1406.5298
+
+Edit: Based on the author's response, I am changing my rating from a 'weak reject' to a 'weak accept'.",6,,ICLR2020
+FgnVP1V4EMF,1,Ogga20D2HO-,Ogga20D2HO-,"Interesting work, good ideas and evaluation. Some aspects need to be improved.","In this work, the authors aim to approach the non-iid data issue in FL by allowing for mean of the local client data to be transmitted in addition to the model parameters. I find this work very interesting and the paper well executed. 
+First, the authors present the logic for MAFL, which encompasses the sending and receiving of other clients' averaged data, followed by FedMix, a method for augmenting the local data-set with the averaged data from other clients. 
+
+Throughout the method section and their experiments, the authors show the benefits of MAFL+FedMix by ablation to other MixUp inspired approaches. 
+
+My issues with this paper are along some different aspects:
+Privacy:
+Sending statistics of local data is inherently less private than sending model parameters alone. The authors mention this explicitly, but do not go into more detail. I understand that the notion of privacy in FL is a research topic in itself, but I would wish for a more nuanced discussion of the trade-offs here. Throughout the experiment section, the largest 'federation' of devices is N=100 for Cifar100 and Femnist. Taking cifar100 as example, each client has 50k/100 = 500 data-points, the average of which I can agree intuitively to be not very informative (at least visually) and the 'discriminative information' that the authors mention, is presumably not very high. However, 500 data-points can still be considered a large amount of data-points for the federated scenario. As the number of data-points per client $n_k$ decreases, the more information about individual data-points is contained in their average. The problem is increased as $M_k>1$ . Further, 'discriminative information' is not the only privacy-worthy information in FL. Differential Privacy, for example, is trying to quantify if an individual data-point is present in a local data-set. Since a client receives a concatenation $(X_g,Y_g) = ({\bar{x}_1,\bar{x}_2,...,\bar{x}_N},{\bar{y}_1,\bar{y}_2,...,\bar{y}_N})$ of all clients' averaged data-sets, an individual client's participation in the training can also not be hidden from other clients. 
+Furthermore, the formulation in Algorithm 1 implicitly assumes a continual learning setup where clients might be collecting more data as training progresses. In its current formulation, the authors do not mention if the batches are re-computed randomly, opening up the possibility for attacks on the differences between batches across time. 
+
+Computational Burden:
+FedMix requires computing gradients through the Taylor expansion (EQ 4), which increases computation and memory requirements. Especially in a federated setting, computation and memory are constrained resources, so I would expect the authors to provide some estimates over the additional requirements for computing gradients $\nabla_w l_{FedMix}$
+
+Experimental Evaluation:
+I am missing some details on the setup for the FEMNIST dataset. At the moments, the authors mention selecting 100 clients, however I wonder if they used the writer-id or re-shuffled to create a controlled label-skew. If they used the writer-id, how did they select the subset of 100 clients? 
+
+Some details:
+I believe in Figure 1 b), the indices above 'Local data' should be $i$, not $j$. 
+Directly below Figure 1, the sentence should begin with: ""A more practical approach to...""
+Algorithm 2 could be improved, I believe. I see no space constraint that would prevent including some more detailed information analogous to Algorithm 3 in the Appendix. I am assuming the gradient is calculated mini-batch wise. Ideally, the $LocalUpdate$ would receive the same arguments as those that it is being called with on the server side for example. 
+Top of page 5: 'meshed' -> 'mashed'.
+Just above Eq (2): '... client i has access to ...' (remove 'an').
+
+
+The experiments that would make this evaluation great in my opinion:
+Train on the full FEMNIST set of 3600clients including all those clients with very small number of data-points. Then introduce a cut-off-threshold for the minimum number of data-points that each client has to have in order to send its averaged data to the server. Alternatively, add random noise to these averages in relation to how much data is present. There is probably a differential privacy formulation that would make the required noise-level explicit. This noise level or cut-off-point should give more insight on several dimensions of the proposed work:
+
+- How sensitive is FedMix to different minimum required data-points as a trade-off with privacy.
+- How sensitive is $\lambda$ to different number of data-points per client (or the consequence of fixed $\lambda$ generally as number of data-points per client differs). Since no other experiment has different number of data-points per client, I believe this to be relevant.
+
+Additionally, to further increase privacy, the authors might consider (randomly) averaging some of the elements in $(X_g,Y_g)$ before sending the data to clients and study those effects.
+
+Summarizing, I want to thank the authors for this very interesting read and interesting insights. If the authors provide a more nuanced/detailed discussion of the privacy aspects of their work and extend their experimental section with the more holistic FEMNIST experiment I described above, I will raise my score! I see no violation of the CoE in this work.
+
+Finally, I cannot believe that the authors let the opportunity slide to name their algorithm 'FedUp' ;) 
+ 
+
+",7,4.0,ICLR2021
+CV7o8VYpqh8,2,7UyqgFhPqAd,7UyqgFhPqAd,Regularized and Constrained Estimation are pretty similar,"This paper studies mean-squared-error bounds for neural networks with small $\ell_1$-norm. The use of $\ell_1$-norm constraint is analogous to the use of LASSO in sparse linear regression. They give a ""mean-squared-error"" bound because they only analyze a fixed-design setting (where the goal is only to analyze the effect of noise) instead of the random-design setting which requires an additional analysis of generalization to fresh data. 
+
+They emphasize that they study the case where the returned network is regularized using an $\ell_1$-penalty (as in eqn (3)) instead of the case where the minimization is explicitly constrained to an $\ell_1$-ball.  As the authors say, regularization is more convenient to implement than constrained optimization in practice. They claim that this makes the problem they analyze quite different from analyzing the explicitly constrained version, because the minimization is technically over an unbounded class of functions. However, we know from the basic theory of Lagrange multipliers that the optimizer of the $\ell_1$-regularized version is also the optimal solution under some $\ell_1$-norm constraint. So conceptually, the difference between the regularized and constrained versions is not that important, although analyzing the regularized version creates some extra technical difficulties.
+
+This paper makes some other overstated claims about the novelty of their result and analysis. The authors claim that their analysis is based off of some new techniques from high-dimensional statistics which have not appeared in the neural net generalization literature. However, their analysis is in fact almost the same as the usual generalization bounds. For example, the ""generalized noise"" (8) which they are concerned with is basically just the empirical Rademacher complexity of a single neuron of the last hidden layer (consider the case where the noise u is Rademacher).
+
+Overall, I think this work does not provide much fresh insight into generalization bounds for neural networks, and I would tend towards rejection.
+
+Minor notes:
+- Older related work. The idea of analyzing generalization error based on $\ell_1$-norm (which associates with sparsity) can be traced back at least to the paper 'The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network' by Bartlett '98. ",4,4.0,ICLR2021
+MFoll9c0_VQ,2,7FNqrcPtieT,7FNqrcPtieT,"Insufficient experimental confirmation and there are some apparent contradictions in the proposed explanation, notably with MixUp","# Summary
+
+This paper proposes a theoretical framework for understanding consistency-based semi-supervised learning. While establishing this framework based on the Hidden Manifold Model, this paper frames the SSL in the context of Manifold Tangent Classifiers.
+
+# Pros
+
+1. Formal understanding of SSL is indeed currently limited and theoretical works are needed.
+2. The formalization of minimizers as harmonic function leads to the non-obvious prediction that SSL methods are insensitive, or at least very robust, to the weighting of the consistency loss $\lambda$. To me, this is the most important result of the paper.
+3. The bibliography is well documented.
+
+# Cons
+1. The experimental verification of insensitivity to the weighting of consistency loss is unconvincing: it is done on a trivial low-dimensional toy dataset. It would be more convincing to take the GitHub code of one or more, possibly recent, SSL methods and vary $\lambda$ on real datasets.
+2. I fail to see any other takeaway from the theoretical framework than the predicted insensitivity to $\lambda$.
+3. Technically, if I understand correctly, the insensitivity to $\lambda$ is not a takeaway of the Hidden Manifold Model but of the Minimizers are Harmonic Functions.
+4. I didn’t see where the claim “... demonstrate that the quality of the perturbations is key to obtaining reasonable SSL performances” has indeed been demonstrated.
+
+# Questions and nits
+1. “Several extensions such as the Mean Teacher (TV17) and the Virtual Adversarial Training (VAT) (MMIK18) schemes have been recently proposed and have been shown to lead to state-of-the-art results in many SSL tasks.” They are far from the current state of the art, especially so for few labels, this would have been true before the advent of MixMatch but the state of the art changed drastically with its introduction.
+2. “... consider as well the tangent plane Tx to M at x.” Here I have a hard time visualizing it. Say the manifold M is a 2D Gaussian point cloud for example, what would be the tangent plane for a point x in that cloud?
+3. Following on question 2, how is the tangent plane property exploited other than by defining an orthonormal basis on it? And why can’t an orthonormal basis of the Manifold space itself be enough if instead we assume it is dense?
+4. “Enriching the set of data-augmentation degrees of freedom with transformations such as elastic deformation or non-linear pixel intensity shifts is crucial to obtaining a high-dimensional local exploration manifold.” This seems in direct contradiction with MixMatch results which does not use any sophisticated augmentation: just pixel shift, mirroring and random pixel-wise linear interpolation between samples and labels (MixUp).
+5. Proposition 4.1: Here I was hoping for a prediction for your method. You mention the “sequence of processes converges weakly” and I was hoping it would explain why SSL techniques are much slower than fully supervised to converge. But then, it seems this statement is not exploited in any way other than saying that it does indeed converge to the solution of the ODE.
+6. “This indicates that the improved performances of the Mean Teacher approach sometimes reported in the literature are either not statistically meaningful, or due to poorly executed comparisons, or due to mechanisms not captured by the η → 0 asymptotic.” This is vague, considering Mean Teacher outperforms VAT, would this mean that the last option holds (due to mechanisms not captured by the η → 0 asymptotic)? Just to clarify what the last option actually means: does it mean the proposed analysis relies on assumptions that don’t capture the reality of the phenomenon?
+7. “These results are achieved by leveraging more sophisticated data augmentation schemes such as ... Mixup”. It seems odd to see MixUp being referred to as sophisticated (as in domain specific). MixUp is domain agnostic, it simply linearly combines a pair of samples and their labels. So in fact, it seems even simpler than a pixel shift since it’s linear.
+8. “... with a neural network with a single hidden layer with N = 100 neurons”. What non-linearity was used? I didn’t seem to find it, or I may have missed it.
+9. “Figure 3 (Left) shows that, contrary to several other types of regularizations such as weight-decay, this method is relatively insensitive to the parameter λ”. I didn’t see it. It indeed shows that (a) the method is relatively insensitive to the parameter λ in the context of the toy task but it doesn’t show that (b) the method is sensitive to other types of regularization such as weight-decay.
+
+Note: Ultimately I must admit a lot of the maths are above my head and I have no idea whether they are correct or not, therefore I didn't comment on them and only focused on the parts that I understand. On the other hand, I'm pretty confident in my understanding of SSL techniques and MixUp.
+
+=====POST-REBUTTAL COMMENTS======== 
+I thank the authors for the response and the efforts in the updated draft. Most of my queries were clarified and I raised my rating accordingly. However, unfortunately, I still think a more realistic validation (e.g. on non-toy dataset) would benefit the paper. 
+
+",6,3.0,ICLR2021
+kWq5gD9gBby,3,4TSiOTkKe5P,4TSiOTkKe5P,"Good work, but I have some concerns in terms of experiments and complexity of the method.","This work proposes latent CCM, a causal discovery method for short, noisy time series with missing values. The method checks whether there exists CCM between latent processes of the time series without computing delay embeddings. Empirical results show the proposed method is more accurate in finding the right causal direction in various datasets.
+
+One of my major concerns is: Is it necessary to learn continuous time latent representation? As time series are not continuous. It would not be useful to go beyond the granularity of the original time series. And solving the problem of missing value does not necessarily require “continuous time latent representation”. Also in the Double Pendulum results, the authors claim the improvement over the multispatial CCM is because the proposed model shares the same parameters across time windows, not because of the continuity of the model. So, it would be interesting to see an ablation study with a model which is based on GRU but without Neural ODEs.
+
+In 4.3.2, the authors mentioned, “We used 80% of available windows for training and used the remaining 20% for hyperparameter tuning”. It would be better to specify what criterion is used for model selection. In addition, one good practice in this kind of causal direction discovery is to anonymize the two time series in training and validation (avoid using the ground truth in model selection), it is not clear if the authors followed this routine.
+
+It would be interesting to report the computational complexity (time and space) of the proposed method. In addition, it would be appreciated if the authors can also report the runtime comparison between the algorithms.
+
+It would also be better if the choice of baselines can be more comprehensive. For example, the authors can try to include algorithms beyond the CCM family such as PCMCI [1] and VARLINGAM [2].
+
+In terms of writing and presentation, It would be better to make the content self-contained and make the use of terminologies consistent. For example, the authors did not explain terms like ""synchrony"" and ""confounding case"" etc.
+
+
+[1] Runge, Jakob, Peer Nowack, Marlene Kretschmer, Seth Flaxman, and Dino Sejdinovic. ""Detecting and quantifying causal associations in large nonlinear time series datasets."" Science Advances 5, no. 11 (2019): eaau4996.
+
+[2] Hyvärinen, Aapo, Kun Zhang, Shohei Shimizu, and Patrik O. Hoyer. ""Estimation of a structural vector autoregression model using non-gaussianity."" Journal of Machine Learning Research 11, no. 5 (2010).
+",6,4.0,ICLR2021
+yrLTFRF_ZF,1,9Y7_c5ZAd5i,9Y7_c5ZAd5i,Review of A Sharp Analysis of Model-based Reinforcement Learning with Self-Play,"-Summary 
+The authors consider self-play in tabular zero-sum episodic Markov game. In this setting, the goal is to learn an \epsilon approximate of the Nash equilibrium of the Markov game while minimizing the sample complexity, i.e. the number of episode played by the agent.  They present Optimistic Nash 
+Value Iteration (Nash-VI) that output with high probability a pair of policies that attains an \epsilon approximate Nash equilibrium in O(H^3SAB/\epsilon^2) episodes where H is the horizon, S the number of states,  A and B the number of actions for the max-player respectively min-player. This rate matches the lower bound of order \Omega(H^3S(A+B)/\epsilon^2) by Jin et al. (2018) up to a factor min(A,B). They extend this result to the multi-player setting with Multi-Nash-VI algorithm of sample complexity O(H^4S^2\prod_i A_i/\epsilon^2) where A_i is the size of the action space of player i. The authors also provide VI-Zero an algorithm for reward-free exploration of N-tasks in Markov game with a sample complexity of O(H^4SAB log N/\epsilon^2) and a lower bound of order \Omega(H^2SAB/\epsilon^2).
+
+
+-Contributions
+algorithmic: Nash-VI (significance: medium)
+theoretical: Nash-VI sample complexity of order O(H^3SAB/\epsilon^2) (significance: high)
+algorithmic: VI-zero (significance: low)
+theoretical: VI-zeros sample complexity of order O((H^4SAB log N/\epsilon^2) (significance: medium)
+algorithmic: Multi-Nash-VI (significance: medium)
+theoretical:  Multi-Nash-VI sample complexity  of order O((H^4S \Prod_i A_i/\epsilon^2)  (significance: medium)
+theoretical:  lower bound for the sample complexity of reward-free exploration in Markov game of order \Omega(H^2SAB/\epsilon^2)
+
+-Score justification/Main comments
+The paper is well written. The authors use the same technical tool as Azar et al. 2017 to get a sharp dependence in the horizon H and the state space size S. Precisely they use Bernstein bonuses in combination with the Law of total variance in the Markov game to obtain the H^3 and only concentrate the empirical transition along the optimal value function to push the S^2 into second-order terms.
+I have mixed feelings concerning this paper. On one hand, I think the contributions deserve to be published on the other hand I not convinced by the proof. Indeed the proofs are dangerously close to a sketch of proofs (see specific comments below) when it is not the case (see proof of Theorem 6). Even if I think most of the issues are fixable, considering the number of corrections required, I would need to read the updated version to assert if the results are correct.
+
+The algorithm that attains a dependence A+B instead of AB is almost adversarial, do you think it is possible to obtain the same result with a model-based algorithm that uses the stochastic assumption, or do you see a fundamental reason why it will be impossible? 
+
+-Specific comments 
+P1: What do you mean by information-theoretic lower bound?
+
+P2, Table 1: It could be interesting to compare these algorithms on a toy example. At least implement Algorithm 1 to prove it is feasible.
+
+P3: precise what is \pi when you introduce D_{\pi}Q
+
+P5: Which (estimation) error the bonus \gamma is compensating precisely?
+
+P6, Algorithm 1: The definition of the empirical transitions and the bonuses are not clear when N_h(s_h,a_h,_b_h) = 0.
+
+P7, comparison with model-free approaches: could you also compare them in term of computational complexity.
+
+P12, Non-asymptotic […] assumptions: precise what do you mean by highly sub-optimal because at the end Algorithm 1 is also sub-optimal by a factor min(A,B).
+
+P15, (Approximate CE): I think that \phi \circ\pi is not a good choice of notation since it is not really composition.
+
+P15, Th 15: Could we, as in the two-player, deduce a Nash equilibrium by only computing CCE and get a polynomial-time algorithm?
+
+P19,Lemma 19: I do not understand the statement of the lemma, since \bar{V}^k_{h+1}-… is a random variable you need to precise in which sense the inequality |V|(s) \leq … holds. In particular, considering how you use it in the sequel it is not almost surely. Furthermore in the proof, you need to deal with the case when N_h^k = 0 and you cannot apply the empirical Bernstein Bound from Maurer and Pontil like that since you have a random number of samples. An additional union bound over the possible value of N_h^k is required and you need to prove that conditionally to this choice the samples are independent…. 
+
+P19, proof of Lemma 19: you need to precise which event you consider in order to be able to invoke Lemma 18. 
+P20, top of the page: the two equalities are wrong in (10) they are inequality because of the clipping for the first one and because \hat{p}_h^k (\bar{V} ….) \geq 0. What do you mean by “with high probability” exactly? And since it is Hoeffding inequality the second term in 1/n is not necessary.
+
+P21, proof of Theorem 3: Again you need to precise with the event you consider to be able to call Lemma 18 or 19 … and guarantee that this event is of probability at least 1-p. Currently, a lot of union bounds are hidden whereas they should be treated properly.  
+\zeta_h^k and \xi_h^k are martingales with respect to which filtration?
+When you apply Azuma-Hoeffding inequality you implicitly imply that you can upper-bound all the \zeta_h^k \xi_h^k by the same constant (moving the bi O outside the sum) but you cannot because they are not necessarily positive.
+For the pigeon-hole argument again you should consider the case N_h^k=0
+After (12) it is \sqrt(H^3SABT).
+
+P22, Lemma 20: same remarks as Lemma 18 and 19.
+
+P23, Lemma 21: Because of the big O after “these terms […] separately” it seems that the constants in Lemma 21 are wrong. Furthermore, here you apply Hoeffding inequality plus a union bound over an \epsilon-net or control in KL to obtain the H^2\sqrt{S/n} bounds. You need to explain this properly.  
+
+P24, proof of Theorem 4: same remark as for the proof of Theorem 3. For (i) it is not only the Law of total variation but also because of the Hoeffding Azuma inequality ( and in Azar et al. it is only proved for stationary transition).
+
+P27, proof of Theorem 6: since you only prove the theorem for H=S=1 you should only state this case and keep the general case as a conjecture. 
+S=H=1 in Markov game (where there is nothing to learn) is not equivalent to a matrix game. What do you mean exactly by reward-free matrix game? The agent does not observe the rewards, but in this case, there is nothing to learn, no? I do not think it easily generalizes to the Markov games setting. Could you also obtain the factor \log(N) in the lower bound?",4,5.0,ICLR2021
+ByxBHkb3FS,1,HJxhUpVKDr,HJxhUpVKDr,Official Blind Review #3,"This paper proposes a novel soft parameter sharing Multi-task Learning framework based on a tree-like structure. The idea is interesting. 
+
+However, the technique details, the experimental results and the analysis are not as attractive as the idea. The proposed method is a simple combination of existing works without any creative improvement. Furthermore, comparing with the MTL baseline, the experimental performance of proposed method does not get obvious improvement while the computation cost increasing significantly.  Besides, there is not enough analysis about the idea this paper proposed. The intuition that more similar tasks share more parameters probably cannot always ensure the improvement of MTL.
+",3,,ICLR2020
+pJorKl0ypul,3,65sCF5wmhpv,65sCF5wmhpv,"The paper approaches an interesting problem and proposes a simple, yet reasonable, approach. Unfortunately, the evaluation fails to provide a clear perspective on the potential impact of the proposed approach.","= Overview = 
+
+The paper proposes a reinforcement learning algorithm that enables an agent to ""fine tune"" the quality/accuracy of its sensors to its current task. The paper considers a partially observable MDP setting where the agent, besides the control actions, is endowed with a set of ""tuning actions"" that control the noise in the perception of the different components of the state. Additional reward terms are introduced that discourage the use of ""tuning"". By enabling the agent to fine tune its perception to the current task, the paper seeks to also investigate the relative importance of different state features in terms of the task.
+
+= Positive points =
+
+The paper is well written and the ideas clearly presented. The ideas seem vaguely related with recent work on ""tuning"" MDPs [a] and some older work on learning state representations in multiagent settings [b,c], where the agents are allowed to ""pay"" to have better models or perceptions. The paper proposes the use of similar ideas in a completely different context - to identify relevant information state information in POMDP settings. 
+
+= Negative points =
+
+My main criticism is concerned with the particular domains considered, which I believe are too structured to provide a clear understanding of the potential impact of the proposed approach.
+
+= Comments = 
+
+I believe that the problem considered in the paper is interesting and follows some recent work on ""tuning"" MDPs (see ref[a] below). The approach explored is quite simple but that is not an inconvenient per se. My main criticism lies in the fact that -- in my understanding -- the domains selected are too structured to provide really interesting insights. 
+
+In particular, all domains considered are classical control problems with essentially deterministic dynamics and full observability. The approach in the paper injects artificial additive noise in the state as perceived by the agent (the paper only provides explicit information regarding the noise in the Mountain Car domain, but I'm assuming that is similar in the other domains). 
+
+Now I may be missing something, but it seems to me that, from the agent's perspective, this is equivalent to adding noise to the dynamics of the environment, since the agent treats the observations as state. Therefore, from the agent's perspective, the practical effect of the ""sensor tuning"" is to actually attenuate the noise in the dynamics, which partly explains the results provided. This renders this work particularly close to those on MDP tuning referred above, and more discussion in this direction would be appreciated.  
+
+I think that the paper would greatly benefit from considering richer domains, either where partial observability is a central issue -- such as those from the POMDP literature -- or with richer perceptual inputs --- such as those from game domains.
+
+= References = 
+
+[a] A. Metelli, M. Mutti, M. Restelli. ""Configurable Markov Decision Processes."" Proc. 35th Int. Conf. Machine Learning, pp. 3491-3500, 2018.
+
+[b] F. Melo, M. Veloso. ""Learning of coordination: Exploiting sparse interactions in multiagent systems."" Proc. 8th Int. Conf. Autonomous Agents and Multiagent Systems, pp. 773-780, 2009.
+
+[c] Y. De Hauwere, P. Vrancx, A. Nowé. ""Learning multi-agent state space representations."" Proc. 9th Int. Conf. Autonomous Agents and Multiagent Systems, pp. 715-722, 2010.",5,3.0,ICLR2021
+ofnyVOKyhC,2,I6-3mg29P6y,I6-3mg29P6y,A review,"### 1. Brief summary:
+The authors look at the Hessian related measures of flatness such as the spectral norm, trace and the Frobenius norm. They ask the question of whether flatness at the end of training is a meaningful metric of generalization and answer no. To to do that they use a simplified model and a mathematical derivation and several numerical experiments where they evaluate the flatness measure for networks trained with varying amounts of L2 generalization, showing that solutions with lower flatness can generalize better.
+
+### 2. Strengths
+* The question the authors are asking is interesting and very relevant.
+* The introduction is very good in clarifying prior work, the reasons for why some people believe flatness is relevant and the assumptions it rests on.
+* The range of architectures and datasets tested is pretty good.
+* The paper does both theory and empirical experiments, which is a great sign and gives more gravity to the theoretical claims.
+* The paper likely dispels some preconceptions that presumably a part of the community holds about the role of flatness.
+* I like that you notice that the regularized Hessian has eigenvalues offset by the regularization strength -- a small thing, but made my appreciate the level of care you take.
+
+
+### 3. Points of my confusion
+
+#### a) The validity of the shift motivation for flatness as a relevant measure. 
+In the intro in page 1 you mention that the original motivation for considering flatness is the assumption that the train and validation minima are offset by delta w. This is not a weakness of your paper, I would just like to know why is it that people assume this would be the case?
+
+#### b) I do not get the Gedanken experiment. 
+I am not sure I understand the point of the Gedanken experiment. Why is the loss exponential dependent on the output of the net? Where is the target y for the (X,y) pair? Are you just trying to make the output of the net as small as possible and exponentially punishing the deviations from that? I get that the you get the X W1 W2 ... WL polynomial structure there the same way you'd get in normal cases, but what is the motivation for this experiment? And what is the argument here? Is it that for L -> 0 the polynomial w1w2w3 must go to -infty, and in the expression ww exp(wwwX) the exponential decay kills off the powerlaw growth, taking the Trace -> 0? I am just generally confused about this setup and what it shows us. If you wouldn't mind clarifying this, I'd be grateful.
+
+#### c) How general is the derivation in Section 3? 
+I see that you are using the fact that the weights go to infinity for the loss to go to 0. How general are the results you derive here for the Loss = epsilon. To my mind that is actually way more interesting because as I rightly say practically we're not going to get to L=0 anyway.  I see that you derive that as L->0 so do the second derivatives, but what about L = epsilon << log(# number of classes)?
+
+### 4. Relevant papers that could be of interest
+[1] I found a paper that looks at the Trace(H) and the ration Trace(H) / |H|_F and links this to the weight space norm of the network. It seems very relevant to what you are doing here. The paper is The Goldilocks zone: Towards better understanding of neural network loss landscapes by Stanislav Fort, Adam Scherlis (AAAI 2019, https://arxiv.org/abs/1807.02581). They link this measure to the easy of optimization on random low-dimensional sections of the weight space.
+
+[2] There is a nice overview of the structure of the Hessian in Traces of Class/Cross-Class Structure Pervade Deep Learning Spectra
+by Vardan Papyan (https://arxiv.org/abs/2008.11865). In your spectra you seem to be observing (C-1) outlying eigenvalues as observed / expected in many papers. 
+
+[3] Emergent properties of the local geometry of neural loss landscapes by Stanislav Fort, Surya Ganguli (https://arxiv.org/abs/1906.04724) forms a model of the Hessian based on the logit gradient clustering properties and also looks at measures of flatness in that regime and its dependence on the weight norm.
+
+### 5. Summary
+Overall I like the paper. I am a bit unsure about the generality of the derivations and the gedanken model, but the experiments show that at least in some cases flatness does not correlate well with generalization. I think it is important to appreciate the role of the weight norm and given that (possibly) some do not, this is a useful addition to that effort. I'm not clear on whether this insight is original (I am ready to be corrected) but if it were I think this is a valuable paper.",6,3.0,ICLR2021
+AlA5s0LC9cr,3,H38f_9b90BO,H38f_9b90BO,interesting paper but not good enough,"The paper proposes a robust training algorithm for graph neural networks against label noise. The authors assume the labeled nodes are divided into two parts, clean part without noise and train part with some noise. The proposed method contains two parts. Firstly, it leverages label propagation (LP) trained on the clean nodes to assign pseudo labels on train nodes with noisy labels. Secondly, the authors design a learnable weight \lambda to learn the label for those noisy nodes where LP does not agree with the original labels. The final graph neural network is trained with clean nodes, high confidence train nodes, and uncertain train nodes with learned labels. The authors conduct experiments on four graph datasets with manual injected noise and one real-world noisy dataset to validate the proposed method.
+
+The paper studies an important problem of graph neural networks with noisy label. I have several concerns for the paper.
+(1)	The idea of using LP (or another algorithm different from GNN) to create pseudo labels for uncertainty nodes is not a new idea (e.g., in [1]). Actually, the LP, original data and GNN are ensembled and the final label comes from the majority vote. When LP agrees with the original data (in noisy part), those labels are retained. 
+[1] Li, Qimai, Zhichao Han, and Xiao-Ming Wu. ""Deeper insights into graph convolutional networks for semi-supervised learning."" arXiv preprint arXiv:1801.07606 (2018).
+(2)	Why the joint learning of \theta with GNN parameters is named meta-learning is not very clear. Algorithm 1 shows both \theta and \w_t are fixed to estimate labels for uncertain nodes. Then those parameters are updated with the estimated nodes. Besides, in the experiment part there is no tuned \lambda compared (random is not good, a wrong \lambda can hurt the performance). This makes the improvement from the learnable \lambda less trustful.
+(3)	The experiments are less convincing because only one real-world noisy dataset is available. Because the authors have a strong assumption that both noisy and clean labels exist, it would be better if the authors can validate such assumption with some real-world data. The reported numbers show the improvement of proposed method is rather limited, I would suggest run the methods more times and report mean performance with variance. Results with manually tuned \lambda should be reported. Besides, since GNNs are good for learning on graphs with few labeled nodes (such as the gcn paper actually only use several labeled nodes per class), I would suggest adding some additional experiments for the size of D_clean. How will the rest RL + learnable \lambda contribute w.r.t the number of clean nodes?
+",4,5.0,ICLR2021
+RRCxNpya2qY,1,uys9OcmXNtU,uys9OcmXNtU,Interesting architecture but evaluation should be more thorough to compare with the literature,"
+
+This paper aims at improving accuracy of multi horizon univariate time series forecasting. The authors propose an encoder-decoder attention-based architecture for multi-horizon quantile forecasting. The model encodes a distinct representation of the past for each requested horizon.
+
+#### Strong points
+
++ The encoding of the holidays and special event specific to the time series is elegant.
++ The idea of explicitly using forecast error feedback is interesting.
++ Ablation study of the architecture's innovations on a large scale forecasting dataset mainly outlines the importance of the horizon specific component of the architecture.
+
+#### Weak points
+
+- The evaluation of the new method is conducted on only two public datasets and the new methods outperform TFT only on one. There exists other common datasets (DeepAR evaluates on three-parts and traffic, TFT evaluates on traffic and volatility) that should be considered to place this method in the literature. 
+- Both manuscripts of Fine and Foster (2020 al b) are not published and I could not find them online. The summary of Fine and Foster (2020 a; b) in Section 2.3 is not enough for me to judge the relevance of the results in Fig 2 and 3.
+- The strong claims at the end of the introduction are made compared to the ablated model (MQCNN) on the private dataset, not against alternative methods such as MQRNN or DeepAR which could scale to this dataset size.
+
+
+#### Decision
+
+I would tend to reject this submission. The proposed model exhibits none of: significant quantitative improvement (on public dataset), speed up, improved simplicity over alternative methods.
+
+Additionally to the weak points mentioned above, the contribution of this paper is undermined by the following points: 
+- The contribution of “Positional Encoding from Event Indicators” is rather incremental considering BERT encoding of input segments (e.g. Sentence A, Sentence B, Question, Answers).
+- The Horizon-Specific encoding is interesting but does not allow to forecast at inference a horizon that has not been trained on (as parametric method such as DeepAR have).
+- The code of the submission is not submitted and there is no mention of future release.
+
+#### Questions
+
+- Is your positional encoding method a superset of relative positional encoding?
+- Can you provide runtimes of the MQTransformers? How does it compare to (Lim et al. 2019) ?
+- In appendix B: What is the percentage of unseen object in the test/valid dataset that would be harder to forecast by TFT?
+- Why do you use the large scale private dataset for the ablation of your method and not a public dataset? Are the architectural innovations only useful in the high data regime?
+
+#### Additional feedback
+
+- Figure 4 in appendix D should be mentioned in the main text as it is helpful for the reader.
+- Typo in Table 1, eq (4), third line: $c_{t,1}^{a}$ should be $c_{t,1}^{hs}$ I believe.
+- Section 3.3: what do you call a *bidirectional* 1-D convolution? (As opposed to unidirectional?)
+- Caption of Figure 3: Try to be consistent on the short name of retail vs Favorita dataset.
+- Eq (2): Consider providing the factorised form which is easier to parse.
+
+
+--- 
+**Update:** I would like to thank the authors for their answer. I acknowledge the improvement of the manuscript after the review process:
+
+- Added clarifications on the baselines,
+- Added helpful precisions for reproducibility (even though the code cannot be open sourced),
+- Evaluation on 2 requested public datasets with good results.
+
+However, I am still not confident to raise my score to 6 (marginally above the acceptance threshold) given the missing public manuscripts (or Appendix) to explain the martingale diagnostic tools. This hinder one of the main selling point of the paper that their model reduces the volatility of the forecast.",5,3.0,ICLR2021
+SJt3bbKgz,1,Syhr6pxCW,Syhr6pxCW,Nice approach on conditional image generation,"Overall I like the paper and the results look nice in a diverse set of datasets and tasks such as edge-to-image, super-resolution, etc. Unlike the generative distribution sampling of GANs, the method provides an interesting compositional scheme, where the low frequencies are regressed and the high frequencies are obtained by ""copying"" patches from the training set. In some cases the results are similar to pix-to-pix (also in the numerical evaluation) but the method allows for one-to-many image generation, which is a important contribution. Another positive aspect of the paper is that the synthesis results can be analyzed, providing insights for the generation process. 
+
+While most of the paper is well written, some parts are difficult to parse. For example, the introduction has some parts that look more like related work (that is mostly a personal preference in writting). Also in Section 3, the paragraph for distance functions do not provide any insight about what is used, but it is included in the next paragraph (I would suggest either merging or not highlighting the paragraphs).
+
+Q: The spatial grouping that is happening in the compositional stage, is it solely due to the multi-scale hypercolumns?  Would the result be more inconsistent if the hypercolumns had smaller receptive field?
+
+Q: For the multiple outputs, the k neighbor is selected at random?
+",8,4.0,ICLR2018
+Skx-rp2qnQ,3,H1fU8iAqKX,H1fU8iAqKX,Interesting contribution to V1 modeling,"In this interesting study, the authors show that incorporating rotation-equivariant filters  (i.e. enforcing weight sharing across filters with different orientations) in a CNN model of the visual system is a useful prior to predict responses in V1. After fitting this model to data, they find that the RFs of model V1 cells do not resemble the simple Gabor filters of textbooks, and they present other quantitative results about V1 receptive fields. The article is clearly written and the claims are supported by their analyses. It is the first time to my knowledge that a rotation-equivariant CNN is used to model V1 cells.
+
+The article would benefit from the following clarifications:
+
+1. The first paragraph of the introduction discusses functional cell types in V1, but the article does not seem to reach any new conclusion about the existence of well-defined clusters of functional cell types in V1. If this last statement is correct, I believe it is misleading to begin the article with considerations about functional cell types in V1. Please clarify.
+
+2. For clarity, it would help the reader to mention in the abstract, introduction and/or methods that the CNN is trained on reproducing V1 neuron activations, not on an image classification task as in many other studies (Yamins 2014, etc). 
+
+3. “As a first step, we simply assume that each of the 16 features corresponds to one functional cell type and classify all neurons into one of these types based on their strongest feature weight.” and “The resulting preferred stimuli of each functional type are shown in Fig. 6.“
+Again, I think these statements are misleading because they suggest that V1 cells indeed cluster in distinct functional cell types rather than form a continuum. However, from the data shown, it is unclear whether the V1 cells recorded form a continuum or distinct clusters. Unless this question is clarified and the authors show the existence of functionally distinct clusters in their data, they should preferably not mention ""cell types"" in the text.
+
+Suggestions for improvement and questions (may not necessarily be addressed in this paper):
+
+4. “we apply batch normalization”
+What is the importance of batch normalization for successfully training the model? Do you think that a sort of batch normalization is implemented by the visual system? 
+
+5. “The second interesting aspect is that many of the resulting preferred stimuli do not look typical standard textbook V1 neurons which are Gabor filters. ”
+OK but the analysis consists of iteratively ascending the gradient of activation of the neuron from an initial image. This cannot be compared directly to the linear approximation of the V1 filter that is computed experimentally from doing a spike-triggered average (STA) from white noise. A better comparison would be to do a single-step gradient ascent from a blank image. In this case, do the filters look like Gabors?
+
+6. Did you find any evidence that individual V1 neurons are themselves invariant to a rotation?
+
+7. The article could be more self-contained. There are a lot of references to Klindt et al. (2017) on which this work is based, but it would be nice to make the article understandable without having to read this other article.
+
+Typo: Number of fearture maps in last layer 
+
+Conclusion:
+I believe this work is significant and of interest for the rest of the community studying the visual system with deep networks, in particular because it finds an interesting prior for modeling V1 neurons, that can probably be extended to the rest of the visual system. However, it would benefit from the clarifications mentioned above.",7,4.0,ICLR2019
+NLAfW2ioxlH,3,iEcqwosBEgx,iEcqwosBEgx,Interesting method but can be more convincing,"Summary: This paper proposed a method to leverage the constrained optimization for policy training to learn diverse policies given some references. Based on a diversity metric defined on policy divergences, the paper employs two constrained optimization techniques for this problem with some modifications. Experiments on mujoco environments suggest that the proposed algorithms can beat existing diversity-driven policy optimization methods to learn both better and novel policies. Generally, the paper is well-written and easy to follow. Some concerns/comments:
+
+* The state distributions of proposed CTNB and IPD are different: In the CTNB method, the trajectories will keep rollout until they reach some termination conditions such as time limit or failure behavior. However, in the IPD method, if the cumulative novel reward is below some thresholds, then the trajectories will be truncated. It will be helpful to compare the CTNB with that extra termination condition. 
+
+* Using the divergence of policies to quantify the difference between policies seems not a very innovative metric. Some related work could be:
+
+Hong, Z. W., Shann, T. Y., Su, S. Y., Chang, Y. H., Fu, T. J., & Lee, C. Y. (2018). Diversity-driven exploration strategy for deep reinforcement learning.
+
+It will be great if the authors can compare and explain the relationship between the proposed metric and some related ones.
+
+* The experiments can be more convincing if more locomotion environments are included, especially some higher-dimensional environments such as Humanoid and HumanoidStandup. Also, some other environments with a long-term/sparse reward setting can be more illustrative such as some mazes or Atari games. For some of those games, since it is stage-based, the IPD might terminate some rollouts if all reasonable policies are similar at the beginning of the trajectory. For a maze example, all good policies should choose to open the door at the beginning and then behave diversely.  
+
+Other/Minor Comments:
+
+* The choice of r_0 can affect the performance: When sequentially training the policy, should r_0 be adjusted when training each new policy?
+
+* It can be more interesting if some visualization of hopper policy diversity is included.",6,3.0,ICLR2021
+BJlNf4D827,1,rye4g3AqFm,rye4g3AqFm,not surprising,"
+The authors make a case that deep networks are biased
+toward fitting data with simple functions.
+
+The start by examining the priors on classifiers obtained by sampling
+the weights of a neural network according to different distributions.  They do this
+in two ways.  First, they examine properties of the distribution
+on binary-valued functions on seven boolean inputs obtained by
+sampling the weights of a small neural network.  They also empirically compare
+the labelings obtained by sampling the weights of a network with
+labelings obtained from a Gaussian process model arising from earlier
+work.
+
+Next, they analyze the complexity of the functions produced, using
+different measures of the complexity of boolean functions.  A
+favorite of theirs is something that they call Lempel-Ziv complexity,
+which is measured by choosing an arbitrarily ordering of the
+domain, writing the outputs of the function in that ordering,
+and looking at how well the Lempel-Ziv algorithm compresses this
+sequence.  I am not convinced that this is the most meaningful
+and fundamental measure of the complexity of functions.
+(In the supplementary material, they examine some others.
+They show plots relating the different measures in the body
+of the paper.  None of the measures is specified in detail in the
+body of the paper. They provide plots relating these complexity
+measures, but they don't demonstrate a very close connection.)
+
+The authors then evaluate the generalization bound obtained by
+applying a PAC Bayes bound, together with the assumption that
+the training process produces weights sampled from the distribution
+obtained by conditioning weights chosen according to the random
+initialization on the event that they fit they fit the training
+data perfectly.  They do this for small networks and simple datasets.
+They bounds are loose, but not vacuous, and follow the same order
+of difficulty on a handful of datasets as the true generalization
+error.
+
+In all of their experiments, they stop training when the training
+accuracy reaches 100%, where papers like https://arxiv.org/pdf/1706.08947.pdf
+have found that continuing training past this point further improves test
+accuracy.  The experiments all use architectures that are
+quite dissimilar to what is commonly used in practice, and
+achieve much worse accuracy, so that a reader is concerned
+that the results differ qualitatively in other respects.
+
+I do not find it surprising that randomly sampling parameters
+of deep networks leads to simple functions.
+
+Papers like the Soudry, et al paper cited in this submission are
+inconsistent with the assumption in the paper that SGD samples
+parameters uniformly.
+
+It is not clear to me how many hidden layers were used for the
+results in Table 1 (is it four?).  
+
+I did find it interesting to see exactly how concentrated the
+distribution of functions obtained in their 7-input experiment
+was, and also found results on the agreement of the Gaussian process
+models with the randomly sampled weight interesting, as far as they
+went.  Overall, I am not sure that this paper provided enough
+fundamental new insight to be published in ICLR.
+",4,4.0,ICLR2019
+HJfuJGINx,3,ryAe2WBee,ryAe2WBee,not very convinced,"The paper proposes a semantic embedding based approach to multilabel classification. 
+Conversely to previous proposals, SEM considers the underlying parameters determining the
+observed labels are low-rank rather than that the observed label matrix is itself low-rank. 
+However, It is not clear to what extent the difference between the two assumptions is significant
+
+SEM models the labels for an instance as draws from a multinomial distribution
+parametrized by nonlinear functions of the instance features. As such, it is a neural network.
+The proposed training algorithm is slightly more complicated than vanilla backprop.  The significance of the results compared to NNML (in particular on large datasets Delicious and EUrlex) is not very clear. 
+
+The paper is well written and the main idea is clearly presented. However, the experimental results are not significant enough to compensate the lack of conceptual novelty. 
+
+
+",4,4.0,ICLR2017
+1j76PJ57v7,2,ipUPfYxWZvM,ipUPfYxWZvM,A simple modification of Transformer that consistently improve its performance without parameters/computation drawback.,"This paper aims at improving the transformer architecture by reordering the encoder/decoder layers. Improving the performance of transformer is an active area of research and has numerous applications. The authors first show that a single layers' ordering is not better than all others and further demonstrate that per instance layer ordering with parameter sharing consistently improve overall performance (which is quite surprising). Other works considered per instance early return (depth) but I am not aware of per instance layer reordering. 
+
+#### Strong points
++ The layer order decision is made on simple sentence embeddings which does not add significant computation compared to the transformer. 
++ The weights are shared between layers of different reordering which does not increase the number of parameters.
++ The authors conduct extensive evaluation on Neural Machine Translation, Code Generation and Summarization and show consistent improvement of their method (IOT).
++ They show that this method is not a form of ensembling/experts (which was my main concern), IOT is orthogonal to ensembling (Section 5.3).
++ The same reordering trick can be applied to other multi-layer architectures.
+
+#### Weak points
+
+- The different layer orders process the samples differently and require to split a batch into smaller batches for each ordering.
+- Why some samples are better processed by a specific layer ordering remains not understood.
+
+#### Decision
+
+I tend to accept this paper as the method is novel and the results are good. The method can be applied without strong drawbacks in term of computation or number of parameters.
+
+#### Questions
+
+- Table 15: I am surprised that inference time is not affected by the batching per order (see weak point), did you apply some tricks like masking to be more GPU friendly?
+- Can you show the proportion of each ordering? (for instance IOT (N=6) on IWSLT14) I am curious if it is balanced and if the decisions made by the classifier are consistent with the best performing order in Table 1 when each transformer order is trained separately.",7,4.0,ICLR2021
+DfnIh_mRH-,5,I-VfjSBzi36,I-VfjSBzi36,Unclear experimental results,"Summary:
+
+The authors propose a technique for reducing the computational requirements of training BERT early in training to reduce the overall amount of resources required.
+
+Pros:
+
+The paper is well written and clear for the most part. The authors do thorough experimental evaluation.
+
+Cons:
+
+I have two primary concerns about the paper and the proposed technique.
+1. The positioning of the technique is not entirely clear to me. The authors pitch it as a technique for reducing the training time of BERT and use LayerDrop as a baseline technique that also removes network components. However, it feels like another baseline that should be considered is neural architecture search, which also seeks to automatically find a more efficient model to train. The difference here is that the authors find the model early in the training run, but it seems like the EarlyBERT procedure could be run once and the resulting model architecture could be saved and re-trained like NAS models are. 
+2. I found the experimental results to be lacking detail and breadth necessary to establish the value of the technique. Firstly, the rough time estimates in Table 2 are very odd given the primary value of the proposed technique is to reduce training time. The accuracy of EarlyBERT is close enough to LayerDrop that accurate training cost numbers are needed to differentiate between the techniques. Secondly, quoting training time reductions over the dense baseline when the EarlyBERT mode does not achieve the same accuracy makes the comparison very difficult to make. This problem shows up quite commonly in the model compression literature [1] and I’d encourage the authors to show full accuracy-training time tradeoff curves so that the training time savings for a given accuracy can be more clearly established. Lastly, I found the use of reduced training epochs in EarlyBERT to be odd because you do not evaluate whether or not this can be done for the baseline models and there isn’t clear evidence as to why your model would be able to do this while others (e.g., DropLayer) cannot. Figure 2 also does not seem to corroborate that higher learning rates can be used with shorter training time to achieve better accuracy. The data in your figure shows that the best learning rate achieves the best model quality independent of the number of training epochs.
+
+References:
+1. https://arxiv.org/abs/2003.03033
+",3,4.0,ICLR2021
+8JoPOQ4wVhK,2,fAbkE6ant2,fAbkE6ant2,"Insightful, some shortcomings in empirical evaluation","The paper proposes a strategy for training feed-forward networks in a more memory-efficient manner by employing local as opposed to end-to-end supervision. End-to-end/global (E2E) supervision as the dominant paradigm in training deep networks considers a loss function at the very end of the network for backpropagation of the resulting gradients, whereas local supervision injects supervisory signals (such as the same E2E objective, e.g. classification) at intermediate layers in the network. The benefit of such intermediate supervision is the ability to train larger networks in smaller chunks piece by piece, where each individual training is more memory efficient due to reduced need to store activations (and weights and biases) in GPU memory. As a drawback, however, it had been shown earlier that such local training is less optimal than global training in terms of the achievable generalization performance. The authors propose a new training strategy that aims at combining the memory efficiency of local supervision and piecewise training with the error performance of global training. Considering a given intermediate layer, the paper motivates to maximize the mutual information between the activations in this layer and the input signal to retain relevant information, while minimizing the mutual information of the activations and a nuisance variable, where the nuisance is defined as having no mutual information with the target variable (e.g. the classification prediction). The authors argue that this local supervision allows to train the features at the intermediate layer such that they carry relevant information from the input to the target variable without resorting to direct supervision with the target variable. Direct computation of the nuisance variable is infeasible and the authors propose a bounded approximation. 
+Empirical results are discussed on five common vision datasets and two CNN architectures in relation to the existing state of the art.
+
+### Strengths
+**[S1]** The paper is largely written well and addresses a relevant problem. Contributions and claims are laid out well.
+
+**[S2]** The method has potentially multiple benefits: (a) more memory efficient training without loss of performance, (b) speedups for asynchronous training, (c) use of larger batch size or larger models.
+
+**[S3]** The method is motivated reasonably well by the argument of minimizing the loss of mutual information with the input variable while minimizing the mutual information of nuisance variables with respect to the target quantity.
+
+**[S4]** The inclusion of Greedy SL+ to account for the additional degrees of freedom of the proxy networks in InfoPro is a good attention to detail in the empirical evaluation.
+
+**[S5]** The paper looks at a reasonable set of tasks and studies the performance on relevant data, for instance Imagenet and Cityscapes semantic segmentation (but consider [W1] below)
+
+### Weaknesses:
+**[W1]** Baselines of empirical comparisons: With the exception of Fig 3, the empirical results are compared to only a comparatively simple baseline (Greedy SL/SL+), but not the same state of the art mentioned in section 4.1 (DGL, DIB, BoostResNet). This seems a rather odd omission, particularly as Fig 3/Section 4.1 elaborate on alternative methods (DGL, DIB, BoostResNet). It would support the paper's case to either include the results of those in table 2 or elaborate on their absence.
+
+**[W2]** Fig 3: The scale of the y-axis is chosen in a way that a casual reader may visually misinterpret the absolute difference in error of the shown methods. For instance, at first glance DGL at K=2 seems twice as bad as Infopro, while in actuality is only about 14% worse (error from 7.76% to ~8.8%)
+
+**[W3]** A methodical concern is the choice of $\phi$ and $\psi$ and the limited discussion around their choice (section 3.3, App. E): What is the sensitivity of the optimization with respect to their size and structure, what happens if I make them larger or smaller, how small can I make them? How would they look like for objectives other than classification?
+
+### Further comments:
+**[C1]** It would be insightful to consider what the implications for networks with recurrent structures would be.
+
+**[C2]** It took a while to infer in section 3.3. that $\mathcal{R}$ is the reconstruction of the input data. A short sentence there may help future readers go through more smoothly.
+
+I feel that I have learned something from the paper. It discusses the contributions, motivations and method reasonably thorough. My concerns are largely around the empirical substantiation of the claims, see [W1].",6,3.0,ICLR2021
+rypb6tngM,1,S1ANxQW0b,S1ANxQW0b,The paper presents an interesting new algorithm for deep reinforcement learning which outperforms state of the art methods. ,"The paper presents a new algorithm for inference-based reinforcement learning for deep RL. The algorithm decomposes the policy update in two steps, an E and an M-step. In the E-step, the algorithm estimates a variational distribution q which is subsequentially used for the M-step to obtain a new policy. Two versions of the algorithm are presented, using a parametric or a non-parametric (sample-based) distribution for q. The algorithm is used in combination with the retrace algorithm to estimate the q-function, which is also needed in the policy update.
+
+This is a well written paper presenting an interesting algorithm. The algorithm is similar to other inference-based RL algorithm, but is the first application of inference based RL to deep reinforcement learning. The results look very promising and define a new state of the art or deep reinforcement learning in continuous control, which is a very active topic right now. Hence, I think the paper should be accepted. 
+
+
+I do have a few comments / corrections / questions about the paper:
+
+- There are several approaches that already use the a combination of the KL-constraint with reverse KL on a non-parametric distribution and subsequently an M-projection to obtain again a parametric distribution, see HiREPS, non-parametric REPS [Hoof2017, JMLR] or AC-REPS [Wirth2016, AAAI]. These algorithms do not use the inference-based view but the trust region justification. As in the non-parametric case, the asymptotic performance guarantees from the EM framework are gone, why is it beneficial to formulate it with EM instead of directly with a trust region of the expected reward?
+
+- It is not clear to me whether the algorithm really optimizes the original maximum a posteriori objective defined in Equation 1. First, alpha changes every iteration of the algorithm while the objective assumes that alpha is constant. This means that we change the objective all the time which is theoretically a bit weird. Moreover, the presented algorithm also changes the prior all the time (in order to introduce the 2nd trust region) in the M-step. Again, this changes the objective, so it is unclear to me what exactly is maximised in the end. Would it not be cleaner to start with the average reward objective (no prior or alpha) and then introduce both trust regions just out of the motivation that we need trust regions in policy search? Then the objective is clearly defined.    
+
+- I did not get whether the additional ""one-step KL regularisation"" is obtained from the lower bound or just added as additional regularisation? Could you explain?
+
+- The algorithm has now 2 KL constraints, for E and M step. Is the epsilon for both the same or can we achieve better performance by using different epsilons?
+
+- I think the following experiments would be very informative:
+
+   - MPO without trust region in M-step
+   
+   - MPO without retrace algorithm for getting the Q-value
+
+   - test different epsilons for E and M step
+
+
+",7,5.0,ICLR2018
+thd5yZa83t6,3,GIeGTl8EYx,GIeGTl8EYx,"The paper presents a shallow graph sampler combined with a deep graph neural network framework to train large graphs. Overall, this method seems well-motivated, and both theoretical and empirical results support their claims. There are few poins on which clarification from the authors would be helpful. ","This paper proposes a new extension of GNNs to deep GNNs, which use subgraphs to keep the computational costs low for training large graphs. It addresses the two main reasons that GNNs have not previously been extended to deep GNNs: expressivity and computational cost. Increasing the number of layers in a GNN leads to averaging over more nodes, which in turn collapses the learned embeddings. The paper claims that using shallow graphs instead of the full graphs avoids this oversmoothing issue. Additionally, using the full graph is computationally expensive since the neighborhood sizes grow with the number of neighbors. Using shallow subgraphs instead allows the size of the neighborhoods to remain constant as the number of layers increase. To this end, the paper presents SHADOW-GNN, a Deep GNN with shallow sampling. They extend this framework to GraphSAGE and GAT models and show that it improves performance over the original model with a lower computational cost. Overall, this method seems well-motivated, and both theoretical and empirical results support their claims. There are a few points on which clarification from the authors would be helpful. 
+
+Strengths: 
+++ The paper presents 3 motivations - (1) shallow neighborhood is sufficient to learn graph representation (2) it is necessary to reduce over smoothing issues (3) One still needs a deep GNN model to learn effectively form the shallow neighborhood - and it supports these claims by providing examples and formal proofs in the form of Proposition 3.1, Theorem 3.2, and Theorem 3.3, respectively. 
+
+++ The paper recommends using two main samplers for sampling the shallow neighborhoods of a node - (1) $k$-hop sampler, which randomly selects $b$ neighbors and (2) Personalized PageRank (PPR) sampler, which uses the induced subgraph from the largest PPR scores $\pi$ for a node. According to the paper, both these methods are lightweight and scalable
+
+++ The method is applied to extend GraphSAGE and GAT models and the paper presents empirical results for 5 different datasets. The results are presented in terms of classification performance (F1-score) and computational cost (Inference cost). Benchmarked against 5 baseline models (including a subsampling algorithm), the SHADOW extension gives SOTA performance at a reduced computational cost. 
+
+++ The ablation study in the paper, further demonstrates that using an ensemble of subgraphs improves performance and is feasible using the SHADOW framework. 
+
+++ The paper is well written and does a good job of putting the work in context and motivating the problem. 
+
+Weaknesses:
+-- The hyperparameters of the sampling algorithms, while mentioned in the Appendix tables, are not included in the tuning descriptions. I am curious to know if and how these hyperparameter choices affect the performance-cost tradeoff. 
+
+-- Is there a performance-cost tradeoff for the subgraph ensemble setting suggested by the paper?
+
+-- Description of the inference cost calculation would be useful. 
+
+-- While the paper mentions two other extensions - SHADOW-GCN and SHADOW-GDC, the results do not include them. Is there a reason for that? 
+
+Minor comments:
+- Labeling Figure 1 with $v$ and $v'$ would make the example much clearer
+- The transitions between the theorems and the discussions are sometimes hard to follow. More connections between notation and interpretation would be helpful.
+- The “budget” term in the Appendix tables has not been connected to the hyperparameters of the sampling algorithms
+
+",7,3.0,ICLR2021
+ryeSKamyoB,3,BJeXaJHKvB,BJeXaJHKvB,Official Blind Review #4,"The proposal is an adapted batch normalization method for path regularization methods used in the optimization of neural networks. For neural networks with Relu activations, there exits a particular singularity structure, called positively
+scale-invariant, which may slow optimization. In that regard, it is natural to remove these singularities by optimizing along invariant input-output paths. Yet, the paper does not motivate this type of regularization for batchnormalized nets. In fact, batch normalization naturally remedies this type of singularity since lengths of weights are trained separately from the direction of weights. Then, the authors motivate their novel batch-normalization to gradient exploding (/vanishing) which is a completely different issue. 
+I am not sure whether I understood the established theoretical results in this paper. Let start with Theorem 3.1: I am not sure about the statement of the theorem. Is this result for a linear net? I think for a Relu net, outputs need an additional scaling parameter that depends on all past hidden states (outputs). Theorem 3.2 and 4.1 do not seem informative to me. Authors are saying that if some terms in the established bound in Theorem 4.1 is small, then exploding gradient does not occur for their novel method. The same argument can be applied to the plain batchnorm result in Theorem 3.2. For me, it is not clear to see the reason why the proposed method remedies the gradient exploding (/vanishing). 
+",3,,ICLR2020
+rkgZtVmCtS,2,SkeATxrKwH,SkeATxrKwH,Official Blind Review #2,"Update: I thank the auhors for the detailed response. The updated paper does look more convincing, but my main concern remains - all considered datasets and tasks are synthetic and specifically made for the method. I think experiments on more real data would be crucial to show the potential of the method. Some examples I can imagine would be part segmentation in images or 3D models, or robotic control.
+
+---
+
+The paper proposes an unsupervised approach to learning objects and their parts from images. The method is based on the ""Attend, Infer, Repeat"" (AIR) line of work and adds a new hierarchy level to the approach, corresponding to object parts. The method is evaluated on two custom synthetic dataset: one composed of simple 2D geometric shapes and another one with 3D geometric shapes. On these datasets the method successfully infers the object-part structure and can parse and reconstruct provided images.
+
+While the general topic of the paper is interesting, I do not think it is fit for publication. First and foremost, the experiments are very incomplete: the method is only evaluated on two custom synthetic datasets and not compared against any baselines or ablated versions of the method. Moreover, the proposed approach seems like a relatively minor modification of SPAIR (Crawford & Pineau).
+
+Pros:
+1) The topic is interesting and the method generally makes sense.
+2) The method works on the tasks studied in the paper, including relatively challenging scenarios with 3D occlusions or ambiguity in assigning parts to objects.
+3) The paper shows generalization to a number of objects different from that seen during training.
+
+Cons:
+1) Experiments are limited:
+1a) The method is evaluated on custom and relatively toy datasets. It is unclear if it would apply to more practical situations. While I agree that at the high level the question of decomposition of object into parts is interesting, it is still important to connect research in this direction to potential practical applications. Where could the proposed method be used? Perhaps in some control tasks, such as robotics? Could it be applied to more realistic data, such as for instance ShapeNet objects?
+1b) There are no comparisons to baselines. One might argue that baselines do not exist since there are no methods addressing the same task. Generally, it is the job of the authors to come up with relevant baselines to show that the proposed model actually improves upon some simpler methods. One variant of obtaining baselines is via ablating the proposed model. Another is by taking prior approaches, for instance object-based ones without the part decomposition, and showing that the part decomposition allows to improve upon those in some sense.
+
+2) The novelty is somewhat limited: the method seems like a relatively straightforward extension of SPAIR by adding another hierarchy layer. It might be sufficient if the experimental results would be strong, but given that they are not, becomes somewhat concerning. If the method does indeed include significant technical innovation, it might be helpful to better highlight it.
+
+3) The related work overview is extremly limited. It is authors' job to provide a comprehensive ovreview of prior literature and I cannot do it for them here, but below are a few papers that come to mind. This is by no means a complete list.
+
+[1] Zhenjia Xu, Zhijian Liu, Chen Sun, Kevin Murphy, William T. Freeman, Joshua B. Tenenbaum, Jiajun Wu. Unsupervised Discovery of Parts, Structure, and Dynamics. ICLR 2019
+[2] Shubham Tulsiani, Hao Su, Leonidas J. Guibas, Alexei A. Efros, Jitendra Malik. Learning Shape Abstractions by Assembling Volumetric Primitives. CVPR 2017
+[3] Jun Li, Kai Xu, Siddhartha Chaudhuri, Ersin Yumer, Hao Zhang, Leonidas Guibas. GRASS: Generative Recursive Autoencoders for Shape Structures. SIGGRAPH 2017.
+[4] Gopal Sharma, Rishabh Goyal, Difan Liu, Evangelos Kalogerakis, Subhransu Maji. CSGNet: Neural Shape Parser for Constructive Solid Geometry. CVPR 2018.
+[5] Adam R. Kosiorek, Hyunjik Kim, Ingmar Posner, Yee Whye Teh. Sequential Attend, Infer, Repeat: Generative Modelling of Moving Objects. NeurIPS 2018.
+
+4) Presentation is at times suboptimal:
+4a) It would be very helpful to have more visuals and intuitions about the functioning of the method, as opposed to equations. Equations are definitely good to have, but they are not the easiest to parse, especially by those not intimately familiar with this specific line of work. 
+4b) It is quite unclear to me how and why is the memory used in the model. If this is described in another paper, it would be useful to point there, but still briefly summarize in this paper to make it self-contained.",3,,ICLR2020
+6yWC4oa4iVt,2,HWqv5Pm3E3,HWqv5Pm3E3,"I appreciated the first new part,  but I felt disappointed of the second part previously proposed by others, the experiment results and the inadequate ablation studies.","In the work, the authors focus on tackling the problem of source free domain adaptation. The proposed method mainly has two parts, in which the second is nearly the same as the SHOT-IM as in Liang et al., 2020 [1], while the first part aims at coping with this problem from a new perspective to align the distribution of target features extracted by the fine-tuned encoder to that of source features extracted by the pre-trained encoder. To achieve this, they utilize batch normalization statistics stored in the pre-trained model to approximate the distribution of unobserved source data. 
+
+Pros: 
+1.	Tackling the problem of source free domain adaptation from a new perspective of aligning the distribution of target features extracted by the fine-tuned encoder to that of source features extracted by the pre-trained encoder, specifically, the BN statistic. They also provide a roughly promising theoretical analysis. To me, the first part is new, elegant, and interesting. 
+2.	This paper is well-written and crystal-clear, making it enjoyable to read.
+
+Before reading the second part of information maximization loss proposed by Liang et al. 2020 [1], which also tackles the source free domain adaptation, I really want to accept this paper. However, after reading the remaining part, the drawbacks of this paper are too obvious to be ignored. My major concerns can be concluded as follows:
+ 
+1.	The second part of information maximization loss is nearly the same as that of SHOT-IM as in Liang et al., 2020 [1], resulting in a limited novelty of the whole paper. To me, the second part should at least provide some insights or some different perspectives to make the technique novelty enough for a top-tier conference.
+2.	Meanwhile, the experiments section also makes me a little disappointed. Neither state-of-the-art results nor comprehensive ablation studies are seen. First, in most domain adaptation tasks, the performance of the proposed method is obviously lower than SHOT Liang et al., 2020 [1], and Model adaptation (Li et al., 2020) [2]. For a new paper that meets the bottom line of a top-tier conference, at least little improvement should be seen in most tasks.
+3.	Further, since the proposed method consists of two parts, some basic ablation studies should be conducted to verify the effectiveness of both parts. Without taking apart the whole method, we will never know how much improvement each part contributes.
+4.	 Another concern is that the performance sensitivity experiment is only conducted on a simple task SVHN→MNIST, which is too weak to draw a conclusion that the proposed method keeps almost the same within the wide range of the value of λ, specifically 0.1 ≤ λ ≤ 50.
+
+
+[1] Jian Liang, Dapeng Hu, and Jiashi Feng. Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In International Conference on Machine Learning, 2020.
+
+[2] Rui Li, Qianfen Jiao, Wenming Cao, Hau-San Wong, and Si Wu. Model adaptation: Unsupervised domain adaptation without source data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9641–9650, 2020.
+",4,4.0,ICLR2021
+ryx2YYTqh7,2,HkeKVh05Fm,HkeKVh05Fm,"Interesting task of multi-grained NER, reasonable models. ","
+<Summary>
+Authors propose the “Multi-grained NER (MGNER)  task” which aims at detecting entities at both coarse and fine-grained levels. Authors propose a Multi-grained Entity Proposal Network (MGEPN) which comprises (1) a Proposal Network that determines entity boundaries, and (2) a Classification network that classifies each proposed segment of an entity.
+
+The task is primarily tested against the proposed method itself. The proposed method does outperform traditional sequence-labeling baseline model (LSTM-LSTM-CRF), validating the proposed approach. When the proposed model (trained with extra MG data) is evaluated on the traditional NER task (on test sets), however, no significant improvement is observed -- I believe this result is understandable though, because e.g. MG datasets have slightly different label distributions from original datasets, hence likely to result in lower recall, etc.
+
+<Comments>
+The task studied is interesting, and can potentially benefit other downstream applications that consume NER results -- although it seems as though similar tasks have been studied prior to this study. The novelty of the proposed architecture is moderate - while each component of the model does not have too much technical novelty, the idea of separating the model into a proposal network and a classifier seems to be a new approach in the context of NER (that diverges from the traditional sequence labelling approaches), and is reasonably designed for the proposed task.
+
+The details for creating the MG datasets is missing - are they labeled by human labelers, or bootstrapped? Experts or crowd-sourced? By how many people? Will the new datasets be released? Please provide clarifications.
+
+The proposed approach does not or barely outperform base models when tested on the traditional NER task -- the proposed work thus can be strengthened by better illustrating the motivation of the MGNER task and/or validating its efficacy in other downstream tasks, etc. 
+
+Authors could provide better insights into the new proposed task by providing more in-depth error analysis - especially the cases when MG NER fails as well (e.g. when coarse-grained prediction predicts a false positive named-entity, etc.)
+",5,3.0,ICLR2019
+GN46_RX6O81,3,zM6fevLxIhI,zM6fevLxIhI,review 2893,"This paper proposes a unified method to combine spatial and channel attention in a probabilistic framework, so that spatial and channel attention weights and probabilistic variables can be jointly optimized. The proposed method is incoporated in the proposed VISTA-Net to achieve state of the art performance in two dense pixel-wise prediction tasks: monocular depth prediction, semantic segmentation.
+
+Overall this work is well motivated and organized in a good shape, so that I am on the positive side. I have some concerns as listed below.
+
+1. My main concern is that this paper combines ideas of (Fu et al., 2019) and (Xu etal., 2017a), i.e., combining spatial and channel attention from (Fu et al., 2019) and Attention-Gated CRFs from (Xu etal., 2017a). 
+
+2. In Eq.1, the attention tensor is limited to T rank. However, it is not clear to me how this can faciliate the inference in Eq.9.
+
+3. In Eq.2, this paper proposes to additionally model CRF kernels. However, it is not explained why this is necessary. This is important as it is listed as a difference to existing method (Xu etal., 2017a). 
+
+4. The approximation used in Eq.3 needs more details or reference to support.
+
+
+Minor issues.
+1. There are a lot of symbols in section 2. It would be better to draw a figure to include these symbols, while illustrating how the CRF formulation is incorporated in the attention mechanism.
+2. The joint learning in section 2.2 can be formed in a algorithm procedure, so that others could see it more clearly.",6,4.0,ICLR2021
+VtRbRkYH7vF,3,#NAME?,#NAME?,Missing references - Not clear what is the improvement over existing architecture ,"I have increased my score to reflect the revisions
+
+---------
+
+This paper presents yet another architecture for fully connected RNNs or infinitely deep networks based on the integration of a continuous time dynamical system, where a projection of the weights is used to guarantee stability, hence a fixed point and a finite Lipschitz constant. 
+
+Positives: The maths presented in the paper is correct and their results are nicer than the considered (not exhaustive) baselines.  
+
+Negatives:
+
+1) This idea has been widely explored and exploited by now. Adding a linear term and using a different solver is not enough in my opinion to make an innovative contribution. 
+
+If you wish to present this as an ablation study, then perhaps you need to benchmark against existing solutions.
+ 
+For instance, in this (missing) reference a very similar network is presented and analysed. 
+
+@incollection{NIPS2018_7566,
+title = {NAIS-Net: Stable Deep Networks from Non-Autonomous  Differential Equations},
+author = {Ciccone, Marco and Gallieri, Marco and Masci, Jonathan and Osendorfer, Christian and Gomez, Faustino},
+booktitle = {Advances in Neural Information Processing Systems 31},
+pages = {3025--3035},
+year = {2018},
+publisher = {Curran Associates, Inc.},
+url = {http://papers.nips.cc/paper/7566-nais-net-stable-deep-networks-from-non-autonomous-differential-equations.pdf}
+}
+
+I will refer to this as [1]. While this paper was about unrolling the stable RNN to generate a deep Lipschitz classifier, and was not used for sequence to sequence task, the architecture you propose and the claims are so much similar to [1] that this demands for a direct comparison. 
+
+In the above paper, stability projections are presented for an architecture that is essentially the same, minus the additional linear component here. Paper [1] should be included as a baseline. You have it already implemented for fully connected layers. 
+
+2) It is not very clear how this additional linear component would help in practise, as your architecture is fundamentally discretised with Euler which results in yet another generalised res-net. It has been shown that ResNet works much better than their predecessor,  Highway networks, because of the direct skip connection and better gradient flow. 
+While the missing reference [1] (NAIS-Net) preserves that connection, It feels like your linear term would get rid of the skip connection and prevent the technique from being used in very deep networks or very long sequences due to vanishing gradients. 
+
+3) What is the motivation for using a continuos-time approach? Your results are invalidated by the forward Euler, unless a very small hyperparameter epsilon is introduced. Why not just compute the projection in discrete time as done in [1]. Your stability will not hold for sparse in time data-points, because the Euler step would become too big and this is effectively an RNN that cannot handle different sampling time while preserving stability. 
+
+4) It seems strange to compare to NODEs as they are meant to be used for something else. In particular, NODEs are designed to work without inputs but just by estimating the initial condition for the ODE and then ""unrolling"". 
+If you add input signals to the NODE, which I guess is what you mean by ""NODE RNN"", how do you train it with the adjoint method? This should be made very clear in the paper. 
+
+5) Your results are limited to a fully connected architecture, while [1] has shown a method to have a Lipschitz RNN for convolutional layers. Can you generalize to that as well?
+
+I don't feel the contribution here is relevant enough to be included in the conference. 
+
+If the above points are clarified in a convincing way and both the theoretical justification and the ablations performed with respect to [1], then I could consider improving my score. 
+",5,4.0,ICLR2021
+B1gxS8ARFS,2,S1xLuRVFvr,S1xLuRVFvr,Official Blind Review #2,"This paper proposes a visualization method for deep metric learning, which derived by analyzing the inner product of two globally averaged activations. The proposed method can generate an overall activation map which highlights the regions contributing most to the similarity. Also, it can generate a partial activation map that lights the regions in one image that have significant activation responses on a specific potion in the other image. The authors also analyzed the linearly of the fully connected layers and global max pooling. These contributions make the applicability of CAM to many CNN architectures. Further, the metric learning architecture is extended to Grad-CAM map, and the problem of Grad-CAM map is pointed out. To the best of my knowledge, these contributions are novel, and derivations seem to be correct. 
+
+Experiments on weakly-supervised localization, model diagnosis, and the applications of the proposed decomposition model in cross-view pattern discovery and interactive retrieval are promising. 
+
+Overall, this paper is well written, and contributions are good. 
+
+Minor problems.  
+In Sec.1 and Sec.2.2, the authors wrote the Grad-CAM has been used for visualization of re-ID (Gordo & Larlus (2017)). However, this paper seems to be not the works of Grad-CAM nor re-ID. 
+
+In my understanding, Decomposition+Bias is a more accurate model than Decomposition. 
+In the experiments of the Sec5.1 and 5.2, the performances of Decompsotion+Bias are lower than Decomposition. However, there are no explanations for this reason. 
+",8,,ICLR2020
+NMItdzk6Yuy,3,EGVxmJKLC2L,EGVxmJKLC2L,"Review of ""Learning not to Learn"" for ICLR 2021","**Summary.** The authors investigate the question of when the optimal behavior for an agent is to learn from experience versus when the optimal behavior is to apply the same (memorized) policy in every scenario. They begin by introducing a simple bandits environment wherein they derive the optimal policy and identify regimes in which it involves memorization vs. learning. Then they train an RL^2 agent and verify that it behaves as expected in these regimes. Next, they expand their approach to a slightly more complicated gridworld environment which does not have an analytic solution to the question. The agent behaves as expected in the gridworld environment.
+
+**Strong points.** This paper tackles a novel question which is fundamental to the field of metalearning. By carefully analyzing the two regimes of learning and memorization in the context of metalearning, this paper will increase awareness about the fact that the two regimes exist. The paper is clearly written and does an excellent job of putting experiments in the context of past ML research. The experimental setup is simple but goes straight to the heart of the issue. Figures and text do a good job of analyzing results and communicating them to the reader. Overall, this paper was an interesting read.
+
+**Weak points.** The idea that “sometimes memorization is best and other times learning is best” does border on the obvious. Indeed, as soon as the authors derive their analytical solution, it becomes clear that we can expect the RL^2 agent to learn the same behavior. For me, there were no surprises in the experimental sections. To the authors: was there anything that was surprising or not obvious to you? What additional information can the experiments tell us, apart from confirming theoretical predictions?
+
+Having said that, I also believe that very simple, well-executed research ideas sometimes make the best papers. This paper appears to be one of those cases. And even though the ideas are simple, they are significant and they are not a major part of the dialogue in the meta-learning community yet. So even if the ideas seem obvious, I think there is value in communicating them well.
+
+I have one concern about the bandit task setup: the authors adjust $\sigma_l$, the width of the Gaussian from which they are sampling the reward, as a proxy for aleatoric uncertainty and hence task complexity. In doing so, they essentially equate “stochasticity of the environment” with “task complexity.” And yet, there are many other ways in which a task can be complex. Sometimes, all the information needed to perform a task is present and yet the task is difficult to solve because one needs to interpret/integrate the information in a particular way. This is why, for example, puzzles are considered difficult tasks. It is also why simulating the 3-body problem is a complex task. To the authors: can you clarify what you mean by “task complexity”?
+
+In the closing paragraph of the paper, the authors claim that their approach “allows us to study the emergence of inductive biases in biological systems” but this claim is not supported by the rest of the paper, which makes almost no connections to biological systems. There are certainly ways in which these results are relevant to learning in biological systems, but the authors did not explore them in this paper, and so this claim is not well supported. In the same paragraph, they bring in contrasting notions of Darwinian and Lamarkian inheritance. Since they do this in one sentence -- the last sentence -- it is hard to understand what their claim is. And it was not clear that this was one of the main takeaways of the paper, as these concepts do not appear anywhere else in the paper. If the authors want to draw these conclusions, then they should add additional discussion on these topics. Otherwise, they risk misleading readers.
+
+One additional minor suggestion would be to invert the color scale of Figure 6, as “white -> red” signifies values of increasing size in all preceding plots, but in Figure 6 it currently signifies values of decreasing size.
+
+Minor grammatical suggestions
+-- “the question which aspects of behavior“ -> “the question of which aspects of behavior“
+-- When typing quotes in LaTex, use `` and ‘’ instead of “” so as to make them open & close correctly
+-- “interplay of the agent’s lifetime,” -> “interplay between the agent’s lifetime,”
+-- “We numerically show” -> “We show numerically”
+-- “as well as explicit models of memory” -> ”and explicit models of memory” (same issue occurs later)
+-- “the agents does not have” -> “the agent does not have”
+
+**Recommendation.** 6 : Marginally above acceptance threshold
+
+**Reasoning.** This paper is well written and the experimental setup is simple, well-executed, and produces results that are relevant to the main question of the paper. The main question of the paper -- when does it make more sense to learn vs. memorize a behavior -- is significant to ICLR and to the field of machine learning. There are a number of relatively minor weaknesses (as described above) but this is overall a nice paper and would be a good contribution to ICLR 2021.",6,4.0,ICLR2021
+RAclMCQYkeW,1,W3Wf_wKmqm9,W3Wf_wKmqm9,Review,"The paper highlights a problem in existing goal-reaching RL agents, in that they do not explicitly allow for trading off speed (how fast you reach the goal) and reliability (how often you reach the goal). While this tradeoff is implicitly determined by the discount factor in training, the paper asserts that in practice the ability to more flexibly determine this during inference is more desirable. Given this shortcoming of existing work, the paper then proposes ""C-Learning"" which learns a policy conditioned on both a goal and a desired horizon (h) -- i.e., a time limit on the policy. The paper presents favorable results of C-Learning on a few simulated domains compared to existing goal-reaching RL agents.
+
+Strengths:
+
+-- The proposed problem in standard goal-reaching agents is convincing.
+
+-- I appreciate the level of detail in the algorithmic description along with the pseudocode. It was mostly easy to follow, except in a few small places (see weaknesses).
+
+-- The experiments appear to have been performed carefully, with a helpful demonstration of the trade-off enabled by the proposed algorithm.
+
+Weaknesses:
+
+-- I found the small section about ""Horizon Independent Policies"" very confusing. For example, in the definition of M, since we know C* is increasing with h, does the max_h just reduce to setting h=infty. And if so, isn't the maximal value simply either 0 or 1 (measuring reachability)? And how is this max efficiently computed? More generally, why is this specific gamma-conditioned behavior policy necessary?
+
+-- The various algorithmic details in 3.1 are worrying, as they suggest that a number of things beyond the basic C-Learning paradigm are necessary in practice. While I understand that this is unavoidable in any algorithm, some of these algorithmic details are exceedingly specific to the domain. For example, ""C-function clipping"" relies on knowing reachability in the environment, which is arguably as difficult as learning a goal-reaching policy in many cases!
+
+-- The introduction mentions evaluating on domains from Nachum 2018, Peng 2018, and Zhang 2020. However, as far as I can tell, none of these domains are actually evaluated on? If you're not going to evaluate on these domains, I suggest removing this sentence from the paper.
+
+-- While I am convinced of the problem the paper claims to solve, I am not convinced that C-learning is necessarily the best solution. I can imagine a number of other approaches which may perform better, worse, or about the same. For example, why not simply learn a policy pi(a|s, g, gamma) for all possible gamma in [0, 1)? Or alternatively, why not use one of the risk-sensitive policy learning approaches in the safe RL or constrained MDP literature (e.g., Lyapunov Safe RL)?
+
+-- Learning a horizon-conditioned Q function was also proposed in TDM (https://arxiv.org/abs/1802.09081). How does C-learning relate and compare to this existing technique?",6,4.0,ICLR2021
+4NLO22ZbHwT,1,IUYthV32lbK,IUYthV32lbK,Difficult to Follow,"
+This manuscript provides proofs for certified robustness for ensembles and proposes a new approach called Diversity Regularized Training (DRT) based on the theoretical findings. In addition to the standard loss term, DRT contains two regularization terms: (i) gradient diversity (GD) loss, and (ii) confidence margin loss (CM). These loss terms encourage the joint gradient difference for each model pair and large margin between the true and runner-up classes for base models. The authors discuss some theoretical findings in detail and demonstrate the performance of DRT on several datasets. I find the work valuable in the sense that it nicely combines ideas of ensembling and certified robustness. However, it was difficult to follow all the theoretical results and how they motivated the main finding (DRT) was not clear. Below are my comments/questions:
+
+- I want to thank the authors for nice summary in Appendix B1 and B2 on certified robustness and their map to ensembles.
+
+- Theorem 1, which serves as the foundation for the paper, assumes that either best prediction or the runner-up prediction is true class for any base model. I wonder how realistic this assumption is.
+
+- Also in theorem 1, I am confused that both f and y have index i. The number of base models does not need to match the number of classes.
+
+- In proof for Theorem 2, it could be helpful to mention each step. For instance, it was not clear to me where Lagrangian reminder was applied. Also, I recommend proving necessary and sufficient conditions separately.
+
+- In theorem 3, N is assumed to be 2. Is it possible to extend this to N > 2? So, discussion on such this could be helpful.
+
+- In the first paragraph on page 5, I do not follow which equation was meant in RHS.
+
+- These statements on page 5 are not clear to me: ""... and leads to higher certified ensemble robustness."" and ""Thus, increasing confidence margins can lead to higher ensemble robustness.""
+
+- For GD loss, why only pairwise summations were considered? Isn't it easier to look at overall diversity of gradients?
+
+- All pair-wise computation of quantities in regularizer terms should come with some computational complexity. It would be nice to include a discussion on this.
+
+- Results for Salman et al., 2019 on Table 2 do not match the paper of Salman et al., 2019 and imagenet results are not convincing. I understand ImageNet data can take too long but should we worry that DRT requires more hyper-parameter tuning? Some discussion on this would be helpful.",5,2.0,ICLR2021
+C4imHvOcWH,3,UAAJMiVjTY_,UAAJMiVjTY_,Review,"Summary.
+
+In this paper, the authors have presented a framework that combines meta-interpretive learning and the abductive learning of neural networks. The high-level idea is to formulate a unified probabilistic interpretation of the entire algorithm so that both the inductive logic programming module and the neural network modules can be trained jointly from data. The authors have demonstrated the application of the proposed algorithm to learning arithmetic operations and sorting operations by looking at input-output mnist digits.
+
+Comments.
+The key idea of the paper has been presented clearly. The authors demonstrated two tasks: cumulative sum/product, and sorting. Both tasks require learning recursive rules, and the bogosort task requires predicate invention. These are challenging tasks for both neural networks and ILP algorithms.
+
+However, my major comments about the paper is that the experiment sections are relatively weak and they have definitely missed some important baseline comparisons. Concretely, taking the cumulative summation task as an example, the MetaAbd model has very strong inductive biases, because of the builtin ""add"" operation and the metarules built into the system, which strongly favors recursive rules of specific forms. However, at least the ""add"" operation was not built into other baselines.
+
+Second, there have also been many other works trying to solve this task:
+- partial ILP (Evans & Grefenstette, 2018) and machine apperception (Evans et al.,
+2019) that can learn mnist digits with much weaker assumptions: they can even learn the ""succ"" relationship between digits.
+- Neural GPU (Kaiser and Sutskever 2015) that can learn to add multi-digit numbers without any builtin ""add"" operations.
+- Differentiable Neural Computer (https://deepmind.com/blog/article/differentiable-neural-computers)
+- Neural Programmer-Interpreters (Reed et al 2015) and its follow-ups: they support integrating human-written primitive functions (such as the ""add"" operation) with neural networks.
+The authors are encouraged to make comparisons with these methods as well.
+
+Third, the learned logic rules are relatively simple. This makes me less convinced about the applicability of the paper. The authors have made very strong claims in the abstract/intro about ""To the best of our knowledge, MetaAbd is the first system that can jointly learn neural networks and recursive first-order logic theories with predicate invention."" For example, partial ILP and machine apperception can do that, too. Recently, there have also been other trials on using relational neural networks for bridging perception and rule learning, such as,
+- Graph Neural Networks (https://arxiv.org/abs/1806.01261)
+- Neural Logic Machines (https://arxiv.org/abs/1904.11694)
+
+Overall, I think this paper is not matching the publication standard of ICLR.
+
+Minor:
+Please change the latex formatting of the model name. There is currently an extra space between M and e.",4,5.0,ICLR2021
+N3su8ujqA09,1,dx4b7lm8jMM,dx4b7lm8jMM,An Efficient Representation of Sequences by Low-Rank Tensor Projections,"The paper presents a method to map static feature space to a space containing sequences of different lengths. The idea is worth of  interest and the appendix gives large amount of information on both theoretical and experimental sides of the work.
+
+The experiments are the main drawback for me of the paper. Indeed, the experiments for sequence discrimination are not convincing enough due to the lack of datasets and framework employed (only the multivariate time series classification problem), and the results obtained. The main results achieved during the TSC classification task are located with moderate or high prior (> 0.6). Moreover, the experiments are conduced with both small models and datasets.
+
+The experiment for sequential data imputation for time series and videos is not clear, and the way of writing and giving explanation is quite confusing for the reader. Since the appendix is large, more details from the appendix in the paper itself will help the reader to more easily follow the idea.
+
+Overall, the experiments have to be larger and considering trickier tasks such as language processing related experiments (speech recognition, language modeling, etc.) that consider sequence mapping to extract robust features.
+",5,4.0,ICLR2021
+R7iSCVPAks,3,D3PcGLdMx0,D3PcGLdMx0,"Good paper, proper experimental evaluation.","### Summary
+This paper proposes a way to exploit relationships across tasks in episodic training with the goal of improving the trained models who might be susceptible to poor sampling in for few-shot learning scenarios. The proposed model consists of two components: a cross-attention transformer (CEAM) which is used to observe details across two episodes, and a regularization term (CECR) which imposes that two different instances of the same task (which have the exact same classes) are consistent in terms of prediction. Cross-attention is computed via a scaled-attention transformer using both support and query set. The consistency loss is a knowledge distillation that imposes an agreement on the two episodes. The soft target is chosen among the two predictions selecting the classifier with the highest accuracy.
+### Considerations: 
+- I like the idea of exploiting the information across tasks to improve the performance of episodic meta training. This is an interesting direction that should might definitely help disambiguate in the case of poor sampling.
+- The ablation study is accurately performed giving the impression of a careful examination of the components of the model proposed.
+- I'm not sure the authors can claim sota results: here some of the latest models that perform best on mini-imagenet https://paperswithcode.com/sota/few-shot-image-classification-on-mini-1 I would prefer to restate the contribution as an improvement of x% over the baseline. It is obvious that sota performance requires higher capacity models such as dense-net. I think that other experiments are needed in order to make the claim of achieving sota, otherwise, if the claim is changed, I'm satisfied with the experiments.
+- I suggest the authors moving algorithm 1 in the main paper, maybe replacing the verbose description of each step with actual formulas and pseudocode.
+- I think there is still room for improvement on the manuscript.  The paper might be a good contribution to the scientific community, but I'll wait for the authors' response on my doubts before my final decision. 
+
+### Questions:
+- Q1: Why only considering tasks with the same classes for consistency? Why not considering also partial overlapping of classes across tasks? I guess it is only for simplicity, but it might be beneficial to consider other types of relationships.
+- Q2: It is not exactly clear how the meta-test evaluation is performed. I understand that during training you always consider a pair of episodes that are used to transform the features, but how does this translate at meta-test time? Do you always need a pair of episodes? My guess is no, but not using the transformer should change the distribution of the features at test-time and I don't find it trivial to see how this is taken into account. Maybe I'm just misreading the paper. I suggest the authors clarifying this point in the paper.
+",6,5.0,ICLR2021
+lQsimEZqETL,4,onxoVA9FxMw,onxoVA9FxMw,Interesting Take on Comparing Position Embeddings,"This paper proposes a formal framework to compare position embeddings (PEs) and presents an empirical study comparing variants of absolute position embeddings (APEs) and relative position embeddings (RPEs) on three properties: 1) monotonicity, 2) translation invariance, and 3) symmetry and evaluates their performance on classification (GLUE) and span prediction tasks (SQUAD). The authors also report results on learnable sinusoidal APEs and learnable sinusoidal RPEs, PE variants which had not been previously proposed. 
+
+The first three properties seem well-motivated (monotonicity and translation invariance), but it is not obvious that symmetry should be a property of an ideal PE, or at least the paper is not convincing on this front. In a sentence (ABCD), doesn’t the word A typically have a different relationship to B than B does to A?
+
+The identical word probing test was a clever way to disentangle the impact of the word from that of the PEs. 
+
+While it does seem valuable to more rigorously compare PEs as they are critical components of SOTA language models, the experimental results were not particularly convincing (e.g. although it’s a very appealing story, it didn’t seem so clear from the tables that APEs did better at classification and that RPEs did better at span prediction.)
+
+The writing quality was borderline and there were a number of small errors:
+- fully-learnable APEs nearly meet all properties even under no constrains” -> “constraints”
+- Under the equation at the beginning of Section 3.1, “word-word correspondence” is repeated four times, which I am sure was not the intention. 
+- nit: “since relative distance with the same offset will be embedded as a same embedding.” -> “the same embedding”
+- “compared to far-way” -> “faraway”
+- “attends more on forwarding tokens than backward tokens” -> “forward tokens”?
+- nit: “In Transformer, where attention calculation does not…” -> “where the attention calculation”
+- “allows PEs o better perceive word order” -> “to”",6,4.0,ICLR2021
+2at9Wr4q_qW,2,otuxSY_QDZ9,otuxSY_QDZ9,Presentation of the papers needs improvement ,"Summary:
+This paper proposes a novel module on top of ConvNet, multi-layer dense connectivity, for learning hierarchical concepts in image classification.
+
+Pros:
+
+This paper proposes to use the label hierarchy (with ancestor concepts from a label) instead of the label itself to learn the image recognition system. To achieve this, it has made two major contributions:
+1. Building label hierarchy with a simplified set of categories, to remove the redundant and meaningless categories
+2. With the constructed label hierarchy, this paper proposes a dense connectivity module to leverage the label hierarchy to model category abstractions over high-level visual embedding, on top of commonly used convolutional neural networks.
+
+With the proposed techniques, this paper builds up its recognition system using two standard deep ConvNets and achieved strong results on large-scale image recognition benchmarks. 
+
+Cons:
+
+1. In general, the paper is not very well written for a few reasons: A) The motivation of the proposed method over previous methods is not clear (intro paragraph#2). B) Section 3.1 is very hard to follow. C) Some notations in Section 3.2 seems unnecessary and there are things being used before it is formally defined.
+
+2. The design of this dense connectivity module in Section 3.2 seems quite arbitrary, there is no good explanation on why we need to use the z to multiply the output x and h.
+
+3. Experiments of naturally adversarial examples are not motivated earlier in the introduction. It's quite hard for me to understand why using a label hierarchy would improve this task.
+
+ 
+Detailed Comments:
+
+1. Paragraph#2 in Intro: why training a neural network as multinomial/softmax logistic regression from images to labels can not acquire a comprehensive knowledge about the input entity? For instance, in some of the prior works (e.g. Hu et. al. 2018), they learn models to simultaneously classify categories on a predefined label hierarchy, including both abstracted classes such as ""Dog"" and concrete class such as the ""English Setter"".
+2. It seems that from Section 3 on, it uses the term ""Category"" to stand for the leaf concept (most specific) and the term ""Concept"" as the shorthand of ` ""Ancestor Concept"". It would be better to mention this explicitly to avoid confusion.
+3. Example in Figure 2 is not very clear and hard to follow. It might be better to simplify the figure by using a smaller hierarchy as an example. Also, it would be good to have a paragraph in section 3.1 to describe what in the right figure has been modified using the concrete examples of Figure 2. 
+4. Equation 1, why do we need a \psi activation function which is linear? What it means by linear, is there an additional linear weight in \psi besides v?
+5. Why are we using an MSE for the concept classifiers? I assume we can use binary cross-entropy for them?
+
+
+Minor:
+* The aspect ratio of Figures 2 and 3 need to be adjusted. It is hard to recognize text and symbols on the stretched figures.
+* Notation \hat{h} in the text is bolded but the ones in the equation (1) is not bolded
+* A recent work that also leverages hierarchical information in the label text to learn visual concept embeddings, which is closely related to the topic of this paper: Learning to Represent Image and Text with Denotation Graph. EMNLP 2020
+",5,3.0,ICLR2021
+rylCRiKCYS,3,SJldu6EtDS,SJldu6EtDS,Official Blind Review #3,"===========
+Summary:
+This paper proposes a new regularization scheme inspired from (virtual) adversarial training to tackle the problem of learning with noisy labels. While based on the adversarial training (AR), it was found that AR does not directly transferable to deal with noisy labels. The author then proposed the Wasserstein version of AR replacing the KL with the Wasserstein distance and its approximate. This gives the proposed  Wasserstein Adversarial Regularization (WAR) which provide considerable robustness improvement on 5 datasets (both classification and segmentation). The correlation between WAR regularization and boundary smoothing is justified both theoretically and empirically with toy examples. The advantage of WAR regularization over existing methods is the flexibility to incorporate intra-class divergence, making it plausible against asymmetric label noise, which is more common in real-world datasets. The authors have done solid work in this paper. The experiment is complete, in terms of the scale, noise settings and comparison to existing works.
+
+I recommend to accept this paper.
+
+Minor suggestions:
+It would be interesting to see the performance on the other common type of real-world noise: open-set label noise [1], and may be applied to adversarial training against adversarial examples.
+The adversarial regularization was also used in a recent adversarial training paper [2]. A similar idea is adversarial logit pairing which is regularization on logits [3].
+
+
+References:
+[1] Iterative learning with noisy labels. CVPR 2018
+[2] Theoretically principled trade-off between robustness and accuracy. ICML 2018
+[3] Adversarial logit pairing. arXiv preprint arXiv:1803.06373 (2018).",6,,ICLR2020
+Skl0dCZk5r,1,rJgQkT4twH,rJgQkT4twH,Official Blind Review #1,"the paper uses model interpretation techniques to understand blackbox CNN fit of zebrafish videos. they show that, relying on the technique of deep taylor decomposition, their CNN relies its prediction on a different part of zebra fish than existing understanding. it is also able to detect the use of experimental artifacts, whose removal improves predictive performance. 
+
+the idea of a case study about the usefulness of model interpretation techniques is interesting. while the experimental studies rely on our belief that the interpretation technique indeed interprets, the result that removing experimental features and improving predictive performance is convincing and interesting. it illustrates how model interpretability and human intuition and domain knowledge can be useful.",6,,ICLR2020
+#NAME?,4,LcPefbNSwx_,LcPefbNSwx_,The paper proposed to project input to factor (low-dimensional) and residual features (high-dimensional) to improve neural network training.,"**Strong points**
+
+The paper provides a very detailed theoretical analysis of motivation.
+
+**Weak points**
+
+Analysis in section 2 is based on linear regression, but the proposed method is based on deep models.
+
+The proposed method is more similar to feature extraction instead of a training method.
+
+Many models in the chart are not fully converged. It is important to compare convergence speed but also the final accuracy. I think they are not converged, because the training and testing curve is perfectly smoothing without any fluctuations.
+
+Comparison in chart 3 needs improvement. Since the two models have different input signals and model structure, they inherently need different hyper-parameters (not only learning rate) for best performance. Only comparing them with the same learning rate may not adequate to prove the significance of the proposed method.
+",4,3.0,ICLR2021
+jrai3Is1mht,3,udaowxM8rz,udaowxM8rz,A principled approach towards robustness evaluation with some shortcomings,"## Paper summary
+The paper considers the problem of measuring the robustness of image classification models to common image perturbations. Datasets of corrupted images, such as ImageNet-C, have been created for this purpose. However, these datasets have been created from an ad-hoc, heuristic selection of perturbations. The present paper proposes a systematic approach to select types of perturbations in a way that spans a large variety of perturbations and assigns similar importance to each perturbation. Similarity of perturbations is measured based on how much training on one perturbations confers robustness against another perturbation. The paper provides an algorithm for selecting perturbations to include, and uses the algorithm to create a variant of ImageNet-C with improved coverage and balance.
+
+
+## Arguments for acceptance
+1. Robustness of computer vision models to image perturbations is a topic of great interest, but methods to measure robustness are either highly artificial (adversarial robustness) or ad hoc heuristics. The present paper takes an important step in the direction of making evaluation of robustness to non-adversarial corruptions more principled.
+2. The paper is written clearly.
+3. The overlapping score is based on a practically relevant quantity, namely the performance of a network on the corrupted data.
+4. A simple algorithm is provided for selecting perturbations for new datasets.
+5. A new alternative to ImageNet-C is created with the proposed algorithm. The new dataset has improved coverage and balance properties. Some evidence is provided that these improvements can affect the results of robustness evaluations.
+
+## Arguments against acceptance
+6. The computational cost of the method is high, and not stated or discussed. As far as I can tell, at least one neural network needs to be trained for each candidate dataset. Although the authors state that results transfer across architectures, such that a small architecture can be used, it is known that robustness properties depend significantly on model size and training procedure (e.g. see https://arxiv.org/abs/2007.08558). This should be addressed further.
+7. This method only really works for synthetic corruptions for which many new examples can be generated. Otherwise, it may be difficult to obtain enough examples to train the network used for computing the overlapping score.
+8. It is not discussed whether Algorithm 1 is guaranteed to provide the optimal combination of datasets (in terms of overlap and balance).
+9. It is not entirely clear if the improvements provided by ImageNet-NOC make a significant difference in practice. Table 3 starts to address this question, but it would be useful to compare ImageNet-C and ImageNet-NOC mCE across a wider range of models (e.g. pretrained models available online). What is the rank correlation between ImageNet-C and ImageNet-NOC mCE? 
+
+## Conclusion and suggestions
+This is a borderline submission. Because of the principled approach to an important and topical problem, I tend towards accepting it.
+
+Suggestions for improvement:
+10. Discuss the computational cost of the method.
+11. Discuss the optimality of Algorithm 1.
+12. Compare ImageNet-C and ImageNet-NOC on a wider range of models.",6,4.0,ICLR2021
+S1c4VEXWz,3,HJtEm4p6Z,HJtEm4p6Z,"Detailed tech report, missing some motivation and comparison experiments","This paper provides an overview of the Deep Voice 3 text-to-speech system. It describes the system in a fair amount of detail and discusses some trade-offs w.r.t. audio quality and computational constraints. Some experimental validation of certain architectural choices is also provided.
+
+My main concern with this work is that it reads more like a tech report: it describes the workings and design choices behind one particular system in great detail, but often these choices are simply stated as fact and not really motivated, or compared to alternatives. This makes it difficult to tell which of these aspects are crucial to get good performance, and which are just arbitrary choices that happen to work okay.
+
+As this system was clearly developed with actual deployment in mind (and not purely as an academic pursuit), all of these choices must have been well-deliberated. It is unfortunate that the paper doesn't demonstrate this. I think this makes the work less interesting overall to an ICLR audience. That said, it is perhaps useful to get some insight into what types of models are actually used in practice.
+
+An exception to this is the comparison of ""converters"", model components that convert the model's internal representation of speech into waveforms. This comparison is particularly interesting because some of the results are remarkable, i.e. Griffin-Lim spectrogram inversion and the WORLD vocoder achieving very similar MOS scores in some cases (Table 2). I wish there would be more of that kind of thing in the paper. The comparison of attention mechanisms is also useful.
+
+I'm on the fence as I think it is nice to get some insight into a practical pipeline which benefits from many current trends in deep learning research (autoregressive models, monotonic attention, ...), but I also feel that the paper is a bit meager when it comes to motivating all the architectural aspects. I think the paper is well written so I've tentatively recommended acceptance.
+
+
+Other comments:
+
+- The separation of the ""decoder"" and ""converter"" stage is not entirely clear to me. It seems that the decoder is trained to predict spectrograms autoregressively, but its final layer is then discarded and its hidden representation is then used as input to the converter stage instead? The motivation for doing this is unclear to me, surely it would be better to train everything end-to-end, including the converter? This seems like an unnecessary detour, what's the reasoning behind this?
+
+- At the bottom of page 2 it is said that ""the whole model is trained end-to-end, excluding the vocoder"", which I think is an unfortunate turn of phrase. It's either end-to-end, or it isn't.
+
+- In Section 3.3, the point of mixing of h_k and h_e is unclear to me. Why is this done?
+
+- The gated linear unit in Figure 2a shows that speaker embedding information is only injected in the linear part. Has this been experimentally validated to work better than simpler mechanisms such as adding conditioning-dependent biases/gains?
+
+- When the decoder is trained to do autoregressive prediction of spectrograms, is it autoregressive only in time, or also in frequency? I'm guessing it's the former, but this means there is an implicit independence assumption (the intensities in different frequency bins are conditionally independent, given all past timesteps). Has this been taken into consideration? Maybe it doesn't matter because the decoder is never used directly anyway, and this is only a ""feature learning"" stage of sorts?
+
+- Why use the L1 loss on spectrograms?
+
+- The recent work on Parallel WaveNet may allow for speeding up WaveNet when used as a vocoder, this could be worth looking into seeing as inference speed is used as an argument to choose different vocoder strategies (with poorer audio quality as a result).
+
+- The title heavily emphasizes that this model can do multi-speaker TTS with many (2000) speakers, but that seems to be only a minor aspect that is only discussed briefly in the paper. And it is also something that preceding systems were already capable of (although maybe it hasn't been tested with a dataset of this size before). It might make sense to rethink the title to emphasize some of the more relevant and novel aspects of this work.
+
+
+----
+
+Revision: the authors have adequately addressed quite a few instances where I feel motivations / explanations were lacking, so I'm happy to increase my rating from 6 to 7. I think the proposed title change would also be a good idea.",7,4.0,ICLR2018
+2LjybOQx-ek,2,sebtMY-TrXh,sebtMY-TrXh,"Interesting idea, but not practical for other applications","This paper proposes AriEL, a sentence encoding method onto the compact space [0, 1]^d. It leverages essences of arithmetic coding and kd-tree to encode/decode sentences with a fixed region of the space. With the property of arithmetic coding, in theory, it can map sentences with any lengths into individual values, and any points on [0, 1]^d can map back into corresponding sentence. Although the method relies on neural network based LMs to assign sentences into corresponding regions, the generality of mapping between any sentences/points is kept while changing the LM's behavior. The idea is interesting.
+
+However, there is some disadvantages of the proposed encoding which are not always mentioned in the paper. First, due to the topological difference between the proposed encoding and other spaces (e.g., Euclidian space), the proposed encoding could not be treated as embeddings in some usual meanings, e.g., it is hard to calculate the ""similarity"" between two encodings by arithmetics on real numbers as many deep learning methods implicitly does. Actually, there seems no evidence of advantages of the proposed encodings on other tasks which are not designed for this encoding unlike experiments on the paper.
+
+Second, the resulting encodings will be affected directly by the capability of the actual representation of real numbers. E.g., if we used float32 for each dimension, the [0, 1] space can contain only information up to 30 bits long in the most efficient case, which may be insufficient to encode ""all"" sentences into the compact space. It will be problematic when we encode very long sentences (|sentence| >> d). Experiments in the paper did not figure out this point enough because the mean number of words in the corpus is too small (9.9 and 5.9 words, whereas d = 16).
+
+There are also several presentation errors in the paper:
+
+- Using different format of all citations.
+- Table 3 clearly exceeds the width limit.",4,4.0,ICLR2021
+e65ePnScK6N,1,poH5qibNFZ,poH5qibNFZ,"A new training schema for KD with various applications, but inadequate experiments","This paper introduces Neighbourhood Distillation (ND), a new training pipeline for knowledge distillation (KD), which splits the student network into smaller neighbourhoods and trains them independently. The authors breaks away from the end-to-end paradigm in previous KD methods and provides empirical evidence to reveal feasibility and effectiveness of ND. Specially, ND can: 1) speed up convergence, 2) reuse in neural architecture search and 3) adapt to the synthetic data.
+
+Strengths
+1) The paper is well written and easy to follow. An empirical evidence of the thresholding effect is provided to explain the motivation.
+2) The idea is simple and intuitive. The ND seems more like an initialization method for DNN’s local components and a finetuning procedure is sometimes needed for recovering the accuracies. Benefit from parallelism and small training components, such training schema can speed up the convergence of standard KD.
+3) Several different applications are conducted to demonstrate the flexibility of ND.
+
+Weaknesses
+1) Missing a relevant paper. [1] proposes a similar blockwise knowledge distillation method. The authors should cite and explain the differences between ND and [1].
+2) What is sparsification in Sec. 4? Is it the sparseness of the convolutional kernel or the channel? 
+3) The authors mention that the work seeks to overcome the limitations of training very deep networks. However, the ResNet50 (the deepest model in experiments) is not deep enough. Usually, it is easy to converge.  
+4) In Sec. 5, only the width search experiments are conducted, which is more like layer-wise or block-wise pruning. However, architecture search is a general method that can not only search the widths but also the operations. Why only mentions “This set could contain variants of the same architecture...”? Is there any limitation when the searched candidates contain different architectures/operations? 
+5) All the experiments are done on ResNet series. Different teacher and/or student architectures, such as VGG, ShuffleNet etc., should be considered.
+6) Does the observation of thresholding effect benefit from the shortcut in Resblok？Is it suitable for plain CNN, such as VGG? Which blocks are chosen in Fig. 1(a)? Does the shallow and deep blocks have the same phenomenon when perturb small number of blocks?
+7) How to record the GPU time of ND? Is it the time of paralleling on multi-GPUS or on single GPU?
+
+I am currently leaning towards a slightly negative score but would like to see the authors' responses and other reviewer's comments.
+
+
+[1] Hui Wang, Hanbin Zhao, Xi Li, Xu Tan. Progressive Blockwise Knowledge Distillation for Neural Network Acceleration. IJCAI, 2018.
+",5,4.0,ICLR2021
+H1gOFYTK3m,1,Hyl_vjC5KQ,Hyl_vjC5KQ,Interesting ideas and analysis but somewhat unclear motivation and limited empirical evaluation,"Revision: The authors addressed most of my concerns and clearly put in effort to improve the paper. The paper explains the central idea better, is more precise in terminology in general, and the additional ablation gives more insight into the relative importance of the advantage weighting. I still think that the results are a bit limited in scope but the idea is interesting and seems to work for the tasks in the paper. I adjusted my score to reflect this.
+
+Summary:
+The paper proposes an HRL system in which the mutual information of the latent (option) variable and the state-action pairs is approximately maximized. To approximate the mutual information term, samples are reweighted based on their estimated advantage. TD3 is used to optimize the modules of the system. The system is evaluated on continuous control task from OpenAI gym and rllab.
+
+For the most part, the paper is well-written and it provides a good overview of related work and relevant terminology. The experiments seem sound even though the results are not that impressive. The extra analysis of the option space and temporal distribution is interesting. 
+
+Some parts of the theoretical justification for the method are not entirely clear to me and would benefit from some clarification. Most importantly, it is not clear to me why the policy in Equation 7 is considered to be optimal. Given some value or advantage function, the optimal policy would be the one that picks the action that maximizes it. The authors refer to earlier work in which similar equations are used, but in those papers this is typically in the context of some entropy maximizing penalty or KL constraint. A temperature parameter would also influence the exploration-exploitation trade-off in this ‘optimal’ policy. I understand that the rough intuition is to take actions with higher advantage more often while still being stochastic and exploring but the motivation could be more precise given that most of the subsequent arguments are built on top of it. However, this is not the policy that is used to generate behavior. In short, the paper is clear enough about how the method is constructed but it is not very clear to me *why* the mutual information should be optimized with respect to this 'optimal' policy instead of the actual policy one is generating trajectories from.
+
+HRL is an interesting area of research with the potential to learn complicated behaviors. However, it is currently not clear how to evaluate the importance/usefulness of hierarchical RL systems directly and the tasks in the paper are still solvable by standard systems. That said, the occasional increase in sample efficiency over plain TD3 looks promising. It is somewhat disappointing that the number of beneficial option is generally so low. To get more insight in the methods it would have been nice to see a more systematic ablation of related methods with different mutual information pairings (action or state only) and without the advantage weighting. Could it be that the number of options has to remain limited because there is no parameter sharing between them? It would be interesting to see results on more challenging control problems where the hypothesized multi-modal advantage structure is more likely to be present.
+
+All in all I think that this is an interesting paper but the foundations of the theoretical motivation need a bit more clarification. In addition, experiments on more challenging problems and a more systematic comparison with similar models would make this a much stronger paper.
+
+Minor issues/typos:
+- Contributions 2 and 3 have a lot of overlap.
+- The ‘o’ in Equation 2 should not be bold font. 
+- Appendix A. Shouldn’t there be summations over ‘o’ in the entropy definitions?
+
+
+",7,4.0,ICLR2019
+BJL4vxu4e,3,Hk-mgcsgx,Hk-mgcsgx,"Interesting idea, need further work","This paper proposes a multiview learning approach to finding dependent subspaces optimized for maximizing cross-view similarity between neighborhoods of data samples. The motivation comes from information retrieval tasks. Authors position their work as an alternative to CCA-based multiview learning; note, however, that CCA based techniques have very different purpose and are rather broadly applicable than the setting considered here. Main points: 
+
+- I am not sure what authors mean by time complexity. It would appear that they simply report the computational cost of evaluating the objective in equation (7). Is there a sense of how many iterations of the L-BFGS method? Since that is going to be difficult given the nature of the optimization problem, one would appreciate some sense of how hard or easy it is in practice to optimize the objective in (7) and how that varies with various problem dimensions. Authors argue that scalability is not their first concern, which is understandable, but if they are going to make some remarks about the computational cost, it better be clarified that the reported cost is for some small part of their overall approach rather than “time complexity”.
+
+- Since authors position their approach as an alternative to CCA, they should remark about how CCA, even though a nonconvex optimization problem, can be solved exactly with computational cost that is linear in the data size and only quadratic with dimensionality even with a naive implementation. The method proposed in the paper does not seem to be tractable, at least not immediately. 
+
+- The empirical results with synthetic data are a it confusing. First of all the data generation procedure is quite convoluted, I am not sure why we need to process each coordinate separately in different groups, and then permute and combine etc. A simple benchmark where we take different linear transformations of a shared representation and add independent noise would suffice to confirm that the proposed method does something reasonable. I am also baffled why CCA does not recover the true subspace - arguably it is the level of additive noise that would impact the recoverability - however the proposed method is nearly exact so the noise level is perhaps not so severe. It is also not clear if authors are using regularization with CCA - without regularization CCA can be have in a funny manner. This needs to be clarified. 
+",4,4.0,ICLR2017
+rJ205zPlG,2,Hk99zCeAb,Hk99zCeAb,"Mixed - great results on image generation, but not properly anonymized","Before the actual review I must mention that the authors provide links in the paper that immediately disclose their identity (for instance, the github link). This is a violation of double-blindness, and in any established double-blind conference this would be a clear reason for automatic rejection. In case of ICLR, double-blindness is new and not very well described in the call for papers, so I guess it’s up to ACs/PCs to decide. I would vote for rejection. I understand in the age of arxiv and social media double-blindness is often violated in some way, but here the authors do not seem to care at all. 
+
+—
+
+The paper proposes a collections of techniques for improving the performance of Generative Adversarial Networks (GANs). The key contribution is a principled multi-scale approach, where in the process of training both the generator and the discriminator are made progressively deeper and operate on progressively larger images. The proposed version of GANs allows generating images of high resolution (up to 1024x1024) and high visual quality.
+
+Pros:
+1) The visual quality of the results is very good, both on faces and on objects from the LSUN dataset. This is a large and clear improvement compared to existing GANs.
+2) The authors perform a thorough quantitative evaluation, demonstrating the value of the proposed approach. They also introduce a new metric - Sliced Wasserstein Distance.
+3) The authors perform an ablation study illustrating the value of each of the proposed modifications.
+
+Cons:
+1) The paper only shows results on image generation from random noise. The evaluation of this task is notoriously difficult, up to impossible (Theis et al., ICLR 2016). The authors put lots of effort in the evaluation, but still:
+- it is unclear what is the average quality of the samples - a human study might help
+- it is unclear to which extent the images are copied from the training set.  The authors show some nearest neighbors from the training set, but very few and in the pixel space, which is known to be pointless (again, Theis et al. 2016). Interpolations in the latent space is a good experiment, but in fact the interpolations do not look that great on LSUN
+- it is unclear if the model covers the full diversity of images (mode collapse)
+It would be more convincing to demonstrate some practical results, for instance inpainting, superresolution, unsupervised or semi-supervised learning, etc.
+2) The general idea of multi-scale generation is not new, and has been investigated for instance in LapGAN (Denton et al., ICLR 2015) or StackGAN (Zhang et al., ICCV2017, arxiv 2017). The authors should properly discuss this. 
+3) The authors mention “unhealthy competition” between the discriminator and the generator several times, but it is not quite clear what exactly they mean - a more specific definition would be useful.
+
+(This conclusion does not take the anonymity violation into account. Because of the violation I believe the paper should be rejected. Of course I am open to discussions with ACs/PCs.) 
+To conclude, the paper demonstrates a breakthrough in the quality and resolution of images generated with a GAN. The experimental evaluation is thorough, to the degree allowed by the poorly defined task of generating images from random noise. Results on some downstream tasks, such as inpainting, image processing or un-/semi-supervised learning would make the paper more convincing. Still, the paper should definitely be accepted for publication. Normally, I would give the paper a rating of 8.",1,4.0,ICLR2018
+rklaWryJoH,4,rJehVyrKwH,rJehVyrKwH,Official Blind Review #1,"The suggested method proposes a technique to compress neural networks bases on PQ quantization. The algorithm quantizes matrices of linear operations, and, by generalization, also works on convolutional networks. Rather than trying to compress weights (i.e. to minimize distance between original and quantized weights), the algorithm considers a distribution of unlabeled inputs and looks for such quantization which would affect output activations as little as possible over that distribution of data. The algorithm works by splitting each column of W_ij into m equal subvectors, learning a codebook for those subvectors, and encoding each of those subvectors as one of the words from the codebook.
+
+The method provides impressive compression ratios (in the order of x20-30) but at the cost of a lower performance. Whether this is a valuable trade-off is highly application dependent.
+
+Overall I find the paper interesting and enjoyable. However, as I am not an expert in the research area, I can not assess how state of the art the suggested method is.
+
+There are a few other questions that I think would be nice to answer. I will try to describe them below:
+
+Suppose we have a matric W_{ij} with dimensions NxM where changing i for a given j defines a column. By definition, linear operation is defined 
+y_i = sum_j W_ij x_j . Now say each column of matrix W is quantized into m subvectors. We can express W_ij in the following way:
+W_ij = (V^1_ij + V^2_ij + ... V^m_ij)x_j where V^m_ij is zero everywhere except for the rows covering a given quantized vector.
+For example, if W had dimensions of 8x16 and m=4, 
+V^2_{3,j}=0, for all j, V^2_{4,j}=non_zero, V^2_{7,j}=non_zero, V^2_{8,j}=0, V^2_{i=4:8,j}=one_of_the_quantized_vectors.
+
+y_i = sum_j W_ij x_j = sum_k sum_j (V^k_ij) x_j =def= sum_k z^k_i where z^k are partial products: z^k_i=0 for i<k*N/m and i>(k+1)N/m
+
+Thus, the suggested solution effectively splits the output vector y_i into m sections, defines sparse matrices V^k_{ij} 1<=k<=m, and performs column-wise vector quantization for these matrices separately.
+
+Generally, it is not ovious or given that the current method would be able to compress general matrices well, as it implicitly assumes that weight W_{ij} has a high ""correlation"" with weights W_{i+kN/m,j} (which I call ""vertical"" correlation), W_{i,k+some_number} (which I call ""horizontal"" correlation) and W_{i+kN/m,k+some_number} (which I call ""other"" correlation). It is not given that those kind of redundancies would exist in arbitrary weight matrices.
+
+Naturally, the method will work well when weight matrices have a lot of structure and then quantized vectors can be reused. Matrices can have either ""horizontal"" or ""vertical"" redundancy (or ""other"" or neither). It would be very interesting to see which kind of redundancy their method managed to caprture.
+
+In the 'horizontal' case, it should work well when inputs have a lot of redundancy (say x_j' and x_j'' are highly correlated making it possible to reuse code-words horizontally within any given V^k: V^k_ij'=V^k_ij''). However, if thise was the case, it would make more sense to simply remove redundancy by prunning input vector x_j by removing either x_j' or x_j'' from it. This can be dome by removing one of the outputs from the previous layer. This can be a symptom of a redundant input.
+
+Another option is exploiting ""vertical"" redundancy: this happens when output y_i' is correlated with output y_{i'+N/m}. This allows the same code-word to be reused vertically. This can be a symptom of a redundant output. It could also be the case that compressibility could be further subtantially improved by trying different matrix row permutations. Also, if one notices that y_i' ir correlated with y_i'', it might make sense to permute matrix rows in such a way that both rows would end up a multiple N/m apart. It would be interesting to see how this would affect compressibility.
+
+The third case is when code words are reused in arbitrary cases. 
+
+Generally, I think that answering the following questions would be interesting and could guide further research:
+1. It would be very interesting to know what kind of code-word reusa patterns the algorithm was able to capture, as this may guide further research.
+2. How invariance copressibility is under random permutations of matrix rows (thus also output vectors)?
+",8,,ICLR2020
+7SSzHDP8S2O,1,u8APpiJX3u,u8APpiJX3u,A potentially interesting topic but studied on a too small scale,"This paper studies the influence of the use of shared vs independent parameters in re-used blocks of neural networks. This is achieved for the task of semantic segmentation, with a network architecture that iteratively refines its prediction.
+
+Strengths:
+- Studying the effect of parameter sharing vs the use of independent parameters in recurrent types of architecture is interesting and could lead to a better understanding of these architectures
+- The paper is clearly written, and the methodology would be reproducible
+
+
+Weaknesses:
+
+Contribution:
+- While I do believe that such a study could provide the community with a better understanding of architectures with re-used blocks, the scale of the study performed in this paper is too small to draw conclusions. The paper tackles a single task (semantic segmentation), and, more critically, evaluates a single architecture. There is therefore no evidence that the conclusions drawn here will generalize to other architectures/tasks, which significantly limits the potential impact of this paper on the community.
+
+Related work:
+- While I am not aware of similar studies, several important references tackling the task of recurrent semantic segmentation are missing, e.g., Pinheiro & Collobert, ICML 2014; Wang et al., ICCV 2019. These methods rely on different architectures, and studying the effect of parameter sharing in their frameworks would broaden the scope of this work.
+- Similarly, video segmentation has also been tackled with recurrent architectures, e.g., Ballas et al., ICLR 2016; Valipour et al., WACV 2017. Studying the use of parameter sharing in this context would thus also increase the potential impact of this work.
+
+Empirical results:
+- While the results indeed evaluate the effect of parameter sharing on the chosen architecture, they seem to be somewhat disappointing in terms of absolute performance. In particular, on CityScapes (Fig. 1(b)), CNMM yields significantly higher mIoUs than the proposed method for the same number of MACs. Wouldn't it be possible to also extend this study to the CNMM architecture?
+
+Minor comments:
+- In the caption of Fig. 1, the authors mention that the parameters of the classification block are never shared. Does this mean that there is one classification block for each recurrent iteration?
+- In Section 2.2, when explaining the normalization of the weight factors, I suggest using a different notation for the weights before and after normalization, e.g., w_n = w_n'/(\sum_i w_i').
+- The title of Section 3.1 (Neural Architecture Search) is a bit misleading, as the authors simply perform a grid search of a few hyper-parameters, not a proper search over a large space of different architectures as in the NAS sub-field.
+- The meaning of the different colors in Fig. 4 is not always explicitly defined.
+",4,4.0,ICLR2021
+r1g_66GAKr,1,H1lac2Vtwr,H1lac2Vtwr,Official Blind Review #2,"This paper proposes a novel BERT based neural architecture, SESAME-BERT, which consists of “Squeeze and Excitation” method and Gaussian blurring. “Squeeze and Excitation” method extracts features from BERT by calculating a weighted sum of layers in BERT to feed the feature vectors to a downstream classifier. To capture the local context of a word, they apply Gaussian blurring on output layers of the self-attention layer in BERT. The authors show their model’s performance on GLUE and HANS dataset.
+
+Strengths
+*This paper claims the importance of the local context of a word and shows an effect of their method on the various datasets: GLUE, and HANS.
+
+Weaknesses
+* It seems like the self-attention layer can learn the local context information. Finding important words and predicts contextual vector representation of a word is what self-attention does.
+So, if using local-context information, which is information in important near words, is an important feature for some downstream tasks, then the self-attention layer can learn such important near words by training the key, query, and value weight parameters to connect the near important words.
+It would be nice if the authors provide some evidence that self-attention can't learn such a local-context feature.
+
+*In table 1, their experimental results show a slight improvement by using their method, but it's not significant.
+
+* On HANS dataset, they show using local-context can prevent models from easily adopting heuristics. How Gaussian blurring can prevent that problem? More explanation about the relation between local-context and adopting heuristics is required.
+
+
+
+",3,,ICLR2020
+B1ggQEC6tS,1,r1gfweBFPB,r1gfweBFPB,Official Blind Review #2,"This paper addresses a very good question - can we do better in terms of model learning, so that we can find the much sought after middle ground between model free and model based RL. In particular, the authors ask, can we find a way to learn a model that is reward/task independent, so that a new task can be equally well handled. This is timely and the general thrust of the thinking, in terms of learning from perturbation around trajectories, is good but I am not sure the proposed methods are sufficiently well developed to merit publication. I am also concerned that the authors do not consider numerous issues with the setup that are fairly well understood as issues for system identification.
+
+The main idea, as laid out in §1.1, is to observe that the parameter update depends mainly on the way a small perturbation in parameters is reflected as a variation in the optimal trajectory (by asking for the probability of a trajectory, this variation becomes the probability of a nearby trajectory). The authors then approach the approximation of this in terms of a discrete finite differences estimate. There are some extensions, such as using a local GP model instead of a local linear model and consideration of ways in which the system might not be exactly repeatable given initial states. These are all proper questions but there are many more important unanswered ones:
+
+1. Starting with where the model setup begins, it is not clear why a complex nonlinear dynamical system, i.e., the typical multi-jointed robot taken as a dynamical system (so, not just kinematics and quasi-static movements), can be sufficiently well approximated using a discretised finite point set that is used at the start of §2 - how does one find the correct T, the correct step size, how does one change these for the local nature of the dynamics (some places might be smoother than others, in phase space), etc.? Even more importantly, are we assuming we know the proper state space ahead of time so that there is no history dependence due to unobserved variables?
+
+2. As such, the authors are proposing to perform closed-loop system identification in a completely data-driven manner. It is well known that this is hard because in the absence of suitable excitation, not all necessary modes in the dynamics will be observed. The only controlled example considered, in §4.3, and subsequent discussion about 'zero-shot' generalisation is getting at this. However, neither at the conceptual level nor in terms of the detailed experiment do I see a good account of what allows this approach to learn all aspects of the dynamics of the system from just small perturbations around a closed loop trajectory.
+
+3. In light of all this, I find the evaluation really weak. Some experiments I would have liked to have seen include - (i) a control experiment based on a standard multi-link arm to show how bad the issue of model mis-match is for the task being considered (I suspect, not much), (ii) experiments with local linearizations, and perhaps piecewise local linearizations, to show how much innovation is needed or is being achieved by the proposed advances, (iii) for us to be talking about 'zero shot' generalisation and the like, more sophisticated tasks beyond merely changing the reaching point (as I say before, it is not even clear that a good PID controller with a roughly plausible linearization is not sufficient to achieve similar effects, and certainly there is a plethora of more sophisticated baselines one could have drawn upon).
+
+4. Some of the discussion comes across as a bit naive, e.g., we have a lemma 3 whose proof is simply a geometric argument about cubes without sufficient consideration of properties of dynamics. I don't doubt the result but in the way it is presented here, it seems shoddy.
+
+Also, some smaller questions not properly explained:
+a. How do you know which kernels for good for the GP in equations 9-10?
+b.  Why should we expect the correlation procedure in §3.0.1 to always work without aliasing and what is the way to get at the suitable domain?
+
+
+",1,,ICLR2020
+BJlOAwOpYr,1,HJgcvJBFvB,HJgcvJBFvB,Official Blind Review #3,"This paper proposes methods to improve generalization in deep reinforcement learning with an emphasis on unseen environments. The main contribution is essentially a data augmentation technique that perturbs the input observations using a noise generated from the range space of a random convolutional network. The empirical results look impressive and demonstrate the effectiveness of the method. The experiments are thorough (includes even adversarial attack) and the core method is novel as far as I am aware.
+
+That said, I have a couple of concerns regarding this paper and I would be willing to change my score if authors can address these.
+
+1) Feature matching loss (Eq 2) is presented as a novel contribution without referring to related work in semisupervised learning literature. This is essentially consistency training. See:
+a) Miyato, Takeru, et al. ""Virtual adversarial training: a regularization method for supervised and semi-supervised learning."" IEEE transactions on pattern analysis and machine intelligence 41.8 (2018): 1979-1993.
+b) Xie, Qizhe, et al. ""Unsupervised data augmentation."" arXiv preprint arXiv:1904.12848 (2019).
+
+2) The main contribution appears to be a data augmentation technique where we add a random neural net based perturbation to the state. My question is:
+
+*Why don't you first evaluate this on computer vision tasks given that the core idea is data augmentation for images?*
+
+If this technique is so powerful, shouldn't this do a great job in CIFAR10, Imagenet etc? Instead authors only provide a niche example (bright vs dark cat/dogs).
+
+If this can compete with top augmentation techniques on Imagenet (e.g. autoagument), then it can explain the RL performance. Otherwise, please provide some intuition on why this works so well on RL but not as well on computer vision tasks. Is it the unseen environment diversity of RL challenges?
+
+3) While proposed method performs well on the benchmarks, it is not clear whether authors compare to the state-of-the-art algorithms. For each task (CoinRun, DeepMind Lab, etc), please explicitly state the best prior result (e.g. Espeholt et al, Tobin et al, Cobbe et al etc) so that proposed method's performance can be better assessed.
+
+-------------------------
+
+After rebuttal: Authors addressed most of my comments. I also found the new experimental results (Fig 5 and 7) very insightful. I increase my score to Weak Accept.
+
+For future improvement: More realistic experiments on computer vision tasks (besides cats and dogs) would be welcome. Otherwise, please justify why proposed strategy is particularly good for RL (rather than traditional computer vision benchmarks) in boosting robustness to new domains.",6,,ICLR2020
+rJeSu1F5hQ,2,B1eO9oA5Km,B1eO9oA5Km,Review,"This paper make two contributions: (1) it propose a new framework for semi-supervised training for NMT by introduce constraint of encoder and decoder states. (2) It apply Q-learning to schedule the updates of different components. I personally highly believe find the relation between encoder and decoder hidden states is a very good direction for utilizing pair data. Model scheduling is also an important problem for multilingual-NMT. 
+
+However,  this paper is very hard to follow. 
+1. It has lots of acronyms, e.g. section 3.1. It also try to over-complicated the algorithm and I don't think these acronyms are necessarily to be defined.  
+2. It try to link it to information theory but most of study is just empirical (which is fine, but avoid it can simplify the writing and make it more readable), e.g. "" According to information theory and the attention mechanism
+(Bahdanau et al., 2014), it is clear that we.."" I agree with the intuition but how it can be ""if and only if""? 
+3. It said Figure 2 shows BDE better aligned with BLUE, is there a quantitative measure, e .g. correlation? Or I missed something.
+4. What is the NMT network structure?
+5. I have trouble to understand ""In this process, one monolingual data Si of language i would first be translated to hidden states (ISD) of deci through NMTi , then ISDi is used to reconstruct..."" ""Guided Dual Learning"" part.
+
+The experimental results looks good, especially for low-resource case. But addressing of similarity and comparison with some previous methods could be improved. At least there is simply baseline which use pre-training. Adding some published SOTA results in the table can also help to understand how well it is.
+
+In summary, the paper provide some interesting perspectives. However, it's hard to follow on the algorithm part and lack of relevant baseline.",5,2.0,ICLR2019
+BygyoQSKYB,1,SklgfkSFPH,SklgfkSFPH,Official Blind Review #1,"The authors  replace the empirical risk term in a PAC-Bayes bound by its second-order Taylor series approximation, obtaining an approximate (?)  PAC-Bayes bound that depends on the Hessian. Note that the bound is likely overoptimistic unless the minimum is quadratic. They purpose to study SGD by centering the posterior at the weights learned by SGD. The posterior variance that minimizes this approximate PAC-Bayes bound can then be found analytically. They also solve for the optimal prior variance (assuming diagonal Gaussian priors/posteriors), producing a hypothetical ""best possible bound"" (at least under the particular choices of priors/posteriors, and under this approximation of the empirical risk term). The authors evaluate their approximate bound and ""best bound possible"" empirically on MNIST and CIFAR. This requires  computing approximations of the Hessian for small fully connected neural networks trained on MNIST and CIFAR10. There are some nice visualizations (indeed, these may be one of the most interesting contributions.)
+
+The direction taken by the authors is potentially interesting. However, there are a few issues that would have to be addressed carefully for me to recommend acceptance. First, the comparison to (some very) related work is insufficient, and so the actual novelty is misrepresented (see detailed comments below). Further, the paper is full of questionable vague claims and miscites/attributes other work. At the moment, I think the paper is below the acceptance threshold: the authors need to read and understand (!) related work, and expand their theoretical and/or empirical results to produce a contribution of sufficient novelty/impact. 
+
+DETAILED FEEDBACK.
+
+I believe the authors missed some related work by Tsuzuki, Sato and Sugiyama (2019), where a PAC-Bayes bound was derived in terms of the Hessian, via a second-order approximation. How are the results presented in this submission relate to Tsuzuki et al approach? 
+
+When the posterior, Q, is a Gaussian (or any other symmetric distribution), \eta^T H \eta is the so-called Skilling-Hutchinson trace estimator. Thus E(\eta^T H \eta) is the Trace(H) scaled by the variance of \eta. The authors seem to have completely missed this connection, which simplifies the final expression considerably.
+
+Why is the assumption that the higher order terms are negligible reasonable? Citation or experiments required.
+
+Regarding the off-diagonal Hessian approximation: how does the proposed layer-wise approximation relate to k-FAC (Martens and Grosse 2015)?
+
+IB Lagrangian: I am not sure why the authors state the result in Thm 4.2 as a lower bound on the IB Lagrangian. What’s the significance of having a lower bound on IB Lagrangian?
+
+Other comments: 
+
+Introduction: “At the same time neither the non-convex optimization problem solved in .. nor the compression schemes employed in … are guaranteed to converge to a global minimum.”. This is true but it is really not clear what the point being made is. Essentially, so what? Note that PAC-Bayes bounds hold for all posteriors, even ones not centered at the global minimum (of any objective). The claims made in the rest of the paragraph are also questionable and their purposes are equally unclear. I would be grateful if the authors could clarify.
+
+First sentence of Section 3.1: “As the analytical solution for the KL term in 1 obviously underestimates the noise robustness of the deep neural network around the minimum...”. I have no idea what is being claimed here. The statement needs to be made much less vague.  Please explain.
+
+Section 4: “..while we will be minimizing an upper bound on our objective we will be referring with a slight abuse of terminology to our results as a lower bound.”. I would appreciate if the authors could clarify what they mean here.
+
+Section 4.1 beginning: “We make the following model  assumptions...”. Choosing a Gaussian prior and posterior is not an assumption. It's simply a choice. The PAC-Bayes bound is valid for any choices of Gibbs classifiers. On the other hand, it is an assumption that such distributions will yield ""tight"" bounds, related to the work of Alquier et al.
+
+Section 4.1 “In practice we perform a grid search over the parameters..”. The authors should mention that such a search should be accounted for via a union bound (or otherwise). The ""cost"" of such a union bound should be discussed.
+
+The empirical risk of Q is computed using 5 MCMC samples. This seems like a very low number, as it would not even give you one decimal point of accuracy with reasonable confidence! The authors should either use more samples, or account for the error in the upper bound using a confidence interval derived from a Chernoff bound.
+
+Section 4.2: “The concept of a valid prior has been formalized under the differential privacy setting...”. I am not sure what the authors mean by that.
+
+Section 5: “There is ambiguity about the size of the Hessians that can be computed exactly.” What kind of ambiguity?
+
+Same paragraph in Section 5 discusses why there are few articles on Hessian computation. The authors claim that “the main problem seems to be that the relevant computations are not well supported...”. This is followed by another comment that is supposed to contrast the previous claim, saying that storing the Hessian is infeasible due to memory requirements. I am not sure how this claim about memory requirements shows a contrast with the claim on computation not being supported.
+
+First sentence in Section 5.1: I believe this is only true under some conditions. 
+
+Section 5.1: The authors should explain why they add a damping term, alpha, to the Hessian, and how alpha affects the results.
+
+***
+Additional citation issues:
+
+The connections between variational inference, PAC-Bayes and IB Lagrangian have been pointed out in previous work (e.g. Germain, Bach, Lacoste, Lacoste-Julien (2016); Achille and Soatto 2017).
+
+In the introduction, the authors say “...have been motivated simply by empirical correlations with generalization error; an argument which has been criticized …” (followed by a few citations). Note, that this was first criticized in Dziugaite and Roy (2017). 
+
+“Both objectives in … are however difficult to optimize for anything but small scale experiments.”. It seems peculiar to highlight this, since the approach that the authors are presenting is actually more computationally demanding. 
+
+Citations for MNIST and CIFAR10 are missing.
+
+***
+Minor:
+Theorem 3.1 “For any data distribution over..”, I think it was meant to be \mathcal{X} \times  (and not \in )
+Theorem 4.2: “For our choice of Gaussian prior and posterior, the following is a lower bound on the IB-Lagrangian under any Gaussian prior covariance”. I assume only the mean of the Gaussian prior is fixed.
+
+
+Citations are misplaced (breaking the sentences, unclear when the paper of the authors are cited).
+There are many (!) missing commas, which makes some sentences hard to follow.
+
+***
+Positive feedback: I thought the visualizations in Figure 2 and 3 were quite nice.
+",1,,ICLR2020
+6GzUTk847K,3,X76iqnUbBjz,X76iqnUbBjz,Interesting work and solid analysis,"This work proposes that the transferability of adversarial attacks has a negative correlation with the interaction within an input perturbation. By defining the interaction of perturbations with the Sharpley value, it can quantify the interactions and demonstrate the negative correlation with the transferability. Furthermore, this work shows that prior work on adversarial attacks (e.g., VR attack and MI attack) can be explained by the (expected) interaction scores. This work further demonstrates that the way of enhancing transferability by minimizing the interaction within input perturbations, with the experiments on the image classification task. 
+
+Overall, I think this work provides a new perspective of understanding transferability and presents solid analysis/experiments to verify the hypothesis. 
+
+Only one comment about the definition of interaction scores. In some literature [Lundberg et al., 2019], it is called the Shapley interaction index, which uses the definition in equation (13) of Appendix D. Shapley interaction index has mainly been used in the machine learning literature recently for explaining feature interactions within models. E.g., 
+
+1. Lundberg et al. Consistent Individualized Feature Attribution for Tree Ensembles. 2019
+2. Chen and Ji. Learning Variational Word Masks to Improve the Interpretability of Neural Text Classifiers. 2020
+",10,4.0,ICLR2021
+SJx-bmHbnm,1,S1lwRjR9YX,S1lwRjR9YX,Stability of Stochastic Gradient Method with Momentum for Strongly Convex Loss Functions,"Comments: 
+
+The author(s) provide stability and generalization bounds for SGD with momentum for strongly convex, smooth, and Lipschitz losses. 
+
+This paper basically follows and extends the results from (Hardt, Recht, and Singer, 2016). Section 2 is quite identical but without mentioning the overlap from Section 2 in (Hardt et al, 2016). The analysis closely follows the approach from there. 
+
+The proof of Theorem 2 has some issues. The set of assumptions (smooth, Lipschitz and strongly convex) is not valid on the whole set R^d, for example quadratic function. In this case, your Lipschitz constant L would be arbitrarily large and could be damaged your theoretical result. To consider projected step is true, but the proof without projection (and then explaining in the end) should have troubles. 
+
+From the theoretical results, it is not clear that momentum parameter affects positively or negatively. In Theorem 3, what is the advantage of this convergence compared to SGD? It seems that it is not better than SGD. Moreover, if \mu = 0 and \gamma > 0, it seems not able to recover the linear convergence to neighborhood of SGD. Please also notice that, in this situation, L also could be large. 
+
+The topic could be interesting but the contributions are very incremental. At the current state, I do not support the publications of this paper. 
+",4,4.0,ICLR2019
+rkxn9hTqhX,2,S1MB-3RcF7,S1MB-3RcF7,"The idea is natural and interesting, the presentation is clear, but short of analysis on the computational cost (FLOPS and memory consumption)","This paper studies the problem of training of Generative Adversarial Networks employing a set of discriminators, as opposed to the traditional game involving one generator against a single model. Specifically, this paper claims two contributions:
+1.	We offer a new perspective on multiple-discriminator GAN training by framing it in the context of multi-objective optimization, and draw similarities between previous research in GANs variations and MGD, commonly employed as a general solver for multi-objective optimization.
+2.	We propose a new method for training multiple-discriminator GANs: Hypervolume maximization, which weighs the gradient contributions of each discriminator by its loss.
+
+Overall, the proposed method is empirical and the authors show its performance by experiments. 
+
+First, I want to discuss the significance of this work (or this kind of work). As surveyed in the paper, the idea of training of Generative Adversarial Networks employing a set of discriminators has been explored by several previous work, and showed some performance improvement. However, this idea (methods along this line) is not popular in GAN applications, like image-to-image translation. I guess that the reason may be that: the significant computational cost (both in FLOPS and memory consumption) increase due to multiple discriminators destroys the benefit from the small performance improvement. Maybe I’m wrong. In Appendix C Figure 10, the authors compares the wall-lock time between DCGAN, WGAN-GP and multiple-discriminator, and claims that the proposed approach is cheaper than WGAN-GP. However, WGAN-GP is more expensive due to its loss function involves gradients, while the proposed method does not. If directly compared with DCGAN, we can see an obvious increase in wall-clock time (FLOPS). In addition, the additional memory consumption is hidden there, which is a bigger problem in practice when the discriminators are large. SN-GAN have roughly the same computational cost and memory consumption of DC-GAN, but inception and FID are much higher. From my perspective, a fair comparison is under roughly the same FLOPS and memory consumption. 
+
+The paper is well-written. The method is well-motivated by the multi-objective optimization perspective. Although the presentation of the Hypervolume maximization method (Section 3.2) is not clear, the resulting loss function (Equation 10) is simple, and shares the same form with other previous methods. The hyperparameter \eta is problematic in the new formulation. The authors propose the Nadir Point Adaption to set this parameter. 
+
+The authors conduct extensive experiments to compare different methods. The authors emphasize that the performance is improved with more discriminators, but it’s good to contain comparison of the computational cost (FLOPS and memory consumption) at the same time. There are some small questions for the experiments. The reported FID is computed from a pretrained classifier that is specific to the dataset, instead of the commonly used Inception model. I recommend the authors also measure the FID with the Inception model, so that we have a direct comparison with existing reported scores.
+
+Overall, I found that this work is empirical, and I’m not convinced by its experiments about the advantage of multiple-discriminator training, due to lacking of fair computational cost comparison with single-discriminator training. ",5,3.0,ICLR2019
+A8vA0SBZUIV,1,DC1Im3MkGG,DC1Im3MkGG,ICLR 2021 Conference Paper3730 AnonReviewer2,"Summary: 
+This paper studies the connections between algorithmic fairness and domain generalization. As discussed in Section 2, the “environment” in domain generalization plays a similar role as the “group membership” in algorithmic fairness. The paper shows in Table 2 that the methods of each field can apply to the other field. 
+
+The paper develops its own algorithm EIIL which extends the Invariant Risk Minimization (IRM) of domain generalization to work in the situation when the prior knowledge of environments is not available. And this extension is mainly based on the idea from algorithmic fairness literature which considers the worst-case environments and solves a bi-level optimization. 
+
+The paper shows empirically that their algorithm EIIL outperforms IRM with handcrafted environments in terms of test accuracy on CMNIST.
+
+Strength:
+(1) The connection between domain generalization and algorithmic fairness shown by the paper is interesting.
+(2) The paper demonstrates the performance of EIIL via empirical results.
+
+Weakness:
+(1) Other than the high level intuitions and examples, the paper does not provide any theoretical analysis of the performance of the EIIL for domain generalization. What guarantees can EIIL get in terms of the test error and how does it compare to IRM (when making reasonable assumptions about the training and test distributions)?
+(2) Similarly, the paper does not provide any theoretical analysis of EIIL for algorithmic fairness.
+(3) On top of page 6, after explaining the bi-level optimization, the paper switches to the sequential approach (EIILv1) without much explanation. Why is the bi-level optimization not practical? How well can the proposed sequential approach approximate the bi-level optimization results and how does this affect the performance of EIILv1?
+
+Reasons for score:
+Overall I vote for rejection since the weakness outweighs the strength. The lack of theoretical analysis of the algorithm makes the paper incomplete. 
+
+Typo:
+Page 5: two periods after word “poorly”.
+",4,3.0,ICLR2021
+rkgtgQXRYB,1,HkeMYJHYvS,HkeMYJHYvS,Official Blind Review #2,"Summary: The suggest two improvements to boundary detection models: (1) a curriculum learning approach, and (2) augmenting CNNs with features derived from a wavelet transform. For (1), they train half of the epochs with a target boundary that is the intersection between a Canny edge filter and the dilated groundtruth. The second half of epochs is with the normal groundtruth. For (2), they compute multiscale wavelet transforms, and combine it with each scale of CNN features. They find on a toy MNIST example that the wavelet transform doesn’t impact results very much and curriculum learning seems to provide some gains. On the Aerial Road Contours dataset, they find an improvement of ~15% mAP over the prior baseline (CASENet).
+
+I have several concerns with this work:
+* The idea of using wavelet transforms to augment CNNs has been more thoroughly explored in prior work (e.g., see [1]).
+* No comparison to existing SOTA segmentation models (e.g., [2]). These semantic / instance segmentation models can easily be adapted to the task of boundary detection. I suspect the baseline here is weak.
+* Section 6 is severely unfinished. The explanation is sparse and there are no quantitative results -- just the output of the model overlaid on one example.
+* The choice of curriculum learning task is arbitrary, and there are no ablations explaining why this is a reasonable task. For example, what about random subsets of pixels? At the moment, it offers no insight for practitioners.
+* There are no ablations for the Aerial Road Contours experiments. This seems necessary because it is the only realistic dataset evaluated in this work. The MNIST experimental results appear qualitatively different from the Contours experiment. For example, they show that wavelet features do not make much of a difference, but does it make a difference for Contours?
+
+Altogether, this work unfortunately offers few insights to vision practitioners, let alone general practitioners. Substantial work needs to be devoted to expanding experimental coverage. 
+
+[1] Wavelet Convolutional Neural Networks. Shin Fujieda, Kohei Takayama, Toshiya Hachisuka
+[2] TensorMask: A Foundation for Dense Object Segmentation. Xinlei Chen, Ross Girshick, Kaiming He, Piotr Dollár",1,,ICLR2020
+g2YJDUhcz-b,3,6xHJ37MVxxp,6xHJ37MVxxp,"Review for ""domain generalization with MixStyle"" ","** Paper Summary **
+
+This paper proposed a simple regularization technique for domain generalization tasks, termed MixStyle, based on the observation that domains are determined by image styles. By mixing styles of different instances, which generates synthesized domain samples while preserving the content features, the proposed method achieves the generalizability of the trained model. The MixStyle was applied to numerous applications, such as category classification, instance retrieval, and reinforcement learning, and attained the state-of-the arts. The MixStyle is relatively simple to implement, but effective. 
+  
+** Paper Strength **
++ Simple methodological design, so it is easy to implement.
++ Understanding the domain shift problems as a style variation makes sense.
++ Randomizing the styles might be the solution to alleviate the domain generalization problems, but searching all the possible styles and applying them would be challenging and not feasible. So, using different instance samples to extract the styles was nice.
++ It makes sense that introducing the \lambda to mix the styles itself and ones of different instances.
++ The paper is well organized and written.
+
+** Paper Weakness **
+
+I have no major comments on this paper, but minor comments as follows:
+- Even though the authors have shown the ablation study to analyze the levels where the MixStyle should be applied, it is not clear for me yet. The authors applied the MixStyle after 1st, 2nd, and 3rd residual blocks for category classification problems, but applied the MixStyle after 1st and 2nd residual blocks for category classification problems for instance retrieval task. In 3.4 analysis, they only showed the ablation studies on the category classification. Thus, one think the optimal combinations may vary according to the applications. In addition, another combination, e.g., conv34, conv25, would be more interesting.
+- Fig 4 is hard to understand; what do the corresponding style statistics mean? Why does (d) only represent different legends? 
+- In Table 1, some experimental settings, e.g., Cartoon or Photo, have shown that MixStyle w/ random shuffle was better? The discussion on this might be interesting. 
+",7,4.0,ICLR2021
+HJepvMo__r,1,r1xapAEKwS,r1xapAEKwS,Official Blind Review #1,"The paper presents an alternative to densely connected shallow classifiers or the conventional penultimate layers (softmax) of conventional deep network classifiers. This is formulated as a Gaussian mixture model trained via gradient descent arguments. 
+
+The paper is interesting and constitutes a useful contribution. The Gaussian mixture model formulation allows for inducing sparsity, thus (potentially) considerably reducing the trainable model (layer) parameters. The model is computationally scalable, a property which renders it amenable to real-world applications.
+
+However, it is unfortunate that the paper does not take into account recent related advances in the field, e.g.
+
+https://icml.cc/Conferences/2019/ScheduleMultitrack?event=4566
+
+The paper should make this sort of related work review, discuss the differences from it, and perform extensive experimental comparisons. ",3,,ICLR2020
+ryx5EZ5htB,2,HygkpxStvr,HygkpxStvr,Official Blind Review #2,"This paper tackles the problem of learning to label individual timesteps of sequential data, when given labels only for the sequence as a whole. The authors take an approach derived from the multiple-instance learning (MIL) literature that involves pooling the per-timestep predictions into a sequence-level prediction and to learn the per-timestep predictions without having explicit labels. They evaluate several pooling techniques and conclude that the log-sum-exp pooling approach is superior. The learned segmentations are used to train policies for multiple control skills, and these are used to solve hierarchical control tasks where the correct skill sequence is known.
+
+This is a good application of the MIL approach. However, I have settled on a weak reject because in my view, the novelty and results are minor.
+
+The main point of comparison is the log-sum-exp() pooling as compared to max() and neighborhood-max() pooling. However, if I understand correctly, the log-sum-exp() approach has been used successfully in several other domains including its original domain of semantic image segmentation. So I view the novelty of the approach to be fairly low.
+
+In addition, although the superior pooling method (which already exists in the literature) does outperform the alternatives evaluated here, the results are somewhat underwhelming, at only ~35-60% validation accuracy. How does this compare to a fully-supervised oracle method trained with per-timestep labels?
+
+The behavioral cloning results are also fairly underwhelming, and the experiments are not very clearly described. Am I correct in my understanding that the learned skills are composed to solve a task where the correct sequence of skills is known, but is longer than the training sequences? A success rate of 50% on this task seems rather low. How does this compare, as above, to a fully-supervised oracle baseline? Why is there no success rate reported for the CCNN baseline?
+
+I think this is a good application of weakly-supervised MIL, but I find the specific contributions to be lacking in novelty and impressiveness of results. There are several directions that I think could improve the work:
+- oracle fully-supervised results, to indicate the gap between the fully- and weakly-supervised case
+- more thorough baselines on the behavior task, such as Policy Sketches [1]
+- perhaps the temporal aspect of the problem could be incorporated into the pooling approach more directly to produce a more novel algorithmic contribution
+
+[1] Andreas, Jacob, Dan Klein, and Sergey Levine. ""Modular multitask reinforcement learning with policy sketches."" Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017.",3,,ICLR2020
+ylfTwYbIIi0,1,MIDckA56aD,MIDckA56aD,"Interesting work, weak theoretical justification","**Summary.** In this work, the author(s) have presented an approach to identify valid perturbation operations that can be applied to the model inputs, which can be exploited to boost model robustness via purposefully corrupting the inputs during training. A weakly supervised setting has been assumed, such that pairs of valid perturbations (x, x') are available to support the identification. Technically, a conditional variational auto-encoder is trained to capture possible variations, where the perturbated inputs are used for reconstruction. The latent codes are treated as source space disruptions, concatenated with the original input x to reproduce x'. The author(s) provided empirical evidence to support their claim, along with some theoretical justifications. 
+
+**Quality.** While the presentation is okay, I have concerns wrt the technical novelty and significance of this work. My detailed arguments are provided below. 
+
+**Clarity.** The first three pages of this paper are very well written. The author(s) have clearly defined the problem setup, adequately reviewed relevant literature, and heuristically motivated the technical route. However, confusions arise when the author(s) try to provide some theoretical underpinnings. While I do not particularly mind if a paper has incredible empirical performance but offers very limited theoretical insights, I do find it annoying if some not-so-well-developed ""theories"" have been imposed upon the paper to boost technical significance. Notations of perturbated distributions are not clearly defined (Eqn (3)). Also, the necessary subset property (Defn. 1) and sufficient likelihood (Defn. 2) are heuristically defined, and subsequent development seems to inconsistently use these notations. For example, the reconstruction error in Thm 2 is actually the expected log-likelihood, rather than expected likelihood as defined in Defn. 2. It is also not clear what exactly does ""Let r be the Mahalanobis distance which captures 1 − α of the probability mass for a k-dimensional standard multivariate normal for some 0 < α < 1. "" mean. 
+
+**Originality.** While the idea of applying CVAE to learn valid perturbations is new, the technical contributions seem very incremental. It feels like a direct application of CVAE to this problem, rather than develop specialized treatments to tailor the solution. 
+
+**Significance.** This is an interesting proposal. But I encourage the author(s) to include additional comparisons wrt related techniques in order for the reviewers to better evaluate significance. For example, baselines such as AutoAug should be compared, as these methods also try to identify valid perturbations that do not affect the label. I would also love to see an adversarial variant of the proposed approach is compared as well. 
+
+**Correctness.** The main reason I am leaning towards a rejection is that I have serious concerns about the correctness of the theories presented. The author(s) seem to imply the KL terms should go away in order to bound the approximation error, this is definitely not the case for VI. Such a phenomenon is more commonly known as KL-vanishing in VAE literature, which is typically associated with uninformative latents -- where VAE learning has failed. The limit of R also doe not make much sense. 
+",5,4.0,ICLR2021
+S1eBU1kkaX,3,rklhb2R9Y7,rklhb2R9Y7,heuristic combining environment rewards with an IRL-style rewards,"The draft proposes a heuristic combining environment rewards with an IRL-style rewards recovered from expert demonstrations, seeking to extend the GAIL approach to IRL to the case of mismatching action spaces between the expert and the learner. The interesting contribution is, in my opinion, the self-exploration parameter that reduces the reliance of learning on demonstrations once they have been learned sufficiently well.
+
+Questions:
+
+- In general, it's known that behavioral cloning, of which this work seem to be an example in so much it learns state distributions that are indistinguishable from the expert ones, can fail spectacularly because of the distribution shift (Kaariainen@ALW06, Ross&Bagnell@AISTATS10, Ross&Bagnell@AISTATS11). Can you comment if GAN-based methods are immune or susceptible to this?
+   
+- Would this work for tasks where the state-space has to be learned together with the policy? E.g. image captioning tasks or Atari games.
+
+- Is it possible to quantify the ease of learning or the frequency of use of the ""new"" actions, i.e. $A^l \setminus A^e$?. Won't learning these actions effectively be as difficult as RL with sparse rewards? Say, in a grid world where 4-way diagonal moves allow reaching the goal faster, learner is king 8-way, demonstrations come from a 4-way expert, rewards are sparse and each step receives a -1 reward and the final goal is large positive -- does the learner's final policy actually use the diagonals and when?
+
+Related work:
+   
+- Is it possible to make a connection to (data or policy) aggregation methods in IL. Such methods (e.g. Chang et al.@ICML15) can also sometimes learn policies better than the expert.
+
+Experiments:
+- why GAIL wasn't evaluated in Fig. 3 and Fig. 4?
+
+Minor:
+- what's BCE in algorithm 1?    
+- Fig.1: ""the the""
+- sec 3.2: but avoid -> but avoids
+- sec 3.2: be to considers -> be to consider
+- sec 3.2: any hyperparameter -> any hyperparameters
+- colors in Fig 2 are indistinguishable
+- Table 1: headers saying which method is prior work and which is contribution would be helpful
+- Fig. 3: if possible try to find a way of communicating the relation of action spaces between expert and learner (e.g. a subset of/superset of). Using the same figure to depict self-exploration make it complicated to analyse.
+- sec 3.2: wording in the last paragraph on p.4 (positive scaling won't _make_ anything positive if it wasn't before)",6,2.0,ICLR2019
+rJxC9rZ2FH,1,r1lnigSFDr,r1lnigSFDr,Official Blind Review #2,"This paper introduces two novel techniques to help long term signal propagation in RNNs. One is an initialization strategy which uses inverse sigmoid function to avoid the decay of the contribution of the input earlier in time and another is a new design of a refine gate which pushes the value of the gate closer to 0 or 1.  The authors conduct exhaustive ablation and empirical studies on copy task, sequential MNIST, language modeling and reinforcement learning. 
+
+Though the experiment section is solid, I still vote for rejection for the following reasons:
+
+1. The writing in the description of the UGI and refine fate is not clear.
+a. The authors compares UGI to standard initialization but where is the standard initialization? I do not see ""standard initialization"" clearly defined in the paper.
+b. I am not convinced how the UGI gate help avoid decaying of the input. There is a proposition 2 trying to explain some part of the mechanism of UGI. But the proposition is never proved anywhere and I am not sure why this proposition is important. More explanations are needed. Also this proposition is far away from the place the authors introduce the UGI. The authors may want to refer it in the place introducing UGI.
+c. Similar to proposition 2, proposition 3 is not explained and proved in the paper. It is hard for me to analyze the importance of these two propositions. Overall, propositions 2 and 3 look isolated in the section.
+d. Proposition 1 looks like a definition. Not sure why the authors name it as a proposition.
+
+2. Even though the title of the paper is ""improving the gating mechanism of recurrent neural networks"", the authors try to solve signal propagation problems. It is unclear why ""gate"" is important. Maybe other designs of the recurrent neural network can satisfy better the desiderata the authors want. Based on my limited knowledge, the initialization the authors mention (saturation) is exactly from the need of using a sigmoid gate. The importance of using ""gate"" should be discussed.
+
+3. The authors shrink the space before and after headings in the paper. I think this is not allowed in ICLR. It would be better that the authors correct the spacing in the revised version.
+
+
+Minors:
+1. page 1 second paragraph: repeated “proven”
+2. page 1 second last paragraph: “due” -> “due to” 
+3. page 2 second last paragraph: repeated “be”
+4. page 2 Equation (4) and (5): using some symbols like \odot for element wise multiplication will be good for the readers.
+
+
+",3,,ICLR2020
+HJeb-tr93Q,3,HyzMyhCcK7,HyzMyhCcK7,An interesting addition to the (large) literature on methods to learn deep networks with quantized weights.,"This paper proposes a new approach to learning quantized deep neural networks, which overcome some of the drawbacks of previous methods, namely the lack of understanding of why straight-through gradient works and its optimization instability. The core of the proposal is the use of quantization-encouraging regularization, and the derivation of the corresponding proximity operators. Building on that core, the rest of the approach is reasonably standard, based on stochastic proximal gradient descent, with a homotopy scheme.
+
+The experiments on benchmark datasets provide clear evidence that the proposed method doesn't suffer from the drawbacks of  straight-through gradient, does contributing to the state-of-the-art of this class of methods.
+
+",8,4.0,ICLR2019
+H1lpyw_atB,2,rJxWxxSYvB,rJxWxxSYvB,Official Blind Review #1,"summary
+
+This paper considers the ""weight transport problem"" which is the problem of ensuring that the feedforward weights $W_{ij}$ is the same as the feedback weights $W_{ji}$ in the spiking NN model of computation. This paper proposes a novel learning method for the feedback weights which depends on accurately estimating the causal effect of any spiking neuron on the other neurons deeper in the network. Additionally, they show that this method also minimizes a natural cost function. They run many experiments on FashionMNIST and CIFAR-10 to validate this and also show that for deeper networks this approaches the accuracy levels of GD-based algorithms. 
+
+
+
+comments
+
+Overall I find this paper to be well-written and _accessible_ to someone who is not familiar with the biologically plausible learning algorithms. To overcome the massive computational burden, they employ a novel experimental setup. In particular, they use a separate non-spiking neural network to train the feedforward weights and use the spiking neurons only for alignment of weights. They have experimental evidence to show that this method is a legitimate workaround. I find their experimental setup and the results convincing to the best of my knowledge. The experimental results indeed show the claim that the proposed algorithm has the properties stated earlier (i.e., learns the feedback weights correctly and that using this to train deep neural nets provide better performance than weight alignment procedure). I must warn that I am not an expert in this area and thus, might miss some subtleties. Given this, it is also unclear to me why this problem is important and thus, would leave the judgement of this to other reviewers. Here I will score only based on the technical merit of the method used to solve the problem.
+
+I had one minor comment on the arrangement of the writing of the paper. Section 4 starts off with ""Results"" but the earlier sub-sections are not really about the results. I would split section 4 as methodology/algorithm and include the everything until section 4.4. From sub section 4.5 onwards are the actual results.
+
+
+overall decision
+
+Without commenting on the importance of this problem, I think this paper merits an acceptance based on the technical content. The paper provides convincing experiments to test the properties the author claim the new algorithm has.",6,,ICLR2020
+BWhOnU82YNU,2,LNtTXJ9XXr,LNtTXJ9XXr,An interesting extension of AdvProp with limited evidence of practicality,"## Overview 
+
+The paper focuses on the generalization issue with adversarial training that various work has recently demonstrated. The paper studies the role of batch normalization (BN) in adversarial robustness and generalizability. The authors single out the rescaling operator in BN to significantly impact the clean and robustness trade-off in CNNs.  They then introduce Robust Masking (Rob-Mask), which is shares similarities to the CVPR2020 paper by Xie et al. (2020). Xie et al. use an auxiliary BN in adversarial training, which uses different batch normalization parameters for adversarial samples, to improve the generalizability of CNNs. Still, the authors clearly state the differences between Rob-Mask and AdvProp. 
+
+## Contributions
+
+The contributions of the paper are as follows:
+
+1. Showing the effect of BN (and, more specifically, the scale parameter of BN together with ReLU) as adversarial masking.
+
+   a. Authors show that *adversarial fine-tuning* of only the BN parameters of a *vanilla-trained* network provides some adversarial robustness, although at the trade-off losing test accuracy. 
+
+   b. Authors show that *standard fine-tuning* of only the BN parameters of an *adversarially-trained* network increases the network's generalizability, although at the trade-off losing robustness. 
+2. Showing that interpolating between the BN parameters in Contribution 1 provides a smooth trade-off between generalizability and robustness.
+
+3. Devising an approach for utilizing different perturbation strengths for model training. The authors build on their Contribution 2 and propose $k$ basic (or better to say principle) rescaling parameters, the linear combination of which leads to a rescaling parameter. 
+
+4. Providing a short yet informative, ablation study to show the effectiveness of Contribution 
+5. Showing experimental benefits over AdvProp on CIFAR10 and CIFAR100 datasets.
+
+Contribution 3 turns AdvProp into a particular case of RobMask.  In fact, Xie et al. (2020) mention in their paper that ""a more general usage of multiple BNs will be further explored in future works,"" which seems to be the inspiration behind this paper.  
+
+## Weaknesses
+
+1. The main limiting factor for the impact of this paper is the experiments. The paper only reports performance on CIFAR10 and CIFAR100. Given that the paper can be considered an extension/improvement over AdvProp, it is desirable to have similar largescale experiments in Xie et al. (2020) on ImageNet and its variations. A head-to-head comparison with the experiments in Xie et al. (2020) would provide a clearer picture to show the proposed method's power.
+2. Regarding the practicality of the approach, I am missing a computational analysis of the approach to compare it against BN and AdvProp, e.g., it would be great if the authors provided a head-to-head comparison of training curves. Does your method take much longer to train?
+3. How many times did you run each experiment? What are the standard deviations in Table 3 (and other tables)? Providing this information, at least in the supplementary materials, could clarify your results' statistical significance.
+
+## Questions and comments for the authors
+
+1. The notation $\gamma_i$ is used both for BN's scaling parameter and for the learning rate, which turns the equations hard to follow.
+2. On the bottom of page 7, you wrote: ""It is because both AdvProp and Adversarial training models are trained with adversarial examples generated with $\epsilon= 8/255$, while our methods use a random perturbation where $\epsilon_{max}=8/255$.""  The term ""random perturbation"" is misleading here, as I believe you are also using PGD attack, but the adversarial perturbation's strength is randomized. Is that correct?
+3. Please refer to Weaknesses 2.
+4. I don't find Figure 2 informative at all. I suggest that the authors remove the figure and use the space to address the raised concerns.
+
+## Evaluation logic
+
+I find the paper an interesting extension of the CVPR2020 paper by Xie et al. However, the paper's experimental section does not provide enough information to the reader to see the concrete benefit of the proposed method in training a large scale CNNs. I think the paper could significantly benefit from a more extensive experimental setting. Given the limited novelty and lack of concrete evidence of practicality, I score the paper as a 5. 
+
+## Post rebuttal evaluation
+
+I thank the authors for providing answers to the raised questions and providing further experiments. Regarding Figure 3, I suggest that the authors provide accuracy as a function of wallclock instead of epochs currently reported in the paper. As a result of the authors' responses, I increase my score to 6. ",6,4.0,ICLR2021
+tPVKmIR-xV-,3,nzLFm097HI,nzLFm097HI,Review,"The paper proposes a neuro-symbolic model for sample-efficient VQA, which turns each question into a probabilistic program which is then softly executed. The problem explored in the paper and its background and context presented clearly and it does a good job in motivating its importance and trade-offs between possible solutions. While the use of a probabilistic program to represent the questions might be too stiff / inflexible in my opinion and may not generalize well to less constrained natural language, this direction is still of course important and interesting. It also does a great job in presenting the existing approaches and comparing their properties. The writing is good and the model is presented clearly with a very useful diagram. 
+
+However, the novelty of the paper seems limited to me, as it mainly combines together ideas that have been extensively explored in many prior works which are mentioned by the paper. Turning the question into a series of attentions over semantic factors appears in the NSM and partially in MAC models. The iterative memory updates appeared in MAC. Combining together small operations and functions defined by hand as dictated by programs as in page 6 of the model is the main idea of Module network. End-to-end differentiability for VQA models has also been extensively explored and multiple solutions have been proposed: relations networks, soft variants of NMN, and also MAC and NSM, etc. The use of stacks has been explored too in the stack-NMN model. I therefore feel the paper mostly recombines and tunes together these ideas rather than offering one particular new idea or insight. 
+
+The paper presents results on CLEVR only, which goes back into my concern about the inflexibility of the probabilistic programs. Especially for this type of models, it will be useful to explore it on tasks beyond CLEVR such as VQA/GQA to show whether it can work for natural or richer language. The use of Mask R-CNN on CLEVR is also quite unreasonable in my opinion: the task has meant to be visually simple, so using a very strong visual model on it nullifies the visual aspect of it completely, making the model working on perfect semantic scene graph inputs rather than on ""real/natural"" uncertain and more noisy inputs. It also gives unfair advantage to the model when comparing to baselines which didn’t use object detectors on CLEVR but rather work directly with the image, e.g. MAC and others (presented in the table in the experiments section).
+
+At the same time, it is important to mention the model does get quantitative improvements in scores and especially sample efficiency, but the paper doesn’t make it clear what is the particular property or part of the model that allows for the improved numbers, and so the paper doesn’t leave the reader with a clear new takeaway message.
+",5,5.0,ICLR2021
+SyqWgxzxf,1,SJzMATlAZ,SJzMATlAZ,Authors of this paper presented a clustering algorithm by jointly solving deep autoencoder and clustering as a global continuous objective. Experiments demonstrate better results than state-of-the-art clustering schemas.,"As authors stated, the proposed DCC is very similar to RCC-DR (Shah & Koltun, 2007). The only difference in (3) from RCC-DR is the decoding part, which is replaced by autoencoder instead of linear transformation used in RCC-DR. Authors claimed that there are three major differences. However, due to the highly nonconvex properties of both formulations, the last two differences hardly support the advantages of the proposed DCC comparing with RCC-DR because the solutions obtained by both optimization approaches are local solutions, unless authors can claim that the gradient-based solver is better than alternating approach in RCC-DR. Hence, DCC is just a simple extension of RCC-DR.
+
+In Section 3.2, how does the optimization algorithm handle the equality constraints in (5)? It is unclear why the existing autoencoder solver can be used to solve (3) or (5). It seems that the first term in (5) corresponds to the objective of autoencoder, but the last two terms added lead to different objective with respect to variables y. It is better to clarify the correctness of the optimization algorithm.
+
+Authors claimed that the proposed method avoid discrete reconfiguration of the objective that characterize prior clustering algorithms, and it does not rely on a priori knowledge of the number of ground-truth clusters. However, it seems not true since the graph construction at every epoch depends on the initial parameter delta_2 and the graph is constructed such that f_{i,j}=1 if distance is less than delta_2. As a result, delta_2 is a fixed threshold for graph construction, so it is indirectly related to the number of clusters generated. In the experiments, authors set it as the mean of the bottom 1% of the pairwise distances in E at initialization, and clustering assignment is given by connected component in the last graph. This parameter might be sensitive to the final results.
+
+Many terms in the paper are not well explained. For example, in (1), theta are treated as parameters to optimize, but what is the theta used for? Does the Omega related to encoder and decoder of the parameters in autoencoder. What is the scaled Geman-McClure function? Any reference? Why should this estimator be used?
+
+From the visualization results in Figure 1, it is interesting to see that K-means++ can achieve much better results on the space learned by DCC than that by SDAE from Table 2. In Figure 1, the embedding by SDAE (Figure 1(b)) seems more suitable for kmeans-like algorithm than DCC (Figure 1(c)). That is the reason why connected component is used for cluster assignment in DCC, not kmeans. The results between Table 2 and Figure 1 might be interesting to investigate. 
+",6,3.0,ICLR2018
+Bkl-NUGTtB,1,rklw4AVtDH,rklw4AVtDH,Official Blind Review #3,"Summary:
+
+
+This work proposed a new variant of AMSGrad called Optimistic-AMSGrad, which makes use of the ideas from Optimistic Online learning. The authors showed that Optimistic-AMSGrad enjoys lower regret compared with AMSgrad in online learning. Experiment results backup their theory. 
+
+Pros:
+
+This work proposed a new variant of AMSGrad called Optimistic-AMSGrad. In the paper the authors showed that by predicting the future gradient using m_t, the regret of Optimistic-AMSGrad can be lowered from \sum |g_t| to \sum |g_t - m_t|, which improves AMSGrad directly. The authors also gave a practical way to compute m_t based on history information with the underlying assumption on input x_t. The authors provided detailed experiment results to backup their theory.  
+
+Cons:
+
+- There is no discussion about the choice of parameters. From equation 2, Corollary 1, it seems that to set \beta_2 = 1 achieves the best regret, which implies that to keep v_t unchanged achieves the best result. That sounds a bit strange because it suggests that the coordinate correction is useless. I recommend the authors to add some explanation for their corollary here. 
+- The intuition behind Algorithm 3 should be demonstrated more clear. Right now I do not understand how the correlation between x_t affects the prediction of m_t. The authors should add more explanation in Section 3.
+- The experiment results are not well aligned with theoretical results, since the authors considered convex loss in their proof, while the optimization on neural network is a highly non-convex task. I suggest the authors add some simple convex examples to demonstrate the superiority of Optimistic-AMSGrad. 
+",3,,ICLR2020
+r1gllwP937,1,r14Aas09Y7,r14Aas09Y7,Interesting idea but needs more work,"This paper proposes to constrain the Generator of a WGAN-GP on patches locations to generate small images (“micro-patches”), with an additional smoothness condition so these can be combined into full images. This is done by concatenating micro-patches into macro patches, that are fed to the Discriminator.  The discriminator aims at classifying the macro-patches as fake or real, while additionally recovering the latent noise used for generation as well as the spatial prior.
+
+There are many grammar and syntax issues (e.g. the very first sentence of the introduction is not correct (“Human perception has only partial access to the surrounding environment due to the limited acuity area of fovea, and therefore human learns to recognize or reconstruct the world by moving their eyesight.”). The paper goes to 10 pages but does so by adding redundant information (e.g. the intro is highly redundant) while some important details are missing  
+
+The paper does not cite, discuss or compare with the related work “Synthesizing Images of Humans in Unseen Poses”, by G. Lalakrishan et al. in CVPR 2018. 
+
+Page. 3, in the overview the authors mention annotated components: in what sense, and how are these annotated?
+How are the patches generated? By random cropping? 
+
+Still in the overview, page 3, the first sentence states that D has an auxiliary head Q, but later it is stated that D has two auxiliary prediction heads.  Why is the content prediction head trained separately while the spatial one is trained jointly with the discriminator? Is this based on intuition or the result of experimentations?
+
+What is the advantage in practice of using macro-patches for the Discriminator rather than full images obtained by concatenating the micro-patches? Has this comparison been done?
+
+While this is done by concatenation for micro-patches, how is the smoothness between macro-patches imposed?
+
+How would this method generalise to objects with less/no structure?
+
+In section 3.4, the various statements are not accompanied by justification or citations. In particular, how do existing image pinpointing frameworks all assume the spatial position of remaining parts of the image is known?
+
+How does figure 5 show that model can be misled to learn reasonable but incorrect spatial patterns?
+
+Is there any intuition/justification as to why discrete uniform sampling would work so much better than continuous uniform sampling? Could these results be included?
+
+How were the samples in Figure.2 chosen? Given that the appendix. C shows mostly the same image, the reader is led to believe these are carefully curated samples rather than random ones.",4,5.0,ICLR2019
+Bygy2gmWhX,1,SJzYdsAqY7,SJzYdsAqY7,Winograd-aware pruning of convolutions,"The paper proposes a technique (well, two) to prune convolutional layers to reduce the required amount of computation when  the convolutions are done using the winograd algorithm. Winograd convolutions first transform the image and the filter, apply a multiplication in the transformed space, and then retransform the image back to the intended image space. The transformation of the filter, however, means that sparsity in the regular domain does not translate to sparsity in the winograd domain.
+
+This paper presents two techniques to achieve sparsity in the winograd domain: approximating winograd sparsity based on sparsity in the regular domain (thereby pruning with a non uniform cost model) and pruning in winograd space directly. The actual implementation alternates the first pruning technique and retraining the network with fixed sparsity followed by alternating winograd-space pruning and retraining. The tricky part is retraining in winograd space, which seems to require fine tuned per coordinate learning rates.
+
+My main concern is that the method feels fairly fragile and hyperparameter-heavy: tuning all the learning rates and sparsity rates for all these iterated levels of pruning doesn't seem easy. Similarly, it's unclear why the first stage of pruning is even needed if it's possible to prune and fine tune in winograd space directly. It's unclear from reading the paper how, given a computational budget, to decide the time spent in each phase of the process.
+
+",6,3.0,ICLR2019
+rkq18W9eG,2,SJFM0ZWCb,SJFM0ZWCb,Deep Temporal Clustering,"This paper proposes an algorithm for jointly performing dimensionality reduction and temporal clustering in a deep learning context.  An autoencoder is utilized for dimensionality reduction alongside a clustering objective - that is the autoencoder optimizes the mse (using LSTM layers are utilized in the autoencoder for modelling temporal information), while the latent space  is fed into the temporal clustering layer.  The clustering/autoencoder objectives are optimized in an alternating optimization fashion.
+
+The main con lies in this work being very closely related to t-sne, i.e. compare the the temporal clustering loss based on kl-div (eq 6) to t-sne.  If we consider e.g., a linear 1-layer autoencoder to be equivalent to PCA (without the rnn layers), in essence this formulation is closely related to applying pca to reduce the initial dimensionality and then t-sne. 
+
+Also, do the cluster centroids appear to be roughly stable over many runs of the algorithm? As the authors mention, the method is sensitive to intitialization.  As the averaged results over 5 runs are shown, the standard deviation would be helpful towards showing this empirically.
+
+On the positive side, it is likely that richer representations can be obtained via this architecture, and results appear to be good with comparison to other metrics 
+
+The section of the paper that discusses heat-maps should be written more clearly.  Figure 3 is commented with respect to detecting an event - non-event but the process itself is not clearly described as far as I can see.
+
+minor note: the dynamic time warping is formally not a metric",5,4.0,ICLR2018
+HyZh5Vgrg,3,SkCILwqex,SkCILwqex,Great start; recommended as workshop paper.,"This paper proposes the Layerwise Origin Target Synthesis (LOTS) method, which entails computing a difference in representation at a given layer in a neural network and then projecting that difference back to input space using backprop. Two types of differences are explored: linear scalings of a single input’s representation and difference vectors between representations of two inputs, where the inputs are of different classes.
+
+In the former case, the LOTS method is used as a visualization of the representation of a specific input example, showing what it would mean, in input space, for the feature representation to be supressed or magnified. While it’s an interesting computation to perform, the value of the visualizations is not very clear.
+
+In the latter case, LOTS is used to generate adversarial examples, moving from an origin image just far enough toward a target image to cause the classification to flip. As expected, the changes required are smaller when LOTS targets a higher layer (in the limit of targetting the last layer, results similar to the original adversarial image results would be obtained).
+
+The paper is an interesting basic exploration and would probably be a great workshop paper. However, the results are probably not quite compelling enough to warrant a full ICLR paper.
+
+A few suggestions for improvement:
+ - Several times it is claimed that LOTS can be used as a method for mining for diverse adversarial examples that could be used in training classifiers more robust to adversarial perturbation. But this simple experiment of training on LOTS generated examples isn’t tried. Showing whether the LOTS method outperforms, say, FGS would go a long way toward making a strong paper.
+ - How many layers are in the networks used in the paper, and what is their internal structure? This isn’t stated anywhere. I was left wondering whether, say, in Fig 2 the CONV2_1 layer was immediately after the CONV1_1 layer and whether the FC8 layer was the last layer in the network.
+ - In Fig 1, 2, 3, and 4, results of the application of LOTS are shown for many intermediate layers but miss for some reason applying it to the input (data) layer and the output/classification (softmax) layer. Showing the full range of possible results would reinforce the interpreatation (for example, in Fig 3, are even larger perturbations necessary in pixel space vs CONV1 space? And does operating directly in softmax space result in smaller perturbations than IP2?)
+ - The PASS score is mentioned a couple times but never explained at all. E.g. Fig 1 makes use of it but does not specify such basics as whether higher or lower PASS scores are associated with more or less severe perturbations. A basic explanation would be great.
+ - 4.2 states “In summary, the visualized internal feature representations of the origin suggest that lower convolutional layers of the VGG Face model have managed to learn and capture features that provide semantically meaningful and interpretable representations to human observers.” I don’t see that this follows from any results. If this is an important claim to the paper, it should be backed up by additional arguments or results.
+
+
+
+1/19/17 UPDATE AFTER REBUTTAL:
+Given that experiments were added to the latest version of the paper, I'm increasing my review from 5 -> 6. I think the paper is now just on the accept side of the threshold.",6,4.0,ICLR2017
+ixmuJV47EU,1,Atpv9GUhRt6,Atpv9GUhRt6,This paper proposes a multiscale superpixel algorithm that tries to circumvent the limitation of SLIC. It is novel but the experiments are not so convincing. ,"Quality: The motivation of this paper is great, and the proposed WaveMesh is interesting. In general, it is a high-quality work. It will be better to supplement more experiments.
+
+Clarity: The expression is clear, and especially, the figures are very exquisite, and helpful for understanding.
+
+Originality: The proposed image-specific superpixeling algorithm WaveMesh is novel and try to circumvent the limitation of SLIC. However, the insufficient experiments affect its persuasiveness.
+
+Significance: The WaveMesh could filter the unimportant information and focus on the important information which is a significant research. It may be applied to more image-based task, such as image segmentation. Consequently, I think this work has a certain significance, but it needs more experiments to evaluate.
+
+Pros:
+1. It is a novel idea that non-uniformly downsamples the images and gets a multiscale superpixel representation.
+2. The figures are concise and precise, and they provide an intuitive understanding about the WaveMesh and WavePool.
+
+Cons:
+1. The experiment is not sufficient. Take MNIST as example, this work only compares the WaveMesh and SLIC when the number of nodes is ~57 which is not convincing enough to conclude that the WaveMesh is effective. What is the acc about SLIC when the number of nodes is ~238. Besides, it seems that SLIC+SplineCNN achieves the best acc when the number of nodes is small. The experiments on Fashion-MNIST and CIFAR-10 have the same problem.
+2. WaveMesh is a superpixeling algorithm that is not coupled with classification algorithm, but all the experiments are performed on SplineCNN. What is the performance of WaveMesh with other GNN algorithms?
+3. How to evaluate the effectiveness of the image-dependent threshold T?
+",5,4.0,ICLR2021
+pOkj6QSTCjJ,1,LkFG3lB13U5,LkFG3lB13U5,"Review of ""Adaptive Federated Optimization""","Summary: This paper presents an adaptive federated optimization framework that induces three different adaptive federated learning algorithms, which are proposed to address the issues of client drift due to data heterogeneity and lack of adaptivity. The authors presented thorough literature survey on the federated learning and formulated the meta-algorithm, FedOpt, introducing both server and client optimizers. Different from the popular FedAvg, the server optimizer in this work is an adaptive protocol which is originated from the client drift. The authors analyzed mathematically the convergence rates of the proposed framework and showed extensive experimental results on different benchmark datasets to validate the efficacy of the developed algorithms.
+
+Overall, this paper is well organized and technically sound. It is also easy to follow. The overall proof ideas are also correct, though I didn’t derive step by step. However, the authors need to pay attention to the following points to improve the current draft.
+
+1. Server optimizer can be confusing. The server optimizer developed in this study generalizes to a more adaptive one. But it is not really an optimization in the server, instead just another form of local model averaging in order to obtain the global model. I completely understand the motivation for this is to improve the performance beyond FedAvg, which may not necessarily perform well with non-IID data. However, without using zeroth order or first order information of objectives, to me, the so-called server optimizer is just a nonlinear local model averaging.
+2. The server learning rate is bit counterintuitive. In Corollary 1 and 2, when defining the learning rate for the server, it looks like $\eta$ can be quite large. Of course, given the fact that the server optimizer is not really an optimization, that could be understandable. Additionally, the authors empirically investigated the relationship between the server and client learning rates in the appendix. But to me, it is naturally a parameter that plays a similar role as learning rate, not exactly the learning rate. 
+3. The communication efficiency needs more discussion. In Section 3, the authors presented one discussion on the communication efficiency, particularly quantifying $K$. I wonder how the authors obtained that. Also, it would be great to see more quantitative results on the communication efficiency. 
+4. Client heterogeneity is only qualitative in the paper. Based on the empirical implementation, I couldn’t find out how the authors simulated the non-IID data distributions for different clients for each benchmark task. This needs more detail. Also, the authors mentioned that the proposed framework can work effectively for moderate and naturally arising heterogeneity. But under how much heterogeneity does the algorithms work well? For example, for CIFAR 10, if having 10 clients, each client has only one class without any overlapping, is FedOpt still effective?
+5. The experimental results look promising, but not showing significant outperforming capability. From Table 1, we can still observe that for some tasks, the AvgM is still favorably comparable. Also, when to select which method among FedAdam, FedAdaGrad, and FedYOGI is unclear. The authors need to give more discussion.
+6. Based on the Corollary 1 and 2, it looks like when $T$ is sufficiently large, the FedOpt can achieve linear speed up, correct? However, the authors failed to give the specific lower bound for $T$, even if they only mentioned the dominating term that induced a sublinear rate.
+
+***********************************
+After carefully considering the rebuttal from the authors, I think I am more positive about the paper so I raised my score. The rebuttal clarifies most of my confusion about the paper, though more improvement can be done. ",6,4.0,ICLR2021
+SylOb8hgaX,2,HJG1Uo09Fm,HJG1Uo09Fm,"Very minor contribution, a manuscript that is lacking important details and does not relate it's technical section to existing work, with very thin evaluation. ","This work addresses the problem of learning a policy-learning-procedure, through meta-learning, that can adapt quickly to new tasks. This work uses MAML for meta-learning, and with this choice, the problem can be broken down into two loops: 
+
+1) inner loop: adapting a policy \pi_phi based on unseen rollouts, where initial parameters phi were provided by the meta-trainer in the outer loop 
+2) outer loop: the meta-trainer tries to learn parameters phi on batches of tasks that provide good initial parameters 
+
+In prior work on meta-reinforcement learning via MAML, both the outer as well as inner objective attempt to minimize a RL objective, leading to an algorithm that has very high sample-complexity. This work uses imitation learning for the outer loop procedure, to significantly decrease sample-complexity.
+
+Technical Contribution:
+-----------------------
+The idea of using imitation learning for reinforcement learning is well explored in the literature, and so using this idea in itself is not real contribution. There are several issues with the presentation of this work, that make it incredibly difficult to identify a technical contribution:
+
+1. overreaching statements without details to backup: you are writing the paper as if you are learning a ""RL algorithm"" that can be used to quickly learn new tasks. your manuscript does not really provide a description for this ""algorithm"". After re-reading several other papers I concluded that what you mean is that you learn an initial set of policy parameters that can quickly adapt to new related tasks and an update rule with which you update these parameters. However, standard MAML uses SGD as an update rule so there is really nothing to be learned here. Unfortunately, your paper provides zero detail on these claims of learning a ""RL procedure"", so for now I have to assume that you are simply learning a good initial set of policy parameters through meta-learning. If that is the case, then using imitation learning in this setting is really not novel, this has been done by a lot of other people before (you're just using MAML to learn ""better"" initial parameters).
+2. you're technical section (section 4) provides some details on the technical challenges of using demonstrations to perform the outer loop optimization step. Unfortunately, you are not putting your work in the context of existing work ([1], [2]), that discuss and address the importance/issue of sampling in meta-rl with MAML. So it's impossible to know whether there is any new insight here
+
+Experimental Evaluation:
+-------------------------
+The experimental evaluation is very ""thin"", other than the original MAML-RL and pure imitation learning no other more recent baselines ([1], [2]) have been compared to. And only 2 relatively simple simulation settings are tested. 
+
+Summary:
+-----------
+Very minor contribution, a manuscript that is lacking important details and does not relate it's technical section to existing work, with very thin evaluation. 
+
+
+[1] The Importance of Sampling in Meta-Reinforcement Learning, NIPS 2018
+[2] CONTINUOUS ADAPTATION VIA META-LEARNING IN NONSTATIONARY AND COMPETITIVE ENVIRONMENTS, ICLR 2018",2,5.0,ICLR2019
+H1d32I-Ee,1,B1ElR4cgg,B1ElR4cgg,"interesting extension of GANs, promising results","This paper extends the GAN framework to allow for latent variables. The observed data set is expanded by drawing latent variables z from a conditional distribution q(z|x). The joint distribution on x,z is then modeled using a joint generator model p(x,z)=p(z)p(x|z).  Both q and p are then trained by trying to fool a discriminator. This constitutes a worthwhile extension of GANs: giving GANs the ability to do inference opens up many applications that could previously only be addressed by e.g. VAEs.
+
+The results are very promising. The CIFAR-10 samples are the best I've seen so far (not counting methods that use class labels). Matching the semi-supervised results from Salimans et al. without feature matching also indicates the proposed method may improve the stability of training GANs.",8,4.0,ICLR2017
+Skx_mWw7TX,2,S1x8WnA5Ym,S1x8WnA5Ym,The connection between the propsoed regularizer and the DPP is not precise.,"For training GANs, the authors propose a regularizer that is inspired by DPP, which encourage diversity. This new regularizer is tested on several benchmark datasets and compared against other mode-collapse mitigating approaches. 
+
+   
+In section 2, important references to recent techniques to mitigate mode collapse are missing, e.g.
+BourGAN (https://arxiv.org/abs/1805.07674)
+PacGAN (https://arxiv.org/abs/1712.04086)
+D2GAN (https://arxiv.org/abs/1709.03831)
+
+Also related is evaluation of mode collapse as in 
+On GANs and GMMs (https://arxiv.org/abs/1805.12462) 
+
+The actual loss that is proposed as in (5) and (6), seems far from the motivation that is explained as in Eq (3), using generator as a point process that resembles DPP. This conceptual gap makes the proposed explanation w.r.t DPP unsatisfactory. A more natural approach would be simply add $det(L_{S_B})$ itself as a regularizer. Extensive experimental comparisons with this straightforward regularizer is in order. 
+
+It is not immediate if the proposed diversity regularizer $L_g^{DPP}$ in (5) is differentiable in general, as it involves computing the eigen-vectors. Elaborate on the implementation of the gradient update with respect to this new regularizer. 
+ 
+
+Experiments: 
+
+1. The results in Table 3 for stacked-MNIST are very different from VEEGAN paper. Explain why a different setting was used compared to VEEGAN experiments. 
+
+2. Similar experiments have been done in Unrolled-GAN paper. Add the experiment from that setting also. 
+
+3. In general, split the experiments to two parts: one where the same setting is used as in the literature (e.g. VEEGAN, Unrolled GAN) and the results are compared against those reported in those papers. Another where new settings are studied, and the experiments of the baseline methods are also run by the authors. This is critical to differentiate such cases, as hyper parameters of competing algorithms could have not been tuned as rigorously as the proposed method. This improves the credibility of the experimental results, eventually leading to reproducibility. 
+
+",5,5.0,ICLR2019
+huPKGfzAdO,4,Ef1nNHQHZ20,Ef1nNHQHZ20,An interesting paper which needs major enhancement,"This paper proposed a layer-wise adversarial defense which added perturbations in each hidden layer considering the influence of hidden features in latent space from the ODE perspective. It is essential to enhance the adversarial model robustness by stabilizing both inputs and hidden layers. The proposed method leveraged two operator splitting theory w.r.t. the Lie-Trotter and the Strang-Marchuk splitting schemes to discretize the specially designed ODE formulation by integrating the continuous limit of back-propagated gradients into the forward process. The main contribution of this paper is to generate perturbations with the idea of ODE in each layer. Empirical studies were performed to show the effectiveness of the proposed method on two benchmarks with two attack methods.
+
+There are several concerns suggesting that this paper may not be accepted at its current form.
+1.	Novelty is a concern. 
+a)	To improve the limitation that the existing adversarial training approaches mainly focus on perturbations to inputs, this paper added perturbations in each hidden layer with the ODE perspective, which is not convincing enough to me. 
+b)	The authors did not state the significance compared with  the other layer-wise perturbation methods. Why it is efficient to use the ODE method adding perturbation in each layer. 
+c)	At least, the authors should compare with the other adversarial training methods in a layer-wise way, e.g. the following two references.
+
+Ref1. Sankaranarayanan, S.; Jain, A.; Chellappa, R.; and Lim, S. N. 2018. Regularizing deep networks using efficient layerwise adversarial training. In 32nd AAAI Conference on Artificial Intelligence. 
+
+Ref2. Kumari, Nupur, et al. ""Harnessing the Vulnerability of Latent Layers in Adversarially Trained Models."" IJCAI. 2019
+2.	Another concern is that the paper may be in lack of sufficient and convincing experimentation to support the usefulness of the proposed method.
+a)	The authors only selected one baseline method. As for the defense against attack method, this paper just used FGSM and IFGSM which are quite weak attacking methods. The proposed method may also be compared under strong attack methods, such as PGD and CW methods, since this paper targets defense. 
+b)	Furthermore, the results may not be significantly enough to verify the effectiveness of the proposed method.
+c)	In Table1, did you use the same ResNet with the baseline, ResNet-110 or ResNet-164? It appears that the results in the paper are quite different from those reported in the original paper of Yang et al.
+3.	The authors mentioned that they leveraged a two-stage approach to solve the time-consuming problem during training in section 3.1. There are not clear descriptions about this approach how it works and the authors did not explain why it could save time.
+4.	In equation (1) and (5), some notations of x are different from each other. The authors did not give clear annotation which is confusing to me. The function f is not defined as well.
+5.	In the experimental part, the authors did not conduct black-box experiments which are import in adversarial training.
+6.	The paper could be further polished. There are quite a few typos. Figure 1 is not mentioned in the full paper. Table 4.1 should be Table 1 in the first paragraph in section 4.1.
+",4,4.0,ICLR2021
+BkZRhnbxz,1,H1MczcgR-,H1MczcgR-,Interesting work on meta-optimization,"The paper discusses the problems of meta optimization with small look-ahead: do small runs bias the results of tuning? The result is yes and the authors show how differently the tuning can be compared to tuning the full run. The Greedy schedules are far inferior to hand-tuned schedules as they focus on optimizing the large eigenvalues while the small eigenvalues can not be ""seen"" with a small lookahead. The authors show that this effect is caused by the noise in the obective function.
+
+pro:
+- Thorough discussion of the issue with theoretical understanding on small benchmark functions as well as theoretical work
+- Easy to read and follow
+
+cons:
+-Small issues in presentation: 
+* Figure 2 ""optimal learning rate"" -> ""optimal greedy learning rate"", also reference to Theorem 2 for increased clarity.
+* The optimized learning rate in 2.3 is not described. This reduces reproducibility.
+* Figure 4 misses the red trajectories, also it would be easier to have colors on the same (log?)-scale. 
+  The text unfortunately does not explain why the loss function looks so vastly different
+  with different look-ahead. I would assume from the description that the colors are based
+  on the final loss values obtaine dby choosing a fixed pair of decay exponent and effective LR. 
+
+Typos and notation:
+page 7 last paragraph: ""We train the all"" -> We train all
+notation page 5: i find \nabla_{\theta_i} confusing when \theta_i is a scalar, i would propose \frac{\partial}{\partial \theta_i}
+page 2: ""But this would come at the expense of long-term optimization process"": at this point of the paper it is not clear how or why this should happen. Maybe add a sentence regarding the large/Small eigenvalues?",7,4.0,ICLR2018
+HygubeqpFS,2,S1x522NFvS,S1x522NFvS,Official Blind Review #3,"The anonymous authors consider the problem of training of classifiers in an unsupervised way. They propose an extension to a one-class based approach that can do anomaly detection in an unsupervised fashion.
+
+The main contribution is a modification of the target function for the training of one-class NN. The experiments are not convincing and the modification doesn't seem to provide much inside into representation learning and anomaly detection area. 
+
+1. Figure 3: no axis labels
+2. ROC AUC is not the best quality to measure the quality of imbalanced classification problems or anomaly detection, PR AUC (average precision) is better
+3. In Table 2 and other experiments, there is a comparison to only one existed method e.g. authors don't reproduce results for OC-NN
+4. Table 1: why compare a supervised method to an unsupervised one and don't compare to other methods?
+",3,,ICLR2020
+RCXPwwy5aMt,3,6KZ_kUVCfTa,6KZ_kUVCfTa,Interesting work using information theory for model-based RL,"Summary:
+The paper introduces a new method for model based RL that learns a dynamical latent representation from pixel data (images) using a maximum mutual information criterion together with a predictability loss. The core idea is that maximizing mutual information between states bias the encoder toward learning predictable (and therefore potentially task relevant) latent features while discarding unpredictable features.
+
+Relevance:
+The paper addresses the very relevant problem of learning representations usable in a control task.
+
+Originality:
+The core novelty of the paper is to combine the mutual information approach introduced with PC3 in the context of control theory with the ""dream to control"" approach for model-based reinforcement learning in pixel space.
+
+Scientific quality:
+- The proposed approach is in general well motivated. However. I am not convinced by the emphasis that the authors put in the proposition that the maximum mutual information loss helps to learn task relevant features. While it is true that the proposed approach is biased towards temporally predictable features, most distinctive features both in the real wold and in most games and simulations have as much temporal predictability as the task relevant ones. For example, the video backgrounds in the experiments are completely task irrelevant and at the same time highly predictable. In general, the encoder cannot trulely promote task-relevant features without having access to the reward structure.
+- The experiment section offers a decently wide range of experiments. However, the authors should include more baselines, possibly including other model-based methods such as [1] and model-free methods such as some variant of DQNs.
+
+Pros:
+- Very relevant research area
+-Rather original combination of methods
+-Clear and well-written paper
+
+Cons:
+- The main claim that the method is biased towards learning task-relevant features is questionable. 
+-  The experiments should contain more baselines including other model based approaches and some model free approach.
+
+References:
+[1] Hafner, Danijar, et al. ""Learning latent dynamics for planning from pixels."" International Conference on Machine Learning. PMLR, 2019.
+
+",6,3.0,ICLR2021
+S1AQa7uxz,2,HJhIM0xAW,HJhIM0xAW,Interesting approach to learning neural response metrics,"In their paper, the authors propose to learn a metric between neural responses by either optimizing a quadratic form or a deep neural network. The pseudometric is optimized by positing that the distance between two neural responses to two repeats of the same stimulus should be smaller than the distance between responses to different stimuli. They do so with the application of improving neural prosthesis in mind. 
+
+First of all, I am doubtful about this application: I don't think the task of neural prosthesis can ever be to produce idential output pattern to the same stimuli. Nevertheless, a good metric for neural responses that goes beyond e.g. hamming distance or squared error between spike density function would be clearly useful for understanding neural representations.
+
+Second, I find the framework proposed by the authors interesting, but not clearly motivated from a neurobiological perspective, as the similarity between stimuli does not appear to play a role in the optimized loss function. For two similar stimuli, natural responses of neural population can be more similar than the responses to two repetitions of the same stimulus.
+
+Third, the results presented by the authors are not convincing throughout. For example, 4B suggests that indeed the Hamming distance achieves lower error than the learned representation.
+
+Nevertheless, it is an interesting approach that is worthwhile pursuing further. ",6,3.0,ICLR2018